Goodreads Book Graph Datasets
Overview
We collected three groups of datasets: (1) meta-data of the books, (2) user-book interactions (users' public shelves) and (3) users' detailed book reviews. These datasets can be merged together by joining on book/user/review ids.
Basic Statistics of the Complete Book Graph:
- 2,360,655 books (1,521,962 works, 400,390 book series, 829,529 authors)
- 876,145 users; 228,648,342 user-book interactions in users' shelves (include 112,131,203 reads and 104,551,549 ratings)
Note the complete interaction dataset is very large! We extracted several medium-size subsets by genre and recommend using these subsets for experimentation first (see "By Genre" in the Datasets section for details).
Latest News
- [May 2023] Our datasets have been moved! Please refer to this webpage on how to download the datasets. The previous Google drive links will be deprecated soon.
Code Samples
- Download datasets without GUI: download.ipynb
- Display sample records: samples.ipynb
- Calculate basic statistics: statistics.ipynb
- Explore the interaction data: distributions.ipynb
- Explore the review data: reviews.ipynb
Citations
- Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18. [bibtex]
- Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19. [bibtex]
Datasets
Meta-Data of Books
- Detailed book graph (~2gb, about 2.3m books): goodreads_books.json.gz
- Detailed information of authors: goodreads_book_authors.json.gz
- Detailed information of works (i.e., the abstract version of a book regardless any particular editions): goodreads_book_works.json.gz
- Detailed information of book series (Note: Unfortunately, the series id included here cannot be used for URL hack): goodreads_book_series.json.gz
- Extracted fuzzy book genres (genre tags are extracted from users' popular shelves by a simple keyword matching process): goodreads_book_genres_initial.json.gz
Book Shelves
- Complete user-book interactions in 'csv' format (~4.1gb): goodreads_interactions.csv
User Ids and Book Ids in this file can be reconstructed by joining on the following two files: book_id_map.csv, user_id_map.csv. - Detailed information of the complete user-book interactions (~11gb, ~229m records): goodreads_interactions_dedup.json.gz
- User-Book Club mapping information: book_clubs.json
Book Reviews
- Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): goodreads_reviews_dedup.json.gz
- English review subset for spoiler detection (~1.3m book reviews about ~25k books and ~19k users, parsed at sentence-level): goodreads_reviews_spoiler.json.gz
- English review subset for spoiler detection (~1.3m book reviews about ~25k books and ~19k users, raw texts): goodreads_reviews_spoiler_raw.json.gz
By Genre
Note in these datasets:- Books may overlap across different genres (i.e., one book may belong to multiple genres);
- The subgraph for each genre may not be self-contained. Those are subsets of the nodes on the complete book graph. Detailed information about authors, works, book series etc. can be found in the meta-data section.
Comics & Graphic (89,411 books, 7,347,630 interactions, 542,338 detailed reviews)
Fantasy & Paranormal (258,585 books, 55,397,550 interactions, 3,424,641 detailed reviews)
History & Biography (302,935 books, 31,479,229 interactions, 2,066,193 detailed reviews)
Mystery, Thriller & Crime (219,235 books, 24,799,896 interactions, 1,849,236 detailed reviews)
Poetry (36,514 books, 2,734,350 interactions, 154,555 detailed reviews)
Romance (335,449 books, 42,792,856 interactions, 3,565,378 detailed reviews)
Young Adult (93,398 books, 34,919,254 interactions, 2,389,900 detailed reviews)