These datasets were collected in late 2017 from goodreads.com, where we only scraped users' public shelves, i.e. everyone can see it on web without login. User IDs and review IDs are anonymized.
We collected these datasets for academic use only. Please do not redistribute them or use for commercial purposes.
We collected three groups of datasets: (1) meta-data of the books, (2) user-book interactions (users' public shelves) and (3) users' detailed book reviews. These datasets can be merged together by joining on book/user/review ids.
Basic Statistics of the Complete Book Graph:
2,360,655 books (1,521,962 works, 400,390 book series, 829,529 authors)
876,145 users; 228,648,342 user-book interactions in users' shelves (include 112,131,203 reads and 104,551,549 ratings)
Download links to these datasets can be found in the Datasets section below.
Note the complete interaction dataset is very large! We extracted several medium-size subsets by genre and recommend using these subsets for experimentation first (see "By Genre" in the Datasets section for details).
[May 2023] Our datasets have been moved! Please refer to this webpage on how to download the datasets. The previous Google drive links will be deprecated soon.
You can find code samples about loading the datasets and doing basic data explorations in our dataset Github repository.
Books may overlap across different genres (i.e., one book may belong to multiple genres);
The subgraph for each genre may not be self-contained. Those are subsets of the nodes on the complete book graph.
Detailed information about authors, works, book series etc. can be found in the meta-data section.
Children (124,082 books, 10,059,349 interactions, 734,640 detailed reviews)