Keywords

1 Background and Introduction

Reading is one of the most ubiquitous activities in our daily lives. We have limited knowledge about historical everyday readers and their reading behavior due to a lack of records left by or collected from them [108]. In the last two decades, the increasing availability of user-generated book reviews from online sourcesFootnote 1 has opened up unprecedented opportunities for computational and empirical research on readerships and everyday reading behavior. Scholars from different fields, e.g., library science, digital humanities, communication studies, and natural language processing, have leveraged such data to examine a variety of topics, such as review classification, social network analyses of readers, impact assessment and sales prediction of books [20, 64, 69, 80, 85, 99, 150, 152, 163]. With the evolution of book review studies, challenges and limitations have also emerged, ranging from disciplinary divergences (such as reader-orientated theories vs. book-centric models [31, 66]) to limitations of the scholarly usability of review corpora (such as review credibility and inclusiveness [66, 67, 70, 73, 115, 118, 168]). This paper asks another insufficiently discussed question that has yet to be fully explored in prior empirical and computational studies of user-generated book reviews: How to best use online book reviews for scholarly research from legal, ethical, and compliance perspectives?

This paper is motivated by two factors: First, while the ethical use of user-generated content and social media data for research purposes has been critically discussed [33, 34, 46, 107, 168], contextualized investigations of specific genres remain very much in need. As Crawford and Finn have pointed out, “social and mobile datasets have limitations that, if not sufficiently understood and accounted for, can produce specific kinds of analytical and ethical oversights” [29]. In their own research on crisis data, they demonstrated the necessity and potential of this research direction by critically examining (1) what crisis data actually represents; and, (2) how these data were used in crisis research [29]. Other studies that focused on specific datasets and use cases have also shed light on specific research challenges and responsibilities by examining ethical issues that stem from work in specific domains or research contexts [22, 35, 37, 48, 50, 98].

Following these exemplary studies, we propose to scrutinize the challenge of legal, ethical, and compliant research conduct in the context of user-generated book reviews. We argue for a deeper engagement with these datasets because of the dual role that many book reviewers play as (1) social media producers and content consumers; and, (2) readers. When people voluntarily post book reviews, they also reveal aspects of their reading history, whether they are aware of that or not. As a result, user-generated book reviews, like most social media data, may contain directly or indirectly personally identifiable informationFootnote 2 [10]. At the same time, similar to library patron data, user-generated book reviews record activities and thoughts that are protected as part of people’s intellectual freedom and valuable contributions to the diversity of viewpoints in society [8]. For instance, online book reviewers might express opinions, values, and beliefs, which can be vehement, controversial, or even illegal (e.g., acquiring and reading banned books). Reviewers may also share personal experiences, information about their physical and mental health, and their socio-demographic identities. These types of sensitive information lead to concerns about the legitimacy and ethics of using such data in scholarly research.

Second, we examine the usage of book reviews to further minimize potential risks for reviewers and researchers. In order to ensure library patrons’ freedom to read, an unfettered exchange of ideas, and equal access to diverse materials and services, library professionals have long protested against policies that would harm the confidentiality of their patrons’ data (e.g., search records, book loans, reference interviews) [6,7,8,9, 77, 78, 134]. In practice, many libraries regularly remove circulation records and decline to keep certain patron data in order to protect their patron’s privacy from “irresistible government requests” [43, 90]. For similar reasons, book reviewers’ reading records and opinions also need protection because reviews might be subject to censorship and could be used against those reviewers. However, online book reviews have not been protected or managed like library patron data, possibly because they have not been conceptualized in this way, but rather as reviews of consumer goods. This is problematic because censorship, trolling, scams, and harassment targeted at online book reviewers have increased [83, 104]. Disliked online book reviews have led to cyber doxing and personal attacks on individual reviewers from book authors, translators, and the public, both online and offline, around the world [17, 83, 92, 116, 128]. For instance, in 2014, a teenage girl in the U.K. was tracked down and assaulted by an author because the girl had left a negative review about one of the author’s books on WattpadFootnote 3 [17, 117]. Although this horrifying incident was an unexpected result of the review posting itself, without any research involved, researchers need to consider the potential for actual harm when designing their studies and reproducing (or even amplifying) potentially harmful content.

At the same time, researchers might be exposed to professional, institutional, and legal consequences of scraping and analyzing user-generated book reviews, such as copyright infringement, violations of policies and end-user license agreements (EULA)/terms of service (TOS),Footnote 4 and conflicts of interest with various stakeholders’ policies. Most user-generated book reviews are considered copyrighted material and/or material governed by TOS/EULAs. Some platforms that make a profit with their user-generated book reviews have explicitly forbidden unauthorized third-party use of their data via TOS, which means researchers are expected to acknowledge the potential legal hazards that come with their accessing and using of reviews. Also, for research based on copyrighted data that is not subject to fair use, scholarly use of the data for non-commercial purposes or the public good does not serve as an exemption from the possibility of legal consequences. For example, the HathiTrustFootnote 5 was sued by The Authors Guild for copyright infringement because of the use of books scanned by Google [15], and the Internet ArchiveFootnote 6 was sued by major book publishers for “grossly” exceeding what libraries were permitted to do by providing “emergency” access to digital teaching materials during the COVID pandemic [57, 146]. These cases are reminders that even for public institutions, it is difficult to manage the legal risks associated with their use of data. We conclude that researchers need to understand how they can access and use user-generated book reviews in ways that protect both their human research subjects and themselves from harm and risks.

Therefore, this paper examines the legitimacy and ethics of leveraging user-generated book reviews in scholarly research. We draw upon library standards and practices in addition to existing scholarly discussions to identify potential pitfalls and solutions. Specifically, we investigate (1) relevant laws; (2) platform policies; (3) user rights and expectations; and, (4) existing research on the ethical use of user-generated data at large. Here are the two primary questions we posit and how we analyze them:

  1. 1.

    Question: What does prior research say about compliance and ethical conduct of research that uses user-generated book reviews?

    Analysis: We review 100 research articles that feature empirical analyses of user-generated book review datasets and their creators/users. We collected these references as part of our empirical and computational research on book reviews [25, 72, 73, 93, 127].Footnote 7 The findings are presented in Sect. 2.

  2. 2.

    Question: What factors should researchers consider for assessing the appropriateness of their use of data while minimizing potential risks caused by their research?

    Analysis: We analyze a broader range of literature to understand the norms, regulations, and concerns for employing user-generated content (book reviews included) from the perspective of legislation, platform providers, users, and researchers. The analyses are presented in Sect. 3.

Then, in Sect. 4, we discuss the findings and limitations of our investigation. In Sect. 5, we summarize our research contributions and propose topics for future work. Due to variance in legislation, expectations, and norms for ethics and compliance across place, time, and disciplines, this paper does not provide a comprehensive review of prior research on user-generated book reviews, but is consciously situated primarily in a contemporary, U.S.-centric context. We invite readers to extend our approach to their own disciplinary and local contexts.

2 Literature Review of Computational and Empirical Studies that Use User-Generated Book Reviews

Existing research on user-generated book reviews has investigated a variety of datasets from different sources around the globe and in a variety of languages [115], such as reviews in Chinese [59, 64, 111, 164, 165], Dutch [19, 86], and German [40, 119]. Among these, book reviews in English obtained from Amazon, Goodreads, and LibraryThing [12, 20, 68, 150, 152, 162, 163]Footnote 8 are most frequently used. Data leveraged include (1) actual review texts, crowdsourced tags, book ratings, rankings, and lists; (2) reviewers’ public profiles and networks; (3) forum discussions and social media posts; and, (4) information about book sales and price [4, 12, 20, 32, 60, 64, 68, 69, 99, 139, 150, 152, 162, 163].The scale and granularity of previously compiled and referenced datasets vary drastically, ranging from hundreds to millions of records [4, 60, 115, 124, 130, 152]. For instance, Wan and colleagues scraped 1,378,033 English book reviews for spoiler detection [152], while Tan and He qualitatively compared 200 book reviews in Chinese and English as part of a multi-method analysis on cross-cultural reception [130].

These book review datasets have enabled computational and empirical research in various disciplines, including library and information sciences (LIS) [162, 163], digital humanities and cultural analytics [20, 85, 150], computer supported cooperative work [12], social network analysis [99], computational linguistics [152], recommender systems and marketing [27, 151], decision making [64, 68, 69], etc. In turn, each discipline has brought topics to the research. For instance, LIS scholars have studied reviews through the lenses of crowd cataloging and social tagging [16, 24, 97, 139, 149]; citation index and impact assessment [111, 153, 166, 169]; and readers’ social networks and activities [110, 136,137,138, 162]. Cultural historians and literary scholars have asked questions about the evolution of literary genres, the formation of literary canons, and reception of literary works [20, 39, 42, 127, 150]. Marketing, economics, and system scientists have examined the relationship between book reviews and book sales [27, 64, 99, 129]. Natural language processing scholars and computational linguists have built models for review classification (e.g., fake, spoiler, and most helpful reviews) [50, 68, 141, 152], sentiment analysis and opinion mining [69, 96], and extracting narratives and relationships among characters [63, 125]. Several taxonomies and conceptual frameworks have been proposed to map and synthesize prior work on user-generated book reviews [89, 118].

Despite the differences between previously used datasets in terms of language, source, scale, and research topic, most datasets are collected via web scraping [26, 75, 150, 152], using application programming interfaces (APIs) provided by the hosting platforms (for example, Goodreads used to provide an API, and Amazon web services (AWS) provides an API for individuals) [69, 71, 122, 153, 169], or a combination of the two [36, 99].Footnote 9 Robots.txt files are a server-side solution for determining what data can be accessed and how, and can inform web scraping efforts. APIs implement the rules for data collection that providers define for their service, and are therefore a recommendable solution for data gathering. Not all platforms provide APIs, however, because enabling research may not be part of a provider’s business model or might conflict with their user agreements. For instance, Goodreads shut down its API for accessing book review data in 2020 and made large-scale data scraping difficult by restricting its webpage content (e.g., sorting reviews with its proprietary algorithms) [36, 150]. Given such implementations, data scraping is broadly adopted for data collection, although it might violate copyright and the EULA/TOS of a platform.

Legal risks and ethical concerns associated with book review scraping and related downstream tasks have been discussed before, but only in small numbers. One of the articles we reviewed mentioned copyright exemptions for research [114]. A few articles have discussed the acquisition of permissions for data collection [162, 163] and attempts to request permissions [114] from the provider platforms. Considerations of human subjects research and institutional ethics review are also often absent.Footnote 10 Within publications of U.S.-based scholars, we only found two articles where consideration of and exemption from Institutional Review Board (IRB)Footnote 11 oversight was explicitly mentioned [12, 102]. Relatedly, only a small number of articles explicitly discussed actions taken to protect the identities of the book reviewers, such as (1) removing user names and other user profile information that might reveal a reviewer’s real-world identity (e.g., self-reported non-binary gender identities) [4, 32, 38, 88, 102, 114, 120, 122, 130], paraphrasing quoted reviews [20], and/or (2) not publishing the original data scraped, which might also violate copyright and EULAs [12, 131, 150]. In contrast, most research did not describe how researchers pre-processed potential personally identifiable information; such information might remain accessible in existing book review datasets [62, 103].Footnote 12.

In conclusion, our literature review indicates a general absence of (1) informed consent from authors of book reviews; (2) permissions obtained from data sources; or (3) institutional ethics review in existing computational studies of user-generated book reviews. Discussions of legal and ethical risks associated with such practices were also largely absent. As discussed in the introduction, failure to consider these issues could pose risks to online users/readers, researchers, and academia alike. Therefore, we survey a broader range of literature and guidelines to fill this gap in legal and ethical considerations of the scholarly usage of user-generated book reviews.

3 Analysis and Findings

We analyze (1) relevant laws; (2) platform policies; (3) user rights and expectations; and, (4) researchers’ discussion of ethical issues in user-generated data research. We combine our analysis with real-world and research cases, particularly studies on book reviews. Our findings are presented in the following four subsections. The four aspects we consider are not isolated; in practice, they intertwine with each other in complementary or sometimes conflicting ways (as exemplified in the following discussions). For example, some research aspects might be ethical but not legal, e.g., violating TOS to scrape publicly available book information, or legal but not ethical, for example, quoting snippets from identifying public information of vulnerable communities.

3.1 Legal Permissions and Risks

One primary legal risk associated with research based on user-generated book reviews comes from data scraping. Various data-scraping lawsuits have been initiated, claiming violations of TOS, copyright infringements, or unfair competition [15, 57]. In this subsection, we consider cases in the U.S. as an example. Researchers from other jurisdictions should refer to the corresponding regulations that apply to their research scenarios. For U.S.-based studies, researchers should first consult the Copyright Law of the United States [143] and the fair use doctrine for risks associated with copyright infringement, and the Computer Fraud and Abuse Act (CFAA) [30] to minimize the risks of being sued. Fair use only conditionally permits unlicensed use of copyright-protected work under certain circumstancesFootnote 13. Scholarship and research activities are typically activities protected by the fair use doctrine [143], but a self-assessment of each use case and/or consultation with a copyright specialist can help to make responsible decisions.

For research based on large-scale scraped data [122, 152], to reduce legal risks associated with copyrighted content, researchers may consider making transformative and non-consumptive use of the dataFootnote 14, which has been increasingly adopted in computational studies of massive cultural data [79, 113, 123]. Furthermore, scholarly use of book review data might not fall under the concerns of the CFAA as the usage is non-commercial and for educational/research only [5]. However, it is essential for researchers to understand the CFAA and address other potential conflicts between their intended use of data and the provider platforms’ policies (e.g., TOS/EULAs), which are discussed in the following subsections.

Second, researchers need to comply with laws that govern the use of personal data and privacy. In the U.S., applicable laws include privacy laws [3, 82], state laws like the California Consumer Privacy Act (CCPA) [23], and state laws protecting the privacy of library records. Library records typically include online search records, circulation records, interlibrary loan records, personally identifiable uses of library materials and services, etc. Although no federal legislation or case law has been established to protect the privacy of library records, forty-eight states and the District of Columbia have established laws regarding the confidentiality of library records [7, 90].Footnote 15 While accessing and presenting publicly accessible user-generated book reviews obtained from commercial websites is different from disclosing confidential user records held by libraries, both actions might expose individual reviewers’ personal data to a third party or the public. Therefore, we advise researchers to check relevant laws on library records to understand legal requirements associated with library patron records and data alike.

Last but not least, researchers should note that user-generated content is often contributed by users from around the globe, regardless of where the platforms are based. For instance, while Goodreads is based in the U.S., its user base is global [122, 140]. Therefore, researchers working on data collection from U.S.-based providers should examine international and regional regulations as well, such as The World Intellectual Property Organization (WIPO) Copyright Treaty [161], and Europe’s Directive on Copyright in the Digital Single Market [135] and the General Data Protection Regulation (GDPR) [44], and China’s Personal Information Protection Law (PIPL) [155]. This recommendation applies to research based in other areas of the world, too.

3.2 Policies and Guidelines Issued by Platforms Provider

Three types of documents from platform providers are most relevant for understanding the permitted use of book review data (any of them, or none, may be available): data access solutions provided by the platform (e.g., APIs, AWS), TOS/EULAs, and “robots.txt” filesFootnote 16. These files specify what and how data from these services can or cannot be used, among other things. For instance, the TOS of Goodreads [56]Footnote 17 severely limit use of data to prevent inappropriate commercial competition, copyright infringement, and violations of user privacy rights. It states that the allowable use of Goodreads data does not include “any use of data mining, robots, or similar data gathering and extraction tools” [56] and restrict the data that people can access from their front page via review sorting algorithms and user-interface design [150]. In addition, Goodreads’ robots.txt excludes a list of sitemaps and webpages from web scraping even though they are publicly accessible [55], and the site retired their API in 2020 [122]. Given these limitations, the next question for researchers might be: what are the consequences of scraping data from platforms that explicitly or implicitly prohibit scraping?

On one hand, researchers might argue for their use of data scraping or scraped data against the platforms’ policies under certain conditions, e.g., when the research’s “benefits to society outweigh the harm of violating terms of service” [145]. One important aspect in advocating for this position is to consider how “public” the scraped data are: while dominant social media platforms are likely to “continue to push the boundaries on allowable methods to limit data scraping”, the Supreme Court’s decisions on the case of hiQ Labs vs. LinkedIn signaled “a shift in the way courts may be viewing attempts to restrict data scraping” [53] in the U.S.Footnote 18 While heated debates on the implications of this verdict continue, a widely recognized takeaway for researchers is that scraping data that is publicly accessible without access control, such as passwords, paywalls, physical or technical barriers (e.g., software verification), is not necessarily unlawful, even if such scraping is prohibited by the platform’s TOS/EULAs [14, 49]. This verdict, to some extent, suggests that researchers are not doomed to be criminalized for scraping publicly accessible data without a platform’s permission or against its policies. On the other hand, the legal and ethical consequences of violating TOS/EULAs in data collection for research purposes remain an open question [46, 145, 148], as the feasibility and enforceability of platforms’ TOS, particularly their prohibitions, are subject to further examination [5, 28]. Existing research on the TOS of over a hundred global social media platforms found that “though these provisions are very common, they are also ambiguous, inconsistent, and lack context” and “may reflect possibly conflicting values” [3, 46]. It is also important to note that platform policies might not align with the best interests of their users or researchers’ ethical considerations [3, 46].

In short, although there is no clear answer to “whether researchers should be permitted to violate TOS when collecting data” [46], a violation of TOS alone does not necessarily criminalize researchers’ data scraping. In the U.S., current federal regulation does not enforce researchers to follow EULAs and does not criminalize scraping as a violation of the CFAA (although scraping might still violate copyright and privacy laws and regulations). Researchers whose plans for data scraping do not align with the platform policies are recommended to conduct a careful assessment of their use case. For instance, they should consider if the data to be scraped are publicly accessible, and they should avoid scraping from disallowed webpages/websites that are specified in robots.txt files/EULA/TOS. Finally, even if data collection procedures follow the requirements and guidelines of a platform, researchers also need to consider how to protect users, as EULAs/TOS do not necessarily align with the best interest of users [3, 47].

3.3 User Rights and Expectations

Relevant laws and platform policies may fail to protect user rights or meet their privacy expectations: “Users care about how their content can be used yet lack critical information” [47]. Therefore researchers should assess how their planned work might conflict with the interests of users. To help with that, based on our literature review, we identified four potential pitfalls and approaches for avoiding them. First, a user’s acceptance of TOS is not the same as their “informed consent” to any third-party use of their data. Prior surveys have shown that most users do not read the TOS they accept or consent to due to “lack of choice, inaptitude, or habituation” [18, 105]. Meanwhile, without prior knowledge or additional information, it is beyond any individual user’s capability to predict the third-party use of their data and potential hazards of that use. Therefore, responsible researchers should not assume that their use of user-generated data is within the expectations of the data creators simply due to their acceptance of a platform’s TOS.

Second, researchers should not necessarily take publicly accessible data as “data open for use”. This false assumption has led to various problems, such as re-identification of users in data shares and violations of user privacy [35, 167, 168]. There is a fundamental difference between (1) the data is public; and (2) the data has been consciously made public by users. The degree to which user (generated) data is public varies: some data are actively created and shared by users (e.g., book reviews that are set to be visible to all), while other data are passive traces automatically generated by algorithms based on user activities (e.g., location information based on IP addresses, time stamps associated with user activities, etc.) [65]. For the first case, some platforms, such as LibraryThing, allow users to set and alter the level of visibility of their contributed content (e.g., write a review that is public to all or kept to oneself) [94]. If reviewers explicitly choose to make their data public, researchers can assume that users are aware of their choice, even though they might not anticipate use cases beyond the visibility of the given site.Footnote 19 Even in this case and moreover in general, users might not be aware that their data is part of passive digital traces or is available for third-party use.

Third, using public user data does not free the researchers from responsibility to avoid accidental or inappropriate use of private information, even though it might have been the users who disclosed their private information in the first place. As mentioned, user-generated book reviews may disclose personally identifiable and other personal information [62, 103, 122]. Additionally, online book reviews may disclose the identities of people other than the reviewers themselves [94], including vulnerable groups of people who have no knowledge of or control over the existence of a review. For instance, in online book reviews of children’s books, ages, gender identities, grades, and first names of children are frequently shared by adult reviewers [106]. Such information, when cross-referenced with reviewer profiles, can put a child’s real-world identity at risk. Responsible researchers are advised to remove any personally identifying information from their datasets.

Fourth, ethical research should respect and protect the book reviewers’ intellectual freedom and freedom of speech, both of which are particularly pertinent to the missions and values of LIS [112]. Book reviews may contain controversial opinions that may not only frustrate or irritate other readers but also unsettle the public at large [104]. Taking library practices in the U.S. as an example, as long as a review does not break any laws or TOS, a reviewer is entitled to “write what they think” and “dispute ideas and words without limitation” [94], even though others may oppose them. Such principles are debated among online book reviewers. For example, a group of book reviewers on Goodreads repeatedly gave one-star reviews to LGBTQIA+ books, sometimes even before the release of advanced copies or as part of book campaigns [126]. Many users consider such behavior to be trans- and homophobic actions targeting LGBTQIA+ groups and marginalized authors, and demanded moderation from Goodreads to remove these reviews [126]. However, Goodreads did not remove the ratings as requested because one-star ratings themselves did not directly violate any platform regulations (while personal attacks and hate speech, for example, would violate their guidelines) [41]. In controversial cases, researchers from different disciplines and cultural backgrounds could potentially approach the data in different ways, which may or may not align with the interests and expectations of either the users or the platforms involved. We are not in a position to question anyone’s research priorities or personal stances; we simply remind researchers that every reader is entitled to their intellectual freedom and freedom of speech, and that library professionals adhere to these principles [84, 91]. Responsible researchers should stay alert to any personal biases and feelings toward different groups of reviewers. All users/readers should be equally protected from unexpected and unwanted surveillance, tracking, blaming, and attacks in scholarly research.

3.4 Discussions and Concerns from the Research Community

There have been various case studies, guidelines, and statements for how to conduct compliant, responsible, and ethical research on user-generated data in general and for specific genres [1, 2, 11, 46, 52, 58, 81, 87, 121, 148], as well as more specialized discussions on this topic from LIS perspectives [13, 100, 101]. Here we zoom in on three topics that have been heatedly discussed: (1) explicit informed consent from human research subjects; (2) institutional/administrative review and approval; and (3) platform restrictions.

As for informed consent and institutional/administrative review, while some researchers argue that such conventional research practices should be applied to research on user-generated data from online sources [46, 51, 147], others disagree [74, 87, 147]. The latter group argues that scholarly research of such data may be exempt from informed consent under certain conditions, e.g., when it is almost impossible to obtain “retrospective” informed consent for archival research [87]Footnote 20; and when research projects involve “no more than minimal risk to the subjects” and “could not practicably be carried out without the waiver or alteration” [74]. Other researchers claim that institutional/administrative review and approval, such as IRBs in the U.S., tend to apply “overly restrictive guidelines developed for biomedical research to lower risk studies”, and sometimes lack “the expertise to effectively evaluate technical proposals” [147]. They also argue that tensions between conventional requirements (such as IRBs) and social computing research could actually “increase risks to participants, delay data collection, or substantively change a research project” [147]. Furthermore, researchers’ attitudes toward platform restrictions also diverge. For example, some researchers insist that the legitimacy and enforceability of TOS are questionable [46, 148], which raises concerns about the legal consequences and ethics of either following or violating the TOS. So far, no consensus has been reached on these three topics with regard to the unobtrusive analysis of user-generated content [147], although opinions are converging on other aspects of ethical social computing, such as ensuring participants’ access to the research outcomes [148].

Nevertheless, there exists consensus on the holism, contextuality, and complexity of the ethical conduct of research [45, 167]. It has been broadly acknowledged that weighing potential harms and intended benefits for all stakeholders (e.g., users, platforms, and society at large) and mitigating different considerations are hard [46, 147]. We have consistently found such dilemmas and trade-offs in existing book review studies. For example, some studies de-identified reviewers by removing their original usernames and partial user profiles (e.g., location, gender identities) [4, 122]. This makes reviewers less likely to be tracked down, although risks of re-identification remain [122, 168]. However, such de-identification deprives the book reviewers of credit for their intellectual contributions and copyrighted work, to which they are entitled as content creators [22]. To overcome this limitation, some researchers choose to seek informed consent from book reviewers they intend to quote in their research publications, particularly as to whether the reviewers want to be quoted verbatim under their scraped usernames [12, 150]. However, getting permission from individual reviewers requires personal contact with human research subjects, which means their data collection is no longer unobtrusive. For U.S.-based studies, unless an IRB review is conducted, this strategy would be considered risky and inappropriateFootnote 21. Similar trade-offs have emerged from data publication as well. Some researchers chose to selectively publish their scraped data, or not to publish any of their scraped data at all, in order to protect reviewers’ data from inappropriate use [122, 150]. However, this raises questions about research reproducibility and transparency [76, 132].

4 Discussion and Limitations

When planning responsible research projects, different factors and considerations might not align or conflict with each other in actual practice, leaving researchers with a number of dilemmas to solve and difficult decisions to make. For instance, as book review platforms often neither provide APIs nor permit scraping, researchers need to evaluate the risks associated with violating platform policies or even laws. Researchers are furthermore expected to honor readers’ rights and expectations, which are crucial concepts that are not always prioritized by platforms’ policies. There are trade-offs and risks associated with many decisions that have to be made by researchers. While researchers might not always be able to resolve them, they should minimize potential harm and make situation-specific decisions to guarantee that the benefits of their research to society outweigh the risks of potential problems. Institutional review and oversight, such as IRBs, share this goal, but they might not apply to working with archival and/or online data, such that researchers need not only to understand these risks, but also have the knowledge and skills to mitigate them. Although our research emphasizes legal risks and ethical problems associated with research on user-generated book reviews, by no means do we intend to discourage research with this genre or type of data. We rather hope to critically engage with this research area by contributing LIS perspectives and facilitating future research by flagging potential pitfalls and suggesting potential solutions.

Our investigation is limited in several ways. As we are neither law practitioners nor policymakers, we are not in a position to give legal advice. Besides, given the broad multidisciplinary reach of user-generated data research, discussions about our research questions remain controversial, without a clear consensus or cross-disciplinary norms. Most importantly, scientific research often comes with risks and uncertainties, and decisions should be made based on the specific context of a research problem. As there is no panacea for minimizing research risks or guaranteeing ethical practice, instead of crafting “guidelines for everyone”, we synthesized prior relevant literature, case studies, and library practices to understand (1) what researchers should look out for; and (2) what they should leverage to guide and assess their scholarly usage of user-generated book review data. Second, given the breadth and multidisciplinarity of book review research, our scope of analysis was unavoidably yet necessarily narrowed down. For instance, we took a U.S.-centric perspective, and some of that might not apply to other regions of the world. Nevertheless, the U.S. context serves to contextualize and exemplify the complexities of the legal and ethical issues in book review studies, and provides a regional research case. As an overview, our research outlines the primary legal and ethical concerns about scholarly usage of user-generated book reviews, which are not limited to research based in the U.S.

Finally, while we put legal and ethical considerations forward as an insufficiently discussed problem in research practice of user-generated book reviews, these considerations are by no means overlooked in research at large. Instead, as our discussion shows, there exist plenty of generally applicable and insightful papers and guidelines to refer to. Thus, this paper calls for more attention to both (1) the paucity of scholarly discussions about legal and ethical concerns in book review research; and, (2) how researchers can leverage existing resources to address this particular problem.

5 Conclusions and Future Work

This paper presents an overview of legal risks and ethical concerns associated with scholarly usage of user-generated book reviews. Our review was primarily motivated by (1) the lack of attention to this problem in prior computational and empirical studies of user-generated book reviews; and, (2) the dual role of the users and readers who are subject to potential harm caused by scholarly use of their data. We reviewed relevant laws, platform policies, user expectations, and prior research to inform future researchers of potential legal and ethical pitfalls, and offer some suggestions for how to avoid them through practical solutions. We also drew on library practices and guidelines to better understand why and how researchers should protect data generated by users/readers. The pitfalls identified and discussed include copyright infringement, violations of TOS/EULAs, conflicts with user rights and expectations, and the role of informed consent and institutional reviews.

The intended contributions of this paper are threefold. First, given the dual role of online book reviewers as (1)content consumers and producers; and, (2) readers, we emphasized the significance of evaluating and reducing risks associated with scholarly usage of user-generated book reviews. Second, we analyzed legal and ethical concerns that have been under-investigated in the context of user-generated book reviews. We hope these insights help to inform future studies on how to reduce potential risks and better protect the users/readers. Third, under the overarching umbrella of responsible data-driven research, we demonstrated how to assess legal and ethical issues associated with the characteristics, stakeholders, and research contexts of book reviews.

For future work, there are more questions to scrutinize. First, there is a variety of data analyses on user-generated book reviews: some studies annotate individual book reviews word by word while others only map high-level patterns in corpora (e.g., average book ratings). Should different ethical expectations be applied to different use cases depending on the research scale, granularity, and “distance from the readers”? For instance, can researchers consider informed consent inapplicable for de-identified and paraphrased quotations or non-consumptive text mining of book reviews? To answer these questions, we need to examine more prior research to understand the needs and costs (e.g., time and administrative procedures) of different actions taken. There are also open questions from the perspective of libraries, such as the argument that libraries are losing competency as a result of their “hands off user data” practice, which sometimes limits their ability to serve their patrons [43, 90]. Are user-generated book review datasets filling the gaps or taking advantage of libraries’ “moral absence”, and if so, where do researchers stand on this question? To explore this question, qualitative studies, such as interviews with researchers working with user-generated book reviews and/or questionnaires among online book reviewers, might be effective methods for gaining a nuanced understanding of different stakeholders’ needs, expectations, and concerns. We also encourage collaborations among researchers from diverse communities and different cultures or regions to cross-examine and broaden our knowledge of this issue.