1 Introduction

It is not uncommon for data-intensive research to use works that may be protected by copyright (papers, news articles, images, etc.), especially in the training phase.Footnote 1 Text and data mining (TDM) may be a powerful tool in the knowledge discovery process,Footnote 2 and an essential step in the process of training Artificial Intelligence (AI) systems. Whether forms of use needed for TDM conflict with copyright rules is still a matter for debate within the specialized literature and when designing new legislation. Legal certainty about the status of TDM-related practices, and, in general, about uses for research purposes, may be decisive for technological, scientific, and economic development. Previous studies have shown that Latin America brings together in one area a substantial number of countries in which either legislation on the matter is restrictive (in the case of uses for research purposes in general) or the subject is unregulated, as is the case for most copyright laws in the region when it comes to TDM.Footnote 3

There is already an extensive body of literature analyzing TDM exceptions. However, even though the borderless nature of research makes the interplay between TDM and copyright a matter of interest to all regions, most of the focus in the existing literature is on countries or examples in the Global North.Footnote 4 In the EU, such issues as the potentially negative impact of existing copyright rules on TDM activities,Footnote 5 harmonization throughout the Union,Footnote 6 realization of the existing TDM exceptions on research and innovation,Footnote 7 and the narrow scope of the TDM exceptionsFootnote 8 are just some of the topics addressed in the copyright literature.Footnote 9 In the United States, given its common law system, the literature analyzes the lawfulness of the acts carried out within the context of TDM research under the fair use doctrine,Footnote 10 often commenting on landmark cases.Footnote 11 The Japanese exception is also often referenced due to its particular wording and permissive scope.Footnote 12

While the debate in Latin America may benefit from literature on international copyright lawFootnote 13 and existing studies addressing the interplay between TDM and copyright on a global level, there are normative, socioeconomic and cultural characteristics unique to either the region or countries within it that must be considered in any analysis and design of a legal framework for research and innovation. This study aims to provide an analysis of the current debate and regulation on the topic in Latin America, the role of Latin American TDM research in the global research community, and examples of TDM practices that are key to research practices and uses.

Part 2 will outline the definition of TDM adopted herein and some of the main forms of use within TDM research, highlighting potential copyright issues in such practices already documented in the dominant literature. Part 3 will provide a broad overview of the international debate on the regulation of research practices and, more specifically, forms of use needed for TDM under copyright law, as well as the potential relationship between the openness of copyright law and research. Part 4 will present the current state of the copyright laws in the Latin American region when it comes to TDM and research exceptions, exemplify some TDM practices that are key to research, and analyze the role of Latin American TDM research in the global research community.Footnote 14 Finally, it will discuss the need for a well-tailored and balanced legal framework in the region.

2 Text and Data Mining and Copyright

Even though the training of AI systems cannot be reduced to TDM, and the two are not synonymous, there is a clear relationship between them, with the latter being an important step for the former. TDM practices often involve the automated analysis of works that could be protected by copyright (papers, news articles, images, etc.). Whether the use that needs to be made of these for TDM may trigger copyright rules is still a matter for debate within the specialized literature. While part of the literature argues that there must be an exception for TDM, other parts disagree and understand that such an exception would reinforce the misconception that TDM involves uses that, by law, fall under copyright holders’ exclusive rights. This section aims at elucidating some key applications within TDM research, and highlighting potential copyright concerns connected therewith, including nuances of the Latin American context concerning the intersection of TDM and copyright.

2.1 What is Text and Data Mining?

When it comes to existing laws for regulating TDM practices, one of the most cited is European Union’s Directive 2019/790 on copyright and related rights in the Digital Single Market (CDSM). Article 2(2) thereof defines TDM as “any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations”.Footnote 15 An example from Latin America is Brazil’s bill on AI, which describes TDM as “the process of extracting and analyzing large amounts of data or partial or full excerpts of textual content, from which patterns and correlations are extracted that will generate relevant information for the development or use of artificial intelligence systems”.Footnote 16

In the literature, data mining is defined by Han, Kamber and PeiFootnote 17 as “the process of discovering interesting patterns and knowledge from large amounts of data”. IzquierdoFootnote 18 defines it as “AI’s ability to interpret large quantities of raw data, [...] [and] by identifying patterns, to process them”.Footnote 19 Similarly, Ruiz Lobaina and Romero SuárezFootnote 20 propose the following definition: “Data Mining is the process [...] that deals with the non-trivial extraction of useful, hidden patterns that are inherent in data, and also the fastest way to study large volumes of information”.Footnote 21

Although it could be challenging to give details of each and every phase of research employing TDM, given the multiple research fields,Footnote 22 projects and techniques,Footnote 23 the description provided in Carroll is helpful for illustrating some of the main steps in the process:

a multi-step process involving first the compilation of a dataset of text-based and related works into a format amenable to software-based statistical and related forms of pattern analysis. Researchers make multiple copies of the data during the TDM process. They make copies when they: (1) collect and compile the data; (2) format the data for computational processing; (3) process the data in a computer’s active memory; and (4) store or archive the data to enable reanalysis or to enable validation through reproducing the analysis.Footnote 24

As seen in many of the available definitions on TDM, including those listed above, one of the common aspects is that the primary output of the employment of TDM techniques is the extraction of knowledge, patterns and correlations from a large amount of data. The output of TDM research is not supposed to reproduce any of the works used in the mining process in the final result and, practically, “little or none of the text, images or other forms of expression in the data appear in the TDM results”.Footnote 25

Yet, whether the uses made of copyrighted works within TDM require prior authorization from copyright holders remains the subject of ongoing and growing debate within the legal literature, as will be further discussed below.

2.2 Is TDM a Copyright Issue?

As stated by Carroll, “researchers make multiple copies of the data during the TDM process”.Footnote 26 As will be further discussed in this item, there is a debate in the literature about whether these copies and, in general, the use of protected works for TDM may constitute copyright infringement. On the one hand, and given that some parts of the process may involve copies or other uses of copyrighted material that may require the prior authorization of the rightholder,Footnote 27 researchers engaging in TDM practices can be liable for copyright infringement.Footnote 28 On the other hand, part of the legal literature considers that some uses made for TDM projects do not trigger copyright rules for multiple reasons.

Building upon debates on this matter in the dominant literature, this section will focus on the discussion about the lawfulness under copyright law of the acts carried out (e.g. reproductions made) during the TDM process. It is important to mention that this is just one facet of a more general discussion concerning the lawfulness of acts carried out within a project employing TDM. This broader discussion is further developed in the literature and may also raise related issues, such as cross-border uses; contract prohibitions on TDM practices; the lawful access requirement; and the legal, social and economic differences between countries in the Global South and Global North.Footnote 29

One of the main theories discussed in the literature about the lawfulness of TDM practices when copyright rules are considered is the one on expressive and non-expressive uses. A non-expressive use, according to Sag, “refers to any act of reproduction that is not intended to enable human enjoyment, appreciation, or comprehension of the copied expression as expression”, while the expressive use of a work would be one connected to “human appreciation of the expressive qualities of that work”, as would be the case were one to “download a film to watch it, or photocopy a magazine article to read it”.Footnote 30 Sag further expands on non-expressive uses and the relationship with TDM by referring to the traditional idea-expression dichotomy:

The idea-expression dichotomy limits the rights of the copyright owner to the expressive elements of the author’s work. In a world of analog works printed on paper or etched in vinyl, this is achieved by simply holding that the copying of facts and ideas alone does not infringe. Preserving the idea-expression dichotomy in the digital world means recognizing that copying a work for purely non-expressive purposes also does not infringe.Footnote 31

Under this proposition, if the necessary acts carried out (e.g. reproductions made) within a research project employing TDM are considered non-expressive, there would not be any copyright infringement.Footnote 32 Therefore, proposing and building TDM exceptions could reinforce the argument that, if there is no exception for this kind of practice, the uses made therein could be considered infringing.Footnote 33 While analyzing Arts. 3 and 4 of the CDSM, Margoni and Kretschmer provide an illustration of the argument:

Nevertheless, the effect of the dispositions contained in Arts. 3 and 4 CDSM is to formalise an interpretation that significantly reduces the ambit of application of the idea/fact/expression doctrines. This is achieved through the affirmation that non-protected mere facts and data when contained in protected works receive some sort of derivative or reflected form of protection since their (non-protected) reuse requires the making of some sort of transient or temporary copy of the (protected) containing work. In other words, the content is not protected in its own right, the container is. But because there is no viable form of using the content without also using the container, the protection of the latter extends to the former.Footnote 34

Given that the major economies in Latin America are parties to key international treaties on copyright, such as the Berne Convention and TRIPS Agreement, as well as human rights treaties, the fundamental issues concerning TDM and the reach of copyright may, to some extent, be similar to those raised in this section. However, potential similarities do not extend much beyond this point, as will be further demonstrated in Part 4 of this article.

2.3 What Changes with Generative AI?

Recently, the popularization of generative AIFootnote 35 systems has highlighted certain copyright-related – and many otherFootnote 36 – issues relating to their training and use. When it comes to copyright, the literature has discussed how far copyright rules extend to AI-generated output, and the use of copyrighted works to train these systems. Considering the commercial purpose behind some of the popular generative AI systems and their outputs’ potential substitutive effect, the use of copyrighted works for training them may require a different treatment by copyright law than uses for research and public-interest-related purposes. Although the use of AI to develop products that are equivalent in ‘artistic’ or literary terms is no longer new, the speed and intensity with which technology has evolved in the last couple of years have brought urgency to the legislative discussion both in Latin America and around the globe.Footnote 37 Projects such as the well-known Next Rembrandt,Footnote 38 Portrait of Edmond Belamy,Footnote 39 or SunspringFootnote 40 are still very popular examples for sparking discussion on the interplay between AI and copyright. However, more recently, the focus appears to have shifted towards generative AI systems like those offered by OpenAI.Footnote 41

Although some of the technical aspects of the use of copyrighted works to train these generative AI systems may not differ substantially from those for the general use of TDM in research,Footnote 42 other aspects may need to be regulated differently.Footnote 43 One of the differences in these recent generative AI systems, when compared with both the use of TDM for research and the projects popularized up to 2020 (e.g. Next Rembrandt, and Sunspring), is that, unlike their predecessors, they are typically accessible to the public, and often provided as a service with both free and premium/pro (paid) versions. Another difference lies in their expected output: while the results expected from TDM practices in research, for example, are generally patterns and correlations (elements not protected by copyright),Footnote 44 the output expected from a generative AI system consists, as a rule, of products (e.g. illustrations and texts) that are objectively indistinguishable from human creations,Footnote 45 and may directly compete with works used in their training.Footnote 46 That may affect markets once exclusively populated by humans (e.g. translators, dubbing actors, designers, illustrators). Moreover, the fact that most of these systems are freely available to anyone interested in operating them allows the simultaneous creation of countless potentially competing products for less or no cost. On the other hand, some applications employing generative AI systems may assist human creators in the creative process.

Questions around fairness in the use of copyrighted works for training generative AI systems have been the object of many lawsuits.Footnote 47 However important, the analysis of generative AI transcends the purpose of this article. For now, it is fair to say that there are solid arguments for advocating that these activities be regulated differently than the forms of use needed for TDM for research and/or by public interest institutions when carrying out their activities.

3 TDM Exceptions Worldwide

In the past decade, there have been developments both in the field of legislation relating to TDM practices and in the related copyright literature. While these developments may have different dynamics and results according to local practices and their jurisdiction, they mostly share at least one common concern: how should copyright law confront and regulate the forms of use needed in the context of TDM? This section will provide an overview of some of the current research on TDM exceptions worldwide.

3.1 Research (and TDM) Exceptions Worldwide

TDM is a powerful tool for processing and analyzing large amounts of data within the scope of research activities and is key for contemporary computational research. When applied within the scope of research uses, TDM may be enabled, or even incentivized, by a permissive and general provision addressing such research uses in copyright law, or, alternatively, may be constricted or even made legally impossible. This section of our study will address some recent studies that have mapped and analyzed the text of copyright laws around the world, focusing on the available research exceptions and others from which TDM could benefit.

Flynn, Schirru, Palmedo and Izquierdo categorize “the world’s copyright laws according to the degree to which they provide exceptions to copyright exclusivity for research uses”.Footnote 48 One of the typologies adopted in the article considered six different categories, classifying research exceptions from the most open (green) to the most restrictive (red). For the purposes of this study, it is enough to note that, as seen from the map below (Fig. 1), most of the countries colored red are concentrated in Latin America. These countries, as described in the study, “specifically limit research exceptions to uses of excerpts of works, which is the minimum standard for limitations and exceptions required by Article 10(1) of the Berne Convention”. This could be inadequate for TDM purposes.Footnote 49

Fig. 1
figure 1

Flynn et al. (2022a) “Figure 2. Research Exceptions in Comparative Copyright: Six Color”

Palmedo et al. (2023) expanded on the analysis in Flynn, Schirru, Palmedo, IzquierdoFootnote 50 by analyzing “a new dataset of copyright exceptions for researchers in 165 countries over 21 years”, with the goal of tracing changes in said laws.Footnote 51 The color scheme adopted in the previous study was adapted into a coded scoring system (switching from colors to numbers), and one of the findings concerns the fact that “[w]ealthier countries, on average, are more likely to have copyright exceptions allowing greater unauthorized uses for research purposes than other countries”.Footnote 52 When it comes to variations over time, Table 2 of the said study shows Latin American countries mostly present in the list of “Countries with Decreasing Scores” (a movement towards more restrictive legislation), while the list “Countries with Increasing Scores” contains more countries located in the Global North (Fig. 2):

Fig. 2
figure 2

Palmedo et al. (2023) “Table 2: Nations with change in their scores”

Using the dataset in Palmedo et al.,Footnote 53 and carrying out a similar analysis focused only on the Latin American countries’ scores, it can be seen that, apart from Ecuador, the trend in Latin America has been for legislation to become more restrictive over the selected time period. Antigua and Barbuda (5 to 1), Brazil (4 to 0), Dominica (5 to 1), and Grenada (5 to 1) all decreased by -4 points, and Panama decreased by -3 points (3 to 0). On the other hand, Ecuador had an increase of +2 points (Fig. 3).

Fig. 3
figure 3

Created by the authors with information available in Palmedo et al. (2023)

When considering the available research exceptions, it becomes evident that not only does the present legal framework fall short of being optimal for research purposes, but it also reflects a broader trend towards increased restrictiveness in the region. This trend contradicts the movement observed in developed countries and, as will be further developed in Section III.C, may negatively impact research and innovation in the region.

When it comes to specific TDM exceptions, and building on the color-coding scheme previously mentioned, Flynn, Schirru, Palmedo and IzquierdoFootnote 54 also provide an overview of existing provisions. As shown in the figure below, most of the existing exceptions were concentrated in Global North countries. In general, existing TDM exceptions enable TDM practices for research purposes, even where there are restrictions on the used works or permitted forms of use. Additional requirements seen in some jurisdictions, for example the need for lawful access and protection against contractual overridability, are also mapped.Footnote 55 As analyzed by the authors, the only TDM exception found in Latin America was in Ecuador, which was flagged red owing to potential restrictions when it comes to the forms of use needed for conducting TDM research (Fig. 4).Footnote 56

Fig. 4
figure 4

Flynn et al. (2022a) “Figure 3. TDM Exceptions in Comparative Copyright”

Latin American countries also make up a significant proportion of those countries in which copyright law restricts research uses and is mostly silent on TDM practices. Next, we will analyze the potential negative consequences of this legal landscape on research in the region.

3.2 The Relationship between Balanced Copyright Regimes and TDM Research

Although the link between stronger and more restrictive intellectual property rights and higher innovation remains unclear,Footnote 57 recent empirical evidence in the literature shows a negative association between restrictive copyright rules and innovation,Footnote 58 as well as a positive relationship between more open and permissive copyright systems and research.Footnote 59

By building a “User Rights Database”Footnote 60 and applying econometric tests to the data available therein, Flynn and Palmedo analyzed 21 countries’ copyright laws between 1970 and 2016.Footnote 61 The study found that “researchers in countries with more open user rights environments produce more scholarly output and more high-quality output”.Footnote 62 When it comes to the impact of more open copyright laws on copyright-intensive industries, the findings were either neutralFootnote 63 or positive. The study found that “more open user rights environments are associated with higher firm revenues” for information industries, and that “more open user rights environments are not associated with harm to industries […] such as publishing and entertainment”.Footnote 64

When comparing developed and developing countries, the study illustrates the gap between the level of openness of the legislation in both categories, concluding that: “developing countries in our sample are now at the level of openness that existed in the wealthy countries about thirty years ago” (Fig. 5).Footnote 65

Fig. 5
figure 5

Flynn and Palmedo (2019a) p. 16 “Growth of Openness”

Handke, Guibault and Vallbé,Footnote 66 found “strong evidence for stricter copyright hindering the wide adoption of novel ways to build on copyright works and generate derivative works”. The study analyzed “bibliometric data to establish how various copyright policies affect the application of DM [data mining] in academic research”,Footnote 67 covering data available in the Web of Science between 1992 and 2014. Amongst the findings of the study are the following:

countries in which academic researchers must acquire the express consent of rights holders [which includes all researched Latin American countries] to conduct lawful DM exhibit a lower share of DM research output in their total research output […] This implies that an application of copyright exceptions or limitations that establish the right to mine for academic researchers – if they have lawful access to input works and irrespective of explicit rights holder consent – boosts DM research.Footnote 68

Recently, some of the authors involved in the previous study presented a related study involving a larger research team and additional data. The research worked with a dataset of 1.5 million TDM-related articles,Footnote 69 and shared similar objectives to the research from 2021, i.e. to understand how much TDM research was published per country and per year,Footnote 70 and – as there may be additional factors influencing the outcomes – to analyze the potential correlation between the amount of TDM research being produced and the degree of “openness” of the copyright law of these countries (Fig. 6).

Fig. 6
figure 6

Vallbé (2023) slide “Leading DM Research Countries”

The conclusions aligned with each other: they found that countries with more restrictive laws tended to be less productive when it came to TDM-related research than countries with more open copyright laws.

By comparing the table above, which gives updated numbers of TDM research articles, with the results presented in Flynn and Palmedo, Handke, Guibault and Vallbé, Flynn, Schirru, Palmedo, and Izquierdo, and Palmedo et al.,Footnote 71 it can be seen that countries commonly referred to as having more “open” and “permissive” copyright law exceptions (e.g. USA, Germany, Japan, Australia, and Canada) are usually also amongst the countries with more published research on TDM.

4 Text and Data Mining in Latin America

Legal clarity, indeed certainty, on research uses and, more specifically, TDM-related practices under copyright law may be decisive for technological, scientific, and economic development. The current legal framework in Latin America, in general, is not well equipped to deal with TDM practices. This may significantly affect research and the development of its AI industry, with countries and researchers having fewer options, which are also far from optimal, such as the adoption of pre-trained models and datasets prone to bias and local inadequacy. This section of our analysis focuses on the current legal status of TDM and research exceptions in Latin America and gives examples of TDM practices that are key for research. Finally, this section will address some issues arising from qualitative interviews with practitioners from different fields that may contribute to the design of a well-tailored legal framework in the region.

4.1 TDM and Research Exceptions in Latin America

Previous studies on research exceptions in copyright law illustrate the alarming scenario for researchers working with TDM in Latin America, as most of the laws in the region are restrictive and insufficient for fostering innovation,Footnote 72 and/or do not provide a general exception for research/TDM.Footnote 73 Moreover, apart from recent and localized developments towards a more extensive approach (based on fundamental rights),Footnote 74 limitations and exceptions (L&Es) are largely interpreted restrictively in Latin American countries.Footnote 75 This creates substantial legal obstacles for different common practices, especially for libraries, archives, museums, and research and education institutions, and their agents.Footnote 76

Regarding TDM specifically, the situation is no different. By analyzing the copyright laws of the five largest economies of South America at the time (Argentina, Brazil, Chile, Colombia and Peru), Bertón (2021) concluded that the copyright systems in all regions were “not prepared for digital research techniques such as text and data mining” and imposed “limits that do not cover the needs of TDM researchers and put the region at a competitive disadvantage for keeping up with the latest developments in AI.”Footnote 77

A recent studyFootnote 78 presented in interactive mapsFootnote 79 divides the copyright exceptions of 19 countriesFootnote 80 into three different categories: (i) educational purposes; (ii) libraries and archives; and (iii) research and new technologies. When compared with the results obtained by Flynn, Schirru, Palmedo and Izquierdo (2022), it is clear that, by 2022, the only TDM provision in Latin America was the one enacted by Ecuador, which was further analyzed in the latter study.

However, this does not mean that TDM policies have not been addressed in Latin America. Recently, there have been multiple initiatives in the region concerning the creation of a TDM provision or amendments in the copyright law that may positively impact research practices. In July 2020, Senator Antares Guadalupe proposed including a TDM limitation in Mexican copyright law. The provision would allow reproductions and extractions for TDM purposes, conditional upon lawful access.Footnote 81 A Uruguayan proposal first dated 2020 was to change the copyright law to allow works to be reproduced for computational analysis within the scope of non-commercial research.Footnote 82 In Brazil, a TDM limitation is currently being discussed in the Senate as part of a bill on AI.Footnote 83 Article 42 of that bill (Bill 2338/2023) authorizes “automated use of works, such as extraction, reproduction, storage, and transformation, in data and text mining processes in artificial intelligence systems”, limited to “activities carried out by research and journalism organizations and institutions, and by museums, archives, and libraries”.Footnote 84

Despite important initiatives at national level to foster the debate on TDM and copyright, its importance to research, and the need for such exceptions in copyright laws,Footnote 85 the actual revision of the relevant legal text still has to be addressed. Legal harmonization, and not only at regional level, may also be of crucial importance for narrowing the gap that exists between developing and developed countries.Footnote 86 The current legal landscape and the fact that developing countries are currently not updating their legislation to properly regulate data-driven research and the use of TDM-related tools create unjustifiable legal obstacles for researchers and research institutions,Footnote 87 both at national and regional level and in cross-border collaboration.Footnote 88

4.2 TDM Practices and Research in Latin America

Data-driven research has many applications in a wide variety of areas, and this is no different in Latin America. Examples of data-intensive research can be found in health,Footnote 89 including, very specifically, regarding neglected diseases common in the region;Footnote 90 in information management by librarians;Footnote 91 in comparing the quality of external communication by universities;Footnote 92 and in avoiding the spread of misinformation on the web.Footnote 93 This section will provide further details on examples in the area of health, focusing on the impact of the recent COVID-19 pandemic and the struggle with neglected diseases. In addition, we will report on some of the main results of an empirical study carried out in Latin America in 2021.

4.2.1 The Role of TDM in Health Research

Brazil has been hit hard by the COVID-19 pandemic in terms of number of cases and deaths. By 4 February 2024, Brazil had the sixth highest number of confirmed cases (37.5 million) and the second highest number of confirmed deaths (702,100).Footnote 94 These numbers may be even higher owing to possible under-notification.

Recent years have seen research goals all over the world redirected in order to fight the SARS-CoV-2 virus.Footnote 95 By 4 February 2024, there were 28,592 articles addressing matters related to COVID-19 in the medRxiv and bioRxiv repositories. Between 1 March 2020 and 1 March 2021 alone, a search for the term “coronavirus” in the bioRxiv and medRxiv repositories brought up 13,038 results.Footnote 96 The articles address issues related to the treatment, diagnostics and other medical procedures for the new coronavirus.Footnote 97 These numbers show that in these two repositories alone, a significant amount of data was already available to deal with the new SARS-CoV virus. Together with all the other content made available in other repositories, it would be humanly impossible for health professionals and scientists to read and extract all the information available on the internet.

Here, automatic and computational text and data analysis for uncovering patterns and correlations make a difference in the speed and scope of findings.Footnote 98 However, while medRxiv expressly allows text and data mining in its database,Footnote 99 and there were several initiatives towards openness and access for research purposes during the COVID-19 pandemic,Footnote 100 this study provides evidence that there are not enough legal possibilities in the copyright laws of Latin American countries for carrying out the acts of use needed for research purposes. That is likely to affect future challenges. Taking the COVID-19 pandemic as an example, there was no literature before 2020 on how to deal with this specific Sars-CoV virus or its effects on the human body. The potential emergence of new diseases could pose similar challenges, and any restrictions on accessing and mining data and information from these studies may hinder the development of critical knowledge essential for crafting an effective response to disease and ultimately saving lives.

On the other hand, owing to access to genetic dataFootnote 101 free of charge,Footnote 102 scientists involved in a study in Brazil could “in just 24 hours […] conduct a sequencing of the samples collected and discern the regions of origins of the virus”, allowing them to reach an important finding in the midst of the health emergency caused by COVID-19: “successfully performing these sequencing methods constituted a crucial step towards understanding the main characteristics of the pathogen and how much it mutated”.Footnote 103 TDM research in the region is also important for neglected tropical diseases,Footnote 104 which affect millions of inhabitants of certain areas.Footnote 105 By way of illustration, TDM was used in a study addressing schistosomiasis, with the main objective of “creating a knowledge base about schistosomiasis and classifying information using text mining techniques”.Footnote 106 The study also relied on a database whose data is available under the open access regime.Footnote 107

4.3 Evidence on the Interplay between Copyright and (TDM) Research

In a parallel study,Footnote 108 also conducted by the authors of this article,Footnote 109 we delved into some of the practices and perceptions surrounding copyright exceptions for research activities, particularly concerning TDM uses. We aimed to understand how stakeholders who need to perform such activities perceive the lack of express limitations and exceptions for research, particularly for TDM. We gathered anecdotal evidence rather than comprehensive statistical data. Between August and December 2021 we conducted 53 interviews with stakeholders from research entities,Footnote 110 government bodies,Footnote 111 and third-sector organizations,Footnote 112 across six Latin American countries.Footnote 113 The main findings reveal a significant disparity in knowledge about copyright laws among the three sectors, ranging from a lack of awareness of the connection between copyright and research to explicit demands for copyright reform.

It is notable that there is limited awareness of existing copyright protection for databases, particularly among researchers regarding access to scientific articles.Footnote 114 None of the interviewees made a distinction between copyright protection of the contents (e.g. scientific articles, videos, photos) and of the database itself. Some of the researchers and other stakeholders interviewed showed little comprehension of how current copyright law works as a whole.Footnote 115

On the other hand, stakeholders who were more informed about copyright issues advocated for necessary reforms. Argentine researchers in particular explicitly called for national and international changes in copyright law to address perceived obstacles to research activities.Footnote 116 Concerning the right to research in particular, 9 out of 53 interviewees considered that there was a need for clear limitations and exceptions allowing for the use of copyrighted materials for research purposesFootnote 117 or that the law should clearly set out what was and what was not allowed for research activities, as a way of protecting researchers and institutions.Footnote 118 Across many countries, private-sector stakeholders exhibited a deeper understanding, possibly owing to the heightened risks associated with their activities.

Lateral findings highlight stakeholders’ support for public access to databases. Some defended the need for publicly funded data to remain open despite challenges in academic publishing dynamics. Interviews also uncovered explicit references to “piracy” and alternative practices like accessing shadow libraries – such as Sci-Hub and LibGen – in order to be able to conduct their research and develop the knowledge that they needed. While researchers acknowledged that these platforms were probably illegal, they still viewed them positively.

Even if sometimes expressed indirectly, this concern reflects the limited access that researchers in this region have to information they need for their work. Their preference for open databases, along with their criticism of knowledge privatization models, and concerns about contractual challenges in licensing private databases (particularly within universities and libraries), highlights significant barriers to democratizing research and knowledge. Researchers commonly refer to shadow libraries as crucial sources, emphasizing how the enforcement of copyright rules is perceived in the region. This, in practical terms, is viewed as a “balance” to the licensing issues faced by universities and libraries.

Another relevant finding was that several stakeholders in all three sectors spontaneously expressed concern about personal data protection. Given that some countries in the Latin American region have recently approved data protection laws, this seems to indicate that there has been some success in creating awareness in the field.

To sum up, the research underscores the need for awareness raising and community action among researchers and institutions, as well as legislative reform to enhance legal possibilities and security for research activities across Latin America. It has also highlighted the fact that, when interviewees were more informed about the impact of copyright on research, they were usually more in favor of legislative reform. The findings also highlight the nuanced views on openness and accessibility, the prevalence of alternative practices based on necessity, and the intertwining of copyright with broader regulatory issues like AI and data protection.

5 Conclusion: The (Urgent) Need for a Proper Legal Framework for Research

Previous studies illustrated the alarming scenario in Latin America regarding the legal landscape applicable to research using copyrighted works. The region brings together in one area a substantial number of countries in which legislation on the matter is either restrictive (in the case of uses for research purposes in general) or non-existent in terms of a specific TDM exception. This is not exclusively a new development: as shown above,Footnote 119 there has been a trend towards increasingly restrictive regimes in Latin America over recent decades.

As empirical evidence suggests, there is a negative relationship between restrictive copyright regimes, and research (including research on TDM) and innovation. While the restrictiveness of copyright laws is just one of several factors affecting innovation in the Global South, ensuring the legal possibility of and certainty for uses for research purposes, and maintaining a well-balanced copyright regime may be important steps in fostering technological, scientific, and economic development.

Most Latin American countries that would highly benefit from a TDM exception or from a legal framework conducive to innovation have not achieved either yet. And there may be several reasons for that, including the fact that these countries may find their position in international trade affected if they do not comply with, or agree to, standards often proposed or advocated by countries in the Global North.Footnote 120 In addition, and given the cross-border nature of many contemporary research practices, the inadequacy of the current legal framework also stems from the absence of international legal standards on limitations and exceptions, both within the region and globally.

The legal uncertainty around TDM research may make it more costly, both economically and otherwise, for researchers and organizations (e.g. owing to the need to negotiate with and pay for licenses from each owner for fear of legal action). This may ultimately compel them to abandon their research, or conduct it abroad,Footnote 121 or may affect collaboration with regional and global research partners. In addition, by not clearly allowing, or by substantially restricting, the training of and research on AI systems in Latin America, countries in this region may have to rely on pre-trained models, whose opacity raises concerns – and risks – of bias and other potentially harmful consequences, owing to the lack of control over the materials used in the training dataset.Footnote 122 On a cultural and social level, countries will lose an important opportunity to train models with materials that reflect their own characteristics, languages, demands and desires in different fields, including but not limited to health (e.g. neglected diseases) and culture (e.g. linguistic peculiarities).

Copyright today can be seen as much more than the set of rules for protecting and using original expressions of the human spirit in the arts, literature and science. By restricting the scope of what can be used to train AI systems and under which circumstances, copyright rules go beyond protecting expressions and may even hinder the development of a country’s AI industry and, more broadly, its economic and technological development.Footnote 123 In the meanwhile, countries like the U.S. have a general clause of “Fair Use” and have already discussed TDM-related issues in their case law,Footnote 124 while jurisdictions like the European Union, Japan, and Singapore already have a specific and express TDM exception.Footnote 125 At the same time, the existing legal framework in Latin America is not suited to regulating and allowing for TDM practices for research purposes.Footnote 126

Therefore, in order to provide the necessary incentive and legal assurance for researchers and research institutions, it is crucial that copyright laws are balanced and updated to take into consideration and regulate data-intensive and computational research and other public-interest activities in a way that actually promotes development and innovation. We argue that TDM exceptions, adequately designed, have the potential to play a significant role in this.

On a more abstract level, new TDM provisions should consider the economic, cultural and technological context and aim to promote national systems of innovation, but also be tailored so as not to isolate any country or region, given the borderless nature of contemporary research. In order for legal reform to happen, more of the stakeholders involved must become aware of the dynamics of the impact of copyright on research. The legal text itself must be sufficiently clear about permitted uses, users, purposes and the relationship with existing related provisions in copyright law (e.g. by clarifying that technological protection measures cannot be imposed to restrict or impede the enjoyment of the authorized uses), as well as with wider legislation outside the copyright system (e.g. laws regulating the use of personal and non-personal data). It is important that provisions allow for all uses necessary in the context of research activities or activities carried out by certain categories of users (e.g. public-interest-oriented institutions),Footnote 127 and not be overridden by private agreements. These recommendations represent no more than a few suggestions that could potentially help craft the research and TDM provisions required in Latin America. In any event, they must take into account the rich and diverse cultures and legal systems coexisting in that region.