Keywords

1 Introduction

Scholars refer to existing materials to support claims in their scholarly work. Citation to books, journal articles, and conference papers can be detected by automated systems and used as a basis for search or bibliometrics (e.g., using citation databases such as Google Scholar, Web of Science, or Scopus). However, there are no comparable databases for citation to the rare, and often unique, unpublished materials in archival repositories. Our goal in this paper is to begin to change that by automating the process of detecting scholarly citation to materials in an archive. We call such citations Archival References (AR).

Our work is motivated by the task of discovering archival content. A recent survey of users of 12 U.S. archival aggregators (e.g., ArchiveGrid, or the Online Archive of California) found that there were a broad range of users for such search services [26]. One limitation of archival aggregation, however, is that it presently relies on sharing metadata that is manually constructed by individual repositories. In the long run, we aim to augment that with descriptions mined from the written text that authors use to cite the archival resources on which they have relied. To do that at scale, we must first automate the process of finding citations that contain archival references. That is our focus in this paper.

Prior studies suggest that that very substantial numbers of archival references exist to be found [3, 18, 22]. As an example, Bronstad [3] manually coded citations in 136 books on history, finding 895 (averaging 6.6 per book) citing archival repositories. HathiTrust, for example, includes more than 6.5 million open access publications, so we would expect to find millions more archival references there.

As the examples in Table 1 illustrate, archival references differ in important ways from references to published content. Most obviously, conventions used to cite unpublished materials differ from those used to cite published materials [1, 25]. The elements of an archival reference (e.g., repository name, box and folder) are different from the elements for published sources (e.g., journal name, volume and pages). It is also common for archival references to include free-form explanatory text within the same footnote or endnote [7].

Table 1. Examples of citations containing archival references. “Strict” citations contain only archival references; “About” citations also include other accompanying text.

In this paper, we aim to begin the process of assembling large collections of archival references by building systems capable of automatically detecting them at large scale. To do this, we have collected documents, automatically detected footnotes and endnotes, annotated some of those “citations” for use in training and evaluation, and compared several classifiers. Our results indicate that automatically detecting archival references is tractable, with levels of recall and precision that we expect would be useful in practical applications.

2 Related Work

Studies of scholars who make use of archival repositories indicate that references in the scholarly literature are among the most useful ways of initially finding the repositories in which they will look. For example, Tibbo reported in 2003 that 98% of historians followed leads found in published literature [24], and Marsh, et al. found in 2023 that of anthropologists 73% did so [16]. There is thus good reason to believe that archival references could be useful as a basis for search. While expert scholars may already know where to look for what they need, search tools that mimic that expert behavior could be useful to novices and itinerant users, who comprise the majority of users in the survey mentioned above [26].

Researchers interested in information behavior and in the use of archives have long looked to citations as a source of evidence, but such studies have almost invariably relied on manual coding of relatively small sets of such references [4, 8, 10,11,12,13, 17, 18, 22, 23]. In recent years, rule-based techniques have been applied to detect archival references [2, 3], but we are not aware of any cases in which trained classifiers that rely on supervised machine learning have yet been built.

3 Methods

Here we describe how we assemble documents, find citations in those documents, and decide which citations contain archival references.

3.1 Crawling Documents and Extracting Citations

Our first challenge is to find documents that might include citations that contain archival references. Since we know that historians cite archival sources, we chose to focus on papers in history. We therefore crawled papers with a discipline label of History and a rights label of Open Access by using the public Semantic Scholar API.Footnote 1 That API requires one or more query terms. To get a set of query terms, we collected the abstracts of the 2,000 most highly cited papers from Scopus that were published in 2021 with a discipline label of Arts and Humanities. We collected terms in those abstracts sorted in the order of their frequency. Then we issued those terms one at a time to Semantic Scholar, and retrieved PDF files. After repeating this process for some number of keywords, we merged the resulting sets of PDF files. Most of our experiments were run on the 1,204 unique documents that resulted from using the most frequent 5 keywords (the KW5 document set), but we also conducted some experiments with the roughly 13,000 unique documents from using the most frequent 14 keywords (KW14).

We then parsed the documents using GROBID [15], an open-source toolkit for text extraction from academic papers. In the KW5 document set, GROBID found at least one footnote or reference (i.e., at least one citation) in 690 documents. In KW14, GROBID found at least one citation in 5,067 documents.

3.2 Detecting Archival References

For this paper, we built three types of classifiers to detect archival references.

Rule-Based (RB) Classifier. Our RB classifier has a single rule: IF a citation includes any of the strings “Box”,“Folder”, “Series”, “Fond”, “Container”, “Index”, “index”, “Manuscript”, “manuscript”, “Collection”, “collection”, “Library”, “library”, “Archive”, or “archive” THEN it contains an archival reference. Regular expression matching is done without tokenization, lowercasing, or stemming. This is similar to an approach used by Bronstad [3] to search the full text of papers for mentions of repositories. We selected our terms after examining the results from Subset 1 (described below).

Repository Name (RN) Classifier. Our RN classifier looked for one of 25,000 U.S. repository names from the RepoData list [9]. However, across all our experiments RN found only one match that had not also matched a RB classifier term. We did use RN to guide sampling, but we omit RN results for space.

Support Vector Machine (SVM) Classifiers. We experimented with three SVM variants. All using radial basis function kernels, which we found to be better than linear kernels in early experiments. In “SVMterm” the features are frequencies of terms found in citations. Specifically, we tokenized every citation on whitespace or punctuation and removed stopwords. Our tokenizer does not split URLs, so URLs are processed as single term. For our other SVMs, we tokenized each citation using NLTK, used a lookup table to select the pretrained GloVe embedding for each term [20], and then performed mean pooling to create a single embedding per citation. We experimented with both 50 (SVM50) and 300 dimensions (SVM300). We report results for SVM300, which were better than SVM50 with larger training sets. For each SVM we swept C from 1 to 100 by 5 and used the value (20) that gave the best results. We set the gamma for the radial basis function to the inverse of the number of feature set dimensions (e.g., 1/300 for SVM300).

3.3 Sampling Citations for Annotation

We drew five samples from KW5 and one from KW14. One approach (“by document”) was to randomly order the documents and then sample citations in their order of occurrence. The other (“by citation”) was to randomly order all citations regardless of their source document, and then sample some number of citations from the head of that list. Subsets are numbered in order of their creation. Focusing first on the 59,261 documents in KW5, random selection for Subsets 1 and 6 found 45 archival references among 3,500 sampled citations, a prevalence of 1.3%. This skewed distribution would make it expensive to find enough positive examples for supervised learning, so we turned to system-guided sampling. We merged positive classification results from our RB and RN classifiers to create Subset 2, annotating the first 600 citations (randomized by document). We then trained SVM50 on Subset 2 and used it to guide our draw of Subset 3, manually annotating all 760 citations (randomized by document). To create Subset 4, we first randomly selected and annotated 1,000 of the 59,261 citations (randomized by citation) and then added 259 citations that RB or RN classified as archival references. GROBID found 346,529 citations in the KW14 document set. We randomly sampled 20,000 of those and ran four classifiers on that sample: RB, RN, SVM300, and a BERT classifier (that did not perform well, the description of which we omit for space reasons). We merged and deduplicated positive results from those classifiers, resulting in 880 citations. We call that Subset 5.

3.4 Annotation Criteria and Annotation Process

Our annotation goal was to label whether extracted citations are archival references using two criteria: “Strict” if it included one or more archival references, with no other text; or “About” if it included one or more archival references together with explanatory text. Table 1 shows examples. The first has two archival references, and nothing else, satisfying our Strict criterion. The second has one archival reference and some explanatory text, satisfying our About criterion.

Annotation was done by two annotators. Annotator A1, the first author of this paper (a computer scientist) annotated Subsets 1 through 4 and Subset 6. Before performing any annotation, he examined the citation practice in 207 pages of endnotes from three published books in history [5, 21, 27] and from one journal article in history [19]. A1’s initial annotations of Subsets 1 and 2 were reviewed by the second author of this paper (an iSchool faculty member). A1 reannotated subsets 1 and 2 and then annotated subsets 3, 4, and 6. For time reasons, Subsets 3 and 4 were annotated only by the Strict criterion. Annotation requires some degree of interpretation, so additional research was conducted using Google when necessary (e.g., to see if some unfamiliar word might be a repository name).

Subset 5 was assessed by annotator A2, a Library Science Ph.D. student studying archives. We trained A2 in three phases. First, we demonstrated how to judge whether a citation is an archival reference (by either criterion) using 50 examples from Subset 4. Then A2 annotated 50 more citations from KW14 with the same criteria prevalence. The first three authors then met with A2 to discuss their annotations, and then A2 coded 120 more citations from KW14 We computed Cohen’s Kappa [6] between A1 and A2 on those 120 citations as 0.80 (substantial agreement, according to Landis and Koch [14]). Finally, A2 annotated the 880 citations in Subset 5. All our annotations are on GitHub.Footnote 2

4 Results

As measures of effectiveness we report Precision (P), Recall (R) and \(F_1\). Table 2 shows results with Strict+About training. We used two approaches to choosing training and test data. In one, we used separate training and test sets. Because of distributional differences between the training and test sets, this yields conservative estimates for the Recall and Precision that could be obtained in practice with more careful attention to that factor. To avoid distributional differences, we also experimented with training and testing on same subset(s), using five-fold cross-validation. Cross-validation yields somewhat optimistic estimates for Recall and Precision, since that approach eliminates systematic differences in the decisions made by different annotators, and it entirely removes all differences between the distributional characteristics of the training and test sets. Considering results from the two approaches together thus allows us to characterize the range of Precision, Recall and \(F_1\) values that we might expect to see in practice.

Table 2. Results for classifiers trained with both Strict (S) and About (A) annotations as positive examples. P = Precision, R = Recall, best \(F_1\) bold. Train or test, with number of positive S and A annotations (after removal of any training citations from the test set). Top block: detecting all citations containing archival references; subsequent blocks: same classifiers evaluated only on citations with S or A annotations.
Table 3. Results for detecting Strict (S) annotations by classifiers trained on only Strict annotations as positive examples. Notation as in Table 2.

Focusing first on the Eval S+A block in Table 2, we see that detecting archival references is not hard. The RB classifier achieves excellent Recall with no training at all, although its Precision is quite poor. Among SVMs, SVM300 does best in every case by \(F_1\). It seems that distributional differences are adversely affecting Recall and Precision when the training and test sets differ (although 95% confidence intervals are about \(\pm 0.2\) on the low-prevalence Subset 6 test set). From the Eval: S and Eval: A blocks of Table 2, we see that a classifier with both S and A annotations for training is much better at finding S than A.

As Table 3 shows, removing A from training doesn’t help to find more S. Compare, for example, Recall in the second set of experiments in both Tables 2 and 3, both of which were trained and tested on Subset 5. There, training with S+A correctly found more S annotations than did training with only S.

5 Conclusion and Future Work

We have shown that archival references can be detected fairly reliably, with \(F_1\) values between 0.5 and 0.83, depending on how well the training and test sets are matched. We have also developed and shared collections that can be used to train and evaluate such systems. Annotator agreement indicates that our Strict and About criteria for characterizing archival references are well defined and replicable. Most archival references satisfy our Strict criterion, and unsurprisingly it is Strict classification decisions where we do best. Experiments with separate training and test sets point to potential challenges from systematic differences in prevalence that result from sampling differences. This work is thus a starting point from which second-generation collections might be built with even better control over prevalence matching between training and test sets, and more robust classification results might be achieved using classifier ensembles. Given our promising results for this archival reference detection task, our next step will be to develop algorithms to segment individual archival references, and then to extract specific elements (e.g., repository name or container).