1 Introduction

Environmental sciences have become both complex and data intensive, needing access to heterogeneous data collected from multiple places, times and thematic scales. For example, research on bird migration would involve exploring and analyzing observational data such as the migration of animals and temperature shifts across the world, from time to time. While the needs to access such heterogeneous data are apparent, the rapid expansion of observational data, in both quantity and heterogeneity, poses huge challenges for data seekers to obtain the right information for their research. Such problems behoove tools that automatically manage, discover, and link big data from diverse sources, and present the data in forms that are easily accessible and comprehensible.

1.1 ONEMercury search service

Recently, DataONE, a federated data network built to facilitate access to and preserve environmental and ecological science data across the world, has become increasingly popular [18, 26, 27]. DataONE harvests metadata from different environmental data providers and makes it searchable via the search interface ONEMercury,Footnote 1 built on Mercury,Footnote 2 a distributed metadata management system. Figure 1 shows sample screen shots of the ONEMercury search interface (left) and the search result page with the search query ‘soil’. ONEMercury offers a full-text search on the metadata records. The user can also specify the boundary of locations in which the desired data are collected or published using the interactive graphic map. At the result page, the user can choose to further filter out the results by Member Node, Author, Project, and Keywords. The set of keywords used in the system is static (users cannot arbitrarily add new or remove the existing keywords) and managed by the administrator to prevent spurious, new keywords from being created. Such keywords are used for manually annotating metadata during the data curation process.

Fig. 1
figure 1

Screen shots of the ONEMercury search interface and result page using the query ‘soil

1.2 Challenge and proposed solution

Linking data from heterogeneous sources always have a cost. One of the biggest problems that ONEMercury is facing is the different levels of annotation in the harvested metadata records caused by different metadata curation standards. For example, a data center may have specialized personnel whose sole duty is to provide rich description and useful keywords for each metadata record, while another data center collects data directly from scientists who are busy with their experiments and do not have time to curate their data. Poorly annotated metadata records tend to be missed during the search process as they lack meaningful keywords. Furthermore, such records would not be compatible with the advanced mode offered by ONEMercury as it requires the metadata records to be annotated with predefined keywords from the keyword library. The explosion of the amount of metadata records harvested from an increasing number of data repositories makes it impossible to annotate them manually by hand, necessitating the need for a tool capable of automatically annotating these poorly annotated metadata records.

In this paper, we address the problem of automatic annotation of metadata records. Our goal is to build a fast and robust system that annotates a given metadata record with related keywords from a given keyword library. The idea is to annotate a given record with keywords associated to the well-annotated records that it is semantically relevant to. We propose a solution to this problem by first transforming the problem into the tag recommendation problem with a controlled tag library, where the set of recommended tags is used to annotate the given document, and then propose a set of algorithms that deal with the problem.

1.3 Problem definition

We define a document as a tuple of textual contents and a set of tags. That is \(\mathtt{d} = \langle \mathtt{c, e} \rangle \), where c is the textual content, represented by a sequence of terms, of the document d and e is a set of tags associated with the document. Given a tag library \(T\), a set of annotated documents \(D\), and a non-annotated query document \(q\), our task is to recommend a ranked list of \(K\) tags taken from \(T\) to the query \(q\). A document is said to be annotated if it has at least one tag; otherwise, it is non-annotated. The formal description of each variable is given below:

$$\begin{aligned} T&= \left\{ t_{1},t_{2},\ldots , t_{M} \right\} ; t_{i} \hbox { is a tag}.\\ D&=\! \left\{ d_{1},d_{2},\ldots , d_{N} \right\} ; d_i \!=\! \left\langle c_{di},e_{di} \right\rangle , e_{di} \subseteq T, \hbox { and } e_{di}\, {\ne }\, {\oslash }\\ q&= \left\langle c_q, {\oslash }\right\rangle \end{aligned}$$

1.4 Contributions

This paper has five key contributions as follows:

  1. 1.

    We address a real-world problem of metadata annotation faced by ONEMercury. We transform the problem into the tag recommendation problem and generalize the problem so that the proposed solution can further be applied to other domains.

  2. 2.

    We propose a novel technique for tag recommendation. Given a document query q, we first compute the distribution of tags. The top tags are then recommended. We propose two variants of our algorithms: term frequency-inverse document frequency (TF-IDF) based and topic model (TM) based.

  3. 3.

    We crawl environmental science metadata records from four different archives for our datasets: the Oak Ridge National Laboratory Distributed Active Archive Center (DAAC),Footnote 3 Dryad Digital Repository,Footnote 4 the Knowledge Network for Biocomplexity (KNB),Footnote 5 and TreeBASE: a repository of phylogenetic information.Footnote 6 We select roughly 1000 records from each archive for the experiments.

  4. 4.

    We validate the proposed methodology using rigorous empirical evaluations. We use document-wise tenfold cross-validation to evaluate our methods with five evaluation metrics: precision, recall, F1, MRR (mean reciprocal rank), and BPref (binary preference). These evaluation metrics are typically used together to evaluate recommendation systems.

  5. 5.

    We further discuss relevant issues namely (i) limitations and scalability of our proposed methods, and (ii) using topical coherence to fine-tune the optimal parameters.

2 Preliminaries

Our proposed solution is built upon the concepts of Cosine Similarity, term frequency-inverse document frequency (TF-IDF), and latent Dirichlet allocation (LDA). We briefly introduce them here before going further.

2.1 Cosine similarity

Cosine similarity is a measure of similarity between two vectors obtained by measuring the cosine of the angle between them. Given two vectors \(A\) and \(B\), the cosine similarity is defined using a dot product and magnitude as:

$$\begin{aligned}&\!\!\!\hbox {CosineSim}(A,B) = \frac{A\cdot B}{\left\| A \right\| \left\| B \right\| }\nonumber \\&\!\!\!\quad \,=\frac{\sum _{i = 1}^{N}A_i\times B_i}{\sqrt{\sum _{i=1}^{N}(A_i)^2}\times \sqrt{\sum _{i=1}^{N}(B_i)^2}} \end{aligned}$$
(1)

In information retrieval literature [16], the cosine similarity is heavily used to calculate the similarity between two vectorized documents. An assumption is made that each element in a document vector is a real non-negative number (such as term frequency and TF-IDF score), hence CosineSim(A,B) outputs [0,1], with the value indicating the level of similarity.

2.2 Term frequency-inverse document frequency

TF-IDF, used extensively in the information retrieval field [16, 29], quantifies how important a term is to a document in a corpus. TF-IDF has two components: the term frequency (TF) and the inverse document frequency (IDF). The TF is the frequency of a term appearing in a document. The IDF of a term measures how important the term is to the corpus, and is inversely proportional to the document frequency (the number of documents in which the term appears). Formally, given a term \(t\), a document \(d\), and a corpus (document collection) \(D\):

$$\begin{aligned} tf(t, d)= & {} \sqrt{\hbox {count}(t,d)}\end{aligned}$$
(2)
$$\begin{aligned} idf(t, D)= & {} \sqrt{\log \left( \frac{|D|}{|d\in D; t \in d|} \right) }\end{aligned}$$
(3)
$$\begin{aligned} \hbox {TFIDF}_\mathrm{Term}(t, d, D)= & {} TF(t,d)\cdot \hbox {IDF}(t,D) \end{aligned}$$
(4)

We can then construct a TF-IDF vector for a document d given a corpus D as follows:

$$\begin{aligned}&\hbox {TFIDF}_\mathrm{Doc}(d, D) \nonumber \\&\quad = \langle \hbox {TFIDF}_\mathrm{Term}(t_1, d, D), \ldots , \hbox {TFIDF}_\mathrm{Term}(t_n, d, D)\rangle \end{aligned}$$
(5)

Consequently, if one wishes to compute the similarity score between two documents \(d_1\) and \(d_2\), the cosine similarity can be computed between the TF-IDF vectors representing the two documents:

$$\begin{aligned}&\hbox {DocSim}_{\mathrm{TF}-\mathrm{IDF}}(d_{1}, d_{2}, D) \nonumber \\&\quad = \hbox {CosineSim}\left( \hbox {TFIDF}_\mathrm{Doc}(d_1,D), \hbox {TFIDF}_\mathrm{Doc}(d_2,D)\right) \end{aligned}$$
(6)

2.3 Latent Dirichlet allocation

In text mining, latent Dirichlet allocation (LDA) [3] is a generative model that allows a document to be represented by a mixture of topics. Past literature [12, 28, 3033] demonstrates successful usage of LDA to model topics from given corpora. The basic intuition of LDA is that an author has a set of topics in mind when writing a document. A topic is defined as a distribution of terms. The author then chooses a set of terms from the topics to compose the document. The whole document can then be represented using a mixture of different topics. LDA serves as a means to trace back the latent topics in the author’s mind before the document is written. Mathematically, the LDA model is described as follows:

$$\begin{aligned} P\left( t_i|d \right) = \sum _{j=1}^{|Z|} P\left( t_i|z_i=j \right) \cdot P(z_i=j|d) \end{aligned}$$
(7)

\(P(t_i|d)\) is the probability of term \(t_i\) being in document \(d\). \(z_i\) is the latent (hidden) topic. \(|Z|\) is the number of all topics. This number needs to be predefined. \(P(t_i|z_i = j)\) is the probability of term \(t_i\) being in topic \(j\). \(P(z_i=j|d)\) is the probability of picking a term from topic \(j\) in the document \(d\).

Essentially, the LDA model is used to find \(P(z|d)\), the topic distribution of document \(d\), with each topic being described by the distribution of term \(P(T|z)\). After the topics are modeled, we can assign a distribution of topics to a given document using statistical inference [2]. A document then can be represented with a vector of numbers, each of which represents the probability of the document belonging to a topic.

$$\begin{aligned} \hbox {Infer}(d, Z) = \left\langle z_1, z_2, \ldots , z_Q \right\rangle ; \quad |Z| = Q \end{aligned}$$
(8)

where \(Z\) is a set of topics, \(d\) is a document, and \(z_i\) is a probability of the document \(d\) falling into topic \(i\). Since a document can be represented using a vector of real non-negative numbers, one can then compute the topic similarity between two documents \(d_1\) and \(d_2\) using cosine similarity as follows:

$$\begin{aligned}&\hbox {DocSim}_\mathrm{TM}(d_{1}, d_{2}, Z) \nonumber \\&\quad = \hbox {CosineSim}\left( \hbox {Infer}(d_1,Z), \hbox {Infer}(d_2,Z)\right) \end{aligned}$$
(9)

3 Related works

The literature on document annotation is extensive. Hence we only present the work closely related to ours.

3.1 Automatic document annotation

Newman et al. [20] discuss approaches for enriching metadata records using probabilistic topic modeling. Their approach treats each metadata record as a bag of words and consists of two main steps: (1) generate topics based on a given corpus of metadata, and (2) assign relevant topics to each metadata record. Hence, a metadata record is annotated by the top terms representing the assigned topics. They propose three variations of their approaches. The first method, which they use as the baseline, uses full vocabulary (every word) from the corpus. The remaining two methods filter out the vocabulary by deleting useless words resulting in more meaningful topics. They compare the three approaches in three aspects: % of usable topics, % enhanced records, and average coverage by the top 4 chosen topics. They acquire the datasets from 700 repositories, hosted by OAISter Digital Library. The results show that, overall, the second method performs the best. However, such methods require manual modification of the vocabulary, hence would not scale well. The third method performs somewhere in between.

Bron et al. [4] address the problem of document annotation by linking a poorly annotated document to well-annotated documents using TF-IDF cosine similarity. One corpus consists of textually rich documents (\(A_s\)) while the other contains sparse documents (\(A_t\)). In the paper, they address two research problems: document expansion and term selection. For the document expansion task, each targeted document (a document in the sparse set) is mapped to one or more documents in the rich set, using simple cosine-similarity measure. Top \(N\) documents are chosen from the rich corpus, and the texts in these documents are added to the targeted documents as supplemental content. The term selection task was introduced because using the whole documents from the source corpus to enrich the targeted document might be too spurious and have a fair chance of topic drifts. This term selection task aims to select only meaningful words from each document in the source corpus to add to the targeted documents. Basically, top K % of the words in each document, ranked by TF-IDF scores, are selected as representative words of the document.

This work has a similar problem setting to ours, except that we aim to annotate a query document with keywords taken from the library, while their approaches extract keywords from the full content of documents.

Witten et al. [36] propose KEA, a machine learning-based keyphrase extraction algorithm from documents. The algorithm can also be applied to annotate documents with relevant keyphrases. Their algorithm first selects candidate keyphrases from the document. Two features are extracted from each candidate keyphrase: TF-IDF score and distance of the first occurrence of the keyphrase from the beginning of the document. A binary NaiveBayes classifier is trained with the extracted features to build a classification model, which is used for identifying important keyphrases. The algorithm is later enhanced by Medelyan et al. [17] to improve the performance and add more functionality such as document annotation and keyphrase recommendation from control vocabulary, where the list of keyphrases to be recommend is already defined in the vocabulary. In our research, we use keyphrase recommendation with control vocabulary feature of this improved version of the KEA algorithm as our baseline.

3.2 Automatic tag recommendation

Since we transform the metadata annotation problem into a tag recommendation problem, we briefly cover related literature. Tag recommendation has gained substantial amount of interest in recent years. Most work, however, focuses on personalized tag recommendation, suggesting tags to a user’s object based on the user’s preference and social connection. Mishne et al. [19] employ the social connection of the users to recommend tags for weblogs, based on similar weblogs tagged by the same users. Wu et al. [37] utilize the social network and the similarity between the contents of objects to learn a model for recommending tags. Their system aims towards recommending tags for Flickr photo objects. While such personalized schemes have been proven to be useful, some domains of data have limited information about authors (users) and their social connections. Liu et al. [14] propose a tag recommendation model using machine translation. Their algorithm trains the translation model to translate the textual description of a document in the training set into its tags. Krestel et al. [13] employ topic modeling for recommending tags. They use the Latent Dirichlet Allocation algorithm to mine topics in the training corpus where tags are used as the textual content. They evaluate their method against the association rule-based method proposed in [8]. Their method, however, is designed for tag recommendation for social documents where the network of users is assumed to exist, while our methods do not rely on such an assumption.

4 Datasets

We obtain four different datasets of environmental metadata records for the experiments: the Oak Ridge National Laboratory Distributed Active Archive Center (DAAC),Footnote 7 Dryad Digital Repository (DRYAD),Footnote 8 the Knowledge Network for Biocomplexity (KNB),Footnote 9 and TreeBASE: a repository of phylogenetic information (TreeBASE).Footnote 10 The statistics of the datasets including the number of documents, total number of tags, average number of tags per document, number of unique tags (tag library size), tag utilization, number of all words (dataset size), and average number of word per document, are summarized in Table 1. Tag utilization is the average number of documents where a tag appears in, and is defined as \(\frac{\#\hbox { all tags}}{\#\hbox { unique tags}}\). The tag utilization quantifies how often, on average, a tag is used for annotation.

Table 1 Statistics of the 4 datasets

The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) is one of the NASA Earth Observing System Data and Information System (EOSDIS) data centers managed by the Earth Science Data and Information System (ESDIS)Footnote 11 Project, which is responsible for providing scientific and other users access to data from NASA’s Earth Science Missions. The biogeochemical and ecological data provided by ORNL DAAC can be categorized into four groups: Field Campaigns, Land Validation, Regional and Global Data, and Model Archive. After raw data are collected, the data collector describes the data and annotates it using topic-represented keywords from the topic library.

Dryad is a nonprofit organization and an international repository of data underlying scientific and medical publications. The scientific, educational, and charitable mission of Dryad is to promote the availability of data underlying findings in the scientific literature for research and educational reuse. As of January 24, 2013, Dryad hosts 2570 data packages and 7012 data files, associated with articles in 186 journals. Metadata associated with each data package are annotated by the author with arbitrary choices of keywords.

The Knowledge Network for Biocomplexity (KNB) is a national network intended to facilitate ecological and environmental research on biocomplexity. For scientists, the KNB is an efficient way to discover, access, interpret, integrate and analyze complex ecological data from a highly distributed set of field stations, laboratories, research sites, and individual researchers. Each data package hosted by KNB is described and annotated with keywords from the taxonomy by the data collector.

TreeBASE is a repository of phylogenetic information, specifically user-submitted phylogenetic trees and the data used to generate them. TreeBASE accepts all types of phylogenetic data (e.g., trees of species, trees of populations, trees of genes) representing all biotic taxa. Data in TreeBASE are exposed to the public if they are used in a publication that is in press or published in a peer-reviewed scientific journal, book, conference proceedings, or thesis. TreeBASE is produced and governed by the Phyloinformatics Research Foundation, Inc.Footnote 12

In our setting, we assume that the documents are independently annotated, so that the tags in our training sets represent the gold-standard. However, some metadata records may not be independent since they may be originated from the same projects or authors, hence annotated with similar styles and sets of keywords. To mitigate such problem, we randomly select a subset of 1000 annotated documents (except DAAC dataset, which only has 978 documents of land terrestrial ecology, hence we select them all) from each archive for our experiments. We combine all the textual attributes (i.e. Title, Abstract, Description) together as the textual content for the document. We preprocess the textual content in each document by removing 664 common stop words and punctuation, and stemming the words using the Porter2 stemming algorithm.Footnote 13

5 Methodology

The metadata annotation problem is transformed into the tag recommendation problem with a controlled tag library. A document is a tuple of textual information and a set of tags, i.e. \(\langle \hbox {text, tags}\rangle \). A document query is a document without tags, \(\langle \hbox {text}, \oslash \rangle \). Specifically, given a tag library \(T=\langle t_1, t_2,\ldots ,t_m\rangle \), a document corpus \(D=\langle d_1, d_2,\ldots , d_n\rangle \), and a document query \(q\), the algorithm outputs a ranked list \(T_{K}^{*}= \langle t_1, t_2,\ldots , t_K \rangle \), where \(t_i \in T\), of \(K\) tags relevant to the document query \(q\).

Our proposed algorithm comprises two main steps:

STEP 1:

\(P(t|q,T,D,M)\), the probability of tag \(t\) being relevant to \(q\), is computed for each \(t \in T\). \(M\) is the document similarity measure, which can be either TF-IDF or TM.

STEP 2:

Return top \(K\) tags ranked by the \(P(t|q,T,D,M)\) probability.

\(P(t|q,T,D,M)\) is the normalization of the relevance score of the tag \(t\) to the document query \(q\) and is defined as:

$$\begin{aligned}&P(t|q,T,D,M) = \frac{\hbox {TagScore}_M(t,q,D)}{\sum \nolimits _{\tau \in T} \hbox {TagScore}_M(\tau ,q,D)} \end{aligned}$$
(10)
$$\begin{aligned}&\hbox {TagScore}_M(t,q,D) = \sum _{d \in D} \hbox {DocSim}_M(q,d,D) \cdot isTag(t,d)\nonumber \\ \end{aligned}$$
(11)

\(\hbox {TagScore}_M(t,q,D)\) calculates the tag score determining how relevant the tag \(t\) is to document query \(q\). This score can be any real non-negative number. \(\hbox {DocSim}_M(q,d,D)\) measures the similarity between two documents, i.e. \(q\) and \(d\), given a document corpus \(D\) and returns a similarity measure ranging between [0,1]. \(\hbox {isTag}(t,d)\) is a binary function that returns 1 if \(t \in \hbox {d.tags}\) and 0 otherwise. We propose two approaches to compute the document similarity: Term Frequency-Inverse Document Frequency (TF-IDF) based (\(\hbox {DocSim}_{\mathrm{TF}-\mathrm{IDF}}(q,d,D)\)) and Topic Modeling (TM) based (\(\hbox {DocSim}_{TM}(q,d,D)\)). These two approaches are described in the next subsections.

5.1 TF-IDF-based document similarity

The TF-IDF-based document similarity scoring function, \(\hbox {DocSim}_{\mathrm{TF}-\mathrm{IDF}}(q,d,D)\), relies on the TF-IDF principle discussed in Sect. 2.2. The function aims to quantify the content similarity based on term overlap between two documents. To compute the IDF part of the equation, all the documents in \(D\) are first indexed. Hence the training phase (preprocess) involves indexing all the documents. The similarity between the query \(q\) and a source document \(d\) is then computed using \(\hbox {DocSim}_{\mathrm{TF}-\mathrm{IDF}}(q, d, D)\) as defined in Eq. 6.

5.2 TM-based document similarity

The TM-based document similarity, \(\hbox {DocSim}_{TM}(q,d,D)\), utilizes topic distributions of the documents computed by the LDA algorithm described in Sect. 2.3. The algorithm further extracts the document semantics using its topic distribution. With this knowledge in mind, one can measure the semantic similarity between two documents by quantifying the similarity between their topic distributions. Indeed, our proposed TM-based algorithm transforms the topic distribution of a document into a numerical vector, wherein cosine similarity is used to compute the topic similarity between two documents using Eq. 9.

6 Evaluation and results

We evaluate our methods using the tag prediction protocol. We artificially create a test query document by removing the tags from an annotated document. The task is to predict the removed tags. There are two reasons behind the choosing of this evaluation scheme:

  1. 1.

    The evaluation can be done fully automatically. Since our datasets are large, manual evaluation (i.e. having human identify whether a recommended tag is relevant or not) would be infeasible.

  2. 2.

    The evaluation can be done against the existing gold standard established (manually tagged) by expert annotators (i.e. data collectors, project principal investigators, etc.) who have good understanding about the data, while manual evaluation by individuals who are not familiar with the data could lead to evaluation biases.

We evaluate our TF-IDF- and TM-based algorithms against the baseline KEA document annotation algorithm with controlled vocabulary. In our setting, the tag library is used as the vocabulary by the KEA algorithm. The document-wise tenfold cross-validation is performed, where each dataset is first split into 10 equal subsets, and for each fold \(i \in \{1,2,3,\ldots ,10\}\) the subset \(i\) is used for the testing set, and the other 9 subsets are combined and used as the source (training set). The results of each fold are summed up and the averages are reported.

For the TF-IDF-based algorithm, we use LingPipeFootnote 14 to perform the indexing and calculating the TF-IDF-based similarity. For the TM-based algorithm, the training process involves modeling topics from the source using LDA algorithm as discussed in Sect. 2.3. We use the Stanford Topic Modeling ToolboxFootnote 15 with the collapsed variational Bayes approximation [2] to learn topics in the source documents. For each document we generate uni-grams, bi-grams, and tri-grams, and combine them to represent the textual content of the document. The algorithm takes two input parameters: the number of topics to be identified and the maximum number of training iterations. After some experiments on varying the two parameters, we fix them at 300 and 1000, respectively. The inference method proposed by Asuncion et al. [2] is used to assign a topic distribution to a given document. The evaluation is done on a Windows 7 PC with Intel Core i7 2600 CPU 3.4 GHz and 16GB of RAM.

6.1 Evaluation metrics

This section presents the evaluation metrics used in our tasks, including precision, recall, F1, Mean Reciprocal Rank (MRR), and Binary Preference (Bpref). These metrics, when used in combination, have shown to be effective for evaluation of recommending systems [10, 15, 38].

6.1.1 Precision, recall, and F1

Precision, recall, and F1 (F-measure) are well-known evaluation metrics in information retrieval literature [16]. For each document query in the test set, we use the original set of tags as the ground truth \(T_\mathrm{g}\). Assume that the set of recommended tags are \(T_\mathrm{r}\), so that the correctly recommended tags are \(T_\mathrm{g} \bigcap T_\mathrm{r}\). Precision, recall, and F1 measures are defined as follows:

$$\begin{aligned}&\hbox {precision} = \frac{|T_\mathrm{g} \bigcap T_\mathrm{r}|}{|T_\mathrm{r}|}, \quad \hbox {recall}=\frac{|T_\mathrm{g} \bigcap T_\mathrm{r}|}{|T_\mathrm{g}|}, \\&\quad \!\!\! F1 = \frac{2 \cdot \hbox {precision} \cdot \hbox {recall}}{\hbox {precision} + \hbox {recall}} \end{aligned}$$

In our experiments, the number of recommended tags ranges from 1 to 30. It is wise to note that better tag recommendation systems tend to rank correct tags higher than the incorrect ones. However, the precision, recall, and F1 measures do not take ranking into account. To evaluate the performance of the ranked results, we employ the following evaluation metrics.

6.1.2 Mean reciprocal rank

Mean reciprocal rank (MRR) measure takes ordering into account [34]. It measures how well the first correctly recommended tag is ranked. The reciprocal rank of a query is the multiplicative inverse of the rank of the first correctly recommended tag. The mean reciprocal rank is the average of the reciprocal ranks of the results of the query set \(Q\). Formally, given a testing set \(Q\), let \(\hbox {rank}_q\) be the rank of the first corrected answer of the query \(q \in Q\), then MRR of the query set \(Q\) is defined as follows:

$$\begin{aligned} \hbox {MRR} = \frac{1}{|Q|}\sum _{q \in Q}\frac{1}{\hbox {rank}_q} \end{aligned}$$

If the set of recommended tags does not contain a correct tag at all, \(\frac{1}{\hbox {rank}_q}\) is defined to be \(0\).

6.1.3 Binary preference

Binary preference (Bpref) measure considers the order of each correctly recommended tag [5]. Let \(S\) be the set of recommended tags by the system, \(R\) be the set of correct tags (Note that it is not necessary that \(R \subseteq S\)), \(r \in R\) be a correct recommendation, and \(i \in S - R\) be an incorrect recommendation. Bpref is defined as follows:

$$\begin{aligned} \hbox {Bpref} = \frac{1}{|R|}\sum _{r \in R}1-\frac{|i\,\text{ ranked } \text{ higher } \text{ than } r|}{|S|} \end{aligned}$$

Bpref can be thought of as the inverse of the fraction of irrelevant tags that are recommended before relevant ones. Bpref and mean average precision (MAP) are similar when used with complete judgments. However, Bpref normally gives a better evaluation when used in a system with incomplete recommendations.

6.2 Results

Figures 2, 3, and 4 plot the precision@K, recall@K, F1@K, respectively, evaluated at the top \(K\) tags recommended by the proposed TF-IDF- and TM-based algorithms against the baseline KEA algorithm on each dataset. Figure 5 summarizes the precision vs. recall on each dataset.

Fig. 2
figure 2

Precision of the TF-IDF, TM, KEA (baseline) algorithms on the four datasets

Fig. 3
figure 3

Recall of the TF-IDF, TM, KEA (baseline) algorithms on the four datasets

Fig. 4
figure 4

F1 of the TF-IDF, TM, KEA (baseline) algorithms on the four datasets

Fig. 5
figure 5

Precision vs. recall of the TF-IDF, TM, KEA (baseline) algorithms on the four datasets

According to the results, our proposed algorithms outperform the baseline KEA algorithm on the DAAC and KNB datasets (TM-based approach outperforms at every \(K\) and TF-IDF-based approach outperforms at larger \(K\)). This is because the tags used to annotate DAAC and KNB documents are drawn from the libraries of topics. Hence, there is a high chance that a tag is reused for multiple times, resulting in high tag utilization. Since our algorithms give higher weight to tags that have been used frequently, datasets with high tag utilization (such as DAAC and KNB) tend to benefit from our algorithms.

However, our proposed algorithms perform worse than the baseline on the DRYAD dataset. This is because tags in each DRYAD document are manually made up at the curation process. Manually making up tags for each document results in a large size of tag library where each tag is used only a few times, leading to the low tag utilization. Datasets with low tag utilization would not benefit from our proposed algorithms since the probability distribution given to the tags tends to be uniform and not very discriminative.

All the algorithms perform poorly on the TreeBASE dataset. This is because TreeBASE documents are very sparse (some do not even have textual content) and have very few tags. From the dataset statistics, each document on the TreeBASE dataset has only 11 words and only 0.7 tags on average. Such sparse texts lead to weak relationship when finding textually similar documents in the TF-IDF-based approach, and the poor quality of the topic model used by the TM-based approach. The small number of tags per document makes it even harder to predict the right tags.

Table 2 lists the MRR, BPref, average learning time (in seconds) per fold, and average testing time (in seconds) per fold of the proposed TF-IDF- and TM-based algorithms against the baseline KEA algorithm on each dataset. MRR quantifies how the first correct recommendation is ranked. In terms of MRR, our TM-based algorithm performs the best on the DAAC and KNB datasets, TF-IDF-based algorithm performs the best in the TreeBASE dataset, and the KEA algorithm performs the best on the DRYAD dataset. The TM-based algorithm achieves notable MRR scores of 0.75 and 0.92 on the DAAC and KNB datasets, respectively, and outperforming the baseline by 47.70 and 33.22 %, respectively.

Table 2 MRR, Bpref, average learning time (ALT) and average test time (ATT) of TF-IDF, TM, KEA (baseline) algorithms on the four datasets

Bpref measures the ranking of all the correctly recommended keywords. In terms of Bpref, our TM algorithm performs the best on the DAAC, DRYAD, and KNB datasets with the Bpref scores of 0.90, 0.49, and 0.91 respectively. The TF-IDF-based algorithm performs the best on the TreeBASE dataset. Similar to the MRR results, notable BPref scores are achieved by the TM-based algorithm on the DAAC and KNB datasets, outperforming the baseline by 285.32 and 274.33 % respectively.

Table 3 shows sample recommended tags by our proposed TF-IDF/TM based algorithms and the baseline KEA algorithm to the DAAC metadata record titled “ISLSCP II IGBP DISCOVER AND SIB LAND COVER, 1992-1993Footnote 16, against the 15 actual ground-truth tags associated with the record. Our TM-based algorithm performs well on this particular example by capturing all the actual tags within the top 15 recommended tags.

7 Discussion

This section provides additional discussions about the proposed algorithms.

7.1 TM- vs. TF-IDF-based approaches

According to the results, our TM-based approach performs better than the TF-IDF-based approach on DAAC, DRYAD, and KNB datasets, in terms of precision, recall, and F1 measure, while the TF-IDF-based approach performs better on the TreeBASE dataset. Since the only difference between the two proposed methods is the document similarity function \(\hbox {DocSim}(q,d,D)\), which computes the similarity score between the query document \(q\) and a source document \(d \in D\), the analysis on the differences between the two document similarity measures could provide explanation about the performance difference.

The TF-IDF document similarity quantifies the cosine similarity between two TF-IDF vectors representing the two documents. Loosely speaking, the TF-IDF document similarity measures the quantity of term overlap, where each term has a different weight, in the two documents.

The TM-based approach first derives a set of topics from the document source, each of which is represented by a distribution of terms. The ranked terms in each topic bare coherent semantic meanings. Table 4 provides an example of the top 10 terms of each of the sample 9 topics derived from the DAAC dataset using the LDA algorithm with 300 topics and 1000 iterations. Once the set of topics has been determined, a document is assigned a distribution of topics using the inference algorithm mentioned in Sect. 2.3. The TM document similarity then measures the cosine similarity between the topic distribution vectors representing the two documents. Loosely speaking, the TM document similarity quantifies the topic similarity between the two documents.

Table 4 Top 10 terms in sample 9 topics derived using LDA algorithm from the DAAC dataset

The performance difference of both the proposed methods could be impacted by the semantic representation of each document. It is evident from the experimental results on the DAAC, DRYAD, and KNB datasets that representing a document with a mixture of topics leads to more accurate semantic similarity interpretation, resulting in better recommendation. However, the reason why the TM-based approach performs worse than the TF-IDF-based approach on the TreeBASE dataset could be that the documents in such a dataset are very sparse (each TreeBASE document has only 11 words on average). Such sparsity could lead to a poor set of topics, consisting of idiosyncratic word combinations.

Hence we recommend the TM-based algorithm for datasets whose documents are rich in textual content, and the TF-IDF-based algorithm approach for those with textually sparse documents.

7.2 Limitations

Regardless of the promising performance, our proposed document annotation algorithms may face the following limitations:

  1. 1.

    The proposed algorithms rely on the existence of a good document source (training set). The quality of the resulting annotation directly reflects the quality of the annotation of each document in the training data. Fortunately, the current ONEMercury system only retrieves the metadata from the archives wherein each metadata record is manually and carefully annotated by principal investigators and data managers. In the future, however, the system may expand to collect metadata from sources in which the metadata records may have poor or no annotation. Such problems urge the need for a method that allows the automatic annotator trained with a high-quality training dataset to annotate the documents in different datasets. Indeed, we briefly discuss the possibility of applying the proposed method for cross-archive annotation in Sect. 7.5.

  2. 2.

    Our TM-based algorithm needs to model topics from scratch every time a significant amount of new documents are added to the training corpus, so that the modeled topics can reflect the new documents added. Since our TM-based algorithm utilizes the traditional LDA algorithm to model topics, wherein incremental training is not a feature, we plan to explore methods such as [1] and [11] which may enable our algorithm to adaptively model the topics from a dynamic corpus.

  3. 3.

    Regardless of the promising performance of our proposed TM-based algorithm, the scalability can be an issue when it comes to mining topics from a larger corpus of documents. The scalability issues of our TM-based algorithm is discussed in detail in the next subsection.

7.3 Scalability of the TM approach

Scalability issues should be taken into account since the algorithms will eventually be incorporated as part of the ONEMercury system, which currently hosts much larger datasets than the ones we use in the experiments. Since theoretical time and space complexities of the underlying LDA algorithm have been extensively investigated (see [22]), we instead focus on the scalability issues from the practical point of view. This section discusses two scalability issues presented in the TM-based algorithm: the increase in number of topics and the increase in size of the corpus.

We examine the scalability issues of the proposed TM-based algorithm on the KNB dataset, using the Stanford Topic Modeling Toolbox with collapsed variational Bayes approximation and fixed 1000 iterations, on the same machine we use for earlier experiments.

As the data grow larger, new topics emerge, urging the need for a new model that captures such increasing variety of topics. Figure 6 plots the training time (in seconds) as a function of number of topics. The training time grows approximately linearly with the number of topics up to 400 topics. The program runs out of physical memory, however, at 500 topics, leading to a dramatic increase in the training time. Hence, this study points out that a more memory-efficient topic model algorithm should be explored.

Fig. 6
figure 6

Learning time in seconds of the TM-based algorithm as a function of number of topics

Another scalability concern lies with the projected increase in the number of training documents. Figure 7 shows the training time of the TM-based algorithm as the number of documents increases. The results also show a linear scale with the number of training documents. Note that the experiment is only done with up to 1000 documents, while there are roughly 47 thousands, and definitely increasing in the future, metadata records in the current system. Even with the current size of the ONEMercury repository, the algorithm would take approximately 5.3 h to model topics, which is not feasible in practice. Hence a large-scale parallel algorithm such as MapReduce [6] should be investigated.

Fig. 7
figure 7

Learning time in seconds of the TM-based algorithm as a function of numbers of training documents

7.4 Employing topic coherence to find optimum numbers of topics

Multiple studies on topic modeling have shown that the coherence of the term distribution in each topic has a direct impact on the effectiveness of the learned topics in various applications [3, 7, 9, 35]. Newman et al. [21] defined the coherence of a topic as the ability to be interpreted by human as a semantically meaningful topic. Since our TM-based method utilizes LDA to learn topical knowledge from the source documents to compute topical similarity between documents, the coherence of the learned topics could have an impact on the relevance of the recommended tags.

In this section, the coherence of topics learned from each dataset is investigated. The results do not only shed light on the quality of the learned topics, but also help determine the optimal number of topics to be learned from each archive. Too few topics would typically result in very broad topics; while too many topics will result in random, meaningless topics that pick out idiosyncratic word combinations [24]. Newman et al. [21] proposed a set of schemes for automatic evaluation of topic coherence, divided into three groups: WordNet, Wikipedia, and Google search engine-based methods. For our topic coherence analysis, we adopt the similar evaluation scheme as their Wikipedia-based method using pair-wise mutual information (PMI) as the word-pair scoring function, since this scheme was reported the most accurate in the authors’ work.

Since we aim to find the optimal number of topics to learn from each archive, the aggregate topic coherence score is calculated for each topic set. In particular, let \(Z_T = \{z_1, z_2, \ldots , z_T\}\) be the set of \(T\) learned topics. We aim to calculate the aggregate topic coherence (ATC) score for the topic set \(Z_T\) by taking the arithmetic mean and median of the coherence scores of all the topics in \(Z_T\), as follows:

$$\begin{aligned}&\hbox {ATC}_\mathrm{mean}(Z_T)= \hbox {mean}\{C(z_i); i \in 1 \ldots T\} \end{aligned}$$
(12)
$$\begin{aligned}&\hbox {ATC}_\mathrm{median}(Z_T) = \hbox {median}\{C(z_i); i \in 1 \ldots T\} \end{aligned}$$
(13)

\(C(z)\) is a coherence score of the topic \(z\) and is calculated as follows:

Let \(z\) be a topic, and \(W_{10}=\{w_1,\ldots ,w_{10}\}\) be the top 10 words in \(z\). Then, the coherence score of a topic \(z\) is the average of pair-wise mutual information (PMI) scores of all possible unique pairs of the words in \(W_{10}\).

$$\begin{aligned}&C(z) = \hbox {mean}\{\hbox {PMI}(w_i,w_j); ij \in 1 \ldots 10; i<j\} \end{aligned}$$
(14)
$$\begin{aligned}&\hbox {PMI}(w_i,w_j) = \log \left( \frac{p(w_i,w_j)}{p(w_i)\cdot p(w_j)}\right) \end{aligned}$$
(15)

\(p(w_i)\) and \(p(w_j)\) are calculated using the portions of documents that contain at least one occurrence of \(w_i\) and \(w_j,\) respectively. \(p(w_i,w_j)\) is the portion of the documents that contain both \(w_i\) and \(w_j\). Mathematically, let \(D\) be the document collection, \(D(w) \in D\) be the set of documents containing at least one occurrence of \(w\).

$$\begin{aligned}&p(w_i) = \frac{|D(w_i)|}{|D|}\ ,\ p(w_j) = \frac{|D(w_j)|}{|D|} \end{aligned}$$
(16)
$$\begin{aligned}&p(w_i,w_j) = \frac{|D(w_i) \cap D(w_j)|}{|D|} \end{aligned}$$
(17)

Instead of using Wikipedia articles as the external knowledge source as in [21], we use the documents in the datasets as the external knowledge. This is because, most metadata records in our datasets are from very specific subfields of ecological and environmental sciences, which Wikipedia articles do not well cover. Plus, these metadata records contain many technical and scientific keywords which are not normally used in general encyclopedias. For each dataset, a set of randomly chosen 1000 documents is used to model the topics. The numbers of topics, \(T\), are varied by 50 during 0–600 topics, and by 100 during 600–2000 topics. At each \(T\), the LDA algorithm is run with 1000 iterations to learn a set of \(T\) topics, \(Z_T\). Then, the mean and median aggregate topic coherence scores are calculated for each \(Z_T\).

Figures 8 and 9 plot the mean and median of the aggregate topic coherence scores, respectively, of each dataset as a function of number of topics. According to Fig. 8, the optimal numbers of topics to be learned from datasets DAAC, DRYAD, KNB, and TreeBASE are 550, 1000, 450, and 250, respectively. Note that, according to Table 1, the approximated data sizes for DAAC, DRYAD, KNB, and TreeBASE used in this analysis are 104,261, 129,926, 63,324, and 11,405 words, respectively. Surprisingly, there is a correlation between the optimal numbers of topics and the sizes of the document collections. An explanation of this phenomenon could be that richer archives tend to have more content, and hence are composed by more topical subjects. It is also interesting to note that the effect of too many topics is apparent in the TreeBASE dataset, where the mean aggregate topic coherence scores significantly drop closed to zero after 700 topics. This is because the size of the TreeBASE document collection used in this analysis is so small that additional topics beyond 700 topics become random and spurious, hence impeding overall quality of the learned topics.

Fig. 8
figure 8

The mean aggregate topic coherence scores of the four datasets as a function of number of topics

Fig. 9
figure 9

The median aggregate topic coherence scores of the four datasets as a function of number of topics

Figure 10 plots the standard deviation of the topic coherence scores of the four datasets at different numbers of topics. Interestingly, the standard deviation directly correlates with the aggregate topic coherence scores in Figs. 8 and 9. This is because, with too small numbers of topics (less than the optimum), the learned topics tend to be general resulting in similar semantics across all the topics, leading to lower standard deviation. At the other extreme, larger numbers of topics (beyond the optimum point) may cause all the learned topics to be equally random, hence lower standard deviation.

Fig. 10
figure 10

The standard deviation of the aggregate topic coherence scores of the four datasets as a function of number of topics

7.5 Experiments on cross-archive annotation

In most cases, the documents used to model the annotator are selected from the same archive as the target document (self-archive annotation), with the intuition that documents in the same archive tend to have similar topical composition. However, the annotator modeled from multiple archives may also be useful. This method is call cross-archive annotation and can provide the following benefits:

  1. 1.

    Mitigating the cold start problem. The cross-archive annotation can solve the cold-start problem where the documents needed to be annotated do not have an associated richly annotated source archive to train the annotator.

  2. 2.

    Introducing new but relevant topical knowledge. Different archives bare a wide variety of topical knowledge and annotation. Modeling an annotator from multiple sources hence would introduce new concepts and tags to the annotator.

To investigate the possibility of applying the proposed methodology on cross-archive annotation, an experiment is conducted using the TM-based method to compare the performance between self-archive and cross-archive annotation. For the self-archive annotation evaluation, the documents in the training set are selected from the same dataset as the target document; while the cross-archive annotation evaluation combines the training documents from all the four datasets together. We evaluate the proposed TM-based algorithm with different source modes using document-wise tenfold cross-validation, where each data set is split into ten equal subsets, and for each fold \(i \in \{1,2,3,\ldots ,10\}\) the subset \(i\) is used for the testing set, and the other nine subsets are combined and used as the source (training set).

Figure 11 shows the comparison of precision, recall, F1, and precision vs. recall of the self- and cross-archive evaluations on the test documents from the four datasets. Interestingly, the cross-archive performance is worse than the self-archive evaluation for all the four test datasets. This is because the prediction protocol is used as the evaluation criteria, where we try to predict the pre-existing tags of the test documents. As a result, the tags from different tag vocabularies may be unknown to the target documents. Hence, even though cross-archive annotation has the potential to bring a new variety of relevant annotations to the target documents, the evaluation criteria used here are too strict (due to being automatic), and hence an expert evaluation where professionals manually review the results of the annotation would be needed to enhance the evaluation of the cross-archive annotation methodology.

Fig. 11
figure 11

Precision, recall, F1, and precision-vs.-recall of the TM-based method performed on different data sets and source selection modes. a Precision, b recall, c F1, d precision vs. recall

8 Conclusion and future work

This paper presents a set of algorithms for automatic annotation of metadata. We are motivated by the real-world problems faced by ONEMecury, a search system for environmental science metadata harvested from multiple data archives. One of the important problems includes the different levels of curation of metadata from different archives, which means that the system must automatically annotate metadata records which are poorly annotated. We treat each metadata record as a tagged document, and then transform the problem into the tag recommendation problem with a controlled tag library.

We propose two algorithms for tag recommendation, one based on term frequency-inverse document frequency (TF-IDF) and the other based on topic modeling (TM) using the Latent Dirichlet Allocation. The evaluation is done on four different datasets of environmental science metadata using the tag prediction evaluation protocol, against the well-known KEA document annotation algorithm. The results show that our TM-based approach yields better results on datasets characterized by high tag utilization and rich in textual content such as DAAC and KNB than those which do not (i.e. DRYAD and TreeBASE), though with the cost of longer learning times. The scalability issues of the TM-based algorithm necessitate investigation into more memory-efficient and scalable approaches. Finally, future steps could be implementing an automatic metadata annotation algorithm on the ONEMercury search service or exploring online tagging [23].