Introduction

Usage metrics of academic articles have become increasingly popular in scientometrics. Article views and downloads are frequently used as important supplementary indicators to measure scholarly impact, identify latest research trends of disciplines and explore user usage patterns. Meanwhile, researchers conduct correlation analysis to investigate relationship among article views, downloads, citations, co-author counts, funding data and so on, trying to probe why articles are viewed or downloaded. However, most researches above are limited to bibliographic data.

Usage metrics are required to be studied in a broader vision. The increasing availability of full text from scientific articles in machine readable electronic formats is an opportunity to greatly impact scientometrics. In-text citations and entity metrics are typical examples of full-text analysis in scientometrics (Ding et al. 2013). Similarly, it is potential and valuable to introduce full-text analysis to usage metrics.

In this study, it is assumed that linguistic characteristics have an influence on article views and downloads to some extent. To understand the relationship between linguistic characteristics and article views and downloads, linguistic characteristics (including title length, abstract length, full text length, sentence length, lexical diversity, lexical density and lexical sophistication) jointly with usage metrics are investigated.

Literature review

Usage metrics

Usage metrics in scientometrics mainly focused on the following topics. Firstly, user behavior patterns, such as scientists’ working timetable (Wang et al. 2012, 2013a), user preferences (Chen 2018; Chen et al. 2018; Davis and Solla 2003; Davis 2006; Wang et al. 2016a, b) and user temporal usage patterns (Chen et al. 2017; Khan and Younas 2017). Secondly, obsolescence of articles from diachronic or synchronic perspective. For example, Moed (2005) and Moed and Halevi (2016) studied diachronic and synchronic obsolescence of usage data from perspective of journals and countries. Gorraiz et al. (2014) done the similar research from perspective of disciplines.

Thirdly, identifying latest research trends of disciplines (Bollen et al. 2002; Wang et al. 2013b). Fourthly, indicators to evaluate performance of journals, authors, groups and countries (Chi and Glänzel 2018; De Sordi et al. 2016; Wan et al. 2010) or supplementary metrics jointly with altmetrics measures (Bollen et al. 2005; Kurtz and Henneken 2016). Finally, correlation between specific usage types, including downloads and citations (Kurtz and Bollen 2010; O’Leary 2008; Schloegl et al. 2014; Subotic and Mukherjee 2014; Zhao 2017), usage data among different platforms (Chen 2018; Chen et al. 2017), usage data and co-author counts (Chi and Glänzel 2017), or funding data (Zhao et al. 2018).

Full-text analysis in scientometrics

Full text contains additional information that has not been available in bibliographic data. At a minimum this includes reference position, proximity of cited references within the text, multiple references at the same reference point, multiple mentions of references (so‐called op. cit.), section information, and words indicating how an author feels about a reference (i.e., citation contexts or sentiments). Full text also contains a relatively high level of detail about motivation, methods, data, instruments, results, and conclusions that authors typically report when documenting and submitting their work for publication (Boyack et al. 2013).

Full-text analysis in scientometrics mainly focused on the following topics. Firstly, in-text citations, such as reference position (e.g. Hu et al. 2013; Boyack et al. 2018), proximity of cited references (e.g. Gipp and Beel 2009; Liu and Chen 2012; Boyack et al. 2013; Kim et al. 2016), citation contexts or sentiments (e.g. Small 2011; Liu and Chen 2013; Ding et al. 2014; Lu et al. 2017), and citation motivation or behavior (e.g. Brooks 1986; Cano 1989; Bonzi and Snyder 1991; Case and Higgins 2000; Zhang et al. 2018). Secondly, entity metrics, such as scientific concepts (e.g. Ding et al. 2013; Mckeown et al. 2016), datasets (e.g. Belter 2014), softwares (Pan et al. 2015, 2016, 2018, 2019) and algorithms (e.g. Wang and Zhang 2018). Finally, linguistic complexity of scientific writing styles and scientific impacts (e.g. Lu et al. 2019a, b) and characteristics of a highly cited article (e.g. Elgendi 2019).

Most researches of usage metrics focus on numerical analysis. Also, a few researches analyze textual contents jointly with usage metrics, but they are limited to traditional bibliometric methods (e.g. keyword frequency and ratio, bibliographic coupling, co-word analysis and correlation analysis) and bibliographic data. Full-text analysis in scientometrics mainly focus on in-text citations and entity metrics. And it is increasingly expanding to more hot topics in scientometrics, such as scientific writing styles and scientific impacts.

Research questions

The application of full-text analysis to understand the relationship between linguistic characteristics and article views and downloads has not been thoroughly investigated. To address this research gap, full-text analysis is used to explore linguistic characteristics of highly browsed and downloaded papers. In this study, we are interested in following research questions in the context of seven journals published by PLoS:

  1. 1.

    Are there any relationships between linguistic characteristics and highly browsed academic articles?

  2. 2.

    Are there any relationships between linguistic characteristics and highly downloaded academic articles?

  3. 3.

    Are there journal and platform differences of linguistic characteristics in highly browsed and downloaded academic articles?

Methodology

Data

The data in this study consist of 63,002 full-text articles published from 2014 to 2015 in the PLoS journal family, a set of peer-reviewed journals covering various disciplines. In PLoS, usage counts along with other metadata are collected between November 1st and November 7th, 2018. The PLoS journals are also indexed by PMC (PubMed Central) and Web of Science (WoS). In PMC and WoS, usage counts along with other metadata are also crawled between November 1st and November 7th, 2018. PLoS usage counts, PMC usage counts and WoS usage counts of each article along with other metadata are aggregated by DOI or article title.

The published time span of the data set ranges from January 2014 to December 2015. Because the usage counts, especially citations, accumulate to a steady level in the first 2 or 3 years after publication (Lippi and Favaloro 2013). The editorials and letters are excluded, only research articles pre-labeled by PLoS are kept, the final datasets are shown in Table 1.

Table 1 Datasets

Usage counts of PLoS

PLoS offers Article-Level Metrics (ALMs) to each journal article. PLoS ALMs draw from the sources below. Viewed: PLoS Journals (HTML, PDF, XML), PubMed Central (HTML, PDF); Saved: CiteULike, Mendeley; Cited: CrossRef, Datacite, Europe PMC, PubMed Central, Scopus, Web of Science; Recommended: F1000 Prime; Discussed: PLoS Comments, Facebook, Reddit, Twitter, Wikipedia.Footnote 1

PLoS articles are provided in three different formats—page views, PDF downloads, and XML downloads and we record the online activity of users across these three formats. This “usage”, comprised of the three types, is provided as an aggregate metric or broken down, month-by-month in graphical format. Online usage via the PLoS platform is presented according to industry standard definitions of usage and is COUNTER-compliant.

We also display COUNTER 3-compliant PMC usage data for each article. PMC individually counts the number of page views and PDF downloads of the article on their site. The results are only made available to PLoS once a month, not in real-time. As a result, articles may experience a lag with the display of PMC data of up to one month. This will also impact the data shown on recently published articles, which may not show PMC usage data for their first month of publication.Footnote 2

Usage counts of WoS

The usage count is a measure of the level of interest in a specific item on the WoS platform. The count reflects the number of times the article has met a user’s information needs as demonstrated by clicking links to the full-length article at the publisher’s website (via direct link or Open-Url) or by saving the article for use in a bibliographic management tool (via direct export or in a format to be imported later). The usage count is a record of all activity performed by all WoS users, not just activity performed by users at your institution. Usage counts for different versions of the same item on the WoS platform are unified. Usage counts are updated daily. There are two kinds of usage counts in WoS platform.

  • Last 180 days This is the count of the number of times the full text of a record has been accessed or a record has been saved in the last 180 days. This count can move up or down as the end date of the fixed period advances.

  • Since 2013 This is the count of the number of times the full text of a record has been accessed or a record has been saved since February 1, 2013. This count can increase or remain static over time.Footnote 3

In general, ‘‘usage’’ often refers to HTML views and PDF downloads. HTML views and PDF downloads of PLoS and PMC belong to traditional usage data. XML downloads of PLoS are new usage data. Different from traditional definitions, WoS defines “usage” as “clicking” and “saving” (Wang et al. 2016a). More accurately, WoS usage should be “HTML view” and “saving” (Chen 2018). From Tables 2 and 3, they show that usage counts in WoS and XML downloads of PLoS are considerably less than HTML views and PDF downloads of PLoS and PMC. Considering traditional definitions of usage data, only HTML views and PDF downloads of PLoS and PMC are investigated in this study. Besides, usage counts in PMC are also significantly less than that of PLoS, which means that PLoS official websites are the primary channel for users to view and download articles.

Table 2 Descriptive statistics of views per journal
Table 3 Descriptive statistics of downloads per journal

Methods

Article classification strategy

In terms of Pareto principle (or 80/20 rule), highly browsed and downloaded academic articles in this study are defined by Top 20% papers ranked by HTML views and PDF downloads in PLoS and PMC platforms respectively. In order to comparatively uncover linguistic characteristics of Top 20% papers, total papers and Bottom 20% papers are also incorporated (detailed number of publications in each category are shown in Table 4).

Table 4 Category assignment based on different usage data

Indicators measuring linguistic characteristics

Linguistic complexity comprises two aspects: syntactic and lexical complexity. Syntactic complexity consists of quantitative variables on sentence length, sentence complexity, and others (Ferris 1994; Kormos 2011; Ojima 2006). Lexical complexity is made up of lexical diversity, lexical density, and lexical sophistication (Vajjala and Meurers 2012). Lu et al. (2019a, b) selected several indicators (sentence length, sentence complexity, lexical diversity, lexical density and lexical sophistication) to measure linguistic complexity.

In this study, more comprehensive indicators measuring linguistic characteristics were adopted compared with former research. The indicators selected follow the structures of academic article, which are “Title–Abstract–Keyword–Full text–Sentence–Word”. Specifically, title length, abstract length, full-text length, sentence length, lexical diversity, lexical density and lexical sophistication are incorporated in this study (shown in Table 5). “Keyword number” is not applied in this study, because there are no keywords in original articles in PLoS platform. “Co-author number” is not selected because it measures co-authorship (Chi and Glänzel 2017). Strictly speaking, it can’t be classified to the category of linguistic characteristics. In addition, punctuation marks are removed from the calculations.

Table 5 Indicators measuring linguistic characteristics

Applicability of indicators to PLoS and PubMed views and downloads

In this study, it is assumed that linguistic characteristics have an influence on article views and downloads to some extent. It means that linguistic characteristics may be the reason why the paper is viewed or downloaded. After conducting experiments of browsing and downloading papers on PLoS and PubMed platforms, it is found that before browsing full text of the paper, only paper title and author name can be seen, and after clicking paper title, abstract and full text can be read. Then, readers can choose to download the paper by clicking the “Download” button on the full-text page. So, before the paper is viewed, readers only know title and author name, and before the paper is downloaded, the readers know title, author name, abstract, full text and so on. Therefore, only indicator “title length” is applicable to PLoS and PubMed views, and indicators “title length, abstract length, full-text Length, sentence length, lexical diversity, lexical density and lexical sophistication” are applicable to PLoS and PubMed downloads.

Results

Title length distribution

In Fig. 1, each journal is plotted by a unique color, articles of three categories in each journal are plotted as points inside or outside the boxes, vertical line and hollow square inside or outside each box denote median and mean respectively. Mean of title length by three categories in each journal are listed in Table 6.

Fig. 1
figure 1

Title word length distribution in PLoS and PMC platforms (color figure online)

Table 6 Mean of title and abstract length by three categories in each journal

From Fig. 1 and Table 6, it reveals that Top 20% viewed and downloaded papers have less title length in average than total and Bottom 20% papers in each journal in PLoS platform (only Top 20% viewed papers of journal PLoS Pathogens and Top 20% downloaded papers of journal PLoS Biology are excluded). But it shows no regular characteristics in each journal in PMC platform. Then, generally, Top 20% viewed and downloaded papers of each journal in PLoS platform have less average title length than that of PMC platform. Finally, in Top 20% viewed and downloaded papers, PLoS Medicine have the most title length in average, then PLoS One, PLoS Pathogens and PLoS Neglected Tropical Diseases, PLoS Genetics, PLoS Computational Biology and PLoS Biology are the least in general.

After tracing submission guidelines of each PLoS journal, it is found that there are character limits of title, no more than 250 characters in PLoS Biology, PLoS One and PLoS Neglected Tropical Diseases, no more than 200 characters in PLoS Computational Biology, PLoS Genetics, PLoS Pathogens and PLoS Medicine. Words consist of characters, so title character length (spaces are removed) distribution in PLoS and PMC platforms are shown in Fig. 2, which reveals that character length of most articles in each journal are within the limits of submission guidelines. But within 200 characters, different journal has the unique title character length.

Fig. 2
figure 2

Title character length distribution in PLoS and PMC platforms (color figure online)

Abstract length and full-text length distribution

From Fig. 3a and Table 6, it reveals that Top 20% downloaded papers have more abstract length in average than total or Bottom 20% papers in most journals, but the differences are marginal. Then, in Top 20% downloaded papers, PLoS Medicine have the most average length in average, then PLoS Neglected Tropical Diseases, other journals are the least in general. After checking submission guidelines of each journal, it is found that there are no words limits of abstract in PLoS Biology, no more than 300 words in PLoS Computational Biology, PLoS Genetics, PLoS One and PLoS Pathogens, less than 250–300 words in PLoS Neglected Tropical Diseases and less than 500 words in PLoS Medicine. Probably, the average abstract length of each journal is affected by submission guidelines.

Fig. 3
figure 3

Abstract length and full-text length (≤ 30,000 words) distribution in PLoS and PMC platforms (color figure online)

In order to reveal more details, only papers with “full-text length ≤ 30,000 words” are captured from the global graph. From Fig. 3b and Table 7, it reveals that only Top 20% downloaded papers in PLoS platform (top right) have more full-text length in average than total and Bottom 20% papers. Then, generally, in Top 20% downloaded papers, PLoS Biology, PLoS Computational Biology, PLoS Genetics and PLoS Pathogens have the most full-text length in average, then PLoS Medicine, PLoS One and PLoS Neglected Tropical Diseases. Finally, Top 20% downloaded papers of each journal in PLoS platform have more average full-text length than that of PMC platform (only journal PLoS Pathogens is excluded).

Table 7 Download mean of full-text, sentence length and lexical diversity by three categories in each journal

Sentence length and lexical diversity distribution

In order to reveal more details, only papers with “average sentence length ≤ 50 words” are captured from the global graph. From Fig. 4a and Table 7, it reveals that Top 20% downloaded papers have less average sentence length than total and Bottom 20% papers in general, especially in PMC platform, although the differences are marginal. Also, generally, in Top 20% downloaded papers, PLoS Medicine have the most average sentence length, then PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens.

Fig. 4
figure 4

Sentence length (≤ 50 words) and lexical diversity distribution in PLoS and PMC platforms (color figure online)

From Fig. 4b and Table 7, it reveals that average Type-Token Ratios of Top 20% downloaded papers are greater than 20%. Then, it reveals that Top 20% downloaded papers have less average Type-Token Ratios than total and Bottom 20% papers in most journals in PLoS platform, but the results are opposite in PMC platform. Also, in Top 20% downloaded papers, PLoS Neglected Tropical Diseases and PLoS One have the most lexical diversity in average, then PLoS Pathogens, PLoS Medicine, PLoS Genetics and PLoS Biology, PLoS Computational Biology is the least.

Lexical density distribution

Lexical density is only measured by lexical items, including nouns, verbs, adjectives and adverbs, whereas other types of words, for example, preposition, are not considered in this study. In order to reveal more details, papers with “noun ratio ≤ 0.6” are captured from the global graph. From Figs. 5 and 6, it is found that among the lexical items, nouns are used most frequently, then verbs and adjectives, adverbs are the least.

Fig. 5
figure 5

Noun ratio (≤ 0.6) and verb ratio distribution in PLoS and PMC platforms (color figure online)

Fig. 6
figure 6

Adjective and adverb ratio distribution in PLoS and PMC platforms (color figure online)

In Fig. 5a, it reveals that Top 20% downloaded papers have more average noun ratio than total and Bottom 20% papers in most journals in PMC platform, but show no regular differences among three categories in PLoS platform. Also, in Top 20% downloaded papers, PLoS Pathogens and PLoS Genetics have the most noun ratio in average (about 38–39%), then PLoS Neglected Tropical Diseases, PLoS One, PLoS Medicine and PLoS Biology (about 36–37%), PLoS Computational Biology is the least (around 34%).

In Fig. 5b, it reveals that average verb ratios among Top 20%, total and Bottom 20% downloaded papers of each journal are marginal, precisely 15% or so. Similarly, from Fig. 6a, it reveals that average adjective ratios among three categories of each journal are marginal, around 11%, but journal PLoS Pathogens shows less average adverb ratio than others (10.5% vs. 11%). From Fig. 6b, it reveals that average adverb ratios among three categories of each journal are also marginal, around 3%, but journal PLoS Computational Biology shows more average adverb ratio than others (4% vs. 3%).

Lexical sophistication distribution

Figures 7 and 8 show the distributions of average lexical word (noun, verb, adjective and adverb) length by category respectively. Generally, in “total papers” category, average length of nouns (6.68) is longer than that of verbs (6.13) and adverbs (6.48), but shorter than that of adjectives (7.92). In “Top 20% downloaded papers” category, average length of adjectives (7.96) is the longest of all, then nouns (6.72), adverbs (6.53) and verbs (6.17).

Fig. 7
figure 7

Average noun and verb length distribution in PLoS and PMC platforms (color figure online)

Fig. 8
figure 8

Average adjective and adverb length distribution in PLoS and PMC platforms (color figure online)

From Fig. 7a, PLoS Computational Biology and PLoS Medicine have the most average noun length, then PLoS Neglected Tropical Diseases, PLoS One and PLoS Genetics, PLoS Biology and PLoS Pathogens are the least. From Fig. 7b, PLoS Pathogens has the most average verb length, then PLoS Biology, PLoS Genetics, PLoS One, PLoS Medicine and PLoS Neglected Tropical Diseases, PLoS Computational Biology is the least. From Fig. 8a, PLoS Pathogens has the most average adjective length, then PLoS Biology, PLoS Genetics, PLoS One, PLoS Medicine, PLoS Computational Biology and PLoS Neglected Tropical Diseases. From Fig. 8b, PLoS Pathogens and PLoS Genetics have the most average adverb length, then PLoS One, PLoS Biology, PLoS Neglected Tropical Diseases and PLoS Medicine, PLoS Computational Biology is the least.

Statistical significance test

Two-sample Kolmogorov–Smirnov (K–S) test for linguistic characteristics between “Top 20% papers” and “BTM 20% papers” categories are provided and p-values of K-S test are shown in Tables 8 and 10, indicating that the differences of the characteristics among categories are statistically significant or not. Also, Spearman’s correlation coefficient between usage data of Top 20% articles and linguistic characteristics are investigated and shown in Tables 9 and 11.

Table 8 p values of Kolmogorov–Smirnov test of title length between “Top 20% viewed” and “BTM 20% viewed” in different journals and platforms
Table 9 Spearman’s correlation coefficient between views and title length of Top 20% viewed articles in different journals and platforms

Form Tables 8 and 10, about 40% K–S test results of linguistic characteristics between “Top 20% viewed and downloaded” and “BTM 20% viewed and downloaded” categories in different journals and platforms suggest statistical significance. For “Top 20% viewed” and “BTM 20% viewed” categories, title length of PLoS Computational Biology, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens suggest statistical significance, especially in PLoS platform. For “Top 20% downloaded” and “BTM 20% downloaded” categories, title length of PLoS Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens suggest statistical significance, especially in PLoS platform. For “Top 20% downloaded” and “BTM 20% downloaded” categories, average sentence length of PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens suggest statistical significance, especially in PMC platform. For other linguistic characteristics between “Top 20% downloaded” and “BTM 20% downloaded” categories, they also show statistical significance, but they depend on different journals and platforms.

Table 10 p values of Kolmogorov–Smirnov test of linguistic characteristics between “Top 20% downloaded” and “BTM 20% downloaded” in different journals and platforms

Form Tables 9 and 11, they show very weak positive or slightly negative correlation between usage data and linguistic characteristics in general. In Top 20% viewed articles, they show slightly negative correlation between views and title length in PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens, especially in PLoS platform. In Top 20% downloaded articles, they show slightly negative correlation between downloads and title length in PLoS Neglected Tropical Diseases and PLoS One in PLoS platform. For Top 20% downloaded articles in PLoS One, they show slightly negative correlation to lexical diversity, noun ratio, verb length, adjective length and adverb length, especially in PLoS platform. For Top 20% downloaded articles in PLoS Biology, they show weak negative correlation to adjective ratio and adverb length in PMC platform. For Top 20% downloaded articles in PLoS Biology, they show moderate negative correlation to adverb length in PLoS platform.

Table 11 Spearman’s correlation coefficient between downloads and linguistic characteristics of Top 20% downloaded articles in different journals and platforms

Discussions and conclusions

This paper applied computational linguistics to understand the relationship between linguistic characteristics and article views and downloads. The mean and median results show marginal differences for most linguistic characteristics among different categories; statistical significance test results indicate no statistical significance generally; however, for certain linguistic characteristics (e.g. title length and average sentence length) in different PLoS journals and platforms, they are still statistically significant.

Despite most linguistic characteristics play little role in article views and downloads in our data sets in general, some linguistic characteristics (e.g. title length and average sentence length) in specific PLoS journal and platform play certain role in article views and downloads in our data sets. Also, academic papers in this study follow some patterns of linguistic characteristics. For example, the average length of sentences in sample papers is usually greater than 22 words; average Type-Token Ratios of sample papers are greater than 20%; average ratios of noun, verb, adjective and adverb are about 35–39%, 15%, 11% and 3% respectively. Besides, each journal has its own linguistic characteristics. Differences of linguistic characteristics between two platforms are also existed.

Jamali and Nikzad (2011) found that articles with longer titles were downloaded slightly less than the articles with shorter titles, but Duan and Xiong (2017) found that there were only weak correlations between total downloads and title length and held that the correlation between downloads and title length could be different due to data differences. In our mind, social factors should be considered, for example, each journal has unique submission guidelines to limit characters or words of article length.

Apart from submission guidelines of journals (eg. word limits of title, abstract and full-text length), other social factors also should be incorporated into to understand linguistic characteristics of academic articles. First of all, each discipline follows its own research paradigm and covers unique terminology. Then, diverse users with various ages, positions and academic backgrounds prefer different academic platforms to acquire academic papers. In empirical research, to keep balance between disciplines and journals in sampling and grouping, and to compare or combine usage data from different academic platforms should also be valued.

There are also some limitations in this study. Only papers published between 2014 and 2015 in PLoS journals are investigated, therefore, the conclusions might be different in more samples and other journals, which need further experiments. Only several basic indicators measuring linguistic characteristics are adopted, more diversified and semantic indicators can be incorporated. Although these limitations exist, we hope that this first introduction of multi-granularity linguistic characteristics to usage metrics would provide a new perspective. In further study, in-depth interviews and experiments of user behaviors will be combined with linguistic characteristics to investigate user motivation and behavior pattern of usage metrics.