Exploring linguistic characteristics of highly browsed and downloaded academic articles

Chen, Bikun; Deng, Dannan; Zhong, Zhouyan; Zhang, Chengzhi

doi:10.1007/s11192-020-03361-4

Exploring linguistic characteristics of highly browsed and downloaded academic articles

Published: 30 January 2020

Volume 122, pages 1769–1790, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Scientometrics Aims and scope Submit manuscript

Exploring linguistic characteristics of highly browsed and downloaded academic articles

Download PDF

Bikun Chen ORCID: orcid.org/0000-0002-7252-3906¹,
Dannan Deng¹,
Zhouyan Zhong¹ &
…
Chengzhi Zhang¹

1120 Accesses
12 Citations
Explore all metrics

Abstract

Views and downloads of academic articles have become important supplementary indicators of scholarly impact. It is assumed that linguistic characteristics have an influence on article views and downloads to some extent. To understand the relationship between linguistic characteristics and article views and downloads, this study selected 63,002 full-text articles published from 2014 to 2015 in the PLoS (Public Library of Science) journals (PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Medicine, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens), and introduced seven indicators (title length, abstract length, full text length, sentence length, lexical diversity, lexical density and lexical sophistication) to measure linguistic characteristics of articles, grouped into Top 20% viewed and downloaded (proxy of highly browsed and downloaded articles), total and Bottom 20% viewed and downloaded categories. The results suggested that most linguistic characteristics played little role in article views and downloads in our data sets in general, but some linguistic characteristics (e.g. title length and average sentence length) in specific PLoS journal and platform (PLoS platform or PubMed Central platform) played certain role in article views and downloads. Also, journal differences and platform differences regarding linguistic characteristics of highly viewed and downloaded articles were existed.

Open access and sources of full-text articles in Google Scholar in different subject fields

Article 23 July 2015

Linguistic and semantic characteristics of articles and peer review reports in Social Sciences and Medical and Health Sciences: analysis of articles published in Open Research Central

Article Open access 03 July 2023

What does open peer review bring to scientific articles? Evidence from PLoS journals

Article 30 March 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Usage metrics of academic articles have become increasingly popular in scientometrics. Article views and downloads are frequently used as important supplementary indicators to measure scholarly impact, identify latest research trends of disciplines and explore user usage patterns. Meanwhile, researchers conduct correlation analysis to investigate relationship among article views, downloads, citations, co-author counts, funding data and so on, trying to probe why articles are viewed or downloaded. However, most researches above are limited to bibliographic data.

Usage metrics are required to be studied in a broader vision. The increasing availability of full text from scientific articles in machine readable electronic formats is an opportunity to greatly impact scientometrics. In-text citations and entity metrics are typical examples of full-text analysis in scientometrics (Ding et al. 2013). Similarly, it is potential and valuable to introduce full-text analysis to usage metrics.

In this study, it is assumed that linguistic characteristics have an influence on article views and downloads to some extent. To understand the relationship between linguistic characteristics and article views and downloads, linguistic characteristics (including title length, abstract length, full text length, sentence length, lexical diversity, lexical density and lexical sophistication) jointly with usage metrics are investigated.

Literature review

Usage metrics

Usage metrics in scientometrics mainly focused on the following topics. Firstly, user behavior patterns, such as scientists’ working timetable (Wang et al. 2012, 2013a), user preferences (Chen 2018; Chen et al. 2018; Davis and Solla 2003; Davis 2006; Wang et al. 2016a, b) and user temporal usage patterns (Chen et al. 2017; Khan and Younas 2017). Secondly, obsolescence of articles from diachronic or synchronic perspective. For example, Moed (2005) and Moed and Halevi (2016) studied diachronic and synchronic obsolescence of usage data from perspective of journals and countries. Gorraiz et al. (2014) done the similar research from perspective of disciplines.

Thirdly, identifying latest research trends of disciplines (Bollen et al. 2002; Wang et al. 2013b). Fourthly, indicators to evaluate performance of journals, authors, groups and countries (Chi and Glänzel 2018; De Sordi et al. 2016; Wan et al. 2010) or supplementary metrics jointly with altmetrics measures (Bollen et al. 2005; Kurtz and Henneken 2016). Finally, correlation between specific usage types, including downloads and citations (Kurtz and Bollen 2010; O’Leary 2008; Schloegl et al. 2014; Subotic and Mukherjee 2014; Zhao 2017), usage data among different platforms (Chen 2018; Chen et al. 2017), usage data and co-author counts (Chi and Glänzel 2017), or funding data (Zhao et al. 2018).

Full-text analysis in scientometrics

Full text contains additional information that has not been available in bibliographic data. At a minimum this includes reference position, proximity of cited references within the text, multiple references at the same reference point, multiple mentions of references (so‐called op. cit.), section information, and words indicating how an author feels about a reference (i.e., citation contexts or sentiments). Full text also contains a relatively high level of detail about motivation, methods, data, instruments, results, and conclusions that authors typically report when documenting and submitting their work for publication (Boyack et al. 2013).

Full-text analysis in scientometrics mainly focused on the following topics. Firstly, in-text citations, such as reference position (e.g. Hu et al. 2013; Boyack et al. 2018), proximity of cited references (e.g. Gipp and Beel 2009; Liu and Chen 2012; Boyack et al. 2013; Kim et al. 2016), citation contexts or sentiments (e.g. Small 2011; Liu and Chen 2013; Ding et al. 2014; Lu et al. 2017), and citation motivation or behavior (e.g. Brooks 1986; Cano 1989; Bonzi and Snyder 1991; Case and Higgins 2000; Zhang et al. 2018). Secondly, entity metrics, such as scientific concepts (e.g. Ding et al. 2013; Mckeown et al. 2016), datasets (e.g. Belter 2014), softwares (Pan et al. 2015, 2016, 2018, 2019) and algorithms (e.g. Wang and Zhang 2018). Finally, linguistic complexity of scientific writing styles and scientific impacts (e.g. Lu et al. 2019a, b) and characteristics of a highly cited article (e.g. Elgendi 2019).

Most researches of usage metrics focus on numerical analysis. Also, a few researches analyze textual contents jointly with usage metrics, but they are limited to traditional bibliometric methods (e.g. keyword frequency and ratio, bibliographic coupling, co-word analysis and correlation analysis) and bibliographic data. Full-text analysis in scientometrics mainly focus on in-text citations and entity metrics. And it is increasingly expanding to more hot topics in scientometrics, such as scientific writing styles and scientific impacts.

Research questions

The application of full-text analysis to understand the relationship between linguistic characteristics and article views and downloads has not been thoroughly investigated. To address this research gap, full-text analysis is used to explore linguistic characteristics of highly browsed and downloaded papers. In this study, we are interested in following research questions in the context of seven journals published by PLoS:

1.
Are there any relationships between linguistic characteristics and highly browsed academic articles?
2.
Are there any relationships between linguistic characteristics and highly downloaded academic articles?
3.
Are there journal and platform differences of linguistic characteristics in highly browsed and downloaded academic articles?

Methodology

Data

The data in this study consist of 63,002 full-text articles published from 2014 to 2015 in the PLoS journal family, a set of peer-reviewed journals covering various disciplines. In PLoS, usage counts along with other metadata are collected between November 1st and November 7th, 2018. The PLoS journals are also indexed by PMC (PubMed Central) and Web of Science (WoS). In PMC and WoS, usage counts along with other metadata are also crawled between November 1st and November 7th, 2018. PLoS usage counts, PMC usage counts and WoS usage counts of each article along with other metadata are aggregated by DOI or article title.

The published time span of the data set ranges from January 2014 to December 2015. Because the usage counts, especially citations, accumulate to a steady level in the first 2 or 3 years after publication (Lippi and Favaloro 2013). The editorials and letters are excluded, only research articles pre-labeled by PLoS are kept, the final datasets are shown in Table 1.

Table 1 Datasets

Full size table

Usage counts of PLoS

PLoS offers Article-Level Metrics (ALMs) to each journal article. PLoS ALMs draw from the sources below. Viewed: PLoS Journals (HTML, PDF, XML), PubMed Central (HTML, PDF); Saved: CiteULike, Mendeley; Cited: CrossRef, Datacite, Europe PMC, PubMed Central, Scopus, Web of Science; Recommended: F1000 Prime; Discussed: PLoS Comments, Facebook, Reddit, Twitter, Wikipedia.^{Footnote 1}

PLoS articles are provided in three different formats—page views, PDF downloads, and XML downloads and we record the online activity of users across these three formats. This “usage”, comprised of the three types, is provided as an aggregate metric or broken down, month-by-month in graphical format. Online usage via the PLoS platform is presented according to industry standard definitions of usage and is COUNTER-compliant.

We also display COUNTER 3-compliant PMC usage data for each article. PMC individually counts the number of page views and PDF downloads of the article on their site. The results are only made available to PLoS once a month, not in real-time. As a result, articles may experience a lag with the display of PMC data of up to one month. This will also impact the data shown on recently published articles, which may not show PMC usage data for their first month of publication.^{Footnote 2}

Usage counts of WoS

The usage count is a measure of the level of interest in a specific item on the WoS platform. The count reflects the number of times the article has met a user’s information needs as demonstrated by clicking links to the full-length article at the publisher’s website (via direct link or Open-Url) or by saving the article for use in a bibliographic management tool (via direct export or in a format to be imported later). The usage count is a record of all activity performed by all WoS users, not just activity performed by users at your institution. Usage counts for different versions of the same item on the WoS platform are unified. Usage counts are updated daily. There are two kinds of usage counts in WoS platform.

Last 180 days This is the count of the number of times the full text of a record has been accessed or a record has been saved in the last 180 days. This count can move up or down as the end date of the fixed period advances.
Since 2013 This is the count of the number of times the full text of a record has been accessed or a record has been saved since February 1, 2013. This count can increase or remain static over time.^{Footnote 3}

In general, ‘‘usage’’ often refers to HTML views and PDF downloads. HTML views and PDF downloads of PLoS and PMC belong to traditional usage data. XML downloads of PLoS are new usage data. Different from traditional definitions, WoS defines “usage” as “clicking” and “saving” (Wang et al. 2016a). More accurately, WoS usage should be “HTML view” and “saving” (Chen 2018). From Tables 2 and 3, they show that usage counts in WoS and XML downloads of PLoS are considerably less than HTML views and PDF downloads of PLoS and PMC. Considering traditional definitions of usage data, only HTML views and PDF downloads of PLoS and PMC are investigated in this study. Besides, usage counts in PMC are also significantly less than that of PLoS, which means that PLoS official websites are the primary channel for users to view and download articles.

Table 2 Descriptive statistics of views per journal

Full size table

Table 3 Descriptive statistics of downloads per journal

Full size table

Methods

Article classification strategy

In terms of Pareto principle (or 80/20 rule), highly browsed and downloaded academic articles in this study are defined by Top 20% papers ranked by HTML views and PDF downloads in PLoS and PMC platforms respectively. In order to comparatively uncover linguistic characteristics of Top 20% papers, total papers and Bottom 20% papers are also incorporated (detailed number of publications in each category are shown in Table 4).

Table 4 Category assignment based on different usage data

Full size table

Indicators measuring linguistic characteristics

Linguistic complexity comprises two aspects: syntactic and lexical complexity. Syntactic complexity consists of quantitative variables on sentence length, sentence complexity, and others (Ferris 1994; Kormos 2011; Ojima 2006). Lexical complexity is made up of lexical diversity, lexical density, and lexical sophistication (Vajjala and Meurers 2012). Lu et al. (2019a, b) selected several indicators (sentence length, sentence complexity, lexical diversity, lexical density and lexical sophistication) to measure linguistic complexity.

In this study, more comprehensive indicators measuring linguistic characteristics were adopted compared with former research. The indicators selected follow the structures of academic article, which are “Title–Abstract–Keyword–Full text–Sentence–Word”. Specifically, title length, abstract length, full-text length, sentence length, lexical diversity, lexical density and lexical sophistication are incorporated in this study (shown in Table 5). “Keyword number” is not applied in this study, because there are no keywords in original articles in PLoS platform. “Co-author number” is not selected because it measures co-authorship (Chi and Glänzel 2017). Strictly speaking, it can’t be classified to the category of linguistic characteristics. In addition, punctuation marks are removed from the calculations.

Table 5 Indicators measuring linguistic characteristics

Full size table

Applicability of indicators to PLoS and PubMed views and downloads

In this study, it is assumed that linguistic characteristics have an influence on article views and downloads to some extent. It means that linguistic characteristics may be the reason why the paper is viewed or downloaded. After conducting experiments of browsing and downloading papers on PLoS and PubMed platforms, it is found that before browsing full text of the paper, only paper title and author name can be seen, and after clicking paper title, abstract and full text can be read. Then, readers can choose to download the paper by clicking the “Download” button on the full-text page. So, before the paper is viewed, readers only know title and author name, and before the paper is downloaded, the readers know title, author name, abstract, full text and so on. Therefore, only indicator “title length” is applicable to PLoS and PubMed views, and indicators “title length, abstract length, full-text Length, sentence length, lexical diversity, lexical density and lexical sophistication” are applicable to PLoS and PubMed downloads.

Results

Title length distribution

In Fig. 1, each journal is plotted by a unique color, articles of three categories in each journal are plotted as points inside or outside the boxes, vertical line and hollow square inside or outside each box denote median and mean respectively. Mean of title length by three categories in each journal are listed in Table 6.

Table 6 Mean of title and abstract length by three categories in each journal

Full size table

From Fig. 1 and Table 6, it reveals that Top 20% viewed and downloaded papers have less title length in average than total and Bottom 20% papers in each journal in PLoS platform (only Top 20% viewed papers of journal PLoS Pathogens and Top 20% downloaded papers of journal PLoS Biology are excluded). But it shows no regular characteristics in each journal in PMC platform. Then, generally, Top 20% viewed and downloaded papers of each journal in PLoS platform have less average title length than that of PMC platform. Finally, in Top 20% viewed and downloaded papers, PLoS Medicine have the most title length in average, then PLoS One, PLoS Pathogens and PLoS Neglected Tropical Diseases, PLoS Genetics, PLoS Computational Biology and PLoS Biology are the least in general.

After tracing submission guidelines of each PLoS journal, it is found that there are character limits of title, no more than 250 characters in PLoS Biology, PLoS One and PLoS Neglected Tropical Diseases, no more than 200 characters in PLoS Computational Biology, PLoS Genetics, PLoS Pathogens and PLoS Medicine. Words consist of characters, so title character length (spaces are removed) distribution in PLoS and PMC platforms are shown in Fig. 2, which reveals that character length of most articles in each journal are within the limits of submission guidelines. But within 200 characters, different journal has the unique title character length.

Abstract length and full-text length distribution

From Fig. 3a and Table 6, it reveals that Top 20% downloaded papers have more abstract length in average than total or Bottom 20% papers in most journals, but the differences are marginal. Then, in Top 20% downloaded papers, PLoS Medicine have the most average length in average, then PLoS Neglected Tropical Diseases, other journals are the least in general. After checking submission guidelines of each journal, it is found that there are no words limits of abstract in PLoS Biology, no more than 300 words in PLoS Computational Biology, PLoS Genetics, PLoS One and PLoS Pathogens, less than 250–300 words in PLoS Neglected Tropical Diseases and less than 500 words in PLoS Medicine. Probably, the average abstract length of each journal is affected by submission guidelines.

In order to reveal more details, only papers with “full-text length ≤ 30,000 words” are captured from the global graph. From Fig. 3b and Table 7, it reveals that only Top 20% downloaded papers in PLoS platform (top right) have more full-text length in average than total and Bottom 20% papers. Then, generally, in Top 20% downloaded papers, PLoS Biology, PLoS Computational Biology, PLoS Genetics and PLoS Pathogens have the most full-text length in average, then PLoS Medicine, PLoS One and PLoS Neglected Tropical Diseases. Finally, Top 20% downloaded papers of each journal in PLoS platform have more average full-text length than that of PMC platform (only journal PLoS Pathogens is excluded).

Table 7 Download mean of full-text, sentence length and lexical diversity by three categories in each journal

Full size table

Sentence length and lexical diversity distribution

In order to reveal more details, only papers with “average sentence length ≤ 50 words” are captured from the global graph. From Fig. 4a and Table 7, it reveals that Top 20% downloaded papers have less average sentence length than total and Bottom 20% papers in general, especially in PMC platform, although the differences are marginal. Also, generally, in Top 20% downloaded papers, PLoS Medicine have the most average sentence length, then PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens.

From Fig. 4b and Table 7, it reveals that average Type-Token Ratios of Top 20% downloaded papers are greater than 20%. Then, it reveals that Top 20% downloaded papers have less average Type-Token Ratios than total and Bottom 20% papers in most journals in PLoS platform, but the results are opposite in PMC platform. Also, in Top 20% downloaded papers, PLoS Neglected Tropical Diseases and PLoS One have the most lexical diversity in average, then PLoS Pathogens, PLoS Medicine, PLoS Genetics and PLoS Biology, PLoS Computational Biology is the least.

Lexical density distribution

Lexical density is only measured by lexical items, including nouns, verbs, adjectives and adverbs, whereas other types of words, for example, preposition, are not considered in this study. In order to reveal more details, papers with “noun ratio ≤ 0.6” are captured from the global graph. From Figs. 5 and 6, it is found that among the lexical items, nouns are used most frequently, then verbs and adjectives, adverbs are the least.

In Fig. 5a, it reveals that Top 20% downloaded papers have more average noun ratio than total and Bottom 20% papers in most journals in PMC platform, but show no regular differences among three categories in PLoS platform. Also, in Top 20% downloaded papers, PLoS Pathogens and PLoS Genetics have the most noun ratio in average (about 38–39%), then PLoS Neglected Tropical Diseases, PLoS One, PLoS Medicine and PLoS Biology (about 36–37%), PLoS Computational Biology is the least (around 34%).

In Fig. 5b, it reveals that average verb ratios among Top 20%, total and Bottom 20% downloaded papers of each journal are marginal, precisely 15% or so. Similarly, from Fig. 6a, it reveals that average adjective ratios among three categories of each journal are marginal, around 11%, but journal PLoS Pathogens shows less average adverb ratio than others (10.5% vs. 11%). From Fig. 6b, it reveals that average adverb ratios among three categories of each journal are also marginal, around 3%, but journal PLoS Computational Biology shows more average adverb ratio than others (4% vs. 3%).

Lexical sophistication distribution

Figures 7 and 8 show the distributions of average lexical word (noun, verb, adjective and adverb) length by category respectively. Generally, in “total papers” category, average length of nouns (6.68) is longer than that of verbs (6.13) and adverbs (6.48), but shorter than that of adjectives (7.92). In “Top 20% downloaded papers” category, average length of adjectives (7.96) is the longest of all, then nouns (6.72), adverbs (6.53) and verbs (6.17).

From Fig. 7a, PLoS Computational Biology and PLoS Medicine have the most average noun length, then PLoS Neglected Tropical Diseases, PLoS One and PLoS Genetics, PLoS Biology and PLoS Pathogens are the least. From Fig. 7b, PLoS Pathogens has the most average verb length, then PLoS Biology, PLoS Genetics, PLoS One, PLoS Medicine and PLoS Neglected Tropical Diseases, PLoS Computational Biology is the least. From Fig. 8a, PLoS Pathogens has the most average adjective length, then PLoS Biology, PLoS Genetics, PLoS One, PLoS Medicine, PLoS Computational Biology and PLoS Neglected Tropical Diseases. From Fig. 8b, PLoS Pathogens and PLoS Genetics have the most average adverb length, then PLoS One, PLoS Biology, PLoS Neglected Tropical Diseases and PLoS Medicine, PLoS Computational Biology is the least.

Statistical significance test

Two-sample Kolmogorov–Smirnov (K–S) test for linguistic characteristics between “Top 20% papers” and “BTM 20% papers” categories are provided and p-values of K-S test are shown in Tables 8 and 10, indicating that the differences of the characteristics among categories are statistically significant or not. Also, Spearman’s correlation coefficient between usage data of Top 20% articles and linguistic characteristics are investigated and shown in Tables 9 and 11.

Table 8 p values of Kolmogorov–Smirnov test of title length between “Top 20% viewed” and “BTM 20% viewed” in different journals and platforms

Full size table

Table 9 Spearman’s correlation coefficient between views and title length of Top 20% viewed articles in different journals and platforms

Full size table

Form Tables 8 and 10, about 40% K–S test results of linguistic characteristics between “Top 20% viewed and downloaded” and “BTM 20% viewed and downloaded” categories in different journals and platforms suggest statistical significance. For “Top 20% viewed” and “BTM 20% viewed” categories, title length of PLoS Computational Biology, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens suggest statistical significance, especially in PLoS platform. For “Top 20% downloaded” and “BTM 20% downloaded” categories, title length of PLoS Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens suggest statistical significance, especially in PLoS platform. For “Top 20% downloaded” and “BTM 20% downloaded” categories, average sentence length of PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens suggest statistical significance, especially in PMC platform. For other linguistic characteristics between “Top 20% downloaded” and “BTM 20% downloaded” categories, they also show statistical significance, but they depend on different journals and platforms.

Table 10 p values of Kolmogorov–Smirnov test of linguistic characteristics between “Top 20% downloaded” and “BTM 20% downloaded” in different journals and platforms

Full size table

Form Tables 9 and 11, they show very weak positive or slightly negative correlation between usage data and linguistic characteristics in general. In Top 20% viewed articles, they show slightly negative correlation between views and title length in PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens, especially in PLoS platform. In Top 20% downloaded articles, they show slightly negative correlation between downloads and title length in PLoS Neglected Tropical Diseases and PLoS One in PLoS platform. For Top 20% downloaded articles in PLoS One, they show slightly negative correlation to lexical diversity, noun ratio, verb length, adjective length and adverb length, especially in PLoS platform. For Top 20% downloaded articles in PLoS Biology, they show weak negative correlation to adjective ratio and adverb length in PMC platform. For Top 20% downloaded articles in PLoS Biology, they show moderate negative correlation to adverb length in PLoS platform.

Table 11 Spearman’s correlation coefficient between downloads and linguistic characteristics of Top 20% downloaded articles in different journals and platforms

Full size table

Discussions and conclusions

This paper applied computational linguistics to understand the relationship between linguistic characteristics and article views and downloads. The mean and median results show marginal differences for most linguistic characteristics among different categories; statistical significance test results indicate no statistical significance generally; however, for certain linguistic characteristics (e.g. title length and average sentence length) in different PLoS journals and platforms, they are still statistically significant.

Despite most linguistic characteristics play little role in article views and downloads in our data sets in general, some linguistic characteristics (e.g. title length and average sentence length) in specific PLoS journal and platform play certain role in article views and downloads in our data sets. Also, academic papers in this study follow some patterns of linguistic characteristics. For example, the average length of sentences in sample papers is usually greater than 22 words; average Type-Token Ratios of sample papers are greater than 20%; average ratios of noun, verb, adjective and adverb are about 35–39%, 15%, 11% and 3% respectively. Besides, each journal has its own linguistic characteristics. Differences of linguistic characteristics between two platforms are also existed.

Jamali and Nikzad (2011) found that articles with longer titles were downloaded slightly less than the articles with shorter titles, but Duan and Xiong (2017) found that there were only weak correlations between total downloads and title length and held that the correlation between downloads and title length could be different due to data differences. In our mind, social factors should be considered, for example, each journal has unique submission guidelines to limit characters or words of article length.

Apart from submission guidelines of journals (eg. word limits of title, abstract and full-text length), other social factors also should be incorporated into to understand linguistic characteristics of academic articles. First of all, each discipline follows its own research paradigm and covers unique terminology. Then, diverse users with various ages, positions and academic backgrounds prefer different academic platforms to acquire academic papers. In empirical research, to keep balance between disciplines and journals in sampling and grouping, and to compare or combine usage data from different academic platforms should also be valued.

There are also some limitations in this study. Only papers published between 2014 and 2015 in PLoS journals are investigated, therefore, the conclusions might be different in more samples and other journals, which need further experiments. Only several basic indicators measuring linguistic characteristics are adopted, more diversified and semantic indicators can be incorporated. Although these limitations exist, we hope that this first introduction of multi-granularity linguistic characteristics to usage metrics would provide a new perspective. In further study, in-depth interviews and experiments of user behaviors will be combined with linguistic characteristics to investigate user motivation and behavior pattern of usage metrics.

Notes

References

Belter, C. W. (2014). Measuring the value of research data: A citation analysis of oceanographic data sets. PLoS ONE,9(3), e92590.
Google Scholar
Bollen, J., Luce, R., Vemulapalli, S. S., & Xu, W. (2002). Usage analysis for the identification of research trends in digital libraries. D-Lib Magazine. https://doi.org/10.1045/may2003-bollen.
Article Google Scholar
Bollen, J., Sompel, H. V. D., Smith, J. A., & Luce, R. (2005). Toward alternative metrics of journal impact: A comparison of download and citation data. Information Processing and Management,41(6), 1419–1440.
Google Scholar
Bonzi, S., & Snyder, H. W. (1991). Motivations for citation: A comparison of self citation and citation to others. Scientometrics,21(2), 245–254.
Google Scholar
Boyack, K. W., Eck, N. J. V., Colavizza, G., & Waltman, L. (2018). Characterizing in-text citations in scientific articles: A large-scale analysis. Journal of Informetrics,12(1), 59–73.
Google Scholar
Boyack, K. W., Small, H., & Klavans, R. (2013). Improving the accuracy of co-citation clustering using full text. Journal of the American Society for Information Science and Technology,64(9), 1759–1767.
Google Scholar
Brooks, T. A. (1986). Evidence of complex citer motivations. Journal of the Association for Information Science and Technology,37(1), 34–36.
Google Scholar
Cano, V. (1989). Citation behavior: Classification, utility, and location. Journal of the Association for Information Science and Technology,40(4), 284–290.
Google Scholar
Case, D. O., & Higgins, G. M. (2000). How can we investigate citation behavior? A study of reasons for citing literature in communication. Journal of the American Society for Information Science,51(7), 635–645.
Google Scholar
Chen, B. (2018). Usage pattern comparison of the same scholarly articles between Web of Science (WoS) and Springer. Scientometrics,115(1), 519–537.
Google Scholar
Chen, B., Zhong, Z., & Zhan, C. (2017). Usage pattern analysis of academic articles from two Chinese journals. In K. Holmberg & J. Vainio (Eds.), Proceedings of ISSI 2017 (pp. 366–375). Wuhan: Wuhan University.
Google Scholar
Chen, B., Zhou, H., Zhong, Z., & Wang, Y. (2018). Exploring the user platform preference and user interest preference of chinese scholarly articles: A comparison based on usage metrics. Journal of Library Science in China,44(6), 90–104. (in Chinese).
Google Scholar
Chi, P. S., & Glänzel, W. (2018). Comparison of citation and usage indicators in research assessment in scientific disciplines and journals. Scientometrics,116(1), 537–554.
Google Scholar
Chi, P. S., & Glänzel, W. (2017). An empirical investigation of the associations among usage, scientific collaboration and citation impact. Scientometrics,112(1), 403–412.
Google Scholar
Davis, P. M. (2006). Ejournal interface can influence usage statistics: Implications for libraries, publishers, and project counter. Journal of the Association for Information Science and Technology,57(9), 1243–1248.
Google Scholar
Davis, P. M., & Solla, L. R. (2003). An ip-level analysis of usage statistics for electronic journals in chemistry: Making inferences about user behavior. Journal of the American Society for Information Science and Technology,54(11), 1062–1068.
Google Scholar
De Sordi, O. J., Conejero, M. A., & Meireles, M. (2016). Bibliometric indicators in the context of regional repositories: Proposing the d-index. Scientometrics,107(1), 235–258.
Google Scholar
Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., et al. (2013). Entitymetrics: Measuring the impact of entities. PLoS ONE,8(8), e71416.
Google Scholar
Ding, Y., Zhang, G., Chambers, T., Song, M., Wang, X., & Zhai, C. (2014). Content-based citation analysis: The next generation of citation analysis. Journal of the Association for Information Science and Technology,65(9), 1820–1833.
Google Scholar
Duan, Y., & Xiong, Z. (2017). Download patterns of journal papers and their influencing factors. Scientometrics,112(3), 1761–1775.
Google Scholar
Elgendi, M. (2019). Characteristics of a highly cited article: A machine learning perspective. IEEE Access,7, 87977–87986.
Google Scholar
Ferris, D. R. (1994). Rhetorical strategies in student persuasive writing: Differences between native and non-native English speakers. Research in the Teaching of English,28(1), 45–65.
MathSciNet Google Scholar
Gipp, B., & Beel, J. (2009). Citation proximity analysis (CPA)—a new approach for identifying related work based on co-citation analysis. In B. Larsen & J. Leta (Eds.), Proceedings of ISSI 2009 (pp. 571–575). Wuhan: Wuhan University.
Google Scholar
Gorraiz, J., Gumpenberger, C., & Schloegl, C. (2014). Usage versus citation behaviours in four subject areas. Scientometrics,101(2), 1077–1095.
Google Scholar
Hu, Z., Chen, C., & Liu, Z. (2013). Where are citations located in the body of scientific articles? A study of the distributions of citation locations. Journal of Informetrics,7(4), 887–896.
Google Scholar
Jamali, H. R., & Nikzad, M. (2011). Article title type and its relation with the number of downloads and citations. Scientometrics,88(2), 653–661.
Google Scholar
Khan, M. S., & Younas, M. (2017). Analyzing readers behavior in downloading articles from IEEE digital library: A study of two selected journals in the field of education. Scientometrics,110(3), 1523–1537.
Google Scholar
Kim, H. J., Jeong, Y. K., & Song, M. (2016). Content- and proximity-based author co-citation analysis using citation sentences. Journal of Informetrics,10(4), 954–966.
Google Scholar
Kormos, J. (2011). Task complexity and linguistic and discourse features of narrative writing performance. Journal of Second Language Writing,20(2), 148–161.
Google Scholar
Kurtz, M. J., & Bollen, J. (2010). Usage bibliometrics. Annual Review of Information Science and Technology,44(1), 1–64.
Google Scholar
Kurtz, M. J., & Henneken, E. A. (2016). Measuring metrics-a 40-year longitudinal cross-validation of citations, downloads, and peer review in astrophysics. Journal of the Association for Information Science and Technology,68(3), 695–708.
Google Scholar
Lippi, G., & Favaloro, E. J. (2013). Article downloads and citations: Is there any relationship? Clinica Chimica Acta,415, 195.
Google Scholar
Liu, S., & Chen, C. (2012). The proximity of co-citation. Scientometrics,91(2), 495–511.
MathSciNet Google Scholar
Liu, S., & Chen, C. (2013). The differences between latent topics in abstracts and citation contexts of citing papers. Journal of the American Society for Information Science and Technology,64(3), 627–639.
Google Scholar
Lu, C., Bu, Y., Dong, X., Wang, J., Ding, Y., Larivière, V., et al. (2019a). Analyzing linguistic complexity and scientific impact. Journal of Informetrics,13(3), 817–829.
Google Scholar
Lu, C., Bu, Y., Wang, J., Ding, Y., Torvik, V., Schnaars, M., et al. (2019b). Examining scientific writing styles from the perspective of linguistic complexity. Journal of the Association for Information Science and Technology,70(5), 462–475.
Google Scholar
Lu, C., Ding, Y., & Zhang, C. (2017). Understanding the impact change of a highly cited article: A content-based citation analysis. Scientometrics,112(2), 927–945.
Google Scholar
Mckeown, K., Daume, H., Chaturvedi, S., Paparrizos, J., Thadani, K., Barrio, P., et al. (2016). Predicting the impact of scientific concepts using full-text features. Journal of the Association for Information Science and Technology,67(11), 2684–2696.
Google Scholar
Moed, H. F. (2005). Statistical relationships between downloads and citations at the level of individual documents within a single journal. Journal of the American Society for Information Science and Technology,56(10), 1088–1097.
Google Scholar
Moed, H. F., & Halevi, G. (2016). On full text download and citation distributions in scientific-scholarly journals. Journal of the Association for Information Science and Technology,67(2), 412–431.
Google Scholar
Ojima, M. (2006). Concept mapping as pre-task planning: A case study of three Japanese ESL writers. System,34(4), 566–585.
Google Scholar
O’Leary, D. E. (2008). The relationship between citations and number of downloads in decision support systems. Decision Support Systems, 45(4), 972–980.
Google Scholar
Pan, X., Yan, E., Cui, M., & Hua, W. (2018). Examining the usage, citation, and diffusion patterns of bibliometric mapping software: A comparative study of three tools. Journal of Informetrics,12(2), 481–493.
Google Scholar
Pan, X., Yan, E., Cui, M., & Hua, W. (2019). How important is software to library and information science research? A content analysis of full-text publications. Journal of Informetrics,13(1), 397–406.
Google Scholar
Pan, X., Yan, E., & Hua, W. (2016). Disciplinary differences of software use and impact in scientific literature. Scientometrics,109(3), 1–18.
Google Scholar
Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics,9(4), 860–871.
Google Scholar
Schloegl, C., Gorraiz, J., Gumpenberger, C., Jack, K., & Kraker, P. (2014). Comparison of downloads, citations and readership data for two information systems journals. Scientometrics,101(2), 1113–1128.
Google Scholar
Small, H. (2011). Interpreting maps of science using citation context sentiments: A preliminary investigation. Scientometrics,87(2), 373–388.
Google Scholar
Subotic, S., & Mukherjee, (2014). Short and amusing: The relationship between title characteristics, downloads, and citations in psychology articles. Journal of Information Science,40(1), 115–124.
Google Scholar
Vajjala, S., & Meurers, D. (2012). On improving the accuracy of readability classification using insights from second language acquisition. In Proceedings of the seventh workshop on building educational applications using NLP (pp. 163–173), July 8–14, 2012, Jelu Island, South Korea.
Wan, J. K., Hua, P. H., Rousseau, R., & Sun, X. K. (2010). The journal download immediacy index (DII): Experiences using a chinese full-text database. Scientometrics,82(3), 555–566.
Google Scholar
Wang, X., Fang, Z., & Sun, X. (2016a). Usage patterns of scholarly articles on Web of Science: A study on Web of Science usage count. Scientometrics,109(2), 917–926.
Google Scholar
Wang, X., Peng, L., Zhang, C., Xu, S., Wang, Z., Wang, C., et al. (2013a). Exploring scientists’ working timetable: A global survey. Journal of Informetrics,7(3), 665–675.
Google Scholar
Wang, X., Wang, Z., & Xu, S. (2013b). Tracing scientist’s research trends realtimely. Scientometrics,95(2), 717–729.
Google Scholar
Wang, X., Xu, S., & Fang, Z. (2016). Tracing digital footprints to academic articles: An investigation of PeerJ publication referral data. Retrieved October 28, 2018, from http://cn.arxiv.org/abs/1601.05271.
Wang, X., Xu, S., Peng, L., Wang, Z., Wang, C., Zhang, C., et al. (2012). Exploring scientists’ working timetable: Do scientists often work overtime? Journal of Informetrics,6(4), 655–660.
Google Scholar
Wang, Y., & Zhang, C. (2018). Using full-text of research articles to analyze academic impact of algorithms. In G. Chowdhury, J. McLeod, V. Gillet, & P. Willett (Eds.), Proceedings of iConference (pp. 395–401). Sheffield: University of Sheffield.
Google Scholar
Zhang, C., Ding, R., & Wang, Y. (2018). Using behavior and influence assessment of algorithms based on full-text academic articles. Journal of the China Society for Scientific and Technical Information,37(12), 1175–1187. (in Chinese).
Google Scholar
Zhao, X. (2017). Exploring the features of usage data for academic literatures. Journal of Library Science in China,43(3), 44–57. (in Chinese).
Google Scholar
Zhao, S. X., Lou, W., Tan, A. M., & Yu, S. (2018). Do funded papers attract more usage? Scientometrics,115(1), 153–168.
Google Scholar

Download references

Acknowledgements

This paper is supported by Youth Program of National Social Science Fund in China (15CTQ035), Social Public Safety S&T Collaborative Innovation Center of Universities in Jiangsu Province and China Scholarship Council (ID: 201906845042).

Author information

Authors and Affiliations

Department of Information Management, Nanjing University of Science and Technology, Nanjing, 210094, China
Bikun Chen, Dannan Deng, Zhouyan Zhong & Chengzhi Zhang

Authors

Bikun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Dannan Deng
View author publications
You can also search for this author in PubMed Google Scholar
Zhouyan Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Chengzhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bikun Chen.

Appendix

See Tables 6, 7, 8, 9, 10 and 11.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, B., Deng, D., Zhong, Z. et al. Exploring linguistic characteristics of highly browsed and downloaded academic articles. Scientometrics 122, 1769–1790 (2020). https://doi.org/10.1007/s11192-020-03361-4

Download citation

Received: 01 November 2019
Published: 30 January 2020
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11192-020-03361-4

Keywords

Mathematical Subject Classification

91C99

JEL Classification

C80

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Exploring linguistic characteristics of highly browsed and downloaded academic articles

Abstract

Similar content being viewed by others

Open access and sources of full-text articles in Google Scholar in different subject fields

Linguistic and semantic characteristics of articles and peer review reports in Social Sciences and Medical and Health Sciences: analysis of articles published in Open Research Central

What does open peer review bring to scientific articles? Evidence from PLoS journals

Explore related subjects

Introduction

Literature review

Usage metrics

Full-text analysis in scientometrics

Research questions

Methodology

Data

Usage counts of PLoS

Usage counts of WoS

Methods

Article classification strategy

Indicators measuring linguistic characteristics

Applicability of indicators to PLoS and PubMed views and downloads

Results

Title length distribution

Abstract length and full-text length distribution

Sentence length and lexical diversity distribution

Lexical density distribution

Lexical sophistication distribution

Statistical significance test

Discussions and conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematical Subject Classification

JEL Classification

Search

Navigation