Abstract
Views and downloads of academic articles have become important supplementary indicators of scholarly impact. It is assumed that linguistic characteristics have an influence on article views and downloads to some extent. To understand the relationship between linguistic characteristics and article views and downloads, this study selected 63,002 full-text articles published from 2014 to 2015 in the PLoS (Public Library of Science) journals (PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Medicine, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens), and introduced seven indicators (title length, abstract length, full text length, sentence length, lexical diversity, lexical density and lexical sophistication) to measure linguistic characteristics of articles, grouped into Top 20% viewed and downloaded (proxy of highly browsed and downloaded articles), total and Bottom 20% viewed and downloaded categories. The results suggested that most linguistic characteristics played little role in article views and downloads in our data sets in general, but some linguistic characteristics (e.g. title length and average sentence length) in specific PLoS journal and platform (PLoS platform or PubMed Central platform) played certain role in article views and downloads. Also, journal differences and platform differences regarding linguistic characteristics of highly viewed and downloaded articles were existed.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Usage metrics of academic articles have become increasingly popular in scientometrics. Article views and downloads are frequently used as important supplementary indicators to measure scholarly impact, identify latest research trends of disciplines and explore user usage patterns. Meanwhile, researchers conduct correlation analysis to investigate relationship among article views, downloads, citations, co-author counts, funding data and so on, trying to probe why articles are viewed or downloaded. However, most researches above are limited to bibliographic data.
Usage metrics are required to be studied in a broader vision. The increasing availability of full text from scientific articles in machine readable electronic formats is an opportunity to greatly impact scientometrics. In-text citations and entity metrics are typical examples of full-text analysis in scientometrics (Ding et al. 2013). Similarly, it is potential and valuable to introduce full-text analysis to usage metrics.
In this study, it is assumed that linguistic characteristics have an influence on article views and downloads to some extent. To understand the relationship between linguistic characteristics and article views and downloads, linguistic characteristics (including title length, abstract length, full text length, sentence length, lexical diversity, lexical density and lexical sophistication) jointly with usage metrics are investigated.
Literature review
Usage metrics
Usage metrics in scientometrics mainly focused on the following topics. Firstly, user behavior patterns, such as scientists’ working timetable (Wang et al. 2012, 2013a), user preferences (Chen 2018; Chen et al. 2018; Davis and Solla 2003; Davis 2006; Wang et al. 2016a, b) and user temporal usage patterns (Chen et al. 2017; Khan and Younas 2017). Secondly, obsolescence of articles from diachronic or synchronic perspective. For example, Moed (2005) and Moed and Halevi (2016) studied diachronic and synchronic obsolescence of usage data from perspective of journals and countries. Gorraiz et al. (2014) done the similar research from perspective of disciplines.
Thirdly, identifying latest research trends of disciplines (Bollen et al. 2002; Wang et al. 2013b). Fourthly, indicators to evaluate performance of journals, authors, groups and countries (Chi and Glänzel 2018; De Sordi et al. 2016; Wan et al. 2010) or supplementary metrics jointly with altmetrics measures (Bollen et al. 2005; Kurtz and Henneken 2016). Finally, correlation between specific usage types, including downloads and citations (Kurtz and Bollen 2010; O’Leary 2008; Schloegl et al. 2014; Subotic and Mukherjee 2014; Zhao 2017), usage data among different platforms (Chen 2018; Chen et al. 2017), usage data and co-author counts (Chi and Glänzel 2017), or funding data (Zhao et al. 2018).
Full-text analysis in scientometrics
Full text contains additional information that has not been available in bibliographic data. At a minimum this includes reference position, proximity of cited references within the text, multiple references at the same reference point, multiple mentions of references (so‐called op. cit.), section information, and words indicating how an author feels about a reference (i.e., citation contexts or sentiments). Full text also contains a relatively high level of detail about motivation, methods, data, instruments, results, and conclusions that authors typically report when documenting and submitting their work for publication (Boyack et al. 2013).
Full-text analysis in scientometrics mainly focused on the following topics. Firstly, in-text citations, such as reference position (e.g. Hu et al. 2013; Boyack et al. 2018), proximity of cited references (e.g. Gipp and Beel 2009; Liu and Chen 2012; Boyack et al. 2013; Kim et al. 2016), citation contexts or sentiments (e.g. Small 2011; Liu and Chen 2013; Ding et al. 2014; Lu et al. 2017), and citation motivation or behavior (e.g. Brooks 1986; Cano 1989; Bonzi and Snyder 1991; Case and Higgins 2000; Zhang et al. 2018). Secondly, entity metrics, such as scientific concepts (e.g. Ding et al. 2013; Mckeown et al. 2016), datasets (e.g. Belter 2014), softwares (Pan et al. 2015, 2016, 2018, 2019) and algorithms (e.g. Wang and Zhang 2018). Finally, linguistic complexity of scientific writing styles and scientific impacts (e.g. Lu et al. 2019a, b) and characteristics of a highly cited article (e.g. Elgendi 2019).
Most researches of usage metrics focus on numerical analysis. Also, a few researches analyze textual contents jointly with usage metrics, but they are limited to traditional bibliometric methods (e.g. keyword frequency and ratio, bibliographic coupling, co-word analysis and correlation analysis) and bibliographic data. Full-text analysis in scientometrics mainly focus on in-text citations and entity metrics. And it is increasingly expanding to more hot topics in scientometrics, such as scientific writing styles and scientific impacts.
Research questions
The application of full-text analysis to understand the relationship between linguistic characteristics and article views and downloads has not been thoroughly investigated. To address this research gap, full-text analysis is used to explore linguistic characteristics of highly browsed and downloaded papers. In this study, we are interested in following research questions in the context of seven journals published by PLoS:
- 1.
Are there any relationships between linguistic characteristics and highly browsed academic articles?
- 2.
Are there any relationships between linguistic characteristics and highly downloaded academic articles?
- 3.
Are there journal and platform differences of linguistic characteristics in highly browsed and downloaded academic articles?
Methodology
Data
The data in this study consist of 63,002 full-text articles published from 2014 to 2015 in the PLoS journal family, a set of peer-reviewed journals covering various disciplines. In PLoS, usage counts along with other metadata are collected between November 1st and November 7th, 2018. The PLoS journals are also indexed by PMC (PubMed Central) and Web of Science (WoS). In PMC and WoS, usage counts along with other metadata are also crawled between November 1st and November 7th, 2018. PLoS usage counts, PMC usage counts and WoS usage counts of each article along with other metadata are aggregated by DOI or article title.
The published time span of the data set ranges from January 2014 to December 2015. Because the usage counts, especially citations, accumulate to a steady level in the first 2 or 3 years after publication (Lippi and Favaloro 2013). The editorials and letters are excluded, only research articles pre-labeled by PLoS are kept, the final datasets are shown in Table 1.
Usage counts of PLoS
PLoS offers Article-Level Metrics (ALMs) to each journal article. PLoS ALMs draw from the sources below. Viewed: PLoS Journals (HTML, PDF, XML), PubMed Central (HTML, PDF); Saved: CiteULike, Mendeley; Cited: CrossRef, Datacite, Europe PMC, PubMed Central, Scopus, Web of Science; Recommended: F1000 Prime; Discussed: PLoS Comments, Facebook, Reddit, Twitter, Wikipedia.Footnote 1
PLoS articles are provided in three different formats—page views, PDF downloads, and XML downloads and we record the online activity of users across these three formats. This “usage”, comprised of the three types, is provided as an aggregate metric or broken down, month-by-month in graphical format. Online usage via the PLoS platform is presented according to industry standard definitions of usage and is COUNTER-compliant.
We also display COUNTER 3-compliant PMC usage data for each article. PMC individually counts the number of page views and PDF downloads of the article on their site. The results are only made available to PLoS once a month, not in real-time. As a result, articles may experience a lag with the display of PMC data of up to one month. This will also impact the data shown on recently published articles, which may not show PMC usage data for their first month of publication.Footnote 2
Usage counts of WoS
The usage count is a measure of the level of interest in a specific item on the WoS platform. The count reflects the number of times the article has met a user’s information needs as demonstrated by clicking links to the full-length article at the publisher’s website (via direct link or Open-Url) or by saving the article for use in a bibliographic management tool (via direct export or in a format to be imported later). The usage count is a record of all activity performed by all WoS users, not just activity performed by users at your institution. Usage counts for different versions of the same item on the WoS platform are unified. Usage counts are updated daily. There are two kinds of usage counts in WoS platform.
Last 180 days This is the count of the number of times the full text of a record has been accessed or a record has been saved in the last 180 days. This count can move up or down as the end date of the fixed period advances.
Since 2013 This is the count of the number of times the full text of a record has been accessed or a record has been saved since February 1, 2013. This count can increase or remain static over time.Footnote 3
In general, ‘‘usage’’ often refers to HTML views and PDF downloads. HTML views and PDF downloads of PLoS and PMC belong to traditional usage data. XML downloads of PLoS are new usage data. Different from traditional definitions, WoS defines “usage” as “clicking” and “saving” (Wang et al. 2016a). More accurately, WoS usage should be “HTML view” and “saving” (Chen 2018). From Tables 2 and 3, they show that usage counts in WoS and XML downloads of PLoS are considerably less than HTML views and PDF downloads of PLoS and PMC. Considering traditional definitions of usage data, only HTML views and PDF downloads of PLoS and PMC are investigated in this study. Besides, usage counts in PMC are also significantly less than that of PLoS, which means that PLoS official websites are the primary channel for users to view and download articles.
Methods
Article classification strategy
In terms of Pareto principle (or 80/20 rule), highly browsed and downloaded academic articles in this study are defined by Top 20% papers ranked by HTML views and PDF downloads in PLoS and PMC platforms respectively. In order to comparatively uncover linguistic characteristics of Top 20% papers, total papers and Bottom 20% papers are also incorporated (detailed number of publications in each category are shown in Table 4).
Indicators measuring linguistic characteristics
Linguistic complexity comprises two aspects: syntactic and lexical complexity. Syntactic complexity consists of quantitative variables on sentence length, sentence complexity, and others (Ferris 1994; Kormos 2011; Ojima 2006). Lexical complexity is made up of lexical diversity, lexical density, and lexical sophistication (Vajjala and Meurers 2012). Lu et al. (2019a, b) selected several indicators (sentence length, sentence complexity, lexical diversity, lexical density and lexical sophistication) to measure linguistic complexity.
In this study, more comprehensive indicators measuring linguistic characteristics were adopted compared with former research. The indicators selected follow the structures of academic article, which are “Title–Abstract–Keyword–Full text–Sentence–Word”. Specifically, title length, abstract length, full-text length, sentence length, lexical diversity, lexical density and lexical sophistication are incorporated in this study (shown in Table 5). “Keyword number” is not applied in this study, because there are no keywords in original articles in PLoS platform. “Co-author number” is not selected because it measures co-authorship (Chi and Glänzel 2017). Strictly speaking, it can’t be classified to the category of linguistic characteristics. In addition, punctuation marks are removed from the calculations.
Applicability of indicators to PLoS and PubMed views and downloads
In this study, it is assumed that linguistic characteristics have an influence on article views and downloads to some extent. It means that linguistic characteristics may be the reason why the paper is viewed or downloaded. After conducting experiments of browsing and downloading papers on PLoS and PubMed platforms, it is found that before browsing full text of the paper, only paper title and author name can be seen, and after clicking paper title, abstract and full text can be read. Then, readers can choose to download the paper by clicking the “Download” button on the full-text page. So, before the paper is viewed, readers only know title and author name, and before the paper is downloaded, the readers know title, author name, abstract, full text and so on. Therefore, only indicator “title length” is applicable to PLoS and PubMed views, and indicators “title length, abstract length, full-text Length, sentence length, lexical diversity, lexical density and lexical sophistication” are applicable to PLoS and PubMed downloads.
Results
Title length distribution
In Fig. 1, each journal is plotted by a unique color, articles of three categories in each journal are plotted as points inside or outside the boxes, vertical line and hollow square inside or outside each box denote median and mean respectively. Mean of title length by three categories in each journal are listed in Table 6.
From Fig. 1 and Table 6, it reveals that Top 20% viewed and downloaded papers have less title length in average than total and Bottom 20% papers in each journal in PLoS platform (only Top 20% viewed papers of journal PLoS Pathogens and Top 20% downloaded papers of journal PLoS Biology are excluded). But it shows no regular characteristics in each journal in PMC platform. Then, generally, Top 20% viewed and downloaded papers of each journal in PLoS platform have less average title length than that of PMC platform. Finally, in Top 20% viewed and downloaded papers, PLoS Medicine have the most title length in average, then PLoS One, PLoS Pathogens and PLoS Neglected Tropical Diseases, PLoS Genetics, PLoS Computational Biology and PLoS Biology are the least in general.
After tracing submission guidelines of each PLoS journal, it is found that there are character limits of title, no more than 250 characters in PLoS Biology, PLoS One and PLoS Neglected Tropical Diseases, no more than 200 characters in PLoS Computational Biology, PLoS Genetics, PLoS Pathogens and PLoS Medicine. Words consist of characters, so title character length (spaces are removed) distribution in PLoS and PMC platforms are shown in Fig. 2, which reveals that character length of most articles in each journal are within the limits of submission guidelines. But within 200 characters, different journal has the unique title character length.
Abstract length and full-text length distribution
From Fig. 3a and Table 6, it reveals that Top 20% downloaded papers have more abstract length in average than total or Bottom 20% papers in most journals, but the differences are marginal. Then, in Top 20% downloaded papers, PLoS Medicine have the most average length in average, then PLoS Neglected Tropical Diseases, other journals are the least in general. After checking submission guidelines of each journal, it is found that there are no words limits of abstract in PLoS Biology, no more than 300 words in PLoS Computational Biology, PLoS Genetics, PLoS One and PLoS Pathogens, less than 250–300 words in PLoS Neglected Tropical Diseases and less than 500 words in PLoS Medicine. Probably, the average abstract length of each journal is affected by submission guidelines.
In order to reveal more details, only papers with “full-text length ≤ 30,000 words” are captured from the global graph. From Fig. 3b and Table 7, it reveals that only Top 20% downloaded papers in PLoS platform (top right) have more full-text length in average than total and Bottom 20% papers. Then, generally, in Top 20% downloaded papers, PLoS Biology, PLoS Computational Biology, PLoS Genetics and PLoS Pathogens have the most full-text length in average, then PLoS Medicine, PLoS One and PLoS Neglected Tropical Diseases. Finally, Top 20% downloaded papers of each journal in PLoS platform have more average full-text length than that of PMC platform (only journal PLoS Pathogens is excluded).
Sentence length and lexical diversity distribution
In order to reveal more details, only papers with “average sentence length ≤ 50 words” are captured from the global graph. From Fig. 4a and Table 7, it reveals that Top 20% downloaded papers have less average sentence length than total and Bottom 20% papers in general, especially in PMC platform, although the differences are marginal. Also, generally, in Top 20% downloaded papers, PLoS Medicine have the most average sentence length, then PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens.
From Fig. 4b and Table 7, it reveals that average Type-Token Ratios of Top 20% downloaded papers are greater than 20%. Then, it reveals that Top 20% downloaded papers have less average Type-Token Ratios than total and Bottom 20% papers in most journals in PLoS platform, but the results are opposite in PMC platform. Also, in Top 20% downloaded papers, PLoS Neglected Tropical Diseases and PLoS One have the most lexical diversity in average, then PLoS Pathogens, PLoS Medicine, PLoS Genetics and PLoS Biology, PLoS Computational Biology is the least.
Lexical density distribution
Lexical density is only measured by lexical items, including nouns, verbs, adjectives and adverbs, whereas other types of words, for example, preposition, are not considered in this study. In order to reveal more details, papers with “noun ratio ≤ 0.6” are captured from the global graph. From Figs. 5 and 6, it is found that among the lexical items, nouns are used most frequently, then verbs and adjectives, adverbs are the least.
In Fig. 5a, it reveals that Top 20% downloaded papers have more average noun ratio than total and Bottom 20% papers in most journals in PMC platform, but show no regular differences among three categories in PLoS platform. Also, in Top 20% downloaded papers, PLoS Pathogens and PLoS Genetics have the most noun ratio in average (about 38–39%), then PLoS Neglected Tropical Diseases, PLoS One, PLoS Medicine and PLoS Biology (about 36–37%), PLoS Computational Biology is the least (around 34%).
In Fig. 5b, it reveals that average verb ratios among Top 20%, total and Bottom 20% downloaded papers of each journal are marginal, precisely 15% or so. Similarly, from Fig. 6a, it reveals that average adjective ratios among three categories of each journal are marginal, around 11%, but journal PLoS Pathogens shows less average adverb ratio than others (10.5% vs. 11%). From Fig. 6b, it reveals that average adverb ratios among three categories of each journal are also marginal, around 3%, but journal PLoS Computational Biology shows more average adverb ratio than others (4% vs. 3%).
Lexical sophistication distribution
Figures 7 and 8 show the distributions of average lexical word (noun, verb, adjective and adverb) length by category respectively. Generally, in “total papers” category, average length of nouns (6.68) is longer than that of verbs (6.13) and adverbs (6.48), but shorter than that of adjectives (7.92). In “Top 20% downloaded papers” category, average length of adjectives (7.96) is the longest of all, then nouns (6.72), adverbs (6.53) and verbs (6.17).
From Fig. 7a, PLoS Computational Biology and PLoS Medicine have the most average noun length, then PLoS Neglected Tropical Diseases, PLoS One and PLoS Genetics, PLoS Biology and PLoS Pathogens are the least. From Fig. 7b, PLoS Pathogens has the most average verb length, then PLoS Biology, PLoS Genetics, PLoS One, PLoS Medicine and PLoS Neglected Tropical Diseases, PLoS Computational Biology is the least. From Fig. 8a, PLoS Pathogens has the most average adjective length, then PLoS Biology, PLoS Genetics, PLoS One, PLoS Medicine, PLoS Computational Biology and PLoS Neglected Tropical Diseases. From Fig. 8b, PLoS Pathogens and PLoS Genetics have the most average adverb length, then PLoS One, PLoS Biology, PLoS Neglected Tropical Diseases and PLoS Medicine, PLoS Computational Biology is the least.
Statistical significance test
Two-sample Kolmogorov–Smirnov (K–S) test for linguistic characteristics between “Top 20% papers” and “BTM 20% papers” categories are provided and p-values of K-S test are shown in Tables 8 and 10, indicating that the differences of the characteristics among categories are statistically significant or not. Also, Spearman’s correlation coefficient between usage data of Top 20% articles and linguistic characteristics are investigated and shown in Tables 9 and 11.
Form Tables 8 and 10, about 40% K–S test results of linguistic characteristics between “Top 20% viewed and downloaded” and “BTM 20% viewed and downloaded” categories in different journals and platforms suggest statistical significance. For “Top 20% viewed” and “BTM 20% viewed” categories, title length of PLoS Computational Biology, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens suggest statistical significance, especially in PLoS platform. For “Top 20% downloaded” and “BTM 20% downloaded” categories, title length of PLoS Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens suggest statistical significance, especially in PLoS platform. For “Top 20% downloaded” and “BTM 20% downloaded” categories, average sentence length of PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens suggest statistical significance, especially in PMC platform. For other linguistic characteristics between “Top 20% downloaded” and “BTM 20% downloaded” categories, they also show statistical significance, but they depend on different journals and platforms.
Form Tables 9 and 11, they show very weak positive or slightly negative correlation between usage data and linguistic characteristics in general. In Top 20% viewed articles, they show slightly negative correlation between views and title length in PLoS Genetics, PLoS Neglected Tropical Diseases, PLoS One and PLoS Pathogens, especially in PLoS platform. In Top 20% downloaded articles, they show slightly negative correlation between downloads and title length in PLoS Neglected Tropical Diseases and PLoS One in PLoS platform. For Top 20% downloaded articles in PLoS One, they show slightly negative correlation to lexical diversity, noun ratio, verb length, adjective length and adverb length, especially in PLoS platform. For Top 20% downloaded articles in PLoS Biology, they show weak negative correlation to adjective ratio and adverb length in PMC platform. For Top 20% downloaded articles in PLoS Biology, they show moderate negative correlation to adverb length in PLoS platform.
Discussions and conclusions
This paper applied computational linguistics to understand the relationship between linguistic characteristics and article views and downloads. The mean and median results show marginal differences for most linguistic characteristics among different categories; statistical significance test results indicate no statistical significance generally; however, for certain linguistic characteristics (e.g. title length and average sentence length) in different PLoS journals and platforms, they are still statistically significant.
Despite most linguistic characteristics play little role in article views and downloads in our data sets in general, some linguistic characteristics (e.g. title length and average sentence length) in specific PLoS journal and platform play certain role in article views and downloads in our data sets. Also, academic papers in this study follow some patterns of linguistic characteristics. For example, the average length of sentences in sample papers is usually greater than 22 words; average Type-Token Ratios of sample papers are greater than 20%; average ratios of noun, verb, adjective and adverb are about 35–39%, 15%, 11% and 3% respectively. Besides, each journal has its own linguistic characteristics. Differences of linguistic characteristics between two platforms are also existed.
Jamali and Nikzad (2011) found that articles with longer titles were downloaded slightly less than the articles with shorter titles, but Duan and Xiong (2017) found that there were only weak correlations between total downloads and title length and held that the correlation between downloads and title length could be different due to data differences. In our mind, social factors should be considered, for example, each journal has unique submission guidelines to limit characters or words of article length.
Apart from submission guidelines of journals (eg. word limits of title, abstract and full-text length), other social factors also should be incorporated into to understand linguistic characteristics of academic articles. First of all, each discipline follows its own research paradigm and covers unique terminology. Then, diverse users with various ages, positions and academic backgrounds prefer different academic platforms to acquire academic papers. In empirical research, to keep balance between disciplines and journals in sampling and grouping, and to compare or combine usage data from different academic platforms should also be valued.
There are also some limitations in this study. Only papers published between 2014 and 2015 in PLoS journals are investigated, therefore, the conclusions might be different in more samples and other journals, which need further experiments. Only several basic indicators measuring linguistic characteristics are adopted, more diversified and semantic indicators can be incorporated. Although these limitations exist, we hope that this first introduction of multi-granularity linguistic characteristics to usage metrics would provide a new perspective. In further study, in-depth interviews and experiments of user behaviors will be combined with linguistic characteristics to investigate user motivation and behavior pattern of usage metrics.
References
Belter, C. W. (2014). Measuring the value of research data: A citation analysis of oceanographic data sets. PLoS ONE,9(3), e92590.
Bollen, J., Luce, R., Vemulapalli, S. S., & Xu, W. (2002). Usage analysis for the identification of research trends in digital libraries. D-Lib Magazine. https://doi.org/10.1045/may2003-bollen.
Bollen, J., Sompel, H. V. D., Smith, J. A., & Luce, R. (2005). Toward alternative metrics of journal impact: A comparison of download and citation data. Information Processing and Management,41(6), 1419–1440.
Bonzi, S., & Snyder, H. W. (1991). Motivations for citation: A comparison of self citation and citation to others. Scientometrics,21(2), 245–254.
Boyack, K. W., Eck, N. J. V., Colavizza, G., & Waltman, L. (2018). Characterizing in-text citations in scientific articles: A large-scale analysis. Journal of Informetrics,12(1), 59–73.
Boyack, K. W., Small, H., & Klavans, R. (2013). Improving the accuracy of co-citation clustering using full text. Journal of the American Society for Information Science and Technology,64(9), 1759–1767.
Brooks, T. A. (1986). Evidence of complex citer motivations. Journal of the Association for Information Science and Technology,37(1), 34–36.
Cano, V. (1989). Citation behavior: Classification, utility, and location. Journal of the Association for Information Science and Technology,40(4), 284–290.
Case, D. O., & Higgins, G. M. (2000). How can we investigate citation behavior? A study of reasons for citing literature in communication. Journal of the American Society for Information Science,51(7), 635–645.
Chen, B. (2018). Usage pattern comparison of the same scholarly articles between Web of Science (WoS) and Springer. Scientometrics,115(1), 519–537.
Chen, B., Zhong, Z., & Zhan, C. (2017). Usage pattern analysis of academic articles from two Chinese journals. In K. Holmberg & J. Vainio (Eds.), Proceedings of ISSI 2017 (pp. 366–375). Wuhan: Wuhan University.
Chen, B., Zhou, H., Zhong, Z., & Wang, Y. (2018). Exploring the user platform preference and user interest preference of chinese scholarly articles: A comparison based on usage metrics. Journal of Library Science in China,44(6), 90–104. (in Chinese).
Chi, P. S., & Glänzel, W. (2018). Comparison of citation and usage indicators in research assessment in scientific disciplines and journals. Scientometrics,116(1), 537–554.
Chi, P. S., & Glänzel, W. (2017). An empirical investigation of the associations among usage, scientific collaboration and citation impact. Scientometrics,112(1), 403–412.
Davis, P. M. (2006). Ejournal interface can influence usage statistics: Implications for libraries, publishers, and project counter. Journal of the Association for Information Science and Technology,57(9), 1243–1248.
Davis, P. M., & Solla, L. R. (2003). An ip-level analysis of usage statistics for electronic journals in chemistry: Making inferences about user behavior. Journal of the American Society for Information Science and Technology,54(11), 1062–1068.
De Sordi, O. J., Conejero, M. A., & Meireles, M. (2016). Bibliometric indicators in the context of regional repositories: Proposing the d-index. Scientometrics,107(1), 235–258.
Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., et al. (2013). Entitymetrics: Measuring the impact of entities. PLoS ONE,8(8), e71416.
Ding, Y., Zhang, G., Chambers, T., Song, M., Wang, X., & Zhai, C. (2014). Content-based citation analysis: The next generation of citation analysis. Journal of the Association for Information Science and Technology,65(9), 1820–1833.
Duan, Y., & Xiong, Z. (2017). Download patterns of journal papers and their influencing factors. Scientometrics,112(3), 1761–1775.
Elgendi, M. (2019). Characteristics of a highly cited article: A machine learning perspective. IEEE Access,7, 87977–87986.
Ferris, D. R. (1994). Rhetorical strategies in student persuasive writing: Differences between native and non-native English speakers. Research in the Teaching of English,28(1), 45–65.
Gipp, B., & Beel, J. (2009). Citation proximity analysis (CPA)—a new approach for identifying related work based on co-citation analysis. In B. Larsen & J. Leta (Eds.), Proceedings of ISSI 2009 (pp. 571–575). Wuhan: Wuhan University.
Gorraiz, J., Gumpenberger, C., & Schloegl, C. (2014). Usage versus citation behaviours in four subject areas. Scientometrics,101(2), 1077–1095.
Hu, Z., Chen, C., & Liu, Z. (2013). Where are citations located in the body of scientific articles? A study of the distributions of citation locations. Journal of Informetrics,7(4), 887–896.
Jamali, H. R., & Nikzad, M. (2011). Article title type and its relation with the number of downloads and citations. Scientometrics,88(2), 653–661.
Khan, M. S., & Younas, M. (2017). Analyzing readers behavior in downloading articles from IEEE digital library: A study of two selected journals in the field of education. Scientometrics,110(3), 1523–1537.
Kim, H. J., Jeong, Y. K., & Song, M. (2016). Content- and proximity-based author co-citation analysis using citation sentences. Journal of Informetrics,10(4), 954–966.
Kormos, J. (2011). Task complexity and linguistic and discourse features of narrative writing performance. Journal of Second Language Writing,20(2), 148–161.
Kurtz, M. J., & Bollen, J. (2010). Usage bibliometrics. Annual Review of Information Science and Technology,44(1), 1–64.
Kurtz, M. J., & Henneken, E. A. (2016). Measuring metrics-a 40-year longitudinal cross-validation of citations, downloads, and peer review in astrophysics. Journal of the Association for Information Science and Technology,68(3), 695–708.
Lippi, G., & Favaloro, E. J. (2013). Article downloads and citations: Is there any relationship? Clinica Chimica Acta,415, 195.
Liu, S., & Chen, C. (2012). The proximity of co-citation. Scientometrics,91(2), 495–511.
Liu, S., & Chen, C. (2013). The differences between latent topics in abstracts and citation contexts of citing papers. Journal of the American Society for Information Science and Technology,64(3), 627–639.
Lu, C., Bu, Y., Dong, X., Wang, J., Ding, Y., Larivière, V., et al. (2019a). Analyzing linguistic complexity and scientific impact. Journal of Informetrics,13(3), 817–829.
Lu, C., Bu, Y., Wang, J., Ding, Y., Torvik, V., Schnaars, M., et al. (2019b). Examining scientific writing styles from the perspective of linguistic complexity. Journal of the Association for Information Science and Technology,70(5), 462–475.
Lu, C., Ding, Y., & Zhang, C. (2017). Understanding the impact change of a highly cited article: A content-based citation analysis. Scientometrics,112(2), 927–945.
Mckeown, K., Daume, H., Chaturvedi, S., Paparrizos, J., Thadani, K., Barrio, P., et al. (2016). Predicting the impact of scientific concepts using full-text features. Journal of the Association for Information Science and Technology,67(11), 2684–2696.
Moed, H. F. (2005). Statistical relationships between downloads and citations at the level of individual documents within a single journal. Journal of the American Society for Information Science and Technology,56(10), 1088–1097.
Moed, H. F., & Halevi, G. (2016). On full text download and citation distributions in scientific-scholarly journals. Journal of the Association for Information Science and Technology,67(2), 412–431.
Ojima, M. (2006). Concept mapping as pre-task planning: A case study of three Japanese ESL writers. System,34(4), 566–585.
O’Leary, D. E. (2008). The relationship between citations and number of downloads in decision support systems. Decision Support Systems, 45(4), 972–980.
Pan, X., Yan, E., Cui, M., & Hua, W. (2018). Examining the usage, citation, and diffusion patterns of bibliometric mapping software: A comparative study of three tools. Journal of Informetrics,12(2), 481–493.
Pan, X., Yan, E., Cui, M., & Hua, W. (2019). How important is software to library and information science research? A content analysis of full-text publications. Journal of Informetrics,13(1), 397–406.
Pan, X., Yan, E., & Hua, W. (2016). Disciplinary differences of software use and impact in scientific literature. Scientometrics,109(3), 1–18.
Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics,9(4), 860–871.
Schloegl, C., Gorraiz, J., Gumpenberger, C., Jack, K., & Kraker, P. (2014). Comparison of downloads, citations and readership data for two information systems journals. Scientometrics,101(2), 1113–1128.
Small, H. (2011). Interpreting maps of science using citation context sentiments: A preliminary investigation. Scientometrics,87(2), 373–388.
Subotic, S., & Mukherjee, (2014). Short and amusing: The relationship between title characteristics, downloads, and citations in psychology articles. Journal of Information Science,40(1), 115–124.
Vajjala, S., & Meurers, D. (2012). On improving the accuracy of readability classification using insights from second language acquisition. In Proceedings of the seventh workshop on building educational applications using NLP (pp. 163–173), July 8–14, 2012, Jelu Island, South Korea.
Wan, J. K., Hua, P. H., Rousseau, R., & Sun, X. K. (2010). The journal download immediacy index (DII): Experiences using a chinese full-text database. Scientometrics,82(3), 555–566.
Wang, X., Fang, Z., & Sun, X. (2016a). Usage patterns of scholarly articles on Web of Science: A study on Web of Science usage count. Scientometrics,109(2), 917–926.
Wang, X., Peng, L., Zhang, C., Xu, S., Wang, Z., Wang, C., et al. (2013a). Exploring scientists’ working timetable: A global survey. Journal of Informetrics,7(3), 665–675.
Wang, X., Wang, Z., & Xu, S. (2013b). Tracing scientist’s research trends realtimely. Scientometrics,95(2), 717–729.
Wang, X., Xu, S., & Fang, Z. (2016). Tracing digital footprints to academic articles: An investigation of PeerJ publication referral data. Retrieved October 28, 2018, from http://cn.arxiv.org/abs/1601.05271.
Wang, X., Xu, S., Peng, L., Wang, Z., Wang, C., Zhang, C., et al. (2012). Exploring scientists’ working timetable: Do scientists often work overtime? Journal of Informetrics,6(4), 655–660.
Wang, Y., & Zhang, C. (2018). Using full-text of research articles to analyze academic impact of algorithms. In G. Chowdhury, J. McLeod, V. Gillet, & P. Willett (Eds.), Proceedings of iConference (pp. 395–401). Sheffield: University of Sheffield.
Zhang, C., Ding, R., & Wang, Y. (2018). Using behavior and influence assessment of algorithms based on full-text academic articles. Journal of the China Society for Scientific and Technical Information,37(12), 1175–1187. (in Chinese).
Zhao, X. (2017). Exploring the features of usage data for academic literatures. Journal of Library Science in China,43(3), 44–57. (in Chinese).
Zhao, S. X., Lou, W., Tan, A. M., & Yu, S. (2018). Do funded papers attract more usage? Scientometrics,115(1), 153–168.
Acknowledgements
This paper is supported by Youth Program of National Social Science Fund in China (15CTQ035), Social Public Safety S&T Collaborative Innovation Center of Universities in Jiangsu Province and China Scholarship Council (ID: 201906845042).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, B., Deng, D., Zhong, Z. et al. Exploring linguistic characteristics of highly browsed and downloaded academic articles. Scientometrics 122, 1769–1790 (2020). https://doi.org/10.1007/s11192-020-03361-4
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-020-03361-4