Introduction

Academic publications represent a substantial investment of expert time to create, referee, edit and publish. It is therefore worrying for the participants as well as the funders or taxpayers that financed the study if the results are rarely read. One reason why something might be ignored is that it is, or appears to be, about a rarely researched topic so that few people find it or think that it is relevant (e.g., Fox and Burns 2015). Conversely, some believe that obscure research is essential to science and has been highly successful in the past (Gamboa 2015; James 2014; Mexal 2010). This article uses a term frequency approach to assess the hypothesis that obscure topics are rarely cited. Assuming that obscure (i.e., rarely researched) topics will often be reflected by the presence of unusual title terms, this article assesses whether the presence of such terms associates with low citation rates. Whilst there will be articles on obscure topics without any obscure terms in their titles (e.g., Fish farming technology in South Turkey during the Bronze Age) and articles on common topics with obscure words in their titles (e.g., Is interindexer consistency a hobgoblin?) the hypothesis is that these are the exceptions rather than the rule.

Many paths may lead to an article being found and read (Tenopir et al. 2010). These include keyword searching, citation chaining and journal browsing. The choice of an obscure topic, or at least an obscure title for an article, may reduce the likelihood that searchers will enter a relevant keyword and find the article, unless the author keywords are more relevant (e.g., Rostami et al. 2014). Similarly, people that cherry-pick interesting articles to read in current issues of journals may ignore topics that are apparently not relevant to their needs. Thus, titles indicating a rarely researched topic (including one that is just very specific) may tend to alienate potential readers (e.g., Sagan 2013). Conversely, some strange titles may provoke enough interest amongst people to read an article for current awareness even if it does not seem to be directly relevant.

Article title lengths may affect the decision to read an article. The American Psychological Association (APA) guide recommends using a maximum of 12 words but most article titles tend to be longer, and this length has increased over time (Hallock and Dillner 2016; Guo et al. 2015). An analysis of the titles of the 25 most cited and 25 least cited articles in medical journals from 2005 found that longer titles were more cited (Jacques and Sebire 2010). Conversely, longer titles associate with fewer citations in both biology (including biochemistry) and the social sciences and there is no relationship for chemistry (Didegah and Thelwall 2013) or management science (Nair and Gibbert 2016). A negative relationship between title length and citations has also been found for UK-authored articles in health and life sciences, natural sciences, geography and economics (Hudson 2016). The most cited psychology articles tend to have shorter titles than typical for psychology articles, but this may be due to higher impact journals tending to have shorter title styles (Subotic and Mukherjee 2014). Thus, evidence about title length is mixed but suggests that shorter titles are beneficial.

The content of article titles affects decisions to read them. In computer science, journals have different styles for titles (Anthony 2001) and in one linguistics journal, about a third of titles were found to describe each of: the topic (only); methods; and results (Sahragard and Meihami 2016). In psychology, articles with amusing titles 1985–1994 tended to receive a below average number of citations (Sagi and Yechiam 2008), but this factor does not affect management science (Nair and Gibbert 2016). In medicine, short titles mentioning the results are more frequently cited (Paiva et al. 2012). This supports a previous argument that informative titles are more useful (Hartley 2005; McGowan and Tugwell 2005). Articles with questions in their titles may be less frequently cited in most disciplines (Jamali and Nikzad 2011; Hudson 2016). More generally, the presence of non-alphanumeric characters, such as colons and hyphens, within article titles is common throughout academia and associates with higher citation rates, perhaps because their absence marks articles as unusual (Buter and van Raan 2011).

Some research has used a term frequency approach to analyse the individual words or phrases within article titles. For example, the distribution of nanotechnology-related terms within the titles of relevant journals follows a power law (Bartol and Stopar 2015). A comparison of word frequencies within article titles in history, sociology, economics and education found history to use substantially rarer terms than the other fields and these were often people or place names (Nagano 2009). An analysis of changing computing-related term frequencies over time in the titles, abstracts or keywords of library and information science articles discovered that terms that rose and declined in frequency tended to be associated with topical issues or terminologies (Thelwall and Maflahi 2015). A study of research-related clichés in medical article titles (e.g., “paradigm shift”, “out of the box”) also found these to rise and fall in popularity over time (Goodman 2012). Within economics 1890–2012 there have also been similar popularity changes in individual terms, such as tax, which was the second most popular substantive title term in the 1950s but was out of the top 10 before then and again after the 1960s (Guo et al. 2015). A comparison between the most frequently used terms between scholarly and trade technical communication publications found trade publication terms to relate to people more (e.g., you, your) whilst scholarly publication terms were more often about the research process (e.g., study, design, research) (Boettger and Friess 2014). The inclusion of a country name within a medical article title may associate with fewer citations (Jacques and Sebire 2010). Country names suggest that an article is primarily or exclusively of interest to a single nation and so their association with low cited articles is unsurprising. Similarly, within ecology, article titles that mention a specific organism are likely to be less cited and articles that mention broad issues are likely to be more cited (Fox and Burns 2015). In both cases research that seems to be obscure, in the sense of being specific, is less cited. The last paper is the closest to the topic of the current study.

Function words are class of common words that have no topic meaning but serve to bind sentences together. They are in this sense the opposite of rare terms describing obscure topics. Function words include articles (e.g., the, a), pronouns (e.g., it, my, she), conjunctions (e.g., and, but), particles (e.g., if, then), prepositions (e.g., in, under), and auxiliary verbs (e.g., some uses of: has, do) (Selkirk 1996). Function words are useful to analyse as the polar opposite to obscure terms in order to detect whether it is possible for topic-neutral terms to associate with high or low citation impact. Function words do not seem to have been analysed in the field of scientometrics before. Despite their apparently neutral and topic independent nature, function words can convey useful meta-information about texts and authors. For example, an increased frequency of the personal pronoun I can occur in periods of stress (Chung and Pennebaker 2007). More relevant to the current paper, translations from Japanese to English have been shown to use fewer articles (a, an, the) (Chung and Pennebaker 2007) and so, extrapolating, it is possible that the presence or absence of specific function words in a title can be faint evidence that an article was originally not written in English but has been translated. Function words in the full text of articles can also point to the likely author gender in some types of texts such as blogs (Koppel et al. 2009). This association can be due to an indirect connection to topics. For example, my in blogs is an indicator of a likely female author and associates with relationships (e.g., mom, boyfriend, love) whereas the in blogs is a male authorship indicator, associating with computing (e.g., software, system, game). From this it is possible, but not obvious, that function words in article titles could associate with topics, and thus, indirectly, with higher or lower citation areas of a field.

Research questions

Despite the above findings, and with one partial exception (Fox and Burns 2015), no previous study has addressed the general issue of whether obscure research is less cited across academia. This article uses a term frequency approach to address this question, using the presence of unusual words in article titles as an indicator of likely topic obscurity. In addition, the overall relationship between term frequency and citations does not seem to have been addressed at all and so this is a secondary research question. The research questions therefore target these two gaps, with RQ3 serving to counterbalance RQ1 with an analysis of common neutral terms.

  • RQ1 Do articles containing unusual words in their titles (i.e., words that are rarely used in other titles from articles in the same field and time period) tend to be less cited?

  • RQ2 Does the relationship between the relative frequency of title words and the citation impact of the articles differ between subjects?

  • RQ3 Do function words within article titles tend to have a neutral association with citation counts? (i.e., do articles with titles containing a given function word tend to attract the same number of citations as other articles in the same subject?)

Although these questions address all areas of academia, the arts and humanities and social sciences (except within the health sciences) are not included in the results for the reasons outlined in the methods. Hence, the scope of the research questions is the natural, life and formal sciences, as well as engineering and the health sciences.

Methods

The data used was recycled from a previous article (Fairclough and Thelwall 2015) that did not analyse term frequencies. The data consists of every third subject in each broad Scopus category, giving a wide sample of subject areas encompassing most of academia. The choice of every third subject was arbitrary, driven only by the need for a systematic selection procedure. The non-English versions of article titles were excluded in cases where two languages were provided (e.g., there were Spanish and English versions of the same title for some journals). Categories including any journals publishing articles with exclusively non-English titles were rejected because the presence of other languages would affect the results. Although it would have been possible to remove these non-English journals, this would have changed the nature of the categories and so this was not attempted. This left 18 Scopus categories out of the initial set of 25, each containing journal articles (excluding reviews and non-article documents) from all years from 2009 to 2015 (with partial coverage), as gathered in April and May 2015.

The 7-year time period 2009–2015 seems to be long enough to get enough data on article titles to reliably identify unusual terms. Nevertheless, terms at the start of the period may have been common in a previous period and terms at the end may be common in the future and so this choice has limitations.

Citation counts are widely used in scientometrics as an indicator of scholarly impact for articles. For statistical analyses, citation counts for sets of articles have the disadvantage that they are highly skewed and so a logarithmic transformation is needed to eliminate this (Lundberg 2007). Another disadvantage of citation counts is that the average number of citations per article varies greatly between years, so the raw citation counts need to be normalised in order to be comparable between years. A simple way to do this is to divide each article’s citation count by the average citation count in its field and year (Waltman et al. 2011). After this, normalised citation counts from different years can fairly be grouped together. The following paragraph explains how these ideas were applied to the raw data.

For each category, the subject and year normalised log-transformed citation count \(\check{c}\) was obtained for each article so that a value of 1 always indicates an average number of citations, irrespective of subject and year. For this, the Scopus citation count c for each article was first log transformed using ln(1 + c) to reduce skewing (Lundberg 2007; Thelwall 2016). The arithmetic mean of the log transformed citation counts \({\overline{\ln \left( {{1} + c} \right)}}\) was then calculated separately for each subject and year. The subject and year normalised log-transformed citation count for each article was computed using \(\check{c} = \ln \left({1 + c}\right)/{\overline{\ln \left( {{1} + c} \right)}}\) where the average in the denominator is taken over all articles in the subject and year containing the article. For each subject and year, the resulting set of subject and year normalised log-transformed citation counts should be approximately normally distributed (Thelwall 2016) with an arithmetic mean of 1. This property is retained if different subjects and/or years are merged.

For each subject separately (but combining all years) a vocabulary was created recording all words in all article titles in all 7 years, together with the number of article titles containing each word. For example, a term with frequency 2 would be in exactly two different article titles within the subject (2009–2015), but they might be from different years and the term could occur multiple times in one or both titles.

Within each subject, the average citation impact \(\check{c}_{t}\) of each term t was calculated by taking the arithmetic mean of the subject and year normalised log-transformed citation counts \(\check{c}\) for all article titles in the subject containing the term. For example, if there were ten article titles containing the term study, then \(\check{c}_{{\mathrm{study}}}\) would be the arithmetic mean of the ten \(\check{c}\) values for these articles.

Within each subject, the average citation impact \(\check{c}_{f}\) of each term frequency f was calculated by taking the arithmetic mean of the average citation impact \(\check{c}_{t}\) of each term t with frequency f. For example, if 1000 terms each only occurred in one article title in a subject then the average citation impact \(\check{c}_{1}\) of term frequency 1 for that subject would be the average citation impact of these 1000 terms. Similarly, if there were 100 terms with frequency 2 then \(\check{c}_{2}\) would be the arithmetic mean of the \(\check{c}_{t}\) values for all of these 100 terms.

Approximate confidence intervals were calculated for each word frequency average citation impact \(\check{c}_{f}\) from the standard normal distribution formula from the complete set of subject and year normalised log-transformed citation counts used to calculate it. If n f is the number of terms with frequency f, then the sample size would be fn f because each term occurs in f different articles and there are n f terms. Here the same article can be counted multiple times if its title contains different terms with the same frequency f. This is a hybrid calculation in most cases. For a frequency count of 1, it is a precise confidence interval for the average impact of all unique terms. For frequency counts with only one associated term (e.g., if only one term occurred in exactly 500 articles) then the confidence interval is for the average impact of the individual term. Between these two extremes, the confidence interval is a purely illustrative hybrid between the two.

Results

Unique words (i.e., terms that occur in only one article title in a subject) were analysed to address the first research question, since unique words are the most unusual in the corpus in terms of frequency in article titles. In all subjects, unique words in article titles associate with lower citation counts (Table 1). Except for Assessment and Diagnosis, the 95 % confidence intervals for the citation counts exclude 1, giving statistical evidence of having a below average citation count for the subject. In other words, in all subject areas except Assessment and Diagnosis, if an article from 2009 to 2015 includes within its title at least one term that is in no other title in the subject area during 2009–2015 then that article can be expected to receive a below average number of citations for its subject and year.

Table 1 Field and year normalised average citation impacts of articles containing unique words in their titles (i.e., words occurring in no other article title from the subject 2009–2015) together with 95 % confidence intervals

Some of the unique terms are specialist and rare words, such as amentacea (ciboria amentacea is a fungus species that grows on willow and elder tree catkins), Boswelic (Boswelic acid is a tree resin traditionally used in Ayurvedic medicine and being investigated for its anti-inflammatory properties), FACSCanto (in title: Comparison of two single-platform ISHAGE-based CD34 enumeration protocols on BD FACSCalibur and FACSCanto flow cytometers), sunnhemp (referring to a hemp plant) and BMP5 (Bone Morphogenetic Protein 5, a protein coding gene). The apparent obscurity of the topics associated with these terms shows that the hypothesis that rare title terms associate with unusual topics has some support in the data. Not all unique terms associate with unusual topics, however. Some appear to be typographically unusual, such as 10’s (article title: The HI Chronicles of LITTLE THINGS BCDs II: The Origin of IC 10’s HI Structure). Some are lists, such as b2–b8 (article title: Effect of peptide fragment size on the propensity of cyclization in collision-induced dissociation: Oligoglycine b 2 b 8 ). Others are more common words that may be rarely used in titles within a field, such as issuing, tigress, and algorithmically. Overall, then, the results (Table 1) are consistent with the hypothesis that articles on obscure (i.e., rarely researched) topics are more rarely cited because a substantial proportion of the unique title word terms associate with apparently obscure topics. The results are not definitive, however, because the judgement of topics being obscure is qualitative and it is possible that the unique words not referring to obscure topics have more influence, for example if they reflect awkward language constructions by junior researchers or researchers with low fluency in English.

In answer to the second research question, a visual inspection of the overall term frequency pattern of each subject (see a complete set of graphs in the online supplement: https://dx.doi.org/10.6084/m9.figshare.3806265.v1) suggests that they are all broadly similar, with one partial exception, Assessment and Diagnosis (Fig. 1). This subject is unusual because most term frequencies have an above average citation impact. This counterintuitive attribute is due to the presence of many articles from Nursing magazine with short titles (e.g., Break through your fears) and no citations. Thus, whilst the overall per-article average normalised citation impact is 1, the overall per-term average normalised citation impact is 1.4.

Fig. 1
figure 1

The year-normalised, log-transformed average citation count of title words by frequency for Assessment and Diagnosis 2009–2015. Error bars show estimated 95 % confidence intervals. The highest and lowest impact points are annotated with the generating term

Catalysis, one of the two middle subjects in Table 1, has a typical shape for subjects other than Assessment and Diagnosis, in the sense of an increasing slope on the left, a fuzzy shape with an average value of 1 in the middle, and a jagged line of high frequency values on the right.

The individual high and low impact Catalysis terms (Table 2) associate with high or low impact research topics (e.g., batteries: 1.455; propane: 0.771), research types (frameworks: 1.492; study: 0.767; investigation: 0.757; theoretical: 0.713), or claims (e.g., first: 0.821). The low impact association of the general terms in this list contrasts with a previous study for ecology (Fox and Burns 2015). The low impact association of the term first is surprising given that it sometimes signals an explicit novelty claim (e.g., The first example of asymmetric hydrogenation of imines with Co 2(CO)8/(R)-BINAP as catalytic precursor). The reason is that it was often used within the phrase “first principals”, to denote a research approach that was perhaps less cited than others (e.g., Selectivity in propene dehydrogenation on Pt and Pt3Sn surfaces from first principles).

Table 2 The 10 highest and 10 lowest average citation impact terms for catalysis 2009–2015, together with 95 % confidence intervals

The jagged line on the right hand side of Fig. 2 indicates, because of the non-overlapping confidence intervals, small but significant differences in the average impact of individual high frequency terms. These differences pervade all subjects, but are not always the same (Fig. 3).

Fig. 2
figure 2

The year-normalised, log-transformed average citation count of title words by frequency for catalysis 2009–2015. Error bars show estimated 95 % confidence intervals. The highest and lowest impact points are annotated, as are the three highest frequency terms

Fig. 3
figure 3

The average citation impact of terms occurring in the titles of all subject areas except Assessment and Diagnosis, which generates much larger outliers. This graph shows the same data as Tables 3 and 4 but visualises the variability between disciplines for individual terms

Although most function words associate with slightly higher impact research overall (Table 3), this is not true for on and the, both of which may be indicators of specificity. The highest impact common function words are for, perhaps indicating an application or purpose, and and, suggesting multiple results or applications.

Table 3 The average citation impact of all function words occurring in the titles of all subject areas

Two non-function words were also present in all subject areas, analysis and effects (Table 4). Ignoring the Assessment and Diagnosis outlier, the term analysis associates with lower impact both overall and in most subjects. Analysis seems to be particularly unvalued in electrochemistry (0.84) and catalysis (0.87), perhaps because these terms suggest a lack of empirical data. Conversely, in Complementary and Manual Therapy (1.27) it might associate with meta-analyses, which tend to be highly cited. There are also disciplinary variations in the average impact associated with the term effects, although the reason is unclear.

Table 4 The average citation impact of the terms analysis and effects in the titles of all subject areas

The specialism with the lowest average citation impact for term frequency 1 is Automotive Engineering (Fig. 4). The cause in this case is the presence of trade magazines, such as Public Transport International and Automotive Industries AI, that contain rarely cited articles and news about specific localities, or industry events. These include “Public transport in Vienna: popular, accepted, high quality” and “LG Chem to supply GM with battery cells and electronic components for Chevrolet Volt”. Locality, product and company names provide a collection of low frequency uncited terms (similar to the organism specificity issue within ecology: Fox and Burns 2015). These are obscure topics in the sense of being highly specific to a company, event or locality rather than focusing on a topic that would be part of the general knowledge for a discipline. To illustrate this, it seems unlikely that many future articles would need to cite information about the electronics supplier for the Chevrolet Volt.

Fig. 4
figure 4

The year-normalised, log-transformed average citation count of title words by frequency for Automotive Engineering 2009–2015. Error bars show estimated 95 % confidence intervals. The highest and lowest impact points are annotated

Discussion

The main limitation of the word frequency analysis reported here is that an individual term can have different meanings (polysemy) and there are also different terms that mean the same thing (synonyms), so word frequency comparisons are simplifications. In addition, a word can be used in different typical contexts that alter its meaning. For instance, the term analysis is part of the name of an area of maths (functional analysis), and a specific method (social network analysis) as well as being a general term for generating knowledge and understanding. Another important limitation is the absence of the arts and humanities as well as all core social sciences so that the findings only relate to natural, formal and life sciences, health sciences and engineering. Since only a minority of subjects in these areas have been examined, there may be other fields that follow a different pattern. A technical limitation is that gathering articles over a longer time span would have increased the term frequency counts of some of the words in each category, but probably would not have affected the overall patterns and findings.

The evidence clearly points to unique terms in article titles associating with lower citation impact in all disciplines. This suggests that rarely researched topics tend to attract fewer citations. Although this seems to be the most likely reason, there are alternative explanations. It is possible that authors that describe their topics in an unusual way (e.g., due to a lack of language skills or extreme language proficiency leading to obscure word choices) that alienates potential readers. Authors may also fail to incorporate the generality of the findings into the title, missing out on part of their audience. More seriously, weaker researchers may fail to adequately generalise their findings or may pick narrow topics (e.g., Finberg 2015) and so their overly specific research has lower impact. The different citation associations of function words undermine the findings somewhat by showing that even the presence of specific neutral words in titles (e.g., and) can associate with higher (or lower) average citation impact in different subjects. Since words that are not content words can associate with differences in expected citation rates, the low citation impact of articles with rarely used title words could also be due to causes other than the topics of the articles.

The results also show the same basic pattern in the term frequency graphs for each subject, but with clear disciplinary differences in the citation impact associated with individual terms. It is perhaps surprising that individual function words, such as the, can associate with higher impact research in some fields but lower impact research in others. This could be due to different styles adopted within high and low impact journals, the presence of the within phrases associated with a high or low impact sub-fields, the scarcity of definite articles from translated documents in some languages, or the tendency of the definite article to denote a more specific topic.

The almost universally higher citation impact association of the term and (i.e., articles with titles containing and tend to be more cited) is surprising since the presence of a conjunction seems to connote a longer, more complex title (although three word titles can include “and”), whereas most previous research (reviewed in the Introduction) has found longer titles to associate with fewer citations.

Conclusions

Focusing on the end of the time period examined, the data suggests that in all subject areas except one, if a new article is published with a title that includes at least one term that has not been used in a title in the subject area within the previous 6 years then this article can be expected to receive fewer citations than average for its subject and year. Assuming, with some support from Table 1 and the surrounding discussion, that the cause of this association is that articles with unique title terms tend to be describing obscure topics, then a generalisation of this is that new articles on obscure topics will tend to attract fewer citations than average for their subject and year.

A simple conclusion from this research is that, except perhaps in the arts, humanities and social sciences, researchers should avoid creating titles that make their research seem obscure (i.e., rarely researched) because they may not be read. It seems likely that researchers should also attempt to generalise their studies as far as possible and to highlight this generality when writing their titles. This strategy should lead to research that is more useful to more people and may result in more citations. This advice should be incorporated into the guidelines given to beginning researchers about writing articles (e.g., Hartley 2005, 2008). Ultimately, the purpose of most research publishing is to attract an audience and composing article titles should be a key part of a strategy to achieve this. Of course, this is only general advice and researchers should not be deterred from attempting to conduct unusual research if they believe that it will attract an audience anyway.

A secondary tentative conclusion, which is a by-product of the research rather than part of the aims, derives from the higher citation association of the term and in almost all subjects, which presumably stems from more complex titles since it is a conjunction. It seems that authors should not be afraid to mention multiple things within their article titles as this may show more comprehensive research or may relate to more researchers’ topics of interest. This is a tentative conclusion, however, since title lengths do not have a clear association with citation counts. Similarly, the inclusion of for within a title suggests a purpose for the research, which seems to be a logical way to attract readers. For future research, it would be useful to investigate the citation association of function words in more detail.