Introduction

The use of citation analysis has grown in importance during the past few years. The vast increase of scientific production made it very difficult for scientists to keep track of publications they might be interested in. Many indicators have been developed to rank scientific journals, authors and scientific publications by measuring their importance.

The most widely used ranking indicator for journals is the Impact Factor proposed by Garfield (1955, 1999, 2005). The ranking is based on the average number of citations received per citable item in the journal in question during a predefined period of time (the past 2 years).

In order to measure the importance of a researcher’s work, other metrics have been proposed that use the collection of all articles a researcher has (co-) authored, plus the sum of all direct citations received. Such indexes are the h-Index (Hirsch 2005), g-Index (Egghe 2006), and their variations.

For example, there have been variations of the h-index that take into account: (a) the total number of citations included in the Hirsch-core (A-index, R-index) (Jin et al. 2007), (b) the age of the publications included in the Hirsch-core (AR-index) (Jin et al. 2007), (c) the age of the publications of an author (contemporary h-index) (Sidiropoulos et al. 2007), (d) the age of the citations (trend h-index) (Sidiropoulos et al. 2007), (e) the combination of the above two (age-decaying h-index) (Katsaros et al. 2007), and, (f) not only the citations inside the Hirsch-core but also the ones received by publications currently not included in the Hirsch-core (tapered h-index) (Anderson et al. 2008).

There have been some variations of the g-index as well, like the gr-index and the grat-index (Guns and Rousseau 2009).

The importance of a scientific publication is most commonly measured based on the number of citations it has received. A different approach was proposed by Rousseau (Rousseau 1987), who claims that publications mentioned in the reference list have an impact on the publication in question, and also, recently, there has been a proposal for applying the philosophy of Page Rank (Brin and Page 1998) on a Citation Graph (Ma et al. 2008). Finally, the Cascading Citations Indexing Framework approach (Dervos and Kalkanis 2005; Dervos et al. 2006; Dervos and Klimis 2008) suggests that citations should be addressed at the (article, author) level in order to rank the contribution of each author’s scientific work.

We suggest a new indicator for measuring the importance of a research article, the f-value. We produce a ranking of the publications included in the CiteSeer bibliographic database (Citeseer 1997; Giles et al. 1998) and compare our results with the ones obtained by other indicators.

In “Related work" section the Number of Citations, the Cascading Citations Indexing Framework, and the Page Rank for citation graphs approaches are presented. “f-Value description" section describes the basic concept of the f-value and in “Determining the reducing factor" section, we justify the selection of the specific reducing factor used in the calculation of the f-value. The paper continuous by presenting the f-value algorithm in “ f-Value algorithm" section and the different rankings produced by three different indicators in “Experimental results" section. “Discussion" section describes the similarities and differences of the f-value with the other indicators, and, finally, the last section concludes the paper.

Related work

A citation graph is a representation of the relationships that exist between research articles based on the references that each article provides. In Fig. 1, articles are shown as nodes of a directed graph. In this example there are seven articles labeled A to G.

Fig. 1
figure 1

Citation Graph 1

The arcs of the graph represent references among articles. For example, the arc leaving node B can be interpreted as “article B references article D”. The incoming arcs are the direct citations received by a specific article. For article D we can state that “article D receives one direct citation from article B”.

Number of citations

This approach produces a ranking of scientific publications based on the number of citations they receive. It is by far the most simplistic approach, but, it is widely used. For example, in the citation graph of Fig. 1, articles A and F receive zero citations, articles B, D, E and G receive one citation each, and article C receives two citations.

The cascading citations indexing framework (c2-IF)

The fundamental concept in the c2-IF approach (Dervos and Kalkanis 2005; Dervos et al. 2006) is the n-gen citation. According to c2-IF, direct citations like the ones discussed in the previous section are called 1-gen citations. If we carefully examine the citation graph in Fig. 1, we observe that article D also receives an indirect citation from article A, via article B. This is considered to be a 2-gen citation. In general, an n-gen citation exists between a source article S and a target article T, if there is a directed path in the citation graph from node S to node T. In the example of Fig. 1, the highest n-gen citation present is of depth 3: the one from article A to article G, along the citation path A → B → D → G.

According to c2-IF, the citations that a (article, author) pair receives can be calculated up to depth n, thus, producing a number of distinct values. So, if we choose to consider the citations up-to depth 3, the following values will be calculated: 1-gen citations, 2-gen citations, and 3-gen citations. These values are stored in a table called Medal Standings Output (MSO).

We also stress that the c2-IF approach is not to be considered as a ranking method but merely a framework that extends the citation indexing paradigm to include 2-,3-,\(\ldots\), k-gen citations. We should also point out that in the c2-IF approach, k is predefined and its value can range from 2\(\ldots\) n, where n is the maximum path present in the specific citation graph. In other words, k ∈ [2 ... n] and consequently that many distinct values are going to be calculated for each article in the citation graph.

Page rank

The original Page Rank (Brin and Page 1998) produces a ranking of web pages by taking into account the number and importance of pages linking to each web page. The formula used by the Page Rank algorithm is

$$ PR(A)=(1-d)+d*\sum_{i}{\frac{PR(T_{i})}{C(T_{i})}} $$
(1)

where PR(T i ) is the Page Rank value of page T i linking to page A whose Page Rank value we wish to calculate, and C(T i ) is the number of outbound links of page T i . Finally, d is the damping factor. In order to better explain the damping factor, we should first give a general description of the concept of Page Rank.

The Page Rank algorithm is based on the Random Surfer model which states that a person, the “random surfer”, navigates through the web randomly, by clicking on links present on a web page. So, how high a web page ranks has to do with the probability that this “random surfer” eventually visits the web page in question. The probability increases as the number of incoming links increases and the effect is even more intense if these links come from web pages which score high, thus having themselves high probability to be visited. But, there is always a chance that our “random surfer” gets bored and chooses to simply leave, a reaction indicated by the damping factor, which on the original article was chosen to be 0.85. In most discussions about Page Rank, 0.85 is the value used for the damping factor, but, there is at least one article that we know of that examines the behavior of the original Page Rank algorithm when different values are chosen (Boldi et al. 2009). So, for the most common value of the damping factor, Eq. 1 actually becomes

$$ PR(A)=0.15+0.85*\sum_{i}{\frac{PR(T_{i})}{C(T_{i})}} $$
(2)

In (Ma et al. 2008) a variation of the original Page Rank algorithm is applied to citation graphs. In that article, the authors apply Eq. 1 by choosing d = 0.5. They choose the specific value based on an empirical study that states that researchers will probably not follow six articles and stop but only two.

f-Value description

The Cascading Citations Indexing Framework introduces the k-gen (indirect) citations as a means of acknowledging the importance of a research article based not only on its direct influence (number of 1-gen citations) but also on the influence the citing articles represent in their scientific field.

In this paper, we introduce the f-value, a new indicator that quantifies the importance of a research article. The f-value considers the accumulated importance of all articles that have based their scientific contribution on the article in question, directly or indirectly. In other words, each article’s importance is represented by a single value, the f-value. The method used to calculate the f-values of articles in a citation graph is based on our complete knowledge of the graph, thus it is exchaustive in nature and considers all citation paths present up to the maximum depth n.

Let us consider the following example. We have six articles, labeled A to F related as shown in Fig. 2, thus producing the MSO table shown in Table 1.

Fig. 2
figure 2

Citation Graph 2

Table 1 MSO table for Citation Graph 2

A possible way to calculate the f-value of an article A by taking into account the indirect citations could be

$$ f(A)=1+(f(A_{1})+f(A_{2})+\cdots+f(A_{m})) $$
(3)

where f(A) is the f-value of article A, and A i , i = 1... m are the articles citing article A. According to the equation, the minimum f-value for a published article is 1. Thus, the f-value of article A is 1 plus the sum of the f-values of all articles citing article A.

By performing the calculations for the articles of citation graph in Fig. 2, we produce the graph shown in Fig. 3, with the number on top of the nodes representing the f-values for the corresponding articles.

Fig. 3
figure 3

f-Values for Citation Graph 2

Such an approach results to each article eventually receiving thus much credit as the sum of the credit received by all articles that cite it, making no distinction between direct or indirect citations. This is also obvious by examining the results shown in Fig. 3. The f-value of each article is 1 plus the f-values of all direct citations. Of special interest are the f-values of articles C and D which are both 3. This means that based on Eq. 3 these two articles are equally important even though article C has received 2 1-gen citations and article D has received one 1-gen citation and one 2-gen citation.

So, there must be some factor that will assist us in differentiating direct and indirect citations. This is going to be a value that will reduce the cascaded f-value passed to an article’s direct citations. Here is the new equation that calculates the f-value of an article:

$$ f(A)=1+RF*(f(A_{1})+f(A_{2})+\cdots +f(A_{m})) $$
(4)

For the dataset used in this paper we have calculated that RF = 2.2. The method for calculating it, is presented at “c2-IF algorithm results and statistical analysis” section. Figure 4 demonstrates the use of RF = 2.2 on citation graph 2.

Fig. 4
figure 4

f-Values for Citation Graph 2

Determining the reducing factor

In this section we explain how the reducing factor (RF) is calculated. First, we provide a description of the CiteSeer database and the preprocessing we performed on it. Then, we use cc-IF information up to depth 3 to compute statistical information which we then use to calculate the reducing factor of the CiteSeer database.

Data used

We chose the CiteSeer database because:

  • It indexes a sufficient number of research articles and is not limited to certain journals

  • It mostly covers the scientific area of Computer and Information Science

  • it uses the Open Access Initiative (OAI) format, which is XML based.

A sample record is shown in Fig. 5. For simplicity, only the identifiers that are used by the algorithm are listed.

Fig. 5
figure 5

CiteSeer Record

Each article is defined by a unique <identifier> tag generated by CiteSeer, as shown in Fig. 5. Other fields required by the algorithm are the title (<dc:title> tag) and the list of references included in each article (<oai_citeseer:relation> tag).

Preprocessing

The original data consisted of the entire CiteSeer database; a total of 72 files, each holding 10,000 articles with their corresponding bibliographic details. Articles appearing in the list of references of a particular article are also part of the CiteSeer database. In order to retrieve the necessary information and to store it in the relational database we developed a parsing algorithm.

During the parsing process certain errors occurred, mainly concerning articles with insufficient information. For the algorithms presented here, articles lacking information about their authors (26,040 in total) or their publication year (280,098 in total) where excluded from the procedure.

c2-IF algorithm results and statistical analysis

The c2-IF algorithm presented in (Fragkiadaki et al. 2009) calculates the numbers of direct and indirect citations present in a Citation Graph, up to a pre-specified depth (in this case up to depth 3). Moreover, it stores in the relational database all the paths in the citation graph that produce these citations thus giving us complete knowledge of the graph. We note that the database stores information about 410,205 articles, with 265,563 identified authors and 1,245,171 direct references among the articles.

During the processing of the data stored in the database we detected many cases where an article cites articles with future publication dates, for example, article A published in 1995 cites article B published in 2000. This situation creates cycles in the citation graph which lead to inaccurate results. In order to avoid such anomalies, we remove from the reference list of every citing article the articles published on the same year as the citing article or a future year. In other words, every article in the database is “allowed” to only cite articles published prior to itself. All other citations (arcs) are excluded from the original dataset. Thus, the direct references among articles in the database were reduced from 1,245,171 to 1,000,077.

After the execution of the algorithm, 1,000,077 1-gen citations, 4,095,493 2-gen citations and 14,924,150 3-gen citations were detected among the articles and that many paths were stored in the database. An interesting fact is that from the 410,025 articles originally included in the database only 133,658 receive at least one citation. To gain a better understanding of our data we calculated the summary statistics for each n-gen (n = 1, 2, 3) citation type (see Table 2).

Table 2 Summary statistics for 1-gen, 2-gen and 3-gen citations

If we compare the mean to the median we observe that in all three cases the median is lower than the mean. This means that even though the means are high they are mostly affected by a small number of articles with high values. This hypothesis is proven true if we examine the quartile information. For example, for 1-gen citations we find that at least 75% of the articles in our database have fewer 1-gen citations than the corresponding mean value, whereas, the maximum value is 1,280 which is much larger than the usual values calculated for articles. Even greater are the differences for 2-gen citations and 3-gen citations.

Finally we identified the ratios

$$ \frac{\hbox{number\,of\,2-gen\,citations}} {\hbox{number\,of\,1-gen\,citations}} $$
(5)

and

$$ \frac{\hbox{number\,of\,3-gen\,citations}} {\hbox{number\,of\,2-gen\,citations}} $$
(6)

for all articles in our database and we calculated the corresponding summary statistics shown in Table 3.

Table 3 Summary statistics for the ratios in Eqs. 5 and 6

We observe, that on average, for each 1-gen citation an article receives from within our database, it also receives 2.22 2-gen citations and for each 2-gen citation it receives 1.54 3-gen citations. This is an expected result since according to the definition of n-gen citations, the (n+1)-gen citations an article receives is the sum of all 1-gen citations received by the n-gen citations of the article. For example the 2-gen citations received by an article are the sum of all 1-gen citations received by the articles directly citing the article in question (1-gen citations). We also mention that there are 44,280 articles for which we can not calculate ratio 6 because the number of 2-gen citations they have received so far is 0.

Based on these statistical data we chose to use 1/2.2 as a reducing factor for the calculation of the f-value. We expect this value to differ among scientific areas or bibliographic databases.

f-Value algorithm

In this section we present the algorithm that calculates the f-values of all articles in our bibliographic database. This algorithm requires a finite number of iterations to calculate the f-values.

The algorithm receives as input the list of articles to be processed \({(I)}\), the \({Article \;Direct \;Citations\;(ADC)}\) data structure which includes for each article the list of articles that cite it, and, the \({Article \;F{\text{-}}Values \;(AFV)}\) data structure which includes the articles that need to be processed plus their current f-value and a flag that denotes whether this value has changed since the last iteration. In other words, if we denote an article by R x , then for a database with m articles, the list of all articles that need to be processed is \({I}=[\hbox{R}_{1}, \hbox{R}_{2}, \hbox{R}_{3}, \ldots, \hbox{R}_{{m}}]\). Let CR x denote the list of articles that reference R x . Thus, CR x is a subset of \({I}\) and the Article Direct Citations (ADC) data structure is \({ ADC}=[\hbox{CR}_{1}, \hbox{CR}_{2}, \hbox{CR}_{3}, \ldots , \hbox{CR}_{{m}}]\). Additionally, for each article R x , let VR x denote the information required for this article during the execution of the algorithm. This information consists of the f-value calculated so far for this article and of a flag indicating whether the f-value has changed since the last iteration of the algorithm. Thus, VR x = [fval = 1, changed = 0] for every article R x in the beginning of the algorithm. Finally, the Article F Values structure is \({AFV}=[\hbox{VR}_{1}, \hbox{VR}_{2}, \ldots, \hbox{VR}_{{m}}]\). The algorithm returns the AFV structure with the calculated f-values for all articles in the database.

During the first iteration of the algorithm, all articles have an f-value equal to 1. At each iteration, the algorithm calculates the f-values of all articles in the database based on the f-values calculated during the previous iteration and records whether any f-value has changed between the two iterations. If there is at least one changed value, the algorithm requires one more iteration because that change could propagate to more articles in the following iteration. If there is no f-value change then all f-values have been calculated and the algorithm terminates.

Algorithm 1 f-Value algorithm

1  Input:

2      I list of articles to be processed

3      ADC data structure with direct citations of each article

4      AFV data structure with initial f-values and flags

5  Output:

6      AFV data structure with calculated f-values and flags

7

8   ADC = remove_cycles(ADC)

9   NChanged = 0

10  first = true

11  while (first || NChanged > 0) do

12    first = false

13    NChanged = 0

14    PREV_AFV = AFV

15    for each R in I do

16      prev_fval = AFV[R][fval]

17      AFV[R][fval] = 1

18      RCIT = ADC[R]

19      for T in RCIT do

20       AFV[R][fval] = AFV[R][fval] + RF*PREV_AFV[T][fval]

21      if AFV[R][fval] != prev_fval then

22       AFV[R][changed] = 1

23       NChanged = NChanged + 1

24      else

25       AFV[R][changed] = 0

In order to avoid possible errors in the execution of the algorithm we must ensure that no cycles exist in the collection of articles stored in our database. Since the algorithm calculates the f-value of an article based on the f-values of the articles that cite it, if there is a cycle the algorithm will enter an infinite loop.

Experimental results

In order to compare the three different indicators for measuring an article’s scientific impact, we tested them against our database and report the obtained rankings per indicator. Recall that only 133,658 out of 410,025 articles listed in our database actually receive at least one 1-gen citation. In addition, there are 203,607 articles that do not give any citation, 38,100 of which receive citations from other articles while the rest do not give or receive any citations. Apart from presenting the rankings, the tables are complemented with the c2-IF Information about the n-gen citations received by the articles up to depth 3. This information derives from the c2-IF algorithm originally introduced at (Fragkiadaki et al. 2009). The algorithm was modified for the needs of the present paper. Table 4, shows the top 10 articles according to the received number of citations.

Table 4 Number of citations: top 10 ranked articles

In order to test the Page Rank algorithm for citation graphs against our bibliographic database, we used an implementation written by Vincent Kräutler in Python (Kräutler 2006), which is based on a mathematical essay by Austin (2006). The implementation of the Page Rank algorithm as a package was imported to a Python script created for handling the reading/writing from/to the database and transforming the data into the appropriate format. The results are shown in Table 5.

Table 5 Page Rank: top 10 ranked articles

Algorithm 1 was implemented and executed against our database. Table 6 shows information about the top 10 ranked articles.

Table 6 f-Value: top 10 ranked articles

Finally, Table 7 shows the summary statistics for all three approaches.

Table 7 Summary statistics

Discussion

In this section, we comment on the similarities and differences of the three indicators. In addition, we attempt to interpret the experimental results we obtained.

The Number of Citations, a measure used traditionally in citation analysis, plays an important role in all indicators. In Page Rank, the direct citations a publication receives are referred to as inbound links to its node in the citation graph and they are similarly used in the calculations of the f-value.

In general, the latter two approaches are based on the assumption that the use of the Number of Citations as a measurement of the importance of a scientific publication is insufficient. The resulting ranking is solely based on the direct impact the article has without taking into account its present state (whether it remains in the researchers’ preferences) or its derived contribution (the impact it has on the research in the specific scientific field). The f-value indicator and Page Rank appear to be very similar in nature, thus, before elaborating on their experimental results, we discuss their main differences and similarities. These are summarized in the following:

  1. 1.

    The logic behind the equation: Page Rank focuses on a person (the “random scientist”) moving from article to article randomly by choosing to read next an article that appears as a citation in the List of References of the article she reads. All cited articles have the same probability to be selected. The f-value is not based on such a probability, but on the cumulative value of the n-gen citations that an article has received.

  2. 2.

    How are citations treated: Page Rank for Citation graphs divides equally the value of an article among its cited articles. Such a division implies that among two articles with equal values, A and B, if A cites 10 articles and B cites 20 articles, then articles cited by A will receive twice as much recognition than articles cited by B, just because A has cited fewer articles. Since we cannot assume that cited articles have less impact when they are encountered in longer reference lists, we claim that this division of value does not correspond to a real world behavior, thus, it is not included in the calculations of an article’s f-value.

  3. 3.

    The damping factor:In the f-value calculation there is no damping factor. Instead, there is a reducing factor used to dicrease the accumulated value of the n-gen citations. This factor has been chosen to be \({\frac{1}{2.2}}\) (see “Determining the reducing factor" section). In addition, the f-value also has a minimum value of 1 for all articles. The f-value of an article always increases as more articles cite directly and/or indirectly the article in question.

Even though the equations used in the calculation of the Page Rank for Citation Analysis and the f-value appear similar, the logic behind each approach is differenet.

We now proceed and discuss the experimental results in an effort to better understand the differences and similarities among the three indicators. Examining the top 10 ranked articles based on the Number of Citations (Table 4), it is very interesting to notice the c2-IF information provided, especially for the top four ranked articles. We observe that according to this indicator, the “Congestion Avoidance and Control” article is ranked 3rd, because it has received fewer direct citations than the two articles above it. On the other hand, if we examine the c2-IF information, we can clearly see that it has received considerably more 2-gen citations and 3-gen citations than the first and second ranked articles. The same is true to a lesser extent for the fourth ranked article. But, this information is not taken under consideration for this ranking.

Table 5, shows the top 10 articles based on PageRank along with the corresponding c2-IF Information. The ranking is different here, and, by inspecting the c2-IF information of the top two articles, we observe that the first ranked article has less 1-gen, 2-gen and even 3-gen citations than the second ranked article. This ordering can only be explained if we consider the way Page Rank values are calculated. Apparently, the “Optimization by Simulated Annealing” article has received fewer 1-gen, 2-gen and 3-gen citations than the second article as an absolute number, but, the prestige (Page Rank value) of the articles that cite it played an important role in the calculations. In addition, the number of citations made by the citing articles has also affected the result. So, we have to assume that although the up to 3-gen citations of the first article are fewer than the ones received by the “Graph-Based Algorithms for Boolean Function Manipulation” article, they are either of higher value and/or have a smaller number of outbound links.

The f-value results are presented in Table 6 along with the corresponding c2-IF information. Let us examine the first ranked article. This article was ranked third according to the Number of Citations. This is explained by the fact that the calculation of the f-value is exchaustive in nature and takes into consideration all the knowledge present in the citation graph. In other words, an article’s f-value increases as it receives more citations at each depth, all the way to the longest citation path.

Finally, Table 8 shows all articles listed in Tables 4, 5 and 6 along with their c2-IF information. The articles are ordered by their f-value rank. Again, we observe that the rankings vary significantly depending on the indicator used.

Table 8 Summarized results of Top article rankings based on all three approaches

The first approach, Number of Citations, only takes into account the direct impact an article has based on the number of citations it receives. On the other hand, Page Rank does not take into account the direct impact alone but it also considers, to some extent, the added value provided by the citing articles of the article in question. We should point out though that Page Rank is not an exchaustive method, that is, for the calculation of the importance of a research article one does not traverse the entire citation graph. Finally, in the calculations of the f-value, the indirect impact an article has is fully accumulated in the calculations. The whole citation graph is traversed and the value of each article is partially propagated to all articles that it cites, thus producing an exchaustive method that uses all the information present in the citation graph.

The calcualtions for the f-value indicator are based on historical data, that is, they are dependent on the dataset. It is very likely that the reducing factor will be different for different datasets. A different reducing factor is expected to alter the resulting ranking, but the extend at which the ranking is affected requires more research.

Conclusions

Based on the Cascading Citations Indexing Framework, we proposed a new indicator for measuring the importance of a research article. The f-value represents a unique value for each article that takes into consideration the n-gen citations received by the specific article. We developed an algorithm that calculates the f-value for all articles in a bibliographic database, and we experimentaly compared it to two other indicators.

Future work on this field will: (a) try to incorporate other aspects of the c2-IF in the calculation of the f-value, (b) examine the impact the different values of the reducing factor have on the final ranking of the articles, and, (c) examine whether there can be a unified f-value for interdisciplinary articles.