Abstract
This work stems from the idea of describing the scientific productivity of Italian statisticians. There are several problems that must be addressed in achieving this goal: What data should be used? Have the data been cleaned? What techniques can be used? We propose the use of multiple sources and multiple metrics to get a complete information base. We check the correctness of the data using multivariate outlier identification techniques. We appropriately transform the data. We apply robust clustering to verify the existence of homogeneous groups. We suggest the use of forward search to establish a ranking among scholars. The proposed methodology, which, in this case, allowed us to group scholars into four homogeneous groups and sort them according to multidimensional data, can be applied to other similar applications in bibliometrics.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Adler R, Ewing J, Taylor P (2009) Citation statistics with discussion. Stat Sci 24: 1–28
Archambault E, Campbell D, Gingras Y, Larivire V (2009) Comparing bibliometric statistics obtained from the web of science and Scopus. J Am Soc Inf Sci Technol 60(7): 1320–1326
Atkinson AC, Riani M (2000) Robust diagnostic regression analysis. Springer, New York
Atkinson AC, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52: 272–285
Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer, New York
Atkinson AC, Riani M, Cerioli A (2006) Random start forward searches with envelopes for detecting clusters in multivariate data. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis classification and the forward search. Springer, Berlin
Baccini A, Barabesi L, Marcheselli M (2009) How are statistical journal linked? A network analysis. Chance 22(3): 34–43
Baccini A, Barabesi L (2011) Seats at the table: the network of editorial boards in information and library sciences. J Infomet 5: 382–391
Bakkalbasi N, Bauer K, Glover J, Wang L (2006) Three options for citation tracking: Google Scholar, Scopus and web of science. Biomed Digit Libr 3: 7
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49: 803–821
Batista PD, Campiteli MG, Konouchi O (2006) Is it possible to compare researchers with different scientific interests. Scientometrics 68(1): 179–189
Box GEP, Cox DR (1964) An analysis of transformations. J R Stat Soc Ser B 26(2): 211–252
De Moya-Anegón F, Chincilla-Rodriguez Z, Vargas-Qesada B, Corera-Álvarez E, JosèMunoz Fernandez FJ, Gonzáles-Molina A, Herrero-Solana V (2007) Coverage analysis of Scopus: a journal metric approach. Scientometrics 73(1): 53–78
Emerson JD (1991) Introduction to transformation. In: Hoaglin DC, Mosteller F, Tukey JW (eds) Fundamentals of exploratory analysis of variance. Wiley, New York
Falagas ME, Pitsouni EI, Malietzis GA, Pappas G (2008) Comparison of PubMed, Scopus, Web of Science, and Google Scholar: strenghts and weaknesses. FASEB J 22: 338–342
Ferrara A, Salini S (2012) Ten challenges in modeling bibliographic data for bibliometric analysis. Scientometrics. doi:10.1007/S11192-012-0810-x
Filzmoser P, Maronna R, Werner M (2008) Outlier identification in high dimensions. Comput Stat Data Anal 52: 1694–1711
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97: 611–631
Franceschet M (2010) A comparison of bibliometric indicators for computer science scholars and journals on Web of Science and Google Scholar. Scientometrics 83(1): 243–258
Godin B (2006) On the origins of bibliometrics. Scientometrics 68(I): 109–133
Hirsch E (2005) An index to quantify an individual’s scientific research output. In: PNAS. Proceedings of the National Academy of Sciences of the United States of America, Nov 15, vol 102, no 46
Jacsò P (2005) Google Scholar: the pros and the cons. Online Inf Rev 29(2): 208–214
Katsaros C, Manolopoulos Y, Sidiropoulos A (2006) Generalized h-index for disclosing latent facts in citation networks. Retrieved 20 Dec 2008, from http://arxiv.org/abs/cs.DL/0607066
Lotka AJ (1926) The frequency distribution of scientific productivity. J Wash Acad Sci 16(12): 317–324
Marchant T (2009) An axiomatic characterization of the ranking based on the h-index and some other bibliometric rankings of authors. Scientometrics 80(2): 327–344
Moed HF (2005) Citation analysis in research evaluation. Springer, Berlin
Norris M, Oppenheim C (2007) Comparing alternatives to the Web of Science for coverage of the social sciences literature. J Infomet 1: 161–169
Rivellini G, Rizzi E, Zaccarin S (2006) The science network in Italian population research: an analysis according to the social network perspective. Scientometrics 67: 3
Yeo IK, Johnson RA (2000) A new family of power transformations to improve normality or symmetry. Biometrika 87(4): 954–959
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
De Battisti, F., Salini, S. Robust analysis of bibliometric data. Stat Methods Appl 22, 269–283 (2013). https://doi.org/10.1007/s10260-012-0217-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-012-0217-0