Abstract
Large digital libraries have become available over the past years through digitisation and aggregation projects. These large collections present a challenge to the new user who wishes to discover what is available in the collections. Subject classification can help in this task, however in large collections it is frequently incomplete or inconsistent. Automatic clustering algorithms provide a solution to this, however the question remains whether they produce clusters that are sufficiently cohesive and distinct for them to be used in supporting discovery and exploration in digital libraries. In this paper we present a novel approach to investigating cluster cohesion that is based on identifying instruders in a cluster. The results from a human-subject experiment show that clustering algorithms produce clusters that are sufficiently cohesive to be used where no (consistent) manual classification exists.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12, 461–486 (2009), doi:10.1007/s10791-008-9066-8
Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999)
Azzopardi, L., Girolami, M., van Rijsbergen, C.: Topic based language models for ad hoc information retrieval. In: Proceedings of the IEEE International Joint Conference on Neural Networks 2004, vol. 4, pp. 3281–3286 (July 2004)
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. In: VLDB 2012 (2012)
Blei, D.M., Griffiths, T., Jordan, M., Tenenbaum, J.: Hierarchical topic models and the nested chinese restaurant process. In: NIPS (2003)
Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: NIPS (2009)
Clough, P., Sanderson, M., Reid, N.: The eurovision st andrews collection of photographs. ACM SIGIR Forum 40(1), 21–30 (2006)
Eklund, P., Goodall, P., Wray, T.: Cluster-based navigation for a virtual museum. In: Adaptivity, Personalization and Fusion of Heterogeneous Information, RIAO 2010, Le Centre de Hautes Etudes Internationales d’Informatique Documentaire, Paris, France, France, pp. 211–212 (2010)
Granitzer, M., Kienreich, W., Sabol, V., Andrews, K., Klieber, W.: Evaluating a system for interactive exploration of large, hierarchically structured document repositories. In: IEEE Symposium on Information Visualization, INFOVIS 2004, pp. 127–134 (2004)
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academiy of Science 101, 5228–5235 (2004)
Handl, J., Meyer, B.: Improved Ant-Based Clustering and Sorting in a Document Retrieval Interface. In: Guervós, J.J.M., Adamidis, P.A., Beyer, H.-G., Fernández-Villacañas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 913–923. Springer, Heidelberg (2002)
Hassan-Montero, Y., Herrero-Solana, V.: Improving tag-clouds as visual information retrieval interfaces. In: Proceedings InfoSciT (2006)
He, J., Tan, A.-H., Tan, C.-L., Sun, S.-Y.: On quantitative evaluation of clustering systems. In: Information Retrieval and Clustering, pp. 105–133. Kluwer Academic Publishers (2003)
Lloyd, S.P.: Least square quantization in pcm. IEEE Transactions on Information Theory 28(2), 129–137 (1982)
Loper, E., Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the ACL 2002 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, ETMTNLP 2002, vol. 1, pp. 63–70. Association for Computational Linguistics, Stroudsburg (2002)
Marchionini, G.: Exploratory search: From finding to understanding. Communications of the ACM 49(4), 41–46 (2006)
Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12), 1650–1654 (2002)
Mei, X.S., Zhai, C.: Automatic labeling of multinomial topic models. In: Proceedings of KDD 2007, pp. 490–499 (2007)
Newman, D., Karimi, S., Cavedon, L.: External evaluation of topic models. In: Proceedings of teh 14th Australasian Document Computing Symposum, pp. 11–18 (2009)
Newman, D., Noh, Y., Talley, E., Karimi, S., Baldwin, T.: Evaluating topic models for digital libraries. In: JCDL 2010 (2010)
Pirolli, P.: Powers of 10: Modeling complex information-seeking systems at multiple scales. Computer 42(3), 33–40 (2009)
Rao, R., Pedersen, J.O., Hearst, M.A., Mackinlay, J.D., Card, S.K., Masinter, L., Halvorsen, P.-K., Robertson, G.C.: Rich interaction in the digital library. Commun. ACM 38(4), 29–39 (1995)
Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50. ELRA (May 2010), http://is.muni.cz/publication/884893/en
Roussinov, D.G., Chen, H.: Document clustering for electronic meetings: an experimental comparison of two techniques. Decision Support Systems 27(1-2), 67–79 (1999)
Sculley, D.: Web-scale k-means clustering. In: WWW 2010 (2010)
Song, M.: Bibliomapper: a cluster-based information visualization technique. In: Proceedings of the Information Visualization, pp. 130–136 (1998)
Sutcliffe, A., Ennis, M.: Towards a cognitive theory of information retrieval. Interacting with Computers 10, 321–351 (1998)
van Ossenbruggen, J., Amin, A., Hardman, L., Hildebrand, M., van Assem, M., Omelayenko, B., Schreiber, G., Tordai, A., de Boer, V., Wielinga, B., Wielemaker, J., de Niet, M., Taekema, J., van Orsouw, M.-F., Teesing, A.: Searching and annotating virtual heritage collections with semantic-web technologies. In: Museums and the Web 2007 (2007)
Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th International Conference on Machine Learning (2009)
Wei, X., Croft, W.B.: Lda-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference, SIGIR 2006, pp. 178–185. ACM, New York (2006)
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.-H., Steinbach, M., Hand, D., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1–37 (2008), doi:10.1007/s10115-007-0114-2
Zhao, W., Ma, H., He, Q.: Parallel K-Means Clustering Based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hall, M., Clough, P., Stevenson, M. (2012). Evaluating the Use of Clustering for Automatically Organising Digital Library Collections. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds) Theory and Practice of Digital Libraries. TPDL 2012. Lecture Notes in Computer Science, vol 7489. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33290-6_35
Download citation
DOI: https://doi.org/10.1007/978-3-642-33290-6_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33289-0
Online ISBN: 978-3-642-33290-6
eBook Packages: Computer ScienceComputer Science (R0)