Abstract
Clustering large amounts of unstructured data is an important challenge in contemporary medicine and biology. This article presents an automatic clustering method for unstructured medical data. The presented method consists of the following main steps: transformation of the document corpus to a frequency matrix of terms; dimensionality reduction of the frequency matrix of terms using principal component analysis (PCA); the direct comparison of pairs of documents similarity measures using the cosine and correlation distances; and finding the optimal number of groups for expertly labelled data sets by treating the clustering problem as an optimization problem in which the objective function is an F measure to be optimized via the selection of parameter values such as PCA resolution and the similarity threshold of the pairs of documents. The usefulness of the proposed methodology was demonstrated by performing calculations on three data sets: short sentences divided into three themes, radiological reports of aneurysms, and radiological reports of abdomen studies. A common barrier in clustering unstructured data is difficulty in results interpretation. To overcome this limitation, the utility of presentation methods, including group histograms, similarity matrices, plots of document assignment to founding clusters, F-measure interpolation and alphabetical- and term-frequency dictionaries, are presented. Excluding the labelling step, the presented method is completely automated and can be used as a preliminary data analysis method for large bodies of text to discover potential groups of interesting topics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhu, F., Patumcharoenpol, P., Zhang, C., Yang, Y.: Biomedical text mining and its applications in cancer research. J. Biomed. Inform. 46, 200–211 (2013)
Kawa, J., Juszczyk, J., Pyciński, B., Badura, P., Piętka, E.: Radiological atlas for patient specific model generation. Adv. Intell. Syst. Comput. 84, 69–84 (2014)
Rebholz-Schuhmann, D., Jepes, A., Li, C., Kafkas, S., Lewin, I., et al.: Assessment of NER solutions against the first and second CALBC Silver Standard Corpus. J. Biomed. Seman. 2(Suppl. 5), S11 (2011)
Krallinger, M., Vasquez, M., Leitner, F., Salgado, D., Chatr-Aryamontri, A., Winter, A., et al.: The protein-protein interaction tasks of BioCreative III: classification ranking of articles and linking bio-ontology concepts to full text. BMC Bioinform. 12(Suppl. 8), S3 (2011)
Amine, A., Elberrichi, Z., Simonet, M.: Evaluation of text clustering methods using WordNet. Int. Arab J. Inf. Technol. 7(4), 349–357 (2010)
Safeer, Y., Mustafa, A., Noor, A.A.: Clustering unstructured data. Int. J. Comput. Sci. Inf. Secur. 8(2), 174–180 (2010)
Spinczyk, D., Dziecitko, M.: Similarity search for the content of medial records. In: Information Technologies in Medicine. Advances in Intelligent Systems and Computing, vol. 471, pp. 489–501 (2016)
Albright, R.: Taming Text with the SVD. SAS Institute White Paper (2004)
Meyer, C.: Matrix Analysis and Applied Linear Algebra. SIAM, Philadelphia (2000)
Vandenberghe, L.: Applied Numerical Computing (lecture) (2011)
Keim, D., Kohlhammer, J., Ellis, G., Mansmann, F.: Mastering the Information Age Solving Problems with Visual Analytics. Eurographics Association, Goslar (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Wilczek, S., Gawrysiak, K., Spinczyk, D. (2019). Similarity Search for the Content of Medical Records Using Unstructured Data. In: Pietka, E., Badura, P., Kawa, J., Wieclawek, W. (eds) Information Technology in Biomedicine. ITIB 2018. Advances in Intelligent Systems and Computing, vol 762. Springer, Cham. https://doi.org/10.1007/978-3-319-91211-0_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-91211-0_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91210-3
Online ISBN: 978-3-319-91211-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)