Abstract
In previous work, we have shown the possibility to automatically discriminate between legitimate software and spyware-associated software by performing supervised learning of end user license agreements (EULAs). However, the amount of false positives (spyware classified as legitimate software) was too large for practical use. In this study, the false positives problem is addressed by removing noisy EULAs, which are identified by performing similarity analysis of the previously studied EULAs. Two candidate similarity analysis methods for this purpose are experimentally compared: cosine similarity assessment in conjunction with latent semantic analysis (LSA) and normalized compression distance (NCD). The results show that the number of false positives can be reduced significantly by removing noise identified by either method. However, the experimental results also indicate subtle performance differences between LSA and NCD. To improve the performance even further and to decrease the large number of attributes, the categorical proportional difference (CPD) feature selection algorithm was applied. CPD managed to greatly reduce the number of attributes while at the same time increase classification performance on the original data set, as well as on the LSA- and NCD-based data sets.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Abe N, Kudo M (2006) Non-parametric classifier-independent feature selection. Pattern Recogn 39: 737–746
Axelsson S (2000) The base-rate fallacy and the difficulty of intrusion detection. ACM Trans Inf Syst Sec 3(3): 186–205
Axelsson S, Baca D, Feldt R, Sidlauskas D, Kacan D (2009) Detecting defects with an interactive code review tool based on visualisation and machine learning. In: 21st international conference on software engineering and knowledge engineering, Boston, USA
Berry MW, Dumais ST, O’Brien GW (1995) Using linear algebra for intelligent information retrieval. SIAM Rev 37(4): 573–595
Boldt M, Carlsson B, Jacobsson A (2004) Exploring spyware effects. In: Eight nordic workshop on secure IT systems, pp 23–30
Cebrian M, Alfonseca M, Ortega A (2007) The normalized compression distance is resistant to noise. IEEE Trans Inf Theory 53(5): 1895–1900
Cebrian M, Alfonseca M, Ortega A (2005) Common pitfalls using normalized compression distance: what to watch out for in a compressor. Commun Inf Syst 5(4): 367–400
Cilibrasi R (2007) Statistical inference through data compression. PhD thesis, Institute for Logic, Language and Computation Universiteit van Amsterdam, Plantage Muidergracht 24, 1018 TV Amsterdam. http://www.illc.uva.nl/
Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391–407
Delany SJ (2009) The Good, the bad and the incorrectly classified: profiling cases for case-base editing. In: 8th international conference on case-based reasoning, pp 135–149
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30
Dong Z (2002) Towards web information clustering. PhD thesis, Southeast University, Nanjing, China
Edsberg O, Nytro O, Rost TB (2007) Novelty detection in patient histories: experiments with measures based on text compression. In: Berthold MR, Shawe-Taylor J, Lavrac N (eds) Advances in intelligent data analysis VII. Springer, New York, pp 367–378
Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, Cambridge
Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G (2007) Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinf 8(1)
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11: 86–92
Gansterer WN, Janecek AGK, Neumayer R (2007) Spam filtering based on latent semantic indexing. In: Berry MW, Castellanos M (eds) Survey of Text Mining II. Springer, New York
Good N, Grossklags J, Thaw D, Perzanowski A, Mulligan DK, Konstan J (2006) User choices and regret: understanding users’ decision process about consensually acquired spyware. I/S Law Policy Inf Soc 2(2): 283–344
Granados A, Cebrian M, Camacho D, Rodriguez FB (2008) Evaluating the impact of information distortion on normalized compression distance. In: Barbero A (ed) Coding Theory and Applications. Springer, Berlin, pp 69–79
Hofmann T (1999) Probabilistic latent semantic indexing. In: 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, pp 50–57
Iman RL, Davenport JM (1980) Approximations of the critical region of the friedman statistic. Commun Stat A 9(6): 571–595
Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 206–215
Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee S-H, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14(1): 99–129
Landauer TK, Foltz PW, Laham D (1998) Introduction to Latent Semantic Analysis. Discourse Process 25: 259–284
Langville AN, Meyer CD (2004) The use of linear algebra by web search engines. Bull Int Linear Algebra Soc 33: 2–6
Lavesson N, Boldt M, Davidsson P, Jacobsson A (2008) Spyware prevention by classifying end user license agreements. In: Nguyen NT, Katarzyniak R (eds) New Challenges in Applied Intelligence Technologies, Studies in Computational Intelligence. Springer, Berlin
Lavesson N, Boldt M, Davidsson P, Jacobsson A (2011) Learning to detect spyware using end user license agreements. Knowl Inf Syst 26(2): 285–307
Leydesdorff L (2005) Similarity measures, author cocitation analysis,and information theory. J Am Soc Inf Sci Technol 56(7): 769–772
Li M, Chen X, Xin ML, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12): 3250–3264
Lin S-W, Chen S-C, Wu W-J, Chen C-H (2009) Parameter determination and feature selection for back-propagation network by particle swarm optimization. Knowl Inf Syst 21(2): 249–266
Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11: 22–31
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization
Nemenyi PB (1963) Distribution-free multiple comparisons. Ph.D. thesis, Princeton university
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Seward J (2001) Space-time tradeoffs in the inverse B-W transform. Data Compression Conference. Washington DC, USA
Simeon M, Hilderman R (2008) Categorical proportional difference: a feature selection method for text categorization. In: Roddick JF, Li J, Christen P, Kennedy PJ (eds) Seventh Australasian Data Mining Conference, volume 87 of CRPIT. ACS, Glenelg, South Australia, pp 201–208
Telles GP, Minghim R, Paulovich FV (2007) Normalized compression distance for visual analysis of document collections. Comput Graph 31: 327–337
Vitanyi PMB, Balbach FJ, Cilibrasi RL, Li M (2008) Information theory and statistical learning, Chap. 3. Springer, New York
Wang P, Hu J, Zeng HJ, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Cleary JG, Witten IH (1984) Data compression using adaptive coding and partial string matching. IEEE Trans Commun 32(4): 396–402
Ye S, Wen J-R, Ma W-Y (2008) A systematic study on parameter correlations in large-scale duplicate document detection. Knowl Inf Syst 14(2): 217–232
Zhang M, Alhajj R (2010) Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space. Knowl Inf Syst 22(1): 1–26
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5): 429–449
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lavesson, N., Axelsson, S. Similarity assessment for removal of noisy end user license agreements. Knowl Inf Syst 32, 167–189 (2012). https://doi.org/10.1007/s10115-011-0438-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0438-9