New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

Lamirel, Jean-Charles

doi:10.1007/978-3-319-18422-7_1

Jean-Charles Lamirel^6,7

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 521))

Included in the following conference series:

International Conference: Beyond Databases, Architectures and Structures

1553 Accesses
1 Citations

Abstract

As regard to the evolution of the concept of text and to the continuous growth of textual information of multiple nature which is available online, one of the important issues for linguists and information analysts for building up assumptions and validating models is to exploit efficient tools for textual analysis, able to adapt to large volumes of heterogeneous data, often changing and of distributed nature. We propose in this communication to look at new statistical methods that fit into this framework but that can also extent their application range to the more general context of dynamic numerical data.

For that purpose, we have recently proposed an alternative metric based on feature maximization. The principle of this metric is to define a measure of compromise between generality and discrimination based altogether on the properties of the data which are specific to each group of a partition and on those which are shared between groups. One of the key advantages of this method is that it is operational in an incremental mode both on clustering (i.e. unsupervised classification) and on traditional categorization. We have shown that it allowed to very efficiently solve complex multidimensional problems related to unsupervised analysis of textual or linguistic data, like topic tracking with data changing over time or automatic classification in natural language processing (NLP) context. It can also adapt to the traditional discriminant analysis, often exploited in text mining, or to automatic text indexing or summarization, with performance that are far superior to conventional methods. In a more general way, this approach that freed from the exploitation of parameters can be exploited as an accurate feature selection and data resampling method in any numerical or non numerical context.

We will present the general principles of feature maximization and we will especially return to its successful applications in the supervised framework, comparing its performance with those of the state of the art methods on reference databases.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Random indexing of multidimensional data

Article Open access 07 December 2016

Dynamic Similarity and Distance Measures Based on Quantiles

Data Mining Paradigms

Keywords

References

Aha, D., Kibler, D., Albert, M.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)
Google Scholar
Bache, K., Lichman, M.: Uci machine learning repository ( http://archive.ics.uci.edu/ml ): University of California, school of information and computer science, Irvine, CA, USA (2013)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowledge and Information Systems 34(3), 483–519 (2013)
Article Google Scholar
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Article MATH Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Tech. rep., Wadsworth International Group, Belmont, CA, USA (1984)
Google Scholar
Chawla, N.V., Bowyer, K.V., Hall, L.O., Kegelmeyer, W.P.: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Dash, M., Liu, H.: Consistency-based search in feature selection. Artificial Intelligence 151(1), 155–176 (2003)
Article MATH MathSciNet Google Scholar
Daviet, H.: Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. Thèse de doctorat, Université de Nantes (2009)
Google Scholar
Falk, I., Gardent, C., Lamirel, J.-C.: Classifying French verbs using French and English lexical resources. In: Proceedings of ACL, Jeju Island, Korea (2012)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
MATH Google Scholar
Fritzke, B.: A growing neural gas network learns topologies. Advances in Neural Information Processing Systems 7, 625–632 (1995)
Google Scholar
Good, P.: Resampling methods. Ed. Birkhauser (2006)
Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)
MATH Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46(1), 389–422 (2002)
Article MATH Google Scholar
Hajlaoui, K., Cuxac, P., Lamirel, J.-C., François, C.: Enhancing patent expertise through automatic matching with scientific papers. In: Ganascia, J.-G., Lenca, P., Petit, J.-M. (eds.) DS 2012. LNCS, vol. 7569, pp. 299–312. Springer, Heidelberg (2012)
Chapter Google Scholar
Hall, M.A., Smith, L.A.: Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, pp. 235–239 (1999)
Google Scholar
Hebb, D.O.: The organization of behavior: a neuropsychological theory. John Wiley & Sons, New York (1949)
Google Scholar
Lamirel, J.-C., Falk, I., Gardent, C.: Federating clustering and cluster labeling capabilities with a single approach based on feature maximization: French verb classes identification with igngf neural clustering. Neurocomputing, Special Issue on 9th Workshop on Self-Organizing Maps (WSOM 2012) 147, 136–146 (2014)
Google Scholar
Lang, K.: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–339 (1995)
Google Scholar
Kohavi, R., John, G.R.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997)
Article MATH Google Scholar
Ladha, L., Deepa, T.: Feature selection methods and algorithms. International Journal on Computer Science and Engineering 3(5), 1787–1797 (2011)
Google Scholar
Lallich, S., Rakotomalala, R.: Fast feature selection using partial correlation for multi-valued attributes. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 221–231. Springer, Heidelberg (2000)
Chapter Google Scholar
Lamirel, J.C.: A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. Scientometrics 93, 151–166 (2012)
Article Google Scholar
Lamirel, J.C., Al Shehabi, S., François, C., Hoffmann, M.: New classification quality estimators for analysis of documentary information: application to patent analysis and web mapping. Scientometrics 60(3) (2004)
Google Scholar
Lamirel, J.C., Cuxac, P., Chivukula, A.S., Hajlaoui, K.: Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, Special Issue on PAKDD-QIMIE 2013, 1–18 (2014)
Google Scholar
Lamirel, J.C., Ghribi, M., Cuxac, P.: Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In: Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT 2010), Paris, France (2010)
Google Scholar
Lamirel, J.C., Mall, R., Cuxac, P., Safi, G.: Variations to incremental growing neural gas algorithm based on label maximization. In: Proceedings of IJCNN 2011, San Jose, CA, USA (2011)
Google Scholar
Lamirel, J.C., Ta, A.P.: Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: an application to social network analysis. In: Proceedings of the 4th International Conference on Webometrics, Informetrics and Scientometrics and 9th COLLNET Meetings, Berlin, Germany (2008)
Google Scholar
Lamirel, J.-C., Reymond, D.: Automatic websites classification and retrieval using websites communication signatures. Journal of Scientometrics and Information Management: Special Issue on 8th International Conference on Webometrics, Informetrics and Scientometrics 8(2), 293–310 (2014)
Article Google Scholar
Martinetz, T., Schulten, K.: A “neural-gas” network learns topologies. In: Artificial Neural Networks, pp. 397–402 (1991)
Google Scholar
Pearson, K.: On lines an planes of closetst fit to systems of points in space. Philosophical Magazine 2(11), 559–572 (1901)
Article Google Scholar
Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods, pp. 185–208. MIT Press, Cambridge (1999)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Prudent, Y., Ennaji, A.: An incremental growing neural gas learns topologies. In: Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, vol. 2, pp. 1211–1216 (2005)
Google Scholar
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Google Scholar
Salton, G.: Automatic processing of foreign language documents. Prentice-Hall, Englewood Cliffs (1971)
Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing (1994)
Google Scholar
Sun, L., Korhonen, A., Poibeau, T., Messiant, C.: Investigating the cross-linguistic potential of verbnet-style classification. In: Proceedings of ACL, Beijing, China, pp. 1056–1064 (2010)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2005)
Google Scholar
Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of ICML 2003, Washington DC, USA, pp. 856–863 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Tartu, J. Liivi 2, 50409, Tartu, Estonia
Jean-Charles Lamirel
LORIA, Equipe Synalp, Bâtiment B, 54506, Vandoeuvre Cedex, France
Jean-Charles Lamirel

Authors

Jean-Charles Lamirel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jean-Charles Lamirel .

Editor information

Editors and Affiliations

Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek
Silesian University of Technology, Gliwice, Poland
Paweł Kasprowski
Silesian University of Technology, Gliwice, Poland
Bożena Małysiak-Mrozek
Silesian University of Technology, Gliwice,, Poland
Daniel Kostrzewa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lamirel, JC. (2015). New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. BDAS 2015. Communications in Computer and Information Science, vol 521. Springer, Cham. https://doi.org/10.1007/978-3-319-18422-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-18422-7_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18421-0
Online ISBN: 978-3-319-18422-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

Abstract

Chapter PDF

Similar content being viewed by others

Random indexing of multidimensional data

Dynamic Similarity and Distance Measures Based on Quantiles

Data Mining Paradigms

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

Abstract

Chapter PDF

Similar content being viewed by others

Random indexing of multidimensional data

Dynamic Similarity and Distance Measures Based on Quantiles

Data Mining Paradigms

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation