Perspective on Data Mining from Statistical Viewpoints

Sato, Yoshiharu

doi:10.1007/3-540-45571-X_1

Yoshiharu Sato⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1805))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1724 Accesses
3 Citations

Abstract

The history of statistical data analysis is old, it goes back to the 1920’s. Many fundamental concepts of multivariate statistical data analysis, especially pure theoretical notions, have been accomplished by the 1950’s. After the 1960’s, the practical applications of multivariate statistical data analysis have been available, coupled with the progress of computers, and these have also been an affect on theoretical considerations.

The basic process of data analysis is given as follows:

p1).
An objective of data analysis is given.
p2).
The data which seems to be closely connected with the objective is observed. (sampling data)
p3).
Constructing a model (or a set of models) for explaining the variation of the data.
p4).
Preprocessing (or transforming) the original data in order to make consistency between input data and the model.
p5).
Identification of the model based on observed (input) data.
p6).
Evaluate a goodness of fit. If the goodness of fit is insufficient, then return to P2) or P3), else go to next process.
p7).
Interpretation of the result and investigate the validity.

The most different point on “data mining” and statistical data analysis seems to be the concept of “Data”. In data mining, the data is given as a database in advance. But, in statistical data analysis, the data is observed according to the objective of the analysis.

On the other hand, the object of “data mining” is to find the effective (or valuable) information in the data. From the framework of statistical data analysis above, the main processes of data mining are p3), p4) and p5). However, the concept of “efficient information” in data mining is different from the main part of the data variation in statistical data analysis. For instance, in principal component analysis, the main part of the data variation is obtained as the first principal component, which has the largest proportion. But in data mining, the major variation of the data is of no interest, because the knowledge obtained from it is trivial. Then, data mining seems to be interested in the principal components with small proportion in order to get unusual but valuable information. Hence, statistical data analysis for residual data which is removing the main part of the data variation from the original data, will be useful for data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and Affiliations

Division of Systems and Information Engineering, Hokkaido University, Sapporo, 060-8628, Japan
Yoshiharu Sato

Authors

Yoshiharu Sato
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Systems Management, Universiy of Tsukuba, 3-29-1 Otsuka, Bunkyo-ku, Tokyo, 112-0012, Japan
Takao Terano
Department of Computer Science and Engineering, Arizona State University, P.O. Box 875 406, Tempe, AZ, 85287-5406
Huan Liu
Department of Computer Science, National Tsing Hua University, Hsinchu, 300, Taiwan ROC
Arbee L. P. Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sato, Y. (2000). Perspective on Data Mining from Statistical Viewpoints. In: Terano, T., Liu, H., Chen, A.L.P. (eds) Knowledge Discovery and Data Mining. Current Issues and New Applications. PAKDD 2000. Lecture Notes in Computer Science(), vol 1805. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45571-X_1

Download citation

DOI: https://doi.org/10.1007/3-540-45571-X_1
Published: 24 March 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67382-8
Online ISBN: 978-3-540-45571-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics