Abstract
Non-parametric smoothing of the location model is a potential basis for discriminating between groups of objects using mixtures of continuous and categorical variables simultaneously. However, it may lead to unreliable estimates of parameters when too many variables are involved. This paper proposes a method for performing variable selection on the basis of distance between groups as measured by smoothed Kullback–Leibler divergence. Searching strategies using forward, backward and stepwise selections are outlined, and corresponding stopping rules derived from asymptotic distributional results are proposed. Results from a Monte Carlo study demonstrate the feasibility of the method. Examples on real data show that the method is generally competitive with, and sometimes is better than, other existing classification methods.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aeberhard S, Vel OYD, Coomans DH (2000) New fast algorithms for error rate-based stepwise variable selection in discriminant analysis. SIAM J Sci Comput 22:1036–1052
Aitchison J, Aitken CGG (1976) Multivariate binary discrimination by the kernel method. Biometrika 63:413–420
Asparoukhov O, Krzanowski WJ (2000) Non-parametric smoothing of the location model in mixed variable discrimination. Stat Comput 10:289–297
Bar-Hen A, Daudin JJ (1995) Generalization of the Mahalanobis distance in the mixed case. J Multivar Anal 53:332–342
Bickel PJ, Levina E (2004) Some theory for Fisher’s Linear Discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli 10:989–1010
Chang PC, Afifi AA (1974) Classification based on dichotomous and continuous variables. J Am Stat Assoc 69:336–339
Costanza MC, Afifi AA (1979) Comparison of stopping rules in forward stepwise discriminant analysis. J Am Stat Assoc 74:777–785
Daudin JJ (1986) Selection of variables in mixed-variable discriminant analysis. Biometrics 42:473–481
Daudin JJ, Bar-Hen A (1999) Selection in discriminant analysis with continuous and discrete variables. Comput Stat Data Anal 32:161–175
Duin RPW (1996) A note on comparing classifiers. Patt Recognit Lett 17:529–536
Everitt BS, Merette C (1990) The clustering of mixed-mode data: A comparison of possible approaches. J Appl Stat 17:283–297
Fienberg SE (1972) The analysis of incomplete multiway contingency tables. Biometrics 28:177–202
Ganeshanandam S, Krzanowski WJ (1989) On selecting variables and assessing their performance in linear discriminant analysis. Aust J Stat 31:433–447
Habbema JDF, Hermans J (1977) Selection of variables in discriminant analysis by F-statistic and error rate. Technometrics 19:487–493
Hall P (1981) Optimal near neighbour estimator for use in discriminant analysis. Biometrika 68:572–575
Hand DJ (1997) Construction and assessment of classification rules. Wiley, Chichester
Hoadley B (2001) Comment on “Statistical modelling: The two cultures”, by Breiman, L. Stat Sci 16: 220–224
Krusińska E (1987) A valuation of state of object based on weighted Mahalanobis distance. Patt Recognit 20:413–418
Krzanowski WJ (1975) Discrimination and classification using both binary and continuous variables. J Am Stat Assoc 70:782–790
Krzanowski WJ (1980) Mixtures of continuous and categorical variables in discriminant analysis. Biometrics 36:493–499
Krzanowski WJ (1983) Stepwise location model choice in mixed-variable discrimination. Appl Stat 32: 260–266
Krzanowski WJ (1994) Quadratic location discriminant functions for mixed categorical and continuous data. Stat Prob Lett 19:91–95
Mahat NI (2006) Some investigations in discriminant analysis with mixed variables. Ph. D. thesis, Exeter University, U.K.
McKay RJ, Campbell NA (1982) Variable selection techniques in discriminant analysis ii. allocation. British J Math Stat Psychol 35:30–41
McLachlan GJ (1992) Discriminant analysis and statistical pattern recognition. Wiley, New York
Olkin I, Tate RF (1961) Multivariate correlation models with mixed discrete and continuous variables. Ann Math Stat 32:448–465
Rao CR (1973) Linear statistical inference and its applications, 2nd edn. Wiley, New York
Raudys SJ, Jain AK (1991) Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Trans Syst Man Cyber 13:252–264
Rencher AC (1993) The contribution of individual variables to Hotelling’s T2, Wilk’s λ, and R2. Biometrics 49:479–489
Snapinn SM, Knoke JD (1989) Estimation of error rates in discriminant analysis with selection of variables. Biometrics 45:289–299
Venables WN, Ripley BD (1994) Modern applied statistics with S-Plus. Springer, New York
Webb A (2002) Statistical pattern recognition, 2nd edn. Wiley, Chichester
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mahat, N.I., Krzanowski, W.J. & Hernandez, A. Variable selection in discriminant analysis based on the location model for mixed variables. ADAC 1, 105–122 (2007). https://doi.org/10.1007/s11634-007-0009-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-007-0009-9
Keywords
- Brier score
- Cross-validation
- Discriminant analysis
- Error rate
- Kullback-Leibler divergence
- Location model
- Non-parametric smoothing procedures
- Variable selection