RE-EM trees: a data mining approach for longitudinal and clustered data

Sela, Rebecca J.; Simonoff, Jeffrey S.

doi:10.1007/s10994-011-5258-3

RE-EM trees: a data mining approach for longitudinal and clustered data

Published: 13 July 2011

Volume 86, pages 169–207, (2012)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

RE-EM trees: a data mining approach for longitudinal and clustered data

Download PDF

Rebecca J. Sela¹ &
Jeffrey S. Simonoff²

14k Accesses
134 Citations
5 Altmetric
Explore all metrics

Abstract

Longitudinal data refer to the situation where repeated observations are available for each sampled object. Clustered data, where observations are nested in a hierarchical structure within objects (without time necessarily being involved) represent a similar type of situation. Methodologies that take this structure into account allow for the possibilities of systematic differences between objects that are not related to attributes and autocorrelation within objects across time periods. A standard methodology in the statistics literature for this type of data is the mixed effects model, where these differences between objects are represented by so-called “random effects” that are estimated from the data (population-level relationships are termed “fixed effects,” together resulting in a mixed effects model). This paper presents a methodology that combines the structure of mixed effects models for longitudinal and clustered data with the flexibility of tree-based estimation methods. We apply the resulting estimation method, called the RE-EM tree, to pricing in online transactions, showing that the RE-EM tree is less sensitive to parametric assumptions and provides improved predictive power compared to linear models with random effects and regression trees without random effects. We also apply it to a smaller data set examining accident fatalities, and show that the RE-EM tree strongly outperforms a tree without random effects while performing comparably to a linear model with random effects. We also perform extensive simulation experiments to show that the estimator improves predictive performance relative to regression trees without random effects and is comparable or superior to using linear models with random effects in more general situations.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Abdolell, M., LeBlanc, M., Stephens, D., & Harrison, R. V. (2002). Binary partitioning for continuous longitudinal data: categorizing a prognostic variable. Statistics in Medicine, 21, 3395–3409.
Article Google Scholar
Afshartous, D., & de Leeuw, J. (2005). Prediction in multilevel models. Journal of Educational and Behavioral Statistics, 30, 109–139.
Article Google Scholar
Becker, R. A., Cleveland, W. S., & Shyu, M.-J. (1996). The visual design and control of trellis display. Journal of Computational and Graphical Statistics, 5, 123–155.
Article Google Scholar
Berk, R. A. (2008). Statistical learning from a regression perspective. New York: Springer.
MATH Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey: Wadsworth.
MATH Google Scholar
De’Ath, G. (2002). Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology, 83, 1105–1117.
Google Scholar
De’Ath, G. (2006). mvpart: multivariate partitioning. R package version 1.2-4.
Dee, T. S., & Sela, R. J. (2003). The fatality effects of highway speed limits by gender and age. Economics Letters, 79, 401–408.
Article Google Scholar
Evgeniou, T., Pontil, M., & Toubia, O. (2007). A convex optimization approach to modeling consumer heterogeneity in conjoint estimation. Marketing Science, 26, 805–818.
Article Google Scholar
Galimberti, G., & Montanari, A. (2002). Regression trees for longitudinal data with time-dependent covariates. In K. Jajuga, A. Sokolowski, & H.-H. Bock (Eds.), Classification, clustering and data analysis (pp. 391–398). New York: Springer.
Chapter Google Scholar
Ghose, A., Ipeirotis, P., & Sundararajan, A. (2005). The dimensions of reputation in electronic markets (Technical Report 06-02). NYU CeDER Working Paper.
Hajjem, A., Bellavance, F., & Larocque, D. (2008). Mixed-effects regression trees for clustered data. Les Cahiers du GERAD G-2008-57.
Hajjem, A., Bellavance, F., & Larocque, D. (2011). Mixed effects regression trees for clustered data. Statistics and Probability Letters, 81, 451–459.
Article MathSciNet MATH Google Scholar
Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72, 320–340.
Article MathSciNet MATH Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. New York: Springer.
MATH Google Scholar
Hsiao, W.-C., & Shih, Y.-S. (2007). Splitting variable selection for multivariate regression trees. Statistics and Probability Letters, 77, 265–271.
Article MathSciNet MATH Google Scholar
Laird, N. M., & Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963–974.
Article MATH Google Scholar
Larsen, D. R., & Speckman, P. L. (2004). Multivariate regression trees for analysis of abundance data. Biometrics, 60, 543–549.
Article MathSciNet MATH Google Scholar
Lee, S. K. (2005). On generalized multivariate decision tree by using GEE. Computational Statistics & Data Analysis, 49, 1105–1119.
Article MathSciNet MATH Google Scholar
Lee, S. K. (2006). On classification and regression trees for multiple responses and its application. Journal of Classification, 23, 123–141.
Article MathSciNet Google Scholar
Lee, S. K., Kang, H.-C., Han, S.-T., & Kim, K.-H. (2005). Using generalized estimating equations to learn decision trees with multivariate responses. Data Mining and Knowledge Discovery, 11, 273–293.
Article MathSciNet Google Scholar
Liu, Z., & Bozdogan, H. (2004). Improving the performance of radial basis function (RBF) classification using information criteria. In H. Bozdogan (Ed.), Statistical data mining and knowledge discovery (pp. 193–216). Boca Raton: Chapman and Hall/CRC.
Google Scholar
Liu, C., & Rubin, D. B. (1994). The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, 633–648.
Article MathSciNet MATH Google Scholar
Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12, 361–386.
MathSciNet MATH Google Scholar
Milborrow, S. (2011). rpart.plot: plot rpart models. R package version 1.2-2.
Patterson, H. D., & Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58, 545–554.
Article MathSciNet MATH Google Scholar
Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., & the R Core team (2009). nlme: linear and nonlinear mixed effects models. R package version 3.1-93.
R Development Core Team (2009). R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. ISBN 3-900051-07-0. URL http://www.R-project.org.
Google Scholar
Ritschard, G., & Oris, M. (2005). Life course data in demography and social sciences: statistical and data mining approaches. In R. Levy, P. Ghisletta, J.-M. Le Goff, D. Spini, & E. Widmer (Eds.), Towards an interdisciplinary perspective on the life course, advances in life course research (pp. 289–320). Amsterdam: Elsevier.
Google Scholar
Ritschard, G., Gabadinho, A., Müller, N. S., & Studer, M. (2008). Mining event histories: a social science perspective. International Journal of Data Mining, Modelling and Management, 1, 68–90.
Article Google Scholar
Segal, M. R. (1992). Tree-structured models for longitudinal data. Journal of the American Statistical Association, 87, 407–418.
Article Google Scholar
Sela, R. J., & Simonoff, J. S. (2009). RE-EM trees: a new data mining approach for longitudinal data. NYU Stern Working Paper SOR-2009-03.
Simonoff, J. S. (2003). Analyzing categorical data. New York: Springer.
MATH Google Scholar
Therneau, T. M., & Atkinson, B. (2010). rpart: recursive partitioning. R port by Brian Ripley. R package version 3.1-46.
Witten, I. H., & Frank, E. (2000). Data mining. New York: Morgan Kauffman.
Google Scholar
Verbeke, G., & Molenberghs, G. (2000). Linear mixed models for longitudinal data. New York: Springer.
MATH Google Scholar
West, B. T., Welch, K. B., & Galecki, A. T. (2007). Linear mixed models: a practical guide using statistical software. Boca Raton: Chapman and Hall/CRC.
MATH Google Scholar
Zhang, H. (1997). Multivariate adaptive splines for analysis of longitudinal data. Journal of Computational and Graphical Statistics, 6, 74–91.
Article MathSciNet Google Scholar
Zhang, H. (1998). Classification trees for multiple binary responses. Journal of the American Statistical Association, 93, 180–193.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

J.P. Morgan Chase & Co., Columbus, OH, USA
Rebecca J. Sela
Statistics Group, Information, Operations, and Management Sciences Department, Leonard N. Stern School of Business, New York University, New York, NY, USA
Jeffrey S. Simonoff

Authors

Rebecca J. Sela
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey S. Simonoff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rebecca J. Sela.

Additional information

Editor: Johannes Fürnkranz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sela, R.J., Simonoff, J.S. RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn 86, 169–207 (2012). https://doi.org/10.1007/s10994-011-5258-3

Download citation

Received: 09 February 2010
Accepted: 14 June 2011
Published: 13 July 2011
Issue Date: February 2012
DOI: https://doi.org/10.1007/s10994-011-5258-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

RE-EM trees: a data mining approach for longitudinal and clustered data

Abstract

Article PDF

Similar content being viewed by others

Multivariate hidden Markov regression models: random covariates and heavy-tailed distributions

Generalized pareto regression trees for extreme event analysis

Mining for the truly responsive customers and prospects using true-lift modeling: Comparison of new and existing methods

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

RE-EM trees: a data mining approach for longitudinal and clustered data

Abstract

Article PDF

Similar content being viewed by others

Multivariate hidden Markov regression models: random covariates and heavy-tailed distributions

Generalized pareto regression trees for extreme event analysis

Mining for the truly responsive customers and prospects using true-lift modeling: Comparison of new and existing methods

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation