Introduction

Research on factors related to student retention has traditionally relied on surveying a student cohort and following them for a specified period of time to determine whether they ultimately dropped out or whether they continued their education. Using this design, researchers have worked to validate theoretical models of student retention including Tinto’s (1987, 1993) widely employed model of student integration (Mallette and Cabrera, 1991; Pascarella and Terenzini, 1980, 1983; Terenzini, Lorang, and Pascarella, 1981). Results of these efforts have led other researchers to elaborate Tinto’s theory by integrating components of other theoretical approaches such as Astin’s (1984) theory of involvement (Berger and Milem, 1999) and Bean’s (1985) student attrition model (Berger and Braxton, 1998; Cabrera, Nora, and Castañeda, 1993) using a variety of survey instruments comprised of various scales designed to measure the components under consideration.

Using this type of survey research, Cabrera et al. (1993) caution that their findings of which factors significantly contribute to student retention are not broadly generalizable to other institutions because patterns in persistence may vary according to institutional type, setting, and student characteristics and should only be used as a starting point in studying retention at other institutions. However, individual institutions may not have the capacity to construct and administer a similar instrument to study their unique retention situation. While most institutions do administer surveys to their students at various times during their academic career, these instruments meet other institutional goals and are usually not specifically designed to support theoretically based retention research. Adding an additional survey to an institution’s research agenda is likely to be well beyond the available capacity of the institution. Even if an institution is capable of fielding a one-time retention survey, repeated administrations over time may be too burdensome to sustain. Thus, it is key for enrollment professionals and researchers to have an efficient means of evaluating the trends in the circumstances of student retention at their institution in order to develop or adjust support programs accordingly; a need which is not well suited to the sort of robust survey research widely employed in the literature.

However, institutions do routinely collect a broad array of information on their students’ backgrounds, socioeconomic status, academic progress, and, in many cases, their academic goals and social involvement; factors which do align well with the major theoretical models of student retention (e.g., Astin, 1984; Bean, 1985; Tinto, 1987, 1993). While some research has been conducted using only institutional data variables (e.g., Antley, 1999; Porter, 1999, 2002), the comparability of these findings to results of survey-based studies has not been empirically established. If found to be of comparable predictive ability to traditional survey-based retention studies, the use of these institutional variables as proxies for factors measured through survey scales would be of great benefit to institutions that do not have the resources to support an ongoing retention research program. Further, because these institutional databases of student information are often retained for each entering cohort, it would be possible for institutions to conduct longitudinal analyses on factors contributing to retention.

This research project1 is designed to investigate the comparability of a traditional survey-based retention study and a retention study based on standard, widely available institutional student data of theoretical importance to retention. Thus, the predictive ability of independent variables drawn from a widely used retention survey is compared to the predictive ability of independent variables drawn from institutional databases of the same cohort in order to determine which approach is more robust. Since it is key for enrollment professionals to have an efficient means of evaluating the trends in student retention in order to adjust support programs accordingly, the findings of this project offer institutional researchers valuable guidance on the strengths and limitations of these methodological approaches.

Theoretical Framework

Tinto’s theory of student integration is perhaps the most widely cited in the literature on student retention and, as such, was used for this study. Tinto’s theory grows out of the work of Durkheim (1951) on suicidal behavior. Durkheim concluded that suicidal behavior was the result of the inability to integrate socially and intellectually into society. Tinto (1975) viewed withdrawal from postsecondary education as analogous to suicidal behavior and consequently postulated that student attrition was due to inadequate social and academic integration into the institutional culture. As such, Tinto argues that successful students enter college with background characteristics (e.g., family background, individual attributes, aptitude, and motivation) that are the basis for their initial contact with the institution. The students then interact with the institutional environment and these experiences influence their commitments and intentions. As students become more integrated into the culture of the institution, their goal commitment increases which fosters their continued enrollment and academic progress. If students are not able to successfully integrate into the academic and social communities at their institution, their goal and institutional commitments are diminished, resulting in a greater likelihood of departure.

Tinto’s theory has undergone numerous validation efforts (e.g., Knight, 2002; Pascarella and Chapman, 1983; Pascarella and Terenzini, 1980; Terenzini and Pascarella, 1980; Terenzini et al., 1981; Tinto, 1982) and has been shown to perform reasonably well in predicting student attrition. For this reason, both the survey and institutional variable components of this research project was guided by Tinto’s model.

Profile of the Study Institution

The study institution for this research project is North Carolina State University, a public, land grant, doctoral research institution situated in Raleigh, North Carolina. The institution focuses heavily on science, technology, and engineering with historical strengths in programs for which it is the unique provider in the state (e.g., agriculture, forestry, and textiles). In the Fall of 2004, NC State enrolled a total of 29,957 students, 76% of which were undergraduates (N = 22,754). The Fall 2004 new freshman cohort of 3,958 students represented 13% of the total institutional enrollment. Freshman 1-year retention for the fall 2004 cohort was 89.0%.

Methodology

Research Design

The purpose of this study is to empirically determine the comparability of traditional survey-based retention research methodology with an alternative approach that relies on data commonly available in institutional student databases. In the first stage of this research, the fall 2004 new freshmen at the study institution were asked to complete a survey designed to predict their likelihood of persistence. Survey respondents’ subsequent enrollment status was determined based on the fall 2005 student data file. If the respondents were still enrolled in fall 2005, then they were identified as retained and if they were not enrolled for their third semester, then they were identified as withdrawn.

The survey phase of this project entailed the development of a web-based survey which was administered to all new freshmen in fall 2004. Access to the survey was limited to fall 2004 new freshmen by requiring each participant to provide their university user name and password. This strategy also worked to ensure only one response was received from each eligible new freshman and survey respondents’ identities were safeguarded by the use of secure computer firewall systems. Monetary incentives to participate were provided via a lottery of all respondents. The institutional student data used in this research project was obtained from annual data files generated by the University Planning and Analysis office for the purpose of institutional decision making and federal and state reporting requirements.

Utilizing an information-theoretic approach to model selection (Burnham and Anderson, 2002), an a priori set of candidate predictive models was developed using the institutional integration subscale variables and a variety of independent variables drawn from the study institution’s student databases. These variables were used to predict the likelihood of fall enrollment for the survey respondents through a multivariate logistic regression analysis. Logistic regression is an appropriate analytical tool because of this method’s ability to describe the relationship between a categorical dependent variable and a number of both interval and categorical independent variables (Hosmer and Lemeshow, 2000). As an application in higher education, logistic regression is considered superior to other methods (e.g., linear discriminant function analysis) when the dependent variable is categorical (Peng, So, Stage, and St. John, 2002).

Variables

The dependent variable (first-year retention) was operationalized as enrollment during the fall semester of 2005. For this variable, each student in the study cohort was identified as either continuously enrolled in their third semester, or withdrawn from the institution. The enrollment status of each student was determined through analysis of the fall 2005 student data file (SDF).

The survey portion of this research project utilized the Institutional Integration Scale (IIS) developed by Pascarella and Terenzini (1980) to measure the various dimensions identified by Tinto as corresponding to the likelihood of persistence. This scale has been found to be generally supportive of the major dimensions of Tinto’s theory of student integration. The Institutional Integration Scale contains 30 items arrayed in five subscales covering peer group interactions, interactions with faculty, faculty concern for student development and teaching, academic and intellectual development, and institutional and goal commitments. Pascarella and Terenzini (1980) used principal component factor analysis, multivariate analysis of covariance, and discriminate analysis to verify the predictive validity of this instrument. Because of the unequal number of items in each subscale, mean scores for the instrument subscales were used as independent variables in the survey-based model of student retention (see Table 1).

Table 1. Institutional Integration Scale

For the institutional database aspect of this research project, independent variables having a bearing on the dimensions of Tinto’s retention theory were drawn from various institutional databases. Seven of these variables (certainty of major, intention to participate in co-curricular activities or programs, working during first semester, contact with someone from the study institution prior to enrollment, highest degree intended, and parent’s educational attainment) were obtained from the First Year Student Survey. The First Year Student Survey is mandated by the UNC System Office of the President and is administered each year to incoming students during summer orientation. The other independent variables (number of hours carried in first semester, weighted HS GPA, total SAT, and residency status) were obtained from the fall 2004 student data file (SDF).

Tinto (1987, 1993) demonstrated that academic integration is a key component in his theory of student integration. Weighted high school GPA and total SAT provide insight into the potential academic performance of the student and have been shown to have a strong, positive effect on persistence (Nelson, Scott, and Bryan, 1984; Porter, 1999, 2002; Terkla, 1985; Tinto, 1993). Similarly, first semester GPA has been shown in the literature on institutional departure to positively reflect the academic integration of the student into the institution’s intellectual community (Horn and Carroll, 1998; Mallette and Cabrera, 1991).

Another variable, parent’s educational background, has also been widely used in the research literature. Porter (1999) as well as Horn and Carroll (1998) demonstrate a significant positive effect of this variable on retention, hypothesizing that students with parents who had a collegiate experience were more likely to receive encouragement and support from their families which, in turn, increases the likelihood of persistence. For the purposes of this study, parents’ educational background was operationalized as two separate variables: mother’s education and father’s education.

Tinto (1987, 1993) argues that institutional and goal commitment are also significant factors in a student’s decision to persist in their academic career and this assertion has been supported by numerous studies (e.g., Cabrera et al., 1993; Pascarella and Terenzini, 1983; Terenzini et al., 1981). Undecided students may not have the strength of goal commitment as students who are certain of their career path. Thus, certainty of college major, highest intended degree, and credit hours carried in the first semester are included in this analysis. In addition, a student’s intention to work in the first semester of college may influence their academic goal commitments and this variable has been shown to heavily influence student interactions with the social and academic environment of the institution (Tinto, 1987, 1993).

Tinto’s (1987, 1993) theory of student integration also highlights the importance of social integration for improving retention. Thus, the institutional database phase of this research project also included intended participation in co-curricular activities and programs. While contact with peers is important, research has shown that contact with faculty is also vital for improved retention (Pascarella and Terenzini, 1980; Terenzini and Pascarella, 1980). Thus, student contact with faculty after admission to the institution was included among the independent variables in this phase of the research.

Similarly, student’s residency status (classified as either in-state or out-of-state) may reflect the degree to which a student feels integrated into the culture of the institution. Students who enroll from out-of-state are typically the only student from their high schools in attendance at the study institution and may not have the social support available to students from in-state high schools that send several students each year to the study institution. Being an out-of-state student also may result in decreased familial interactions during the semester and heighten the feelings of isolation and homesickness of students. Table 2 summarizes the variables used in this study.

Table 2. Study Variables

Variable Limitations

While the Institutional Integration Scale has been shown to be a reasonable predictor of retention, the power of this scale is limited by the nature of the sample of respondents. Though an important first step, this project concentrates on one cohort at a single institution and the ability to generalize the findings of this proposed research will require additional study beyond the current project. The variables selected for use in institutional database candidate models were based upon their theoretical relationship to student retention as demonstrated in the literature. However, it is important to note that these variables are not based on survey items specifically designed to measure respondent perceptions, feelings, and motivations. In addition, some of these variables are obtained from existing survey activities of the study institution and may not be available at other institutions. However, most institutions do collect similar data on an ongoing basis and these institutions are encouraged to use this study as a reference in conducting their own institution-specific studies.

Data Analysis

Data analysis began by developing a set of candidate models based on Tinto’s (1987, 1993) theory of student integration. Thus, an a priori set of candidate models were generated from the subscale scores of the Institutional Integration Scale (Pascarella and Terenzini, 1980) as well as theoretically relevant independent variables drawn from the institutional databases. These candidate models are summarized in Table 3. The global model includes the broadest selection of independent variables including those from the Institutional Integration Scale, variables from the student data file, as well as from the institutional First Year Student Survey. Additional candidate models include one based exclusively on the Institutional Integration Scale, one where the independent variables are entirely drawn from the student data file, and one comprised of only variables from the institution’s First Year Student Survey. The fifth model combines institutional database variables from the student data file and the First Year Student Survey.

Table 3. A Priori Candidate Models

Following the guidance of Burnham and Anderson (2002), model selection techniques using an information-theoretic approach were employed to rank the candidate models relative to each other. The comparative evaluation of the candidate models was based on the Akaike Information Criterion (AIC) (Akaike, 1973) which uses the Kullback–Leibler distance between models concept to determine the best approximate model (Burnham and Anderson, 2002). Typically, regression analysis based on maximum likelihood methods has relied on hypothesis testing to determine the best model for the given data; however, these approaches rely on arbitrary α levels (e.g., 0.05, 0.01, or 0.15) and model selection should not be based on such subjective measures (Akaike, 1974). Similarly, goodness-of-fit tests rely on subjective α levels to test whether the model is a good fit for the data, and the choice of this level can lead to overfitted models (Burnham and Anderson, 2002). The adjusted coefficient of multiple determination (R 2) is commonly used in multiple linear regression for model selection (i.e., the best model is the one with the highest R 2) and in multivariate logistic regression, the Hosmer and Lemeshow (2000) goodness-of-fit statistic is widely used. Such measures are considered by Burnham and Anderson (2002) to be good descriptive statistics, but are poor measures for use in model selection. Thus, the Hosmer and Lemeshow goodness-of-fit statistic was used to initially determine if the global model was a good fit to the data. Once the global model was found to adequately fit the data, the Akaike Information Criterion (AIC) was used within an information-theoretic paradigm for model selection (Burnham and Anderson, 2002).

The AIC offers the researcher an estimate of the relative distance between the fitted model and the unknown “true” model that represents the phenomenon under study which thereby allows for the selection of the model that most closely approximates the phenomenon under study (Burnham and Anderson, 2002). Thus, the smaller the AIC, the better a model is considered to reflect the data, thereby allowing the researcher to empirically identify the best approximate model from the set of candidate models. In addition, model selection uncertainty was addressed though a consideration of Akaike weights which correspond to the strength of evidence of one model versus another. Though commonly employed in epidemiological and biological studies, this information-theoretic approach has not been widely used in higher education research, thereby making this project innovative from not only a design perspective but also from an analysis perspective.

In the information theory literature, it is common to use the bias-corrected AIC (AICc); however, due to the large number of observations in relation to the number of model parameters, calculation of the bias-corrected AIC was not necessary for this study. This decision was justified because as the ratio of observations to model parameters increases (i.e., K/n > 40) the values of AIC and AICc converge per Burnham and Anderson (2002). A consideration of whether the best approximate model is of survey data or institutional database origin offers guidance to institutional researchers on the most effective approach to predictive model development in institution-specific retention research (i.e., traditional survey-based methods or the use of institutional datasets as sources of data). The SAS System, Version 9.1.3 was utilized for the data analysis in this study.

Results

Of the 3,958 new freshmen invited to participate in this survey, 1,513 responded to the survey for a 38.2% response rate. Of the survey respondents, 1,383 (91.4%) were retained for their second year (i.e., 130 respondents withdrew from the study institution) compared to the rate for the whole 2004 freshman cohort (89.0%). A chi-square analysis determined that there was a significant difference between the rate of withdrawal in the sample and the rate found in the study population (χ2 = 6.75538, df = 1, p < 0.01). Although the dropout rate among the survey participants was found to be significantly different from the withdrawal rate in the overall 2004 new freshman population at this institution, this finding is not surprising. It is certainly reasonable that students in the process of making the decision to withdraw would not be inclined to participate in a survey regarding their thoughts and feelings about their educational experience. In addition, because the survey was administered in October of the fall semester, some students in the overall population could have already dropped out and thus, were not available to take the survey when it was fielded, thereby contributing to a significantly higher dropout rate in the general cohort, compared with the dropout rate among the survey participants. Table 4 presents descriptive statistics for the sample.

Table 4. Descriptive Statistics

Multivariate logistic regression was used to evaluate the predictive ability of the five candidate models. Following the guidance of Burnham and Anderson (2002), the global model was first evaluated for goodness-of-fit using Hosmer and Lemeshow’s (2000) goodness-of-fit statistic (HL). The global model was found to be a good fit for the data (χ2 = 7.0021, df = 8, p > 0.5364) thereby establishing that the best model selected will also fit the data.

Table 5 presents a summary of the logistic predictive models and the associated model selection parameters. These results indicate that the global model of both survey and institutional database origin was the best model of the set; however, as this research is designed to compare the differences in predictive ability of these two sources of data, one must therefore look to the next best predictive model in the set (IDB) whose independent variables are drawn from institutional databases. This model contains variables from the student data file and from the First Year Student Survey, which is administered to all incoming freshmen during their summer orientation. The differential AIC score between the lowest AIC (i.e., the AIC of the global model) and the AIC value for this IDB model was 11.857, followed closely by the model containing only the first year student variables (FYS model) (AICΔ = 12.031). The model whose independent variables are drawn from only the Institutional Integration Scale ranked fourth in the set (AICΔ = 212.376), followed by the model drawn exclusively from the student data file (SDF) (AICΔ = 223.814).

Table 5. Summary of Logistic Models and Model Selection Parameters

Discussion

Research on retention typically relies on surveys of student perceptions in relation to the factors believed to theoretically influence persistence decisions. However, this resource-intensive methodology is not always feasible for retention research at individual institutions. Yet, for institutional, state, and federal reporting purposes, all institutions do maintain student data files that address many of the factors identified as important determinants of persistence in retention theory. Porter (1999, 2002) and Antley (1999) have used this type of data to develop models designed to predict student attrition; however, the comparability of this approach and that of the traditional survey-based analyses has not been previously ascertained.

An information-theoretic approach to selecting the best (most parsimonious) model for predicting undergraduate retention revealed that the independent variables drawn from institutional databases out-perform variables drawn from the institutional integration survey scale developed by Pascarella and Terenzini (1980). While the model using only the database variables (i.e., high school GPA, SAT scores, in-state residency, and total hours carried in the first semester) collected by virtually every postsecondary institution proved to be inferior to Pascarella and Terenzini’s (1980) Institutional Integration Scale, the addition of other commonly available variables such as parents’ educational background, certainty of major, and intention to work in the first semester (among others) provide a powerful addition to the basic database variables and ultimately, provides a better prediction in this study than the IIS.

This empirical support for the use of institutional database variables is valuable for institutional researchers and higher education professionals looking to conduct institution-specific retention research to inform intervention programs for students at risk for withdrawal. While these findings offer important validation for institutional researchers looking to utilize the considerable corpus of data which they routinely collect in investigating retention issues, this study does not directly address the usefulness of the Institutional Integration Scale for retention-related research problems or other associated topics. The IIS has been independently validated and widely employed to address a number of important topics in retention (e.g., Mallette and Cabrera, 1991; Pascarella and Terenzini, 1980; Pascarella and Terenzini, 1983; Terenzini et al., 1981). The findings of this study do not diminish the results of these or other works which have utilized the Institutional Integration Scale; rather, this study offers researchers evidence in support of an additional tool for use in understanding this complex phenomenon. This expanded toolkit for retention research offers the possibility for more research in diverse settings which, given resource constraints, would not have otherwise been possible. Armed with institutionally specific retention research, institutional researchers and higher education professionals can then begin to develop empirically based programs and policies designed to effect positive change in undergraduate student retention. This study lays the groundwork for this effort.

Endnote

1. The authors would like to gratefully acknowledge the support received from the AIR/NPEC Focused Grant Program (AIR Grant 04-482).