Introduction

Due to the hierarchical nature of science, general chemistry is a gateway course that many students must pass before they can enroll in more advanced science courses. This is especially problematic at the authors’ home institution where approximately thirteen degree programs/tracks depend upon successful completion of general chemistry. There are many treatments designed to improve student success and retain science majors, such as concurrent support courses, online tutorials, “just-in-time” instructional aides, study skills training, motivational interviewing, and/or peer mentoring. In order for these treatments to be successful, however, they must be provided to students at risk for failure as early in the semester as possible.

Many institutions of higher education experience failure rates of 25% or more in general chemistry (Campbell 2011). Identifying students at risk for failure in general chemistry at the beginning of the semester is a complex task. No single measure will predict success in general chemistry all of the time. Many studies, detailed later in the Related Studies section, have identified algebra skills as a significant variable for predicting student success in general chemistry. Anecdotally, the chemistry faculty at the authors’ home institution have found that students who fail chemistry typically lack the mathematical skills necessary to pass the course. The goal of this research is to create a predictive system which accurately identifies students at risk for failure in general chemistry due to mathematical deficiencies.

This study examines a diagnostic algebra test (the “DAT”) created by the authors for the purpose of identifying students at the beginning of the semester who are at risk for failure in general chemistry. Nine statistically significant variables for predicting student success or failure in chemistry were identified: whether a student is declared white, cumulative GPA, math ACT score, five individual questions from the DAT, and the total score on the DAT. Using neural networks, a function was constructed that gives the probability a student will pass general chemistry with a grade of C or better based on these nine predictor variables. The accuracy of the results was maximized by using an adaptive learning algorithm, called backpropagation, to find optimal weights for each edge of the neural network.

Using all nine predictors for success in general chemistry, the neural network correctly predicted 76% of the students whose course grade was a C- or worse as being at risk for failure at the beginning of the semester. When the total score on the DAT is eliminated as a predictor, the accuracy for identifying at-risk students drops down to 60%, which is a larger drop than when cumulative GPA or math ACT scores are individually eliminated as predictors. This suggests that the DAT is a not only valid, but also a powerful predictor for identifying at-risk students. Cumulative GPA is historically one of the best single predictors of college success (Cooper 2007, 2009), and the results of the DAT are as good as for cumulative GPA. One advantage of the DAT over cumulative GPA is that scores on the DAT are easy to obtain, while cumulative GPA for a student may be difficult to obtain, especially for new students or transfer students.

Background

The college at which this research was conducted is a comprehensive, 4 year public liberal arts college with approximately 3,700 students. The student population is multi-culturally diverse, with approximately 20% Native American students, and is split nearly equally by gender. In a typical semester, approximately 190 students enroll in three sections of general chemistry, and approximately 25% of them do not pass.

At the authors’ home institution, the chemistry faculty is interested in identifying students at risk for failure due to mathematical deficiencies. General chemistry requires that students have taken or are concurrently enrolled in College Algebra or Algebra for Calculus. Due to time restrictions and curricular requirements, most chemistry professors do not have time to review presumed algebra skills during class. Instead, chemistry faculty would like to enroll these at-risk students in a concurrent one-credit chemistry problem solving course designed to help students pass general chemistry, and/or give them “think aheads” that will help them with the mathematical skills they need “just in time” before they use them in their chemistry class.

Related Studies

Over the past 50 years there has been increasing interest in diagnostic tests that predict student success and failure in general chemistry. Such tests include the Toledo Chemistry Placement Exam (Hovey and Krohn 1963; Niedzielski and Walmsley 1982), the Group Assessment of Logical Thinking (GALT) and the Test of Logical Thinking (TOLT) exams (Bunce and Hutchinson 1993; Burke et al. 1999; Bird 2010; Lewis et al. 2010), the California Chemistry Diagnostic Test (Russell 1994; Karpp 1995), the Fullerton Test (McFate and Olmsted 1999), the CU-Denver Placement Exam (Kimbrough and Weaver 1999), the Student Pre-Semester Assessment (SPSA) (Wagner et al. 2002), the University of Iowa Placement Exam (Pienta 2003), the University of Minnesota Mathematics Assessment for Second-Semester Chemistry (UnMASSC) test (Leopold and Edgar 2008), and the math SAT (Pickering 1975; Ozsogomonyan and Loftus 1979; Spencer 1996). The GALT and TOLT exams measure formal reasoning abilities, while the other exams measure ability in chemistry and/or mathematics. The studies that examine mathematical ability all offer evidence to support the general conclusion that success in general chemistry correlates positively with mathematical ability. In fact, one study found that mathematics background was the most strongly predictive of outcome in the course, followed by chemistry background (Wagner et al. 2002).

Predicting whether students will pass or fail general chemistry has been studied thoroughly using the Student Pre-Semester Assessment (SPSA), the math SAT, and the Toledo Exam, sometimes in combination with data about math background, chemistry background, and age (Wagner et al. 2002). Both the SPSA and Toledo Exams measure ability in mathematics in conjunction with chemistry. In contrast, the diagnostic algebra test (DAT) developed by the authors measures only algebra ability. Some algebra questions on the DAT are presented in the context of chemistry, but they do not require knowledge of chemistry. This study constructs a predictive system based on a neural network analysis of the results of the DAT, math ACT scores, cumulative GPA, and student demographics. A comparison of results for the neural network with other predictive systems from Wagner is given in Table 1. Most significantly, the neural network correctly identified 75.6% of students who will fail general chemistry, which is much better than the next best result at 40.8%.

Table 1 Comparison of predictive systems

Methods

Neural networks have been found to be effective means of predicting student success in a variety of settings. They have been shown to give accurate predictions within the fields of mathematics (Cooper 2007), actuarial science (Schumacher et al. 2010), first-year education (Cooper 2008), business administration (Naik and Ragothaman 2004), and graduate studies (Hardgrave and Wilson 1994). Neural networks as a means to predict success in general chemistry would appear to be a viable method given its successful application to a wide variety of fields.

The predictive system developed in this research follows a genetically optimized version of creating a decision support system for predicting student success in a developmental mathematics course developed by the first author in his doctoral studies (Cooper 2007). This research demonstrates the applicability of the methodology to other fields of study outside mathematics (i.e., general chemistry). This project included the following four stages in the creation of the neural network-based predictive system: instrument creation & data collection, input analysis, and neural network creation & evaluation.

Instrument Creation and Data Collection

The diagnostic algebra test (DAT) utilized in this research was designed to assess basic math skills and to identify students at risk for failing general chemistry. The DAT consists of twenty multiple choice questions, each of which has four choices and one correct answer. Each question is worth one point, and there is no penalty for incorrect answers. The seven topics and twenty questions on the DAT are given in Table 2.

Table 2 The diagnostic algebra test

The DAT is a 20 min paper-and-pencil test given in class during the first week. Students are not allowed to use calculators or notes. The test was announced, but the material on the test was not discussed before it was administered. The tests are scored by machine so that professors can promptly advise students of outcomes and their at-risk classification.

The questions on the DAT were chosen by two mathematicians and one chemist. Some questions on the DAT are similar to questions on the Moorpark College Chemistry Placement Exam (Moorpark 2010). Three criteria were important for selecting questions. First, only those math skills used most often in general chemistry were tested to ensure the test has high validity. Second, a variety of basic algebra skills were tested to make the instrument general enough to identify students at risk for failure, and also to increase the spread of scores, which should also increase the reliability of the results. Third, each question should require only a small number of algebra skills so that the results for individual questions are specific enough to minimize the false identification of students who will pass the course despite poor performance on the DAT. Individual questions were designed so that the arithmetic would not require a calculator and so that the choices included distracters indicative of common mathematical errors. The multiple choice DAT was designed to have no easily discernible patterns in the answers, and to have no bias toward any particular letter for correct answers. Resultant reliability and validity are discussed later in the paper in the Results section.

An item analysis of the questions on the DAT is given in Fig. 1. The discrimination index is a number between −1 and 1 that measures success on an item for students who passed the DAT relative to those who did not pass. A discrimination index above 0.20 is considered adequate, while 0.30–0.50 is considered good (Wagner et al. 2002). The discrimination index for a question is defined to be \( \frac{{c_{\text{p}} }}{\text{p}} - \frac{{c_{\text{f}} }}{\text{f}} \), where c p is the number of students who answered the question correctly and passed the test, c f is the number of students who answered the question correctly and failed the test, and p and f are the number of students who passed and failed the test, respectively.

Fig. 1
figure 1

Item analysis of DAT questions

The mean score on the DAT was 70.9% with standard deviation 13.7%. When the minimal passing score for the DAT is set at 60%, the mean discrimination index for all twenty questions is 0.418, which suggests that overall the DAT offers good discrimination. Individual discrimination index results are shown in Fig. 1.

Input Analysis

Predicting which students are at risk for failure is a complex task that should use measures of current knowledge, background knowledge, and demographics that correlate positively or negatively with academic success. Demographic and academic data from the institution’s student information database, specific questions on the DAT, and total score on the DAT were considered as possible predictors. The non-parametric statistic Spearman Rho, although only linear, was chosen as a first measure of association for potential ordinal and categorical predictors for course success. Of the 32 possible inputs considered, 9 were found to be statistically significant. Four of these were found to be statistically significant at p ≤ 0.01 level (2-tailed)**, and five were found to be statistically significant at p ≤ 0.05 level (2-tailed)*. Collectively, these nine variables provided the baseline neural network input space. A summary of the inputs and their corresponding Spearman Rho correlation coefficients is shown in Table 3.

Table 3 Statistically significant inputs

Neural Network Creation and Evaluation

Using Neurosolutions, a comprehensive neural network software package, the authors created a multilayered, feedforward, backpropagation neural network, also referred to as a multilayered perceptron. Neurosolutions utilizes a breadboard approach to neural network creation. The breadboard approach stems from the company’s software engineers’ background in electrical engineering. In regards to neural network development, most of the networks components can be added via a “drag-and-drop” fashion in a fairly easy-to-use interface. With a Microsoft Excel add-in for data entry and manipulation, researchers interested in utilizing and testing neural networks as a data mining technique can minimize the computational complexities associated with neural network creation and create an accurate predictive system relatively quickly. A screenshot of Neurosolutions development platform is shown in Fig. 2.

Fig. 2
figure 2

Neurosolutions user interface

Multilayered perceptrons are the most commonly used neural network architecture in data mining (Turban et al. 2006). The popularity of this neural network approach is due in large part to the mathematical proof that multilayer backpropagation networks are universal approximators (Hornik et al. 1989). A universal approximator is a numerical technique capable of finding a mapping between inputs and outputs if such a mapping exists.

Neurosolutions also offers two algorithmic features to optimize the parameters necessary to create a neural network and to maximize the use of all the data collected for both training and testing. First, one of the algorithmic features of Neurosolutions is the capability for genetic algorithm optimizations. Genetic algorithmic optimization provides researchers the ability to determine best the values of the numerous parameters/weights that are involved in creating a network. A processing element is a simple mathematical model of a biological neuron. Each processing element creates a weighted sum of inputs and individually becomes activated or not via a threshold function. The training process in neural network creation takes predetermined outcome data and algorithmically determines these weights via the backpropagation algorithm. A set of processing elements creates a layer within a neural network. Neural network creation began with a network with an overabundance of processing elements in each of its layers (i.e., 15 in the first hidden layer and 12 in second hidden layer). This allowed for genetic optimization to reduce the number of processing elements. The genetic optimization setup used to create the best network in this research is shown in Fig. 3.

Fig. 3
figure 3

Genetic algorithm optimization

The neural network accuracy across the number of generations is given in Fig. 4. This figure shows that genetic algorithms helped optimize the network parameters, dramatically decreased the mean squared error (MSE), and improved overall predictive accuracy. A steepest descent algorithm with momentum was used to avoid local minima and saddle points that might prevent the minimization of error.

Fig. 4
figure 4

Mean squared error (average fitness) versus generation

In addition to genetic optimization, the second algorithmic feature Neurosolutions offers is the Leave-N-Out algorithm. Leave-N-Out allows researchers to utilize all of their data for both training and testing via a sliding cross validation window. This feature is especially useful when data is a scarce commodity, which is the case in this research. For the predictive network created in this research, N = 1 was used. Selecting N = 1 allows nearly all of the data, except for one, to be used for training from the very beginning of the training cycle. Again, training a neural network determines the weights within each of the processing elements and thereby gives the network its predictive power. The weights resulting from the training cycle create the predictive capability of the network. Using the neural network parameters found via the genetic algorithm optimization process (e.g., eight processing elements in the first hidden layer and nine in the second hidden layer), Leave-N-Out with N = 1 was applied. Figure 5 depicts the mean squared error versus the epoch number.

Fig. 5
figure 5

Mean squared error versus epoch

From the learning curve depicted in Fig. 5, learning appears to have levelled off towards the end of the learning process. This suggests the network has fully learned from the data it was given, and has not over-modelled the specific data presented to it during training. This keeps the network generalizable and able to provide accurate predictions for out-of-sample data (i.e., data not used during training).

An input space of nine variables is not an unreasonable number in order to create a generalizable neural network which can perform in a production setting. In practice, however, the number of inputs should be minimized in order to have the most generalizable results (i.e., accurate predictions when presented with new data). Hence, backwards elimination was chosen as a means to possibly prune the number of inputs for the neural network. Backwards elimination entails removing the least statistically significant variable, recreating the neural network, and determining the new accuracy. If the accuracy improves, the input is eliminated as a predictor. Otherwise, the input is kept.

Results

Before using backwards elimination, a baseline accuracy using all nine of the statistically significant variables from Table 3 needs to be found. Using all nine input variables, the system correctly predicted \( 76\% = \frac{34}{45} \) of the at-risk students who did not pass the course, \( 85\% = \frac{121}{142} \) of the not at-risk students who passed the course, and correctly categorized approximately \( 83\% = \frac{34 + 121}{187} \) of all chemistry students. These findings are summarized in Table 4, also referred to as a confusion matrix. In reference to sensitivity, the predictive system identified 45 students to be at risk, while 11 of these students were actually not at risk. In other words, a Type I error occurred \( 24\% = \frac{11}{45} \) of the time. In reference to specificity, 142 of the participants were categorized by the system as being not at risk, while 21 were actually at risk. In other words, a Type II error occurred \( 15\% = \frac{21}{142} \) of the time (Table 4).

Table 4 Confusion matrix showing accuracy and error for the DAT
Table 5 Changes in predictive accuracy of the DAT when one predictor is eliminated during backwards elimination

Discussion

The ability to categorize correctly approximately 83% of general chemistry students at the very beginning of the semester would be a great asset to chemistry professors. It is the authors’ opinion that all nine predictors be kept as inputs to the neural network. The overall predictive accuracy remained near 80% regardless of the eliminated input. However, the accuracy for identifying at-risk students all dipped below 70% when an input was eliminated, thereby substantially increasing the chance of a Type I error—the error of greatest importance given the desire of this research to identify at-risk students and thereby increase student success (Table 5). In fact, eliminating either the total score on the DAT or cumulative GPA reduced the predictive accuracy for at-risk students from 76 to 60%. This suggests that the total score on the DAT is as good a predictor as cumulative GPA, which is significant because cumulative GPA is historically one of the best predictors (Cooper 2007, 2009). Creating an effective treatment for the purposes of mitigating a classification of being at risk could possibly be done via and item analysis of the nine predictive variables, as in Table 6.

Table 6 Possible at-risk mitigation approaches for a given predictor

The validity and reliability of the instrument and overall predictive system need to be addressed. First, in regards to validity, due to the “out-of-sample” treatment of data from the Leave-N-Out algorithm, the found predictive accuracies are believed to be similar to what would be found when actually performing classifications of actual general chemistry students. This provides this research a high level of criterion-related validity. Second, in regards to reliability, using all nine statistically significant inputs, a Cronbach’s Alpha of 0.70 was computed using SPSS, which is an acceptable level of reliability. All nine inputs collectively appear to be measuring a similar characteristic, the at-risk nature of a student in general chemistry.

Conclusions

The predictive system created in this research correctly identified 76% of at-risk students who did not pass general chemistry, a result which is approximately 35% better than other reported results (see Table 1). Given the behavioral nature of the data and context, the diagnostic algebra test created within this research coupled with a neural-network based predictive system offers faculty and others interested in increasing student success rates in chemistry a viable means to identify at-risk students. The Type I and Type II errors, reliability, and validity all seem to be within acceptable ranges. The metric of greatest concern might be the 24% likelihood of a Type I error, in which students are identified as being at risk who are actually not at risk. However, these misidentified students will not be harmed by participating in any academic success treatments deemed appropriate by their chemistry faculty members. In fact, these students are likely to be on the cusp of success, and their participation will only help them advance academically. Students who are identified as at-risk might perceive themselves as a negatively stereotyped and underperform as a result. Treatments for this could include exercises in self-affirmation of values and attitude-changing strategies that frame adversity as common and transient (Cohen et al. 2009; Miyake et al. 2010; Walton and Cohen 2011).

With reasonable Type I and Type II error categories, the only other concerns regarding use of this predictive system might be validity and reliability. In regards to validity, the relatively high accuracies seen via the Leave-N-Out algorithm gives this system a high level of criterion related validity. Data used via Leave-N-Out is considered to be out-of-sample, thereby making the accuracies indicative of what is to be found in a production situation. By using Leave-N-Out, it is the authors’ opinion that a predictive system has been created that accurately identifies at-risk students. Thus, given the acceptability of Type I error, Type II error, reliability, and validity, the predictive system presented in this research offers a viable and potentially worthwhile means to improve student success in chemistry.

The success of the predictive system developed in this study for the purpose of correctly identifying students at risk for failure in general chemistry is a result of several factors. First, the predictive system uses a genetically optimized neural network to determine weights for the inputs that give the highest predictive accuracy instead of leaving it to humans to make poorly informed guesses at appropriate weights and threshold values. Second, the DAT was created jointly by professors in math and chemistry who drew on their extensive experience to choose questions that not only appear often in chemistry, but also target common mistakes they have seen made by students in the past who tended to fail. Third, the questions on the DAT that were most predictive were also among the most basic algebra questions, which might help explain why the neural network predictive system identified at-risk students who will eventually fail better than other predictive systems that do not include such basic algebra questions.

The authors believe this research sets the foundation for a potential pilot of a production setting. The predictive system developed here could still be improved by replacing some questions on the DAT by ones that are statistically significant, have high discrimination indices, and high predictive accuracy as measured by a neural network analysis. With these improvements, it is possible that the number of inputs to the neural network could be reduced. Reducing the number of inputs would simplify data collection, reduce reliance on data that might not be easily obtained, and, most importantly, should increase the predictive power of the system. The logical outcome of this research is to create and implement treatments for students at risk for failing general chemistry.