Introduction

Private supplementary tutoring (PT) is widely referred to as shadow education in the educational research community because it mimics the regular school curriculum (Baker et al., 2001; Bray & Kwok, 2003). Recently, the PT industry has experienced rapid development across the world with the expansion of various forms of its service. The PISA 2012 data indicated that in Shanghai, over 70 percent of the students received after-school mathematics tutoring (Hu et al., 2015). According to the Quality Monitoring Report on China’s Compulsory Education released by the National Center of Education Quality (Zhang et al., 2021), 23.4 percent of eighth-graders participated in after-school mathematics classes. In the last two decades, PT has emerged in regions other than East Asia and has been increasingly discussed by researchers worldwide, including in Europe and the USA (Bray, 2014; Davies, 2004; Mischo & Haag, 2002; Zhang et al., 2021). This widespread social phenomenon impacts the daily lives of people involved in education (Liu & Bray, 2018) and opens new research areas (Bray, 2014; Davies, 2004) outside the focus on regular school or family education (He et al., 2021). PT is the basic strategy adopted by parents, students, teachers, principals, and school systems when students fail to meet the desired academic standards (Bray, 2014; Wang & Guo, 2017; Zeng & Zhou, 2012). Some Canadian coaching centers (also referred to as “cram” centers) claim, without supporting empirical evidence, that students can gain academic advantages by receiving long-term training on sufficiently complex and solid cognitive skills (Davies, 2004). The claim is still attractive in China today, even though PT institutions lack systematic evidence on the effectiveness of their services to support their aggressive advertising campaigns. The effectiveness of PT in mathematics in improving a student’s learning career is a core question examined in relevant studies (He et al., 2021; Zhang et al., 2021). After investing considerable financial resources and time to PT in mathematics, families are entitled to know whether their investments bring any benefits to their children’s academic achievement in mathematics (Bray, 2014).

Previous international academic achievement assessments in mathematics have shown a clear advantage for Chinese and other Eastern Asian students over their Western peers (Leung, 2001; Mullis et al., 2004; Wang & Lin, 2005). This advantage is potentially attributable to Chinese students’ efforts in both in-school learning and PT outside of school (Wang & Guo, 2017; Bray et al., 2020). Therefore, the effect of PT on students’ mathematics achievement should be an important aspect in the studies of mathematics education in China (Wang & Guo, 2017; Zhang et al., 2021). Additionally, an evaluation of PT effects on students’ learning performance across China offers the basis for national educational governance and support related to PT. This could assist national educational authorities in developing and improving standards and systems for after-school education, training, and taking relevant actions in monitoring the industry.

This paper presents an empirical study of PT in mathematics due to its importance among other subjects. Leveraging data from a large-scale survey and testing project in China, it profiled the PT for middle school students and examined whether the tutoring facilitated the improvement of students’ mathematics learning performance.

Literature Review

The Roles and Functions of PT in Mathematics Learning

Studies on various education-related factors affecting students’ (mathematics) academic performance focus on quantitative educational indicators of formal schooling, including those related to teachers (Konstantopoulos & Sun, 2012; Wang & Cao, 2014) and schools (for example, Coleman et al., 1966). Both formal and outside-school mathematics instruction should be included when discussing the potential factors in promoting students’ mathematics learning or in analyzing the variance in students’ mathematics achievement (Wang & Guo, 2017; Wang et al., 2019; Zhang et al., 2021).

The purposes of PT include enhancing the results of formal schooling and expanding its curriculum (Zeng & Zhou, 2012). It requires students to learn ahead of schedule or review what they have learned in school. Students with high and low mathematical abilities are usually the targets of PT. For the former, the tutoring helps them build on their advantages or at least maintain their performance levels, while, for the latter, it offers the opportunity to catch up with their peers (Bray & Kobakhidze, 2014; Zhang, 2011). The second purpose often relates to subject competitions, for example, improving the competition results in mathematical Olympiads in China (Wang & Guo, 2017), as well as independent enrollment courses for domestic universities and preparatory courses for overseas ones (such as advanced placement courses) that have become increasingly important in recent years.

Effects of PT on Students’ Mathematics Learning Performance

Regarding the global PT research agenda, effectiveness studies have been the fundamental research issue. These studies provide fundamental empirical results that explain Asian students’ achievements in international assessment projects and provide insights to parents for informed educational decisions. Considering PT’s comprehensive theoretical effectiveness, previous studies have typically discussed its effectiveness under the Carroll’s (1963) model of school learning (Guill et al., 2019; He et al., 2021; Zhang et al., 2021). This model posits that learning is a function of the ratio of the time spent on learning to the time required to learn, with the latter depending on cognitive factors and instruction quality. Specifically, PT potentially adds external time to students’ learning and helps them access further learning opportunities (Bäumer, et al., 2011), but cognitive factors and instruction quality decide the effectiveness of the external time as well as the time required. Zhang et al (2021) applied Dunkin and Biddle’s (1974) paradigm to determine the aspects of the instructional quality of PT that impact students’ mathematics achievement, as illustrated in Table 1. Several presage or process variables discussed in previous studies as well as newly developed variables were included in the framework to explain the effect of PT on the product variable (mathematics achievement in school). The effectiveness was also explained by the time spent on PT, according to Carroll’s (1963) model.

Table 1 Theoretical framework

Therefore, according to the comprehensive theoretical framework, the effectiveness of PT must be evaluated based on carefully designed data analysis processes and comprehensive representative samples in the future.

The overall effectiveness of PT should be tested and the instructional quality analyzed to explain the corresponding results. Rigorous empirical research is necessary to dispel exaggerated or even misleading advertisements and inform the policy (supervision) decisions of education authorities and education decisions of households.

Few studies have examined the effects of PT in mathematics on students’ academic achievement in mathematics using a sample from mainland China, which is insufficient to comprehensively identify complex PT effectiveness. Simple correlation analysis may not assist in understanding the effects, particularly given the complex and diverse teaching practices in PT. Bray (2014) contended that robust data analyses pointing to causation (rather than simple correlation analysis) are more suitable for tracing data, especially when experimental intervention studies are lacking. Further, some prior studies (e.g., Fang et al., 2018) are not specifically designed for the analysis of PT in mathematics and mathematics performance. Some statistical methods are known to be quite efficient in the literature on the effectiveness of PT. Kuan (2011) used two sets of panel data and propensity score matching (PSM) to analyze the data collected by the Taiwan Education Panel Study in 2001 and 2003. The statistical analysis results indicated that, on an average, the mathematics tutoring programs had a small effect on participants. Zhang (2011) used instrumental variable to analyze the effects of PT on the mathematics scores of college entrance examinations in the megacities of China and found no significant results after controlling for the scores of high school entrance examinations. Sun et al. (2020) used a PSM model and found no significant impact of PT on normalized mathematics scores in a Chinese background. Furthermore, the heterogeneity of effectiveness is frequently addressed in the literature (Campbell et al., 2003; Choi & Park, 2016; Ömeroğulları, et al., 2020; Zhang et al., 2021). It should be noted that, several sorts of variables in the theoretical framework (see Table 1) could be applied to analyze the heterogeneity of effectiveness, especially for the context variables, such as students with different propensity scores (Choi & Park, 2016), or PT with different pre-achievements (Zhang et al., 2021), which can provide an empirical basis for educational theories and practices (Wang & Cao, 2014; Zhang et al., 2021). Zhang et al. (2021) suggested that students with low achievement in mathematics might benefit from PT, while others might not.

There is no straightforward evidence suggesting that PT impacts learning performance in all cases. Results from different countries do not support the argument that PT improves students’ educational benefits (Choi & Park, 2016). Bray (2014) summarized several international studies and declared that the results on the effectiveness of PT were inconclusive or even contradictory. He attributed the results partly to different operational definitions and study foci, considering a wide spectrum of forms, organizations, and intensity of PT. Moreover, differences in teaching characteristics (quality), such as curriculum and teaching objectives, teaching methods, and teaching organizational forms, have not been considered by extant studies. Of course, it is not easy to measure some variables quantitatively due to the lack of theoretical research. Zimmer et al., (2010) suggested the importance of accurately evaluating the effectiveness of various programs and estimating the effects of PT on mathematics performance. PT was informal, less regulated, and voluntary, and PT institutions are less formal than mainstream schools (for example, many PT activities are conducted in students’ or teachers’ homes; Zhang et al., 2021). These characteristics made the research complex, especially in data collection. Additionally, though there is not a chasm between PT instruction and formal school instruction, PT instruction has specific characteristics (Wang & Wang, 2021) and might not be identified through measurement instruction developed for formal school instruction.

Previous studies have focused on limited samples such as a single city and the results might not be generalizable across China. Although Sun et al (2020) and Zheng et al (2020) applied the same nationally representative survey of China, the achievement data they used included midterm exam scores for students provided by schools. The scores were not comparable across schools, and might not represent the students’ achievement appropriately, which might have biased the evaluation of effectiveness (Lockwood et al., 2007; Shavelson et al., 1986).

Research Questions

Overall, PT can provide students with additional learning time. The mathematics teacher in shadow education (Wang & Wang, 2021) can potentially assist students in optimizing such time. Specifically, the researchers of the study asked, “What are the effects of PT in mathematics on students’ mathematics learning performance in China?”.

Materials and Methods

Participants and Procedures

The database used in the study was from a large-scale nationally representative survey and testing project. The comprised counties from 31 provinces (including autonomous regions and municipalities directly under the central government) and the Xinjiang Production and Construction Corps. The final data sample covered 120 districts and counties. 8 middle schools were sampled from each selected county. In each county, the schools were stratified based on their locations (i.e., city, town, and rural area). 30 eighth-grade students were randomly selected from each school. Each participant completed a mathematics test and a self-report questionnaire. Among them, 41 are in Eastern China, 35 in Central China, and 44 in the west. A total of 29,337 eighth-grade students were sampled, and 27,716 students were finally included in the data analysis. We used the multiple imputation for interpolation (Little & Rubin, 2002; Rubin, 1987) and iterated 50 times to get the data set with SPSS 24.0.

Measure

In this study, the dependent variable is the students’ academic achievement in mathematics, which was measured by normalized z scores. These were converted from scaling scores generated based on the Item Response Theory (IRT). The test was developed by a team including mathematics expert teachers and researchers in mathematics education and educational measurement was based on the national curriculum standard of China according to a standardized process. The mathematics-testing instrument reflected the requirements of national mathematics curriculum standards and has good measurement indexes for evaluating students’ mathematics academic achievement. Item analysis indicated that the difficulty index of the mathematics testing was moderate with better discrimination and high overall reliability and validity. Unidimensional IRT models were used to calibrate items and create the score scales for mathematics (see Jiang et al., 2018). In addition, this study also adopted the Modified Angoff method to analyze the students’ testing data, supplemented by the Bookmark procedures, to allocate students’ academic performance from low to high into several grades, and performance standards for each subject were specified according to the national curriculum. There are four performance levels that classify students into the levels of I, II, III, and IV, with Level I being the lowest and Level IV being the highest level (see Jiang et al., 2018). It should be noted that, though the database contained abundant variables and the sample was large enough, the current study is still a cross-sectional study, which may not well identify the causal impact of PT on students’ achievement (Zhang et al., 2021). To partly address the issue, the categories of levels were used as a proxy for students’ pre-existing performance levels, since the results of previous studies indicated that PT might not enhance the students’ mathematics achievement (Zhang et al., 2021), or at most enhance the students’ mathematics achievement slightly within a semester. Therefore, we could reasonably assume that the students’ grade level did not change within one semester, and the grade level could partly reflect the students’ academic level at the beginning of the semester. The independent variable of this study is whether students receive PT in the current semester.

The covariates variables in this study include those at the individual level of students (e.g., gender, self-academic expectation, and single-child family), at the household level (e.g., parents’ maximum education years, parents’ academic expectation, and household economic status), and at the school level (e.g., student–teacher ratio, class size, school nature, and location). These variables were discussed in previous studies (He et al., 2021; Zhang et al., 2020, 2021) to solve the problem of sample selection bias. The specific classification and assignment of variables are presented in Table 2 below.

Table 2 Assignment and definitions of the variables

Data Analysis

As discussed in the introduction, the HLM regression model and the PSM were applied to assess the effectiveness of PT in student achievement. Two powerful statistical models were applied in this study to obtain triangulation. The Stata15.0 and SPSS24.0 were applied for the data analysis.

HLM

HLM can avoid the limitations of classical statistical techniques in analyzing hierarchical data and the potential misinterpretations of results. It is a suitable tool for in-depth analysis and interpretation of commonly occurring data with hierarchical levels (Raudenbush & Bryk, 2002).

Given the structural characteristics of testing data obtained by the random multistage stratified cluster sampling, and considering the significant differences in the educational quality between urban and rural areas and across regions, we assigned the student level characteristics and household background factors, school-level factors and province-level factors to the first, second, and third levels, respectively, for constructing a multi-level random intercept model. Then, the model was used to analyze the effects of private tutoring on students’ academic performance. Specifically:

  • Level 1: \({Y}_{ijk}={\beta }_{0jk}+{{\beta }_{1}}_{jk}{{{\chi }_{1}}_{i}}_{jk}+{{\beta }_{2}}_{jk}{{{\chi }_{2}}_{i}}_{jk}+\cdots +{{\beta }_{n}}_{jk}{{{\chi }_{n}}_{i}}_{jk}+{r}_{ijk}, {r}_{ijk}\sim N\left(0,{\sigma }^{2}\right)\)

  • Level 2: \(\begin{aligned} &{\beta }_{0jk}={\gamma }_{00k}+{\gamma }_{01k}{s}_{1jk}+\cdots {{\gamma }_{0}}_{mk}{s}_{mjk}+{\mu }_{0jk},{\mu }_{0jk}\sim N(0,{\tau }^{2}) \\ & {{\beta }_{1}}_{jk}={{\gamma }_{1}}_{0k} \\ & \cdots \\ & {{\beta }_{n}}_{jk}={{\gamma }_{n}}_{0k}\end{aligned}\)

  • Level 3: \(\begin{aligned} & {\gamma }_{00k}={\pi }_{000}+{\varepsilon }_{00k} \\ &{\gamma }_{01k}={\pi }_{010} \\ & \cdots \\ & {\gamma }_{n0k}={\pi }_{n00}\end{aligned}\)

In the above model, \({\pi }_{000}\) was the average score of all students’ academic performance; \({\gamma }_{00k}\) was the average score of academic performance for students in province \(k\); \({\beta }_{0jk}\) was the average score of academic performance for students in province \(k\), where school \(j\) was located; \({{{\chi }_{1}}_{i}}_{jk}\sim {{{\chi }_{n}}_{i}}_{jk}\) denoted variables for individual- and household-level factors affecting students’ academic performance;\({{\beta }_{1}}_{jk}\sim {{\beta }_{n}}_{jk}\) denoted the regression parameters of variables for the factors at the individual level:\({s}_{1jk}\sim {s}_{mjk}\) denoted variables for school-level factors affecting students’ academic performance: \({\gamma }_{01k}\sim {{\gamma }_{0}}_{mk}\) denoted the regression parameters of variables for the factors at the school level: \({r}_{ijk}\), \({\mu }_{0jk}\), and \({\varepsilon }_{00k},\) respectively, represented random errors at the individual, school, and province levels. Grand mean was adopted for the variables, and the ICC is 0.312, which indicated the hierarchical structure of the sample.

PSM

The factors that affect students’ academic performance are complex and diverse, and the typical regression analysis model often cannot capture all the potential variables, especially some latent variables that cannot be observed, such as students’ individual ability. In addition, because the samples were not completely homogeneous, self-selection bias may have occurred in whether to choose to receive tutoring. For example, poor performers and students with favorable household backgrounds might have been more likely to receive PT (Choi & Park, 2016), introducing some bias in the analysis results of the PT effects. To evaluate the causal effect of PT on students’ academic development accurately, the ideal way is to compare the results of each individual student in the states of receiving and not receiving PT simultaneously, which are counterfactual and untenable. Therefore, based on a quasi-experimental research design, this study adopted the PSM method (Rosenbaum & Rubin, 1983) under the assumption of conditional independence. The purpose of this method was to match a student receiving PT with a student not receiving the service albeit having similar or even identical endowment characteristics. Statistically, this method could control the interference caused by sample non-random selection bias by ensuring the independence of the two groups of students, making a more robust and reliable estimation of the causal effect of private tutoring on middle school students’ development possible. We used Logit regression to fit the propensity score value of each sample, which reflected the probability of a sample participating in PT. Then, according to the propensity score, the treatment and control groups were matched with the put back, one-to-many nearest neighbor matching, and finally achieved matching the control group with the treatment group. We used software Stata15.0 for PSM analysis, and then checked the matching effect through balance and common support tests. The result shows that the matching effect is good (details described in the result section).

Results

HLM Analysis on the Effects of PT

For comparison, we assigned the student-level characteristics and household background factors, school-level factors, and province-level factors to the first, second, and third levels, respectively, for constructing a multi-level random intercept model. Then, the model was used to analyze the effects of PT on students’ academic performance.

Table 3 presents the results of the analysis. Several covariates were found to have significant effect on students’ mathematics achievement with a considerable effect size, such as self-expectation, parents’ expectation, raised by parents, and school nature. Moreover, other covariates might be understood with caution because the effect size was small, and the sample size of the current study was large and easy to obtain a significant result with small effect size.

Table 3 Analysis results of the three-level model for the effect of tutoring on academic performance

The results indicate that receiving PT in the current study did not improve students’ learning performance significantly. Instead, it had significant negative effects on learning performance.

PSM Analysis on the Effects of PT

Given some problems of endogeneity and self-selection bias even for HLM regression, the result that PT negatively affected students’ academic performance might be deemed as conclusive. Next, to address the problems, we performed a further analysis with PSM as Triangulation.

Based on typical Logit regression, the propensity matching scores of students receiving PT in mathematics were derived. Figure 1 and Table 4 present the test results of the matching quality. It demonstrates that the matching quality was satisfactory as the distribution of the experimental group (with PT) and the control group (without PT) after matching was consistent, and there was no statistically significant difference between the covariates.

Fig. 1
figure 1

Tests on matching quality

Table 4 The comparison of the treated and control groups

Table 5 indicates that before matching, the average scores of students with PT were significantly lower than those of peers without PT. In addition, after matching, the difference in academic performance between the student groups was almost zero, although the result was significant. Since the sample size was indeed large, the significant result in such a small effect size could be ignored. That is, receiving PT had almost no significant effect on students’ learning performance, which is paralleled to the results of the HLM analysis. The combined results of the two models suggest that there is no evidence to support the argument that PT in mathematics could improve students’ mathematics academic performance.

Table 5 Effects of PT before and after matching

Discussion and Implications

This study aimed to investigate the contribution of PT to the academic performance of middle school students all over China. In conclusion, the study did not identify any significant positive effect of receiving PT (in the current semester) in mathematics on academic achievement in the subject, contrary to parents’ expectations. This is consistent with the results of many previous empirical studies with limited samples in China (e.g. Zhang, 2011; Zhang et al., 2021) as well as large scale survey in other countries (e.g. Guill et al., 2019; Ömeroğulları et al., 2020). The result questions the effectiveness of PT in China. Based on the framework of Carroll model of schooling discussed above, although receiving PT implies extended learning time and the acquisition of more learning resources, the activity alone does not necessarily improve students’ academic achievement (He et al, 2021).

The learning time of PT might interfere with the time for formal school learning (Bray, 2009), such as time spent on homework. Moreover, the quality of PT instruction is a factor that impacts PT effectiveness (Guill et al., 2019; Zhang, et al., 2021), as PT instruction might not be coherent with the regular school instruction in content or objectives of learning (He et al, 2021). Previous studies tended to pay little attention on addressing variables related to specific instructional characteristics of PT (Wang & Wang, 2021), such as the learning processes and teaching behaviors because these variables cannot be measured readily. The arrangement of PT and formal schooling might interfere with each other and make it harder to for the PT teacher to maintain coherence within PT instruction. This is especially the case when the class size is larger than one student from different schools or classes with different learning progressions. Simultaneously, students might face difficulties in dealing with two or even more incoherent trajectories of mathematics learning at the same time. Thus, PT might not benefit students’ mathematical learning (He et al., 2021). Uniform professional standards and professional development program for PT teachers are lacking (Wang & Wang, 2021). The instructional quality could explain the results of evaluation of PT effectiveness, because in practice, PT activities in mathematics are not identical and inevitably lead to distinctive learning outcomes. This issue needs to be further adequately addressed in future teaching activities.

It should be noted that, though several variables were controlled to address the problem of the selection bias of PT and endogeneity (Zhang, 2013), this study was still a cross-section study, and longitudinal research design would be developed in the future study. In addition, this study only explored the PT within one semester, and if students would benefit from longer time PT should be tested in the future studies.

This study indicates that PT in mathematics may fail to improve students’ mathematics academic achievement. As the PT could potentially overburden students, special attention must be paid to the results of the current study, which confirms an important educational paradigm—partnership with schools should be the priority for parents who care about their children’s learning performance. To improve students’ academic achievement, schooling may be a better option, especially considering schools’ potential educational and teaching resources and their roles as providers of mainstream education. The partnership promotes students’ participation in school activities to improve their learning performance. Seeking PT without providing further parental support would be ineffective or even harmful. Parents should motivate their children to take ownership of their learning careers. Educational authorities need to take measures to mitigate parents’ anxiety about students’ learning, and inform parents and the wider public about the empirical evidence regarding the effectiveness of PT to guide them to seek private tutoring carefully.