Keywords

1 Introduction

Learning mathematics is hard. At the neuro-cognitive foundation of young children’s math development are core numerical processing competences such as the ability to enumerate small sets of dots and to compare the relative magnitudes between sets [6]. These numerical competences are diagnostic markers of emerging math abilities from as early as preschool age [7] which make them targets for conceptually motivated intervention programs [1].

Fig. 1.
figure 1

Dot enumeration response time regressed on set size (number of dots) using segmented linear regression with one break point

Fig. 2.
figure 2

An example of a dot enumeration item for set size 4; the correctly answered items are used in Fig. 1

We investigate characteristics that define exceptional patterns of young children’s enumeration ability. Generally, enumeration performance reflects two distinct processes: the subitizing system where small sets (1–4 dots) are recognized accurately and rapidly, and the counting system where larger sets are enumerated more slowly perhaps by counting or other enumeration strategies [22]. Figure 1 gives an example; the enumeration response time of small sets is relatively flat while the counting slope is steeper. The inflection point demarcates the subitizing range from the counting range.

Individual differences in subitizing range predict math ability [17]. An inability to subitize is associated with dyscalculia [12]. There is value in accurately and reliably estimating the parameters that define subitizing patterns (initial reaction time, range, slope). Common algorithms used for estimating the subitizing range can produce inconsistent results [15], especially among individuals with dot enumeration curves that deviate from the typical curve.

We develop a piecewise linear regression model class for Exceptional Model Mining (EMM) [4] to discover subgroups of children whose subitizing curves exhibit atypical patterns. EMM is a local pattern mining framework seeking coherent subgroups in a dataset that somehow behave exceptionally. We develop various quality measures based on log likelihood that allow us to discover atypical subitizing patterns such as deviating initial reaction times, subitizing ranges, counting slopes, or a combination of those.

We use data collected by the FUnctional Numerical Assessment (FUNA) study [21]. Numerical processing competences and math abilities are assessed using several computer-assisted tasks. Some of these tasks contain a fixed number of questions, or items; others are time-based and the number of answered items will vary per child. Items are taken from a larger set of items, and not necessarily answered in the same order. Consequently, the dataset does not follow the conventional data mining representation where each individual can be described with one tuple of attribute values, and where a column contains the same semantic information for each individual.Footnote 1 Hence, pre-processing is required to allow existing algorithms to search through the space of candidate subgroups. We discuss a manner tailored to the item-performance data at hand.

The main contributions of this paper are: 1) an EMM model class and various quality measures for segmented linear regression; 2) a deeper understanding of how subitizing patterns relate to other numerical processing competences and emerging math abilities; 3) an effective pre-processing technique for handling repeatedly measured attributes in descriptive space.

2 The FUnctional Numerical Assessment Study

The FUnctional Numerical Assessment (FUNA) study [21] is a large-scale research program in Finland to develop digital assessment tools for detecting dyscalculia and dyslexia. Currently, several studies are run to evaluate the validity and reliability evidence of the tasks [8]. The current version has been normed in Finnish and Finnish-Swedish languages for grade levels 3 to 9 (9 to 15 years old). In the FUNA-DB (Dyscalculia Battery) the children respond to six digital (CAI) tasks using a tablet or a computer: Number Comparison (NC), Dot Matching equivalence task (DM), Single Digit Addition (SA), Single Digit Subtraction (SS), Combination Addition (CA) and Number Series (NS). Every task consists of multiple questions, or items. The tasks SA, SS, CA, and NS measure arithmetic fluency, and the items considered easier are provided earlier than more difficult items, but the exact order is not the same between children (i.e. a quasi-random order). In the number processing tasks (NC, DM), a set of predefined items are presented in a fully random order. Figure 2 displays an example of a DM item. Children compare a symbolic number (1–9) to a non-symbolic representation of a number. The location of the dots is randomized as well. When the symbolic and non-symbolic representations are the same, and when the children answer correctly, the DM task can be considered a Dot Enumeration (DE) task: determining the number of dots in a visual array.

Table 1 displays a dataset slice. On the right side (to be used as target attributes in the EMM model class, see Sect. 5), we present information from the DE task. Attributes \(\ell _1\) and \(\ell _2\) represent the set size (1–9) and response time in milliseconds respectively. These attributes are the dependent and independent variables in a segmented linear regression model class as visualized in Fig. 1. We indicate the fact that we obtain data from multiple DE items per child, by using tuples (e.g., for the first item of child 1, the set size was 5 and response time was 1330 ms). For the SA, SS, and CA tasks, the number of items (tuple-length) differs per child; for the NC and DE tasks, the tuple-length is 52.

Apart from the set size and response time for each task, we may consider information such as whether the item is answered correctly, what is the correct answer, and what is the numerical distance between two numbers shown in a certain item. All this information is represented as separate attributes (e.g., attribute \(a_3\) indicates where the items on the NC task have been answered correctly yes or no) and will be used to discover and describe exceptional subgroups of children. We have additional descriptive information such as a child’s sex (\(a_1\)), grade, and the language (Finnish or Swedish) in which they executed the tasks.

Table 1. Small slice of FUNA dataset. Some descriptors originate from the NC-task (\(a_2\), \(a_3\), \(a_4\)), others from the SA (\(a_5\)), SS, or CA task, or from the general background information (\(a_1\)). In our EMM instance, target attributes originate from the DE task (\(\ell _1\), \(\ell _2\)). All task-based attributes contain data from multiple items, resulting in tuples of values. The number of values per tuple may vary per child and per task.

The data format as used by most traditional data mining algorithms is also known as a propositional table; these are single-table representations where each individual can be described with one term. In the attribute-value case, this term is a tuple of attribute values [13]. For instance, a student could be represented by a three-tuple specifying age, grade and language. Generally in EMM, we let the subgroup description be a conjunction of selection conditions over the descriptors, where condition \(sel_j\) is a restriction on the domain \(\mathcal {A}_j\) of the respective attribute \(a_j\). For instance, a description \(\text {sex} = \text {girl} \wedge \text {language} = \text {Finnish}\) covers all girls who executed the FUNA tasks in Finnish.

However, for all attributes other than sex, grade and language, our dataset does not follow this conventional data mining representation; a descriptive attribute is not associated to one value, but rather to a tuple of values. In this case, it is unclear what it means to apply a selector \(sel_j\) directly; a selector \(a_2 \le 3\) would select items rather than individuals and a selector such as \(a_{2t} \le 3\) where t refers to the item indicator, would inflate the number of descriptors, which is detrimental to efficient traversal of the search space. In addition, such a selector has little conceptual meaning, again because the items are quasi-randomly ordered and item t is not the same across children. We will provide a more satisfactory alternative in Sect. 4.

3 Background

Exceptional Model Mining (EMM) [4] is a local pattern mining framework seeking coherent subgroups in the dataset that somehow behave exceptionally. The observed attribute values are divided into descriptors \(a_1,\ldots ,a_k\) and targets \(\ell _1,\ldots ,\ell _m\). Dataset \(\varOmega \) is then a bag of n records \(r \in \varOmega \) of the form

$$\begin{aligned} r = (a_1,\ldots ,a_k,\ell _1,\ldots ,\ell _m). \end{aligned}$$
(1)

Subgroups are defined using descriptions; a Boolean function \(D : \mathcal {A} \rightarrow \{0,1\}\). A description D covers a record \(r^i\) if and only if \(D(a_1^i, ..., a_k^i) = 1\).

Definition 1

( Subgroup cf. [4]). A subgroup corresponding to description D is the bag of records \({G}_D \subseteq \varOmega \) that D covers:

$$\begin{aligned} {G}_D = \{r^i \in \varOmega \ |\ D(a_1^i, a_2^i, \ldots , a_k^i) = 1\}. \end{aligned}$$

The complement contains all non-covered records: \(G_D^C = \varOmega \setminus G_D\) [4, p.53].

In EMM, the choice of description language \(\mathcal {D}\) is free, though generally we let the description be a conjunction of selection conditions over the descriptors, where condition \(sel_j\) is a restriction on the domain \(\mathcal {A}_j\) of the attribute \(a_j\). For discrete variables the selector may be an attribute-value pair (\(a_j = v\)); for continuous variables it could be a range of values (\(w_1 \le a_j \le w_2\)) [4].

The task of EMM is to discover the descriptions whose subgroups display exceptional behaviour on the target variables. The precise instantiation of “behaviour” depends on the application. A quality measure quantifies the exceptionality of within-subgroup behaviour with some reference behaviour model.

Definition 2

( Quality Measure cf. [4]). A quality measure (QM) is a function \(\varphi : \mathcal {D} \rightarrow \mathbb {R}\) that assigns a numerical value to a description D.

The challenge in EMM is to effectively search through the descriptive space to find the top-q best-scoring subgroups.

In traditional EMM, the combination of Eq. (1) and a description language based on conjunctions of selection conditions implicitly assumes the data to be in a flat-table format where every record is an individual that is described by a tuple of attribute values, and placed on a new row in the single flat-table. In contrast, in this paper, an attribute a or \(\ell \) may or may not be measured repeatedly per individual i. We focus our notation on the descriptive attributes, and write \(a_{jt}^i\) to denote the \(t^{\text {th}}\) measurement of the \(j^{\text {th}}\) descriptive attribute for the \(i^{\text {th}}\) individual. Compared to Eq. (1), the form of the descriptive part of a record (individual) \(r \in \varOmega \) changes to:

$$\begin{aligned} r = \left( (a_{11}, a_{12}, ..., a_{1t}, ..., a_{1{t_1}}), (a_{21}, ..., a_{2t_2}), ..., (a_{k1}, ..., a_{k{t_k}})\right) , \end{aligned}$$
(2)

where \(t_j^i\) refers to the number of repeated measures of attribute \(a_j\) for individual \(i \in \{1,2,...,n\}\), which may vary across individuals and attributes; we let \(t_j=\max _{i=1,2,...,n}t_j^i\). Some descriptors may be measured only once per individual (such as sex in Table 1); then, \(t_j^i = 1\) for all i.

3.1 Segmented Linear Regression

The goal of regression is to predict the value of an attribute y given a new value of \(\textbf{x}\), where \(\textbf{x}\) is a random draw from a vector of variables \(\textbf{X} = (X_1, \ldots , X_d)\) with state space \(\mathbf {\mathcal {X}}\). The simplest linear model for regression is one that involves a linear combination of the input variables and parameters \(\textbf{w}\): \(f(\textbf{x},\textbf{w}) = \textbf{w}^T \textbf{x}\). We additionally aim to model the uncertainty, modeling a predictive distribution \(p(y|\textbf{x})\) by assuming that the deterministic function \(f(\textbf{x},\textbf{w})\) has additive Gaussian noise with zero mean and precision \(\beta \) (inverse variance). We then obtain the likelihood function:

$$\begin{aligned} p(y|\textbf{x},\textbf{w},\beta ) = \mathcal {N}(y|f(\textbf{x},\textbf{w}),\beta ^{-1}) = \prod _{i=1}^n \mathcal {N}(y[i]\ |\ \textbf{w}^T \textbf{x}[i], \beta ^{-1}), \end{aligned}$$
(3)

Next, estimating \(\textbf{w}\) and \(\beta \) using Maximum Likelihood Estimation shows that the log likelihood of a regression model depends on the sum-of-squares error function (SSR) [2] (see [24, Section 1] for an elaboration):

$$ \ln p(y|\textbf{x},\textbf{w},\beta ) \approx SSR(y,f(\textbf{x},\textbf{w})) = \sum _{i=1}^n \left( f(\textbf{x}[i],\textbf{w}) - y[i]\right) ^2. $$

Segmented linear regression appears to require non-standard optimization techniques. However, one can parameterize the model such that it can be modeled using an iterative, linear approach [18]. We focus on modeling a segmented relationship with two line segments between response variable y and one explanatory variable \(x_h\) by fitting the terms:

$$\begin{aligned} y = g(x_h,\alpha ,\beta ,\psi ) = \alpha x_h + \beta (x_h - \psi )_{+} \end{aligned}$$
(4)

where \((x_h - \psi )_{+} = (x_h - \psi ) \cdot I(x_h > \psi )\) where \(I(\cdot )\) is the indicator function equal to 1 if the statement is true and 0 otherwise. Consequently, \(\psi \) is the x-axis break point, \(\alpha \) is the slope of the line segment to the left of \(\psi \), and \(\beta \) is the difference in slopes between the line segments to the left and right of \(\psi \). Next, [18] iteratively fit linear models of the form \(\alpha x_h + \beta U^{(s)} + \gamma V^{(s)}\) with \(U^{(s)} = (x_h-\psi ^{(s)})_{+}\) and \(V^{(s)}=-I(x_h > \psi ^{(s)})\). Every iteration, \(\hat{\psi }^{(s+1)}\) is updated through \(\left( \psi ^{(s+1)} - \psi ^{(s)}\right) = \hat{\gamma }/\hat{\beta }\) and when the algorithm stops and \(\hat{\gamma } \approx 0\), the \(s^\text {th}\) approximation is the Maximum Likelihood Estimate: \(\hat{\psi }^{(s)} \equiv \hat{\psi }\) [18].

3.2 Connections to Existing SD/EMM Approaches

Linear target models for EMM are not a new concept [19]. Existing model classes use QMs comparing a regression parameter between the subgroup and a reference model. Instead, we follow the approach of [26] and [23] who build QMs on the log likelihood. These QMs do not directly compare parameter estimates but rather evaluate the overall fit of a model estimated on the subgroup. In addition, in this paper, we utilize the special situation that when we assume Gaussian noise, maximizing the log likelihood is similar to minimizing the residuals sum-of-squares. This characteristic simplifies the notation and calculation of our QMs.

Our dataset has a nested structure: we aim to create subgroups at the level of the individual, while having access to repeated measures per individual in both target and descriptive space. We are not the first to consider time-varying target attributes. For instance, [23] analyzed glucose fluctuations and [3] discovered funding applications with deviating temporal sub processes. However, in descriptive space, these authors use attributes that are measured at the same level as the individual; their flattening approach can be categorized as a transformation to a wide flat-table data format. Alternatively, [16] transformed their data into a long, stacked flat-table format where each row contains a transition rather than an entire sequence. Consequently, [16] changed the notion of an individual: from sequence to transition.

Relational subgroup discovery (RSD) [27] uses a propositionalization-based approach using Prolog queries consisting of structural predicates, and create a binary table where each column represents a newly created feature that may or may not be present for a particular record. Our proposed method is best described as a simple aggregation approach to feature construction [11]. We do not apply automated feature construction methods; these typically assume that columns of the dataset have a coherent semantic meaning, which our data does not (cf. Footnote 1). We show that with domain-specific aggregation functions, subgroup interpretability blossoms.

4 Our Proposed Flattening Approach

An aggregated descriptor is a descriptive attribute constructed out of one or more original descriptors, where the original descriptors are defined as in Sect. 3 and may or may not contain repeated measures per individual. The goal is to describe each individual with one tuple of attribute values as in Eq. (1), rather than a tuple of tuples as in Eq. (2). This allows defining descriptions as conjunctions of selection conditions over the aggregated descriptors.

Denoting an original descriptor with \(a_j\), we construct an aggregated descriptor \(\tilde{a}_h\) by applying a function \(\xi : \mathbb {R}^* \rightarrow \mathbb {R}^1\) such that per individual, the number of observed values on attribute \(\tilde{a}_h\) is 1. A function \(\xi \) may be applied to one or more time-varying descriptors, possibly in combination with an invariant descriptor.

Definition 3 (Aggregated descriptor)

Given one or more descriptors \(a_* \subseteq \{a_1,a_2,...,a_k\}\), an aggregated descriptor \(\tilde{a}_h\) is an attribute constructed by applying a function \(\xi : \mathbb {R}^* \rightarrow \mathbb {R}^1\) such that per individual, the number of observed values on attribute \(\tilde{a}_h\) is 1, that is \( \tilde{a}_h = \xi (a_*). \)

Aggregated descriptors may arise from a function such as a summation or average, they may be non-linear (conditional) functions of one or more original descriptors, and/or they could be parameter estimates of a statistical model. Section 4.1 provides examples of all of these for the FUNA study.

The aggregated descriptors induce a tweak to the definition of a subgroup:

Definition 4 (Subgroup)

A subgroup corresponding to description D is the bag of records \(G_D \subseteq \varOmega \) that D covers:

$$\begin{aligned} G_D = \{r^i \in \varOmega | D(\tilde{a}_1^i, \tilde{a}_2^i, ..., \tilde{a}_s^i,a_{\dagger _1}^i, a_{\dagger _2}^i, ..., a_{\dagger _{k_\dagger }}^i) = 1\}. \end{aligned}$$
(5)

The domain \(\tilde{\mathcal {A}} \times \mathcal {A}_\dagger \) is the collective domain of all aggregated descriptors \(\tilde{a}_1,\ldots ,\tilde{a}_s\) and the time-invariant descriptors \(a_\dagger = \{a_j \in \{a_1,\ldots ,a_k\} | t_j = 1\}\).

4.1 Domain-Specific Aggregations Functions

Definition 3 allows for many variations. In the context of FUNA, a simple example is a function \(\xi _{\max }\) that counts the number of answered items per task. For instance, \(\tilde{a}_1^i = \xi _{\max } (a_{\text {NC}}^i) = t_{\text {NC}}^i\) is the number of NC items answered by individual i, where \(a_{\text {NC}}\) is the item-indicator of task NC. We may want to know how many items individual i answered correctly: \(\tilde{a}_2^i = \xi _{\text {sum}} (a_3^i) = \sum _t a_{3t}^i\), where \(a_3\) is a binary attribute as in Table 1. We could subsequently measure the proportion of correctly answered NC items: \(\tilde{a}_3^i = \xi _{\max } (a_{\text {NC}}^i) / \xi _{\text {sum}} (a_3^i)\).

Other aggregation functions that are interesting from a domain perspective are the mean and median response time of the correctly answered items. We write \(\tilde{a}_4^i = \xi _{\text {MeanTC}}(a_3^i,a_4^i) = (\xi _{\text {sum}} (a_3^i))^{-1}\cdot \sum _{t \in \{1,...,t_4\}\ \text {s.t.}\ a_{3t}^i = 1} a_{4t}^i\). For \(\xi _{\text {MedTC}}\) we would do something similar but take the median rather than the mean.

Table 2. An overview of the aggregation functions used in FUNA

In the domain of educational learning, the Inverse Efficiency Score (IES) [7] is a measure that combines both the median response time and the accuracy (proportion of correctly answered items). The IES allows researchers to identify children with high response times, or a low proportion of correctly answered items, since the IES score is high in both cases. For an individual:

$$\begin{aligned} \tilde{a}_6^i = \xi _{\text {IES}} (a_{\text {NC}}^i,a_3^i,a_4^i) = \frac{\xi _{\text {MedTC}}(a_3^i,a_4^i)}{\xi _{\text {PropAnsC}}(a_{\text {NC}}^i,a_3^i)}. \end{aligned}$$
(6)

For the Number Comparison (NC) task, it is interesting to analyze the numerical distance effect [9]. When tasked with saying which of two numbers is greater, this task is easier to perform when the numbers are far apart (NumD). If numbers have the same distance, the task is hypothesized [22] to be easier if the largest number is smaller. This is called the Number Ratio (NumR). We regress the response time of the NC items on the NumD (and once more for NumR), and evaluate the intercept (Ic) and Slope (Sl) of these models. Thus, we first create a time-variant descriptor \(a_{\text {NumD}} = |a_{\text {NC}\text {L}}-a_{\text {NC}\text {R}}|\) (where \(a_{\text {NC}\text {L}}\) and \(a_{\text {NC}\text {R}}\) are the numbers shown on the left and right in each NC item) and then fit a linear regression model per individual i: \(a_4^i = f(a_{\text {NumD}}^i,w_0^i,w_1^i)\). Parameter estimates \(w_0^i\) and \(w_1^i\) are the intercept and slope of the regression model, stored as aggregated descriptors \(\tilde{a}_7^i = w_0^i\) and \(\tilde{a}_8^i = w_1^i\). We take the same approach for NumR.

An overview of these aggregated descriptors is given in Table 2.

5 Our Proposed Target Model

We seek subgroups of children with atypical dot enumeration curves. We use the segmented linear regression model as a target model (cf. Sect. 3.1) with response time \(\ell _2\) as output (y) and set size \(\ell _1\) as input (\(x_h\)) (cf. Table 1). We are interested in finding any kind of deviation from the typical DE curve; in a typical DE curve the subitizing slope is close to zero, the subitizing range is somewhere between 3 and 4, and the counting slope is relatively steep.

Following [26] and [23], we assume that the parameters of a linear model fitted on the subgroup will likely describe the subgroup better than the parameters estimated on the entire dataset. Then, in the presence of a subgroup, the log likelihood of dataset \(\varOmega \) will increase if the parameters of the subgroup are separately estimated. For any subgroup SG and its complement \(SG^C\),

$$ \ln p(SG|\theta ^{SG}) + \ln p(SG^C|\theta ^{\varOmega }) > \ln p(SG|\theta ^{\varOmega }) + \ln p(SG^C|\theta ^{\varOmega }), $$

where \(\ln p(SG|\theta ^{SG})\) is the log likelihood of the subgroup for a segmented linear regression model estimated on the SG with \(\theta ^{SG} = (\alpha ^{SG},\beta ^{SG},\psi ^{SG})\). We expect this term to be larger than the log likelihood of the subgroup for a segmented linear regression model estimated on the entire dataset \(\varOmega \): \(\ln p(SG|\theta ^{SG}) > \ln p(SG|\theta ^{\varOmega })\). Next, we use the characteristic of linear regression that maximizing the log likelihood is similar to minimizing the sum-of-squares error function (SSR) (see Sect. 3.1, and [24, Section 1]) and aim to find subgroups where \(\ln p(SG|\theta ^{SG}) > \ln p(SG|\theta ^{\varOmega })\) holds. Hence, we formulate our first QM as follows:

$$\begin{aligned} \varphi _{\text {ssr}} &= \frac{1}{\varphi _{\text {ef}}} \cdot - \frac{A}{N^{SG}}\nonumber \\ A &= SSR(\ell _2,g(\ell _1,\theta ^{SG})) = \sum _{i=1}^{n^{SG}} \sum _{t=1}^{t^i_{\ell _1}} \left( \ell _{2t}^i - \hat{\alpha }^{SG} \ell _{1t}^i - \hat{\beta }^{SG}(\ell _{1t}^i - \hat{\psi }^{SG})_{+}\right) ^2, \end{aligned}$$
(7)

where \(N^{SG} = \sum _{i=1}^{n^{SG}} t^i_{\ell _1}\) is the number of observations in the subgroup in target space and \(\varphi _{\text {ef}}\) is the entropy function [4] to discourage tiny subgroups. We take the SSR of \(\ell _2\) with respect to \(g(\ell _1,\theta ^{SG})\), which is defined in Eq. (4). If the sum-of-squared error decreases, \(\varphi _{\text {ssr}}\) increases.

Although both the regression parameters and precision depend on the sum-of-squares, they are statistically independent. This means that we could find subgroups with a small error where \(\ln p(SG|\theta ^{SG}) > \ln p(SG|\theta ^{\varOmega })\) does not hold; the log likelihood of the subgroup may be large, but it may not be larger than the log likelihood of the global model, for instance when the regression parameters \(\theta ^{SG}\) do not differ much from \(\theta ^{\varOmega }\). Therefore, we propose a QM that rewards not only small values of SSR for the subgroup, but also values of SSR for the subgroup that are smaller than the SSR of the subgroup evaluated on the global model:

$$\begin{aligned} \varphi _{\text {ssrb}} = \varphi _{\text {ef}} \cdot \frac{A(B-A)}{N^{SG}}, \end{aligned}$$
(8)

where A is as in Eq. (7) and B is similar but with \(\theta ^{SG}\) replaced by \(\theta ^{\varOmega }\).

6 Experiments

We perform two experiments with the FUNA dataset. First, we randomly sample 5% of the children and experiment with both QMs \(\varphi _{\text {ssr}}\) and \(\varphi _{\text {ssrb}}\). We perform beam search [4, Algorithm 1] with \(b=4\), \(w=20\), and \(q=10\). Especially when working with domain-specific data, we aim for our resulting subgroup set to be a good balance between interpretability, variety, and quality. To further understand how a weighted coverage scheme (WCS) [14] can contribute to finding such a balanced subgroup set, and what its relation is to the search depth d, we vary \(d \in \{3,5\}\) and the multiplicative weighting parameter of the WCS \(\gamma \in \{0.1,0.5,0.9\}\). We evaluate our results by inspecting the average quality of the subgroup set, the average size of the subgroups, the number of subgroups (out of \(q=10\)) that validation with the Distribution of False Discoveries (DFD) [5] cannot distinguish from false discoveries over \(m=50\), the average run time, and two measures of subgroup set redundancy: Joint Entropy (JE) [14] and median Jaccard similarity (JSIM) [20] (see [24, Section 2] for precise definitions). We use the pwlf Python library to fit our segmented linear regression models [10].

Second, based on our findings in the first experiment, we choose the most appropriate QM, value for d and value for \(\gamma \), and repeat the experiment with the full FUNA dataset (\(n=15 486\)) and \(q=20\). All these children have at least 5% of their answers correct in each descriptor task (NC, SA, SS, CA) and the children have at least one observed answer for every possible set size in the DE task. The maximum number of observed items in the DE task is 18 per child. Our experimental code, all results, and a slice of the FUNA dataset are available at https://github.com/RianneSchouten/FUNA_EMM.

Extra experiments on Curran dataset We perform an additional set of experiments on a fully public dataset and find subgroups of children with exceptional relations between age and reading skills. Since our quality measures generalize to linear regression problems other than segmented linear regression, we perform these extra experiments with polynomial regression. More information and a short discussion of the results can be found in [24, Section 3].

6.1 Results Experiment 1

Figure 3 presents the standardized, average quality of a subgroup set (\(q=10\)) for various values of d, \(\gamma \), and both QMs. In essence, the results are as expected: the quality increases when either the description length d or the weight parameter \(\gamma \) increases, and the impact of varying \(\gamma \) is larger for smaller d (see Fig. 3; absolute difference between the smallest and largest quality for varying \(\gamma \) is larger for \(d=3\) than for \(d=5\)). Table 4 reports the other evaluation metrics: the average subgroup size decreases when either d or \(\gamma \) increases, and in general, the subgroup set redundancy is larger when d decreases or \(\gamma \) increases (higher JE, lower JSIM). Except for 2 subgroups for \(\varphi _{\text {ssrb}}\) when \(d=3\) and \(\gamma =0.1\), all discovered subgroups can be considered valid discoveries.

Fig. 3.
figure 3

The relation between the average quality of a subgroup set (\(q=10\), standardized per QM), search depth d, and WCS parameter \(\gamma \), for both QMs.

Fig. 4.
figure 4

Experimental results for both QMs, \(d \in \{3,5\}\), \(\gamma \in \{0.1,0.5,0.9\}\).

For \(\varphi _{\text {ssr}}\), given d, the average subgroup size, JE, and JSIM are comparable when varying values of \(\gamma \). It seems that there is barely an effect of the WCS. When \(d=5\), the average quality is lower for \(\gamma =0.9\) compared to \(\gamma =0.5\), and when \(d=3\), the average quality is lower for \(\gamma =0.5\) compared to \(\gamma =0.1\). These results are unexpected since a decreasing \(\gamma \) is supposed to increase the variety of the subgroup set at the cost of average quality. Inspecting the individual descriptions and qualities, we find that for \(\varphi _{\text {ssr}}\) the variety in the subgroup set is larger when \(\gamma =0.9\) than when \(\gamma \in \{0.1,0.5\}\). Most likely, the reason is the use of a square when calculating the quality. Even when we use a strict WCS (small \(\gamma \)), the same subgroup recurs, since the weighted quality of the other subgroups does not beat the non-weighted quality of the recurring subgroup. When the WCS is very strict, at lower search levels, important precursors may be removed and not available for refinement at higher levels. As a consequence, a subgroup set with a strict WCS could have fewer candidate subgroups, which in the end creates a relatively redundant subgroup set. It is unfortunate that JE and JSIM do not fully reveal these conclusions. With \(\varphi _{\text {ssrb}}\) the subgroup sets are less redundant than with \(\varphi _{\text {ssr}}\), especially for small values of \(\gamma \). Clearly, JSIM increases and JE decreases when \(\gamma \) increases. Subgroups found with \(d=5\) are slightly smaller than for \(d=3\).

6.2 Results Experiment 2

Table 3. Subgroup proportion, description and estimated target models for subgroups 1, 5, 6, 7, 10, 17 and 18, discovered with \(\varphi _{\text {ssrb}}\). The global target model is \(1407 + 88\ell _1 + 463(\ell _1-3.3)_{+}\)
Fig. 5.
figure 5

Estimated segmented linear regression models of subgroups 1, 5, 6, 7, 10, 17 and 18 discovered with \(\varphi _{\text {ssrb}}\). Target model equations can be found in Table 3.

We perform the experiment on the entire dataset with \(\varphi _{\text {ssrb}}\), since this QM turns out to be stable and produces small and interesting subgroups. We choose \(\gamma =0.5\) to balance between high quality and low redundancy. We choose \(d=3\) since Table 4 shows that these results do not differ much from \(d=5\), and a description with fewer literals is easier to interpret for domain experts. Descriptions and target models of all top-20 exceptional subgroups can be found in [24, Section 2]; we report a smaller selection in Table 3 and Fig. 5.

Although we allow for descriptions to have \(d=3\) literals, strong performance is found in single-attribute subgroups. There is a variety in used descriptors (multiple aggregation functions, multiple tasks), subgroup size, and target models. Compared to the segmented linear regression parameters of the global model, 15 out of 20 exceptional subgroups have a subitizing range lower than average; the other 5 have a higher subitizing range.

Subgroups 1 and 2 have very similar subitizing curves: children in these subgroups are particularly slow to subitize, and these groups are the only ones that have an intercept over 2 s. The subgroups contain children with slow NC response times (either expressed in terms of IES or mean response time) and both are slow to solve addition problems (based on SA and CA tasks). The groups are small, and probably most typical of dyscalculia, or at the very least groups with children who are likely to have maths learning difficulties. Indeed, the subgroup sizes 0.05 and 0.06 are in accordance with the dyscalculia prevalence estimate of 3–6% [25].

Subgroup 5 is a more general version of subgroup 1; it covers 50% of the children and its description contains the first literal of the description of subgroup 1. The subitizing curve shows the same trend as the one of subgroup 1, but less extreme: the subitizing range is smaller than the global model, but not as small as in subgroup 1, and intercept, subitizing slope, and counting slope are larger than in the global model, but not as large as in subgroup 1. Domain experts suspect that these children have maths learning difficulties as well.

Subgroup 6 is the inverse of subgroup 5. This is not only clear from the description in Table 3, but from the regression model in Fig. 5 as well; the subitizing range is higher, and the intercept and subitizing slope are lower than in the global model. Subgroups 13, 15, 18, and 19 are the other four subgroups that have subitizing ranges above the average, and characteristically have subitizing intercepts (baseline response time or speed of processing) that are 300–350ms faster than the average and at least 500ms faster than any other group in the table. They also have shallower (faster) counting slopes by 150–200ms than most other groups.

Subgroups 18 and 19 have target models that are very similar to the one of subgroup 6, even though the descriptions of these subgroups differ. Subgroup 6 expresses the subgroup in terms of NC-IES while subgroup 18 is described based on performance on the arithmetic addition task (SA). A similar effect occurs for subgroups 5 and 10: the target models are similar while the descriptions use aggregated descriptors from different tasks. These findings suggest relations between number processing skills and arithmetic skills. They additionally show that it may be possible to obtain diagnostic information by focusing on fewer tasks; that is, domain experts may be able to deduct information about the performance on one task given the performance on another task. This is a promising result that provides opportunity for further development of assessment tools and intervention programs.

The only subgroup that does not use an aggregated descriptor is subgroup 17, which selects children in the third grade. Interestingly, the estimated target model of subgroup 17 is similar as the ones for subgroups 5 and 10; similar to the global model, expect for a larger counting slope. Compared to the other children in the FUNA dataset, the children in subgroup 17 are younger and hence, slower for all tasks, including the NC (subgroup 5) and SA (subgroup 10) tasks.

7 Discussion and Conclusion

The FUnctional Numerical Assessment (FUNA) project [21] develops digital assessment tools for detecting dyscalculia and dyslexia in young children by evaluating numerical processing competences such as the ability to enumerate small sets of dots and to compare the relative magnitudes between sets. These numerical processing competences are diagnostic markers of children’s emerging math abilities [7]. In this paper, we particularly focus on the characteristics that define children’s enumeration ability, such as the threshold at which children can determine the correct number of dots at a glance, known as subitizing range, and other parameters of subitizing patterns such as the initial reaction time and counting slope. Common algorithms used for estimating subitizing range can produce inconsistent results [15] especially among individuals with dot enumeration curves that deviate from the typical curve.

Therefore, we develop an EMM model class for segmented linear regression to discover subgroups of children whose subitizing curves exhibit atypical patterns. It could be argued that choosing segmented linear regression as a model class is a drawback since the observations are not independently distributed (i.e. a model is estimated on \(n^{SG}\) independent children, who all contribute the measurements of several items, resulting in a total number of \(N^{SG}\) observations). Despite of that, we follow this approach since segmented linear regression fits the neuro-cognitive concept of subitizing very well. Furthermore, the assumption of independent observations is required for most of the other algorithms as well; segmented linear regression has the least baggage built into it.

Our findings confirm the belief that numerical processing competences strongly correlate with arithmetic skills. We find several exceptional subgroups that confirm existing knowledge, including subgroups that are considered typical of dyscalculia; these children have slow NC response times and are slow to solve addition problems. We find subgroups with similar subitizing patterns but different descriptions. These findings demonstrate the strong relation between subitizing, counting, and arithmetic ability, and additionally provide promising opportunities for further development of assessment tools and intervention programs that focus on fewer tasks or a reduced number of items per task: it may become possible to know the results on a particular task given a child’s performance on another task.

Both quality measures in this paper assume that the overall population and subgroups are best modelled with the canonical subitizing range model: a piecewise linear regression model with precisely one break point. However, it is entirely possible that coherent subgroups of children do not follow this regimen: some groups may display no substantial break point; behavior of others might be best modelled by multiple break points. The piecewise linear regression model class for EMM can accommodate this sort of behavior, but it requires development of a new QM: log likelihoods will necessarily increase when more break points are available to the model, so some penalty for model complexity must be involved.