Keywords

JEL Classifications

‘Data mining’ and the older word ‘fishing’ are pejorative terms for illusory or distorted statistical inference from an empirical regression model, where the distortion results from explorations of various models in a single sample of data. This process usually involves adding or dropping variables, but may involve exploring a variety of alternative nonlinear functional forms or data subsamples. Data mining properly applies as a derogatory term only when exploratory results are used for inference within the sample used in exploration. But the term is sometimes used to refer to the exploratory process itself, as economists emphasize inference over data exploration, and even use inference to discuss exploratory activities. Some take data mining to be a more serious offence when there is conscious effort to manipulate, although data mining will distort results regardless of intent.

Importance and History

Some economists consider data mining to be pervasive in applied work. But the portion subscribing to this view is unclear, since those who do so understandably retreat from applied work into economic or econometric theory. Leamer and Leonard (1983, p. 306) give voice to the view that collective data mining renders standard inference meaningless, and hence in general ‘statistical analyses are either greatly discounted or completely ignored’. This stance may have reached a peak in the late 1970s, fuelled by an explosion in the volume of regression studies. But contemporary suspicion is still quite common. Kennedy (2003, pp. 82–3) characterizes the ‘average economic regression’ as perpetrating some of the worst data mining practices.

The issue was known to the originators of econometrics. Ragnar Frisch (1934) advocated methods to deal with the data mining issue which were applied into the 1950s, then neglected for two decades and reincarnated in modern form by Leamer (1983). Because Frisch found that differing but reasonable specifications could yield disparate results, he came to believe attempts at formal inference were illegitimate. Malinvaud (1966, chs. 1 and 2) provides a wonderful exposition of Frisch’s methods and of why Frisch’s stance was replaced by contemporary textbook assumptions. Even Haavelmo’s (1944, ch. 7, sect. 17) founding statement of the contemporary inferential approach discusses data mining.

Econometrics textbooks quite properly warn against data mining, yet it is difficult to avoid and is pervasive in published work. This places the new practitioner in a difficult position. It is helpful to be armed with an understanding of the consequences of data mining and why data mining is difficult to avoid. Econometrics in the contemporary sense began when we decided that economic data could be treated as equivalent to sampling from an uncontrolled experiment (Haavelmo 1944), borrowing from R.A. Fisher’s methods for experimental data. The following illustration clarifies these issues.

An Illustration

Suppose two students of the economy live in parallel universes. Both are interested in a variable y, believing the most important determinant of this variable y to be another variable x1, but also supposing that variables x2 and x3 may be relevant. Their initial data-sets are identical, and they propose to model y via a linear regression model. Both start out assuming that the errors of the model (ε) are independent and normally distributed with constant variance. Thus they propose the model y = b1x1 + b2x2 + b3x3 + ε, where the coefficients ‘bi’ are to be estimated.

The first student lives in a universe in which he can generate more data via experiments. The second student must wait passively for the passage of time before she can see more data; data generated by events she does not control. Thus, the first student is confident of his science, while the second student is in the actual universe of economics.

Now suppose that in their initial regression results for the coefficient on x1 they find the sign is the opposite of what they expected. As in standard practice they take this to imply that they have omitted an important variable. After fiddling with their specifications they find that adding a variable x4 yields a more sensible coefficient estimate for the variable x1. Suppose also they find that, for the coefficients on x2 and x3, the null hypothesis for coefficients of zero would be accepted individually (leaving the other variable coefficient unrestricted, as in a t-test). But suppose they find the joint hypothesis (b2 = b3 = 0) would be rejected. They find the fit of the regression is penalized least by dropping the variable x3 and do so. They have used a process of specification search to arrive at a model for y as a function of x1, x2 and x4.

The first student takes the results to his professor. The professor commends the effort to learn from the world, but corrects the student on one point. He notes that, although the estimated standard error for the coefficient on x3 included zero, it also included (we will suppose) five, and if this coefficient is truly so far from zero then (given expected variation in x3) the variable x3 would have appreciable effects. So the professor tells him to run another experiment designed so that the resulting data- set is large enough (and so standard errors of coefficient estimates are small enough) to usefully distinguish between large and small values of b3. The student does so, and publishes the results with the statistics and standard critical values treated as valid ‘tests’. This is not data mining.

Now the second student takes her results to her professor. This professor says the first regression result (employing x1, x2 and x3) can be treated as possibly generating test statistics drawn from standard distributions. However, in the final model (x1, x2 and x4) some of the t-statistics were created by design. Since one ‘fished’ or fiddled with variables included in the model until the coefficient on x1 had the correct sign, the t-statistic was drawn from a distribution such that there was 100 per cent probability it would have the ‘correct’ sign. Likewise the student explored specifications until the t-statistic for the coefficient on x2 appeared to be significant. This implies for the final specification that within the interval bounded by the standard critical values (approximately plus or minus 2) the probability of the t-statistic for b2 falling within this standard range must actually be zero, hardly a standard t-distribution. This process of modifying the model and re-estimating it using the same sample used to suggest those modifications will also affect in an unknown manner the distribution of other test statistics, even those that were not direct objects of exploration and design. These are data mined results.

Note that the two professors agree that something was potentially learned in the exploratory stage. Both students could use data exploration to reveal aspects of the first sample, but the results of exploration over this same sample could not then provide a formal test. As in any legitimate science, the first professor views taking inspiration from observation to be a process separate from confirmation or testing. The second student also hopes to have learned something from the sample, but her professor objects to treating the statistics resulting from this exploration as providing a test. The second student treated each regression as though it was a separate experiment, but regressions and their associated statistics are mere calculations that organize the data. Also note that, when these students took the initial estimate of b1 as having the ‘wrong sign’, they were applying strong prior beliefs which led them to place little weight upon this empirical result. Bayesian inference provides a formal treatment of such priors.

The second student continues the consultation with her professor. The professor says these first results are not publishable because economists are interested in inference, and all she has shown is that the first model did not make sense. The professor may advise that she should first have chosen a successful regression model from the empirical literature, modifying it only slightly if at all. If the student is alert, she will notice the data available to her is identical to that in the literature, except for a few more recent observations.

So this alert student will go back to her professor and tell him she already knows the regression results will be the same as those already published, except to the extent the new data observations have some effect when averaged with the old. The test statistics will not have the usual distributions; instead, the distributions are a function of the previous results and the portion of new observations relative to those used in the previously published results. The student has discovered that, to the extent data-sets overlap, taking guidance from the regressions of other researchers is collective data mining, even if one runs only one regression oneself. Thus collective data mining is pervasive, and the meaning of published test statistics is unclear. Only if each data-set is entirely distinct can one learn from the work of others while preserving known statistic distributions.

Contemporary Practice and Remedies

Three partial remedies for data mining are practised in the current literature. One is to insist upon seeing all the possible regression results a reasonable researcher might propose, supplementing imperfect ‘tests’ with a range of results. This is most associated with Leamer (1983), but we have already mentioned the earlier work of Frisch. Current practice is moving towards this approach, more often presenting multiple specifications.

A second remedy is inspired by noting that it is possible to calculate probabilities for statistics resulting from specification search, if the process begins with a model including a set of variables large enough that the true model is reasonably assumed nested within, and respecification deletes and does not add variables. An example is the general-to-specific approach. This approach is now common when specifying lag-lengths of time-series models, but in other contexts is controversial. The statistical consequences of such an approach fall under the heading of ‘pretest’ estimators discussed in most econometrics textbooks, but the best introductory discussion is found in Campos et al. (2005, Introduction, sects. 3.3–3.4). Interestingly, Hoover and Perez (1999) show that when pretest distributions are not accounted for this second remedy leads to an acceptable level of distortion.

A third remedy reserves some of the available data for ‘out-of-sample’ tests. Here one engages in specification search in one portion of the data and then tests in the reserved portion. We place ‘out-of-sample’ in quotes because this is not confirmation in a new sample. This response cannot avoid collective data mining because it is likely that among many projects the more satisfactory reserved-sample results will be selected for publication, if not by individual authors then through the collective filter of journal referees. But this remedy is useful to the individual researcher.

The first two remedies focus on data exploration, and only the third remedy adds the key scientific step of confirmation in separate data. Followers of the second remedy such as David Hendry and others of the ‘London School’ are often accused of data mining. Yet they have been the strongest proponents and practitioners of the third remedy, which provides the legitimate test in separate data, even inventing new out-of-sample tests such as for forecast encompassing. A good introduction to the second and third remedies is found in Charemza and Deadman (1997).

As noted in our discussion of the third remedy, universal adoption of these remedies cannot avoid collective data mining. Collective data mining would be avoided if upon accepting a paper the journal offered an explicit or implicit contract to accept a follow-up study. Formal and precise testing would be performed in the subsequent study employing only data not available for the initial paper. This is yet to be practised by any journal, so as a result the methodological issues remain troublesome, leaving room for vague and inconsistent norms across referees and journals. New practitioners must develop their own approaches to navigating these norms and practices, while deciding how to preserve their own sense of integrity.

See Also