1 Introduction

Understanding the spread of any disease is a highly complex and interdisciplinary exercise as biological, social, geographic, economic, and medical factors may shape the way a disease moves through a population and options for its eventual control or eradication (Moustakas and Evans 2016a; Oleś et al. 2012). Disease spread poses a serious threat in animal and plant health and has implications for ecosystem functioning and species extinctions (Fisher et al. 2012) as well as implications in society through food security and potential disease spread in humans (Graham et al. 2008; Tomley and Shirley 2009).

Space–time epidemiology (Knox and Bartlett 1964) is based on the concept that various characteristics of the pathogenic agents and the environment interact in order to alter the probability of disease occurrence and form temporal or spatial patterns (Snow 1855; Ward and Carpenter 2000). Epidemiology aims to identify these patterns and factors, to assess the relevant uncertainty sources, and to describe disease in the population. Thus disease spread at the population level differs from the approach traditionally taken by veterinary practitioners that are principally concerned with the health status of the individual (Arah 2009). Patterns of disease occurrence (Markatou and Ball 2014) provide insights into which factors may be affecting the health of the population, through investigating which individuals are affected, where are these individuals located and when did they become infected.

2 Technological advancements

With the rapid development of smart sensors (Aanensen et al. 2009), social networks, as well as digital maps and remotely-sensed imagery spatio-temporal data are more ubiquitous and richer than ever before (Gange and Golub 2016) epidemiology in the big data era needs to integrate novel methods (Mooney et al. 2015; Pfeiffer and Stevens 2015). The availability of such large datasets (big data) poses great challenges in data analysis (Fan et al. 2014; Najafabadi et al. 2015). In addition, increased availability of computing power facilitates the use of computationally-intensive methods for the analysis of such data (Moustakas and Evans 2015). Data mining—methods combining statistics and computer science—are increasingly employed (Lynch and Moore 2016) and may provide novel insights into epidemiological problems (McCormick et al. 2014; Nelson et al. 2014).

3 Let the data speak?

Can big data replace theory? It has been suggested that the availability of a large volume of data, data deluge will make the scientific method obsolete (Anderson 2008); hypothesis-driven, or equation-driven research will become irrelevant and data mining will be used instead (Anderson 2008). This thesis has generated a large scientific discussion—for some examples across scientific disciplines see (Benson 2016; Chiolero 2013; Levallois et al. 2013; Toh and Platt 2013), for online discussions see: https://www.edge.org/discourse/the_end_of_theory.html. Adding up to the discussion it has been suggested that experts will decline in importance in the big data sector (Mayer-Schönberger and Cukier 2013). There are cases where model-free forecasting (using machine learning methods) outperforms the correct mechanistic model for simulated and experimental data (Perretti et al. 2013). However if one simply relies on data-driven science several components of scientific methods will be made poorer: thought experiments (McAllister 1996), stochastic reasoning (Christakos 2010; Pearl 1987), or theoretically-derived predictions may open a new field and propose as a testable hypothesis (Gorelick 2011); something feasible in the mathematical universe is something that may happen in the biological/physical universe (regardless upon how likely is that to happen). A classic example derives from Einstein’s general relativity theory. The theory was based on the observed difference for Mercury’s precession between Newtonian theory and observation i.e. the deviance between observation and a model. The theory at the time that was developed lacked data but it was at later time steps verified by data. A data-driven science is welcome but we cannot afford to lose well established, tested through time scientific methods.

4 Are more data always better?

While the answer may look an obvious yes and that the only challenge is how to handle, visualise, and analyse large datasets, this is not always the case. Big datasets bring a lot of spurious correlations which appear to be simply relationships between things that are just random noise (Silver 2012). In addition big datasets allow easier ‘cherry-peaking’, people can choose which fractions of the data to use in order to show something that they already support or simply to produce a novel result, while a larger dataset may have simply falsified the reported result (Silver 2012), or simply verified something that was already known (Donoho and Jin 2015), therefore this would not merit a groundbreaking result/publication (Silver 2012). In addition, factor analysis in time series in econometrics showed that collating several datasets together may generate cross-correlated idiosyncratic errors, or a dominant factor in a smaller dataset may be a dominated factor in a larger dataset (Boivin and Ng 2006). In such cases smaller datasets have yielded results at least as satisfactory or in fact even better than larger datasets (Boivin and Ng 2006; Caggiano et al. 2011). Methods accounting for the effects of cross correlated errors have been proposed (Blair and Bar-Shalom 1996). While these examples are mentioned in order to highlight problematic issues related with big data, more often than not certainly more data are desirable than fewer.

5 Data availability and model complexity

A study in climate modelling has shown that as the models are becoming increasingly complex and realistic, they are also becoming less accurate because of cumulative uncertainties (Maslin and Austin 2012). In the case of climate modelling earlier models did not account for many important factors that are now being included (Maslin and Austin 2012). The simplicity of the models also prevented the uncertainties associated with these factors from being included in the modelling. The uncertainty remained hidden. More complex models that include more factors are also associated with higher uncertainties (Maslin and Austin 2012). There is thus the paradox that as models are becoming more complex and more realistic (matching the real world better) they also become more uncertain. Ecological systems are quite complex with many small tapering effects, large heterogeneity, and interactions that are generally unknown. On an information-theoretic approach, ‘information’ about the biological system under study exists in the data and the goal is to express this information in a compact way (Evans et al. 2014; Lonergan 2014); the more data available the more information exists, i.e. a more complicated statistical model may approximate the data (Burnham and Anderson 2002) and more complex predictive models (process based models such as individual based models) may be calibrated (Evans and Moustakas 2016).

6 The importance of pubic data

While several new technologies providing a large volume of data exist (mentioned earlier in this paper), publicly available data from governmental organizations as well as data sharing among scientists (Michener 2015) having public data repositories are easier than ever due to large computer storage availabilities as well as fast network connections for downloading them. These public data promote transparency and accountability in the analysis, the potential for data expansion by merging several datasets together, as well as building up the impact of the work (Kenall et al. 2014; Piwowar and Vision 2013). In order to predict and mitigate disease spread informed decisions are needed. Often decisions involve conflicts between several stakeholders (Krebs et al. 1998; Moustakas 2016). These decisions need to be taken based on data analysis and predictive models calibrated with data. Making publicly available data will greatly facilitate their analysis and to informed decisions. For a review of publicly available veterinary epidemiological data with web sources links see (Pfeiffer and Stevens 2015).

7 Spatio-temporal data mining in veterinary and ecological epidemiology

There is thus a need for new methods as well as case studies to enhance our understanding in spatio-temporal data mining in veterinary and ecological epidemiology. A special issue in the journal Stochastic Environmental Research and Risk Assessment aimed to address this topic. Potential thematics included: spatiotemporal statistics (Biggeri et al. 2016; Picado et al. 2007), stochastic analysis (Heesterbeek 2000; Marx et al. 2015), Bayesian maximum entropy modeling (Biggeri et al. 2006; Juan et al. 2016), big data analytics (Andreu-Perez et al. 2015; Guernier et al. 2016), GIS and Remote Sensing (Ferrè et al. 2016; Norman et al. 2012), Trajectories and GPS tracking (Demšar et al. 2015; Zhang et al. 2011), Agent Based Modelling calibrated with data (Dion et al. 2011; Moustakas and Evans 2015; Smith et al. 2016), decision making and risk assessment (Fei et al. 2016; Lowe et al. 2015), network and connectivity analysis (Nobert et al. 2016; Ortiz-Pelaez et al. 2006) and co-occurrence and moving objects (Miller 2012; Webb 2005). Nine contributions were finally accepted after peer reviewing.

Bayesian analysis of spatial data often uses a conditionally autoregressive prior, expresses spatial dependence commonly present in underlying risks or rates. These conditionally autoregressive priors assume a normal density and uniform local smoothing for underlying risks often violated by heteroscedasticity or spatial outliers encountered in epidemiological data. Congdon (2016) proposes a spatial prior representing spatial heteroscedasticity within a model accommodating both spatial and non-spatial variation. The method is applied both in a simulation example based on US states, as well as in a real data application considers Tuberculosis incidence in England (Congdon 2016). The code used for generating simulations is also provided in R (R Development Core Team 2016).

An understanding of the factors that affect the spread of endemic bovine tuberculosis is critical for the control of the disease. Analyses of data need to account for spatial heterogeneity, or spatial autocorrelation may inflate the significance of explanatory covariates. Brunton et al. (2016) used three methods, least-squares linear regression with a spatial autocorrelation term, geographically weighted regression, and boosted regression tree analysis, to identify the factors that influence the spread of endemic bovine tuberculosis at a local level in England and Wales. The methods identified factors related to flooding, disease history and the presence of multiple genotypes of endemic bovine tuberculosis and these factors were consistent across two of the three methods (Brunton et al. 2016).

Early warning indicators are particularly useful for monitoring and control of any disease. Malesios et al. (2016) provide an early warning method of sheep pox epidemic applied in data from Evros region, Greece. To provide inference on the mechanisms governing the progress of sheep pox epidemic (Malesios et al. 2016) follow a two-stage procedure. At the first stage, a stochastic regression model is fitted to the complete epidemic data. The second stage uses an analogy of the fitted model with branching processes in order to obtain a system of estimating the probability of the epidemic going extinct at each of several time points during this epidemic. The end result is an evidence-based early warning system that could inform the authorities about the potential spread of the disease, in real-time.

Japanese encephalitis, a vector-borne disease transmitted by mosquitoes and maintained in birds and pigs. To examine the potential epidemiology of the disease in the USA, Riad et al. (2016) use an individual-level network model that explicitly considers the feral pig population and implicitly considers mosquitoes and birds in specific areas of Florida and Carolina. To model the virus transmission among feral pigs, two network topologies are considered: fully connected and random with a defined probability networks. Patterns of simulated outbreaks support the use of the random network similar to the peak incidence of the closely related West Nile virus, another virus in the Japanese encephalitis group (Riad et al. 2016). Simulation analysis suggested two important mitigation strategies.

Disease outbreaks are often followed by a large volume of data, usually in the form of movements, locations and tests. These data are a valuable resource in which data analysts and epidemiologists can reconstruct the transmission pathways and parameters and thus devise control strategies. However, the spatiotemporal data gathered can be both vast whilst at the same time incomplete or contain errors. Enright and O’Hare (2016) provide a user friendly introduction to the techniques used in dealing with the large datasets that exists in epidemiological and ecological science and the common pitfalls that are to be avoided as well as an introduction to Bayesian inference techniques for estimating parameter values for mathematical models from spatiotemporal datasets. The analysis is showcased with a large dataset from Scotland and the code and data used in this paper are also provided (Enright and O’Hare 2016).

Mechanistic epidemiological modelling has a role in predicting the spatial and temporal spread of emerging disease outbreaks and purposeful application of control treatment in animal populations. Lange and Thulke (2016) address the newly emerging epidemic of African swine fever spreading in Eurasian wild boar using an existing spatio-temporally explicit individual-based model of wild boar. Lange and Thulke (2016) propose a mechanistic quantitative procedure to optimise calibration of several uncertain parameters based on the spatio-temporal simulation model output and the spatio-temporal data of infectious disease notifications. The best agreement with the spatio-temporal spreading pattern was achieved by parameterisation that suggests ubiquitous accessibility to carcasses but with marginal chance of being contacted by conspecifics e.g., avoidance behaviour. The parameter estimation procedure is fully general and applicable to problems where spatio-temporal explicit data recording and spatial-explicit dynamic modelling is performed.

In the last two decades, two important avian influenza viruses infecting humans emerged in China, the highly pathogenic avian influenza H5N1 virus, and the low pathogenic avian influenza H7N9 virus. China is home to the largest population of chickens and ducks, with a significant part of poultry sold through live-poultry markets potentially contributing to the spread of avian influenza viruses. Artois et al. (2016) compiled and reprocessed a new set of poultry census data and used these to analyse H5N1 and H7N9 distributions with boosted regression trees models. Artois et al. (2016) found a positive and previously unreported association between H5N1 outbreaks and the density of live-poultry markets.

Transmitted infectious diseases, aggregate regional chronic diseases, and seasonal or transitory acute diseases can cause extensive morbidity, mortality and economic burden. Since the space–time distribution of a disease attribute is generally characterized by considerable uncertainty, the attribute distribution can be mathematically represented as a spatiotemporal random field model. Christakos et al. (2016) present a random field model of disease attribute that transfers the study of the attribute distribution from the original spatiotemporal domain onto a lower-dimensionality travelling domain that moves along the direction of disease velocity. The partial differential equations connecting the disease attribute covariances in the original and the travelling domain are derived with coefficients that are functions of the disease velocity. The theoretical model is illustrated and additional insight is gained by means of a numerical mortality simulation study, which shows that the proposed model is at least as accurate but computationally more efficient than mainstream mapping techniques of higher dimensionality (Christakos et al. 2016).

Moustakas and Evans (2016b) use a very large dataset generated by a calibrated agent based model to perform network analysis, spatial, and temporal analysis of bovine tuberculosis between cattle in farms and badgers. Infected network connectedness was lower in badgers than in cattle. The contribution of an infected individual to the mean distance of disease spread over time was considerably lower for badger than cattle. The majority of badger-induced infections occurred when individual badgers leave their home sett, and this was positively correlated with badger population growth rates. The spatial aggregation pattern of the disease in cattle and badgers is different across scales—in badgers, we find that the disease is found in clusters whereas in cattle the disease is much more random and dispersed. There is little geographical overlap between farms with infected cattle and setts with infected badgers, and cycles of infections between the two species are not synchronised. The findings reflect the movements of the animals—for example, cattle move greater distances within their grounds or they can be sold to farms further afield. Conversely, badgers are social animals that live in groups, and rarely leave their homes, meaning that the presence of TB is more clustered (Christakos et al. 2016). The research suggests that an efficient way to vaccinate badgers might be to follow the spatial pattern of TB infections. This targeted approach would save labour and costs to control the spread of the disease.