1 Introduction

Environmental change, including climate change and biodiversity loss, are determining factors for the emergence of diseases originating from wildlife [14] and can be the source of the selective forces of new genetic variations that allow the disruption of biological barriers by pathogens and the increase in the potential for spread of diseases to humans. Although not considered appropriately in health surveillance policies, the situation is relevant, since the majority (60.3%) of infectious diseases circulate between humans and animals (zoonoses), of which 71.8% are caused by pathogens originating from wildlife [22]. Not to mention the alarming data from a recent study [52], which shows that the number of pathogens infecting humans and animals is vast and, more worryingly, they are growing over time.

These emergences are widely associated with areas most affected by natural and anthropogenic impacts, also composing the range of parameters that make social inequalities even more severe and unfair, with substantial repercussions and costs to health and quality of life [4, 43]. Over the past decade, several studies have shown that biodiversity can affect both the dilution and dispersion of pathogens, as well as modulate their transmission rate [23, 38, 54].

However, studies and actions in the last century, despite the expansion of epidemiological knowledge, responded to specific disease emergence events in the human population, with some mitigation attempts. Considering the low ability to reverse climate change and the environmental impacts determined by human population growth, and the rate of production and consumption of natural resources, it seems reasonable to expect that the emergence of these diseases cannot be held back. This scenario is paradoxical in megadiverse countries, such as Brazil. While species richness results in richness of parasites that are associated to them, and therefore a potential risk, it is this complexity of species and their relationships that protect and stabilize the dynamics of transmission, reducing the outbreaks of diseases, one of the essential ecosystem services [2, 8, 24, 26, 32, 44, 50]. In this scenario, more than seeking effective responses to crises, there is a reason to pursue actions that anticipate problems so that one can mitigate them where possible, and quickly respond to them when prevention or mitigation fail.

This approach has been strengthened with international programs, such as “One world, one health” from the WHO/OIE and the 2011-2020 Strategic Plan of the Convention on Biological Diversity (CBD) [10] and strategically in governmental programs of developed countries. These already dedicate considerable resources and efforts to tracking pathogens, whether to prevent pandemics, such as the recent occurrences with influenza and Ebola viruses, the development of new drugs or even biological warfare concerns. There are programs and systems of surveillance of zoonoses in wild animals that have been acting essentially for the identification of new and old diseases, especially those of economic and conservationist interest and in the approach of One Health [12]. Most of these programs are: structured and maintained by governmental services, with professional personnel, collection protocols and standardized diagnostic capacity (e.g., US Wildlife Disease Surveillance and Emergency Response), implemented by groups with scientific or conservation interest in one or a few species (e.g., World Conservation Society Health Program) or are based on the participation of trained farmers and hunters who are the first to come into contact with slaughtered animals, such as in Europe [46]. Except for studies of scientific interest and conservation of species, the other characteristics are not applicable in Brazil. In Brazil, systematized strategies for monitoring and predicting occurrences of diseases resulting from biodiversity are incipient. They follow a notification model about diseases that already occurred in humans or in a few species, which is insufficient for preventive action [6]. Firstly, there is no government system in place to monitor wildlife health consistently. Secondly, in Brazil, hunting is prohibited by law throughout the territory, except in particular places where vulnerable and traditional populations have the right to subsistence hunting. It should also be emphasized that the act of collecting biological samples for diagnosis imposes risks to the health of the person and therefore requires specific training and personal protection equipment. However, this is not consistent with the reality at the national level as well as with the vulnerability and low level of education of the majority of the population that lives in the forest of natural and anthropized environments.

The relationships that link biodiversity to health are complex because they are often indirect, scattered in space and time, and dependent on many forces [38]. The problem is not restricted to identifying species and their geographical distribution. In the context of the emergence of zoonoses, there are various species of pathogens, vectors, and hosts that modulate evolutionarily each other, their populational dynamics and composition, which collectively also undergo and react to environmental changes [23].

Therefore, a multi-dimensional challenge is faced:

  1. 1.

    Sensitizing decision-makers about the need to monitor the movement of pathogens in wildlife before they impact humans, expanding health surveillance actions.

  2. 2.

    Building a mechanism that is not limited by the territorial extension of Brazil, the poorly integrated sectoral policies, and by other outbreaks or emergencies that absorb all the health staff.

  3. 3.

    How to integrate multiple skills, since this mechanism should contain specialists to handle data, species, and distinct social and environmental contexts.

  4. 4.

    How to effectively obtain, store, and manage data properly.

  5. 5.

    Modeling the risks from data to identify and predict them, as well as to extract the relevant information to convey it to society ultimately.

The first challenge is arguably the hardest one because it is mostly non-technical and involves dealing with politics. The ongoing strategy to sensitize decision-makers stands on two continuous actions: (i) getting in touch with decision-makers and, backed up by scientific studies, educating them about the benefits in terms of health, sustainability, economics, and politics from taking preventive and predictive measures; (ii) presenting to decision-makers regularly how the SISS-Geo platform has been helping in disease prevention moreover, how the monitoring can be made both effective and inexpensive thanks to the network of volunteers and machine-learning based workflows. How the remaining challenges were dealt with in designing SISS-Geo will be explained further in the following sections.

As evidenced, data collection, monitoring, and extraction of knowledge and information about wildlife health and its relationship to human health arise as challenging tasks involving several areas of knowledge, characterized as interdisciplinary activities aimed at modeling a dynamic and complex system. It is also clear that major areas of computing are mostly applicable in the context presented, such as computer modeling, machine learning, and parallel programming. However, their application is not apparent given the need to integrate information in different ways, the complexity and dimensionality of the data to be manipulated and the sensitivity involved in the use and dissemination of these data [40].

In this article, the Information System on Wildlife Health (SISS-Geo) is presented, a joint effort between the Oswaldo Cruz Foundation (Fiocruz) and the National Laboratory for Scientific Computing (LNCC), as an essential step for moving forward on the challenges posed. Its conception aimed at the integration and participation of various segments of society and encompasses: the registration of primary data by any person interested; the application of the concept of citizen science; the reliable diagnosis of pathogens circulating in wildlife that may potentially impact humans with the participation of laboratory and expert networks; the computational and mathematical challenges that include analytical and predictive systems, data mining, intensive processes, parallel programming, system integration, data (unstructured and heterogeneous) and information, geographic information systems (GIS), machine learning, meta-heuristics, and data visualization.

SISS-Geo is mainly characterized by managing its data in a spatially referenced environment. It aims to:

  • provide, quickly and efficiently, the flow of information between (i) the Information Center for Wildlife at Fiocruz and the national system of health surveillance, with special contribution to the Strategic Information Center on Health Surveillance (CIEVS, Ministry of Health); (ii) the participatory networks in wildlife health and laboratories; (iii) the general population that wants to participate in the process; and (iv) the different biodiversity monitoring centers, as the MCTI (Ministry of Science, Technology and Innovation), ICMBio (Chico Mendes Institute for Biodiversity), MAPA (Ministry of Agriculture, Livestock and Supply), and Embrapa (Brazilian Agricultural Research Corporation).

  • create, from the data and georeferenced information, warning and forecasting models on human and wildlife diseases in order to act as a sentinel system for emerging and reemerging diseases as well as provide the results of spatial modeling to scientific community and decision-makers.

  • allow for adequate means to integrate the georeferenced system with spatial databases partners from governmental and non-governmental partners.

  • adapt to the metadata standard of the National Spatial Data Infrastructure (INDE) (http://www.inde.gov.br), aiming to provide, efficiently and with full compatibility, data related to wildlife health to the scientific community and the general population.

2 Design and Implementation of SISS-Geo

SISS-Geo is built upon four high-level modules, as illustrated in Fig. 1.

Fig. 1
figure 1

Four modules of SISS-Geo, consisting of (1) data collection and storage, (2) alert prediction and confirmation, (3) forecast of ecological opportunities, and (4) model interpretation

The first one systematizes photographs and the capture of georeferenced field and observation records of animals, their physical conditions, and their surrounding environment, which are stored in a database (Sections 2.1 and 2.2 ). Collaborators compile these observations through mobile applications, for Android (Fig. 2), iOS, and in a Web interface (Fig. 3). The second module analyzes the data to generate automated alert models that take into account territorial distances, time interval, similarity between taxonomic groups involved (notably for primates, Chiroptera, rodents, and carnivores, but not limited to them), the observed physical conditions of the animals in the field according to pre-categorized clinical patterns, and the environmental characteristics of the site where the animal was observed (Section 2.3.1 ). A georeferenced data explorer is available as well, allowing for multiple layers of information to be overlayed. Figure 4 illustrates a visualization where records (green), alerts (red), and biomes are overlayed in a map of Brazil.

Fig. 2
figure 2

Screenshots of the SISS-Geo mobile application displaying the initial screen, main screen action buttons for taking photos and adding records, record description, and record map

Fig. 3
figure 3

Screenshots of the SISS-Geo Web application. Record details in the map (left), corresponding photo with a dead marmoset (right)

Fig. 4
figure 4

Screenshots of the SISS-Geo georeferenced data explorer with options for displaying records, environmental, and socioeconomic layers in the right panel

From the indication of importance and emergency generated by the alert model, the participatory and laboratory networks in wildlife and human health and environmental services established in the country are requested to collaborate on collecting biological samples from animals in the field and on providing reliable diagnoses. The reliable diagnosis feeds and validates the alert models which in turn, from the initial correlation of the environmental conditions of the occurrence, allows for the generation of forecast models of ecological opportunities for disease occurrence that may result from biodiversity loss, thus opening up a different research viewpoint. These actions comprise the third module (Section 2.3.2 ).

Fig. 5
figure 5

Use cases of SISS-Geo displaying various possible interactions between users and functions of SISS-Geo

Finally, the fourth module approaches the challenge of understanding the relationships that govern the phenomenon in question, from the trained models. In this context, the model interpretation serves as the main hypotheses mechanism for further investigation and validation by experts (Section 2.3.3 ). The main components found in SISS-Geo can be categorized into four classes: wildlife health data management, GIS, machine learning, and wildlife health, in the next section.

It should be clear by now that in designing a platform whose (i) the primary source of data comes from citizens, i.e., it does not necessarily rely on the typically overburden health staff moreover, it is also not affected by sectoral policies at different administrative levels; also, (ii) has many components automated by smart workflows and machine learning, the platform is capable of covering the whole territorial extension of Brazil. Therefore, it overcomes the second challenge referred to in Section 1. In respect to the third challenge, SISS-Geo has been designed since the beginning to accommodate and integrate multiple use cases according to the role of each collaborator class, such as citizen scientists, specialists, laboratories, and decision-makers (Figs. 1 and 5). The strength of SISS-Geo comes from the collaboration among these different users, with some providing the data (citizen scientists and specialists), others validating/processing it (specialists and laboratories) moreover, finally, a third group conveying the processed information to the academy (specialists) and society (decision-makers).

The components that had their implementations concluded correspond to the functionality that allows for gathering occurrence data from volunteers through the mobile or the Web application, storing the occurrences in the database, allowing specialists and laboratories to manipulate the occurrences, and allowing for the data to be geographically explored. These components are fully functional and are deployed as mobile applications (for Android and iOS), Web applications (for manipulating occurrences and for geographic exploration), and a database. They correspond to the wildlife health data management and GIS classes and are described in Sections 2.1 and 2.2. In Section 2.3, a methodology is presented for generating alerts using machine learning techniques; this functionality is still under implementation.

2.1 Data Management in Wildlife Health

To monitor changes in biodiversity, one needs to collect, document, store, and analyze indicators of the spatial and temporal distribution of species, as well as information on how they interact with each other and with the environment they live in [28]. The development and implementation of mechanisms to produce these indicators [35] depend on access to reliable data from field surveys, automated sensors, biological collections, and from the academic literature. This data is usually available in various institutions that use different formats and identifiers, which makes it a challenging data integration task. The methods and techniques used to manage and analyze this data define a research area often called Biodiversity Informatics [19, 37]. Some initiatives for establishing metadata and data publishing standards, such as EML [15] and Darwin Core [53], were able to present standard vocabularies used to describe concepts of biodiversity. Although these vocabularies cover only a fraction of the possible concepts, they allow institutions to publish their data about biodiversity using the same format, and for their automatic collection and processing by aggregator systems.

Through the use of these standards, SISS-Geo can collect species occurrence data provided by various contributors, as well as providing data stored in its database to the community at large in an easy to use format. Darwin Core has been extended to include concepts on specific topics, such as information about interactions and pollinators (Darwin Core Extension for Interactions) and on species profiles (Plinian Core) [34]. It would be essential to evaluate and propose an extension of the standard to include information about wildlife health on species observation records, which is typically carried out in the context of the Biodiversity Information StandardsFootnote 1 (TDWG) organization.

SISS-Geo is a biodiversity informatics platform and, as such, it allows for users to upload species occurrence records. In SISS-Geo, these records are enriched with additional attributes, provided by the user, to describe the health condition of the respective individuals. The term occurrence is used in this work to refer to the observation of an individual that apparently carries a disease, which is a particular case of a species occurrence as commonly defined in the biodiversity informatics literature. Its geographical scope is limited to Brazil, and the users are given by citizen scientists and specialists. A relational database was conceptually modeled and implemented for SISS-Geo comprising occurrences of organisms along with associated information about their health condition. Standard operations for creating, reading, updating and deleting information are enabled by mobile and Web applications that allow for both citizen scientists and system managers to interact with the system (Fig. 5 describes SISS-Geo use cases).

As can be observed in its database schema in Fig. 6, SISS-Geo stores information about wildlife health occurrences (O ccurrence). These occurrences usually have an animal (A nimal), a collaborator (C ollaborator) and a location (L ocation) associated with them. Specialists can require samples (S ample) related to the occurrence to be collected, which are going to be analyzed (A nalysis) in the laboratory (L aboratory) network. Data stored in this database is consumed by mathematical models that can produce and confirm wildlife health alerts (A lert).

Fig. 6
figure 6

Overview of the various entities and relationships that comprise the database schema of SISS-Geo

The architecture of SISS-Geo is described in Fig. 7. It is comprised of the following components: a mobile application, a Web application server, a database server, and high-performance computing (HPC) resources. As described in the use cases diagram in Fig. 5, citizen scientists use the mobile application to request, for instance, the upload of their observations or queries to be executed. These requests are forwarded to the Web application server which connects to the database server to answer these requests. Administrative users and specialists can access the Web application server directly also to send requests to SISS-Geo. Finally, the Web application server can invoke the execution of computationally-intensive analyzes on high-performance computing resources. A complete list of use cases is described in Fig. 5.

Fig. 7
figure 7

Architectural view of SISS-Geo displaying components of the system and their interactions with citizen scientists, specialists, and system administrators

The approach used to tackle the fourth challenge mentioned in Section 1, of effectively obtaining, storing, and managing data is based on following the best practices for scientific data management, especially from the biodiversity informatics community. The conceptual model of SISS-Geo’s database follows established standards, such as Darwin Core [53], and the Ecological Metadata Language (EML) [15]. Following the example of citizen science initiatives, such as eBird [49], SISS-Geo can obtain massive valuable data from volunteers that use its mobile application in Android and iOS platforms. As described in the next subsection, this data is combined with other datasets, and it is used in the alert prediction model proposed in Section 2.3.

2.2 Geoprocessing

Spatial and geographical visualization are fundamental conditions for the management of information today. It is often difficult due to the need for normalization, update, and access to qualified data. In studies of infectious diseases, the spatialization of data needs additionally to consider populational pulses and fluctuations determined by several factors such as seasonality, reproductive periods, migrations, among others [33].

SISS-Geo aims to generate relevant and reliable information that can support decision processes of the Brazilian Ministries of Health, Environment, Agriculture, Livestock, and Supply providing subsidies for more agile and timely decision-making.

Because it is an innovative project, the functionality developed is not straightforward, and it was often not available in similar initiatives. The construction of new methodologies and the use of different types of geographic technologies that can meet the expectations and objectives of SISS-Geo is therefore necessary. The GIS Infrastructure (GI) of SISS-Geo has strategic importance in this process, in which there is a need to overcome challenges related to quality control of spatial data, modeling spatialization based on machine learning and the dissemination of models in the form of dynamic maps on the Internet.

The data-driven modeling of diseases occurrence based on socio-environmental variables in SISS-Geo uses a broad diversity of spatial data, such as land use and vegetation cover (Mapbiomas collection 2.3Footnote 2); temperature and precipitation (Global Precipitation Mission - GPMFootnote 3 and WorldclimFootnote 4); geomorphology, soil types, climatic zones, degree of urbanization, highways, mineral exploration areas, biomes, and conservation units (Brazilian Institute of Geography and Statistics - IBGEFootnote 5); demographic density (NASA’s Socioeconomic Data and Applications CenterFootnote 6); and altimetry (NASA’s Aster GDEMFootnote 7). Since these data come from different sources (Brazilian and other national and global sources), they have different scales, reference systems, and mapping methodologies. Therefore, they were pre-processed and structured for integration into a geographic database. It is used both to consume information/data and to store the modeling results in the form of geographically distributed models. The data used as input for modeling are obtained from the overlapping of wildlife occurrence records and environmental, social, and human impact databases. Depending on the location of the records, spatial relationships of the types intersect, within, close, crosses, and the like can be established.

All pre-processing tasks, performed on over one hundred gigabytes of data, were carried out in QGIS [39]. At this stage, it was necessary to standardize the cartographic characteristics of geographic data, correct topological errors, clean duplication of information, and standardize the structure of the attribute table. In general, the data were divided into two groups: vector data and raster data. All data in raster format was converted to vector format in order to be compatible with the internal software package which expects this format as input.

Knowing that part of the thematic data used was produced in small and medium scale (1:1,000,000, 1:500,000, 1:250,000), which provide a limited level of detail and accuracy, the verification methodology of spatial relations adopted areas of influence (buffers) on the occurrence points of the animal species. It brought flexibility for spatial queries, allowing to identify the context of socio-environmental features on which the animal was observed.

Other spatial and temporal information is requested to the user and added to the database as observation site (“local scale”) attributes to enhance species observation records used for data-driven modeling.

The geoprocessing infrastructure also needs to make available the results, alerts, and prediction models produced by SISS-Geo to the public domain according to the Brazilian Information Access Act, except for sensitive information. Therefore, adequating the geographic information system for the Web environment, which provides SISS-Geo’s results in the form of dynamic/interactive maps and graphical statisticsFootnote 8, is an ongoing development. An advantage of this technology is the ease of handling, analysis, and interpretation of models by the end user, as well as operating system independence and interaction with desktop systems and other Internet systems (interoperability).

2.3 Machine Learning

SISS-Geo embraces machine learning techniques to fulfill the fifth challenge mentioned in Section 1, leading to risk mapping and the understanding of factors related to the emergence of diseases. These products are vital for the genuine purpose of SISS-Geo because they account for the main avenue of conveying information to decision-makers and society. The first component (Section 2.3.1) deals with real-time alert prediction, which intends to target health authorities for further verification and diagnostic of alerts. The second and third components (Sections 2.3.2 and 2.3.3) aim, respectively, at building models and extracting knowledge from them in order to advance the understanding of associations between socio-environmental factors and suitability for disease occurrence, which are of vital importance for specialists, decision-makers, and society.

2.3.1 Grouping of Observation Records and Alert Prediction

When a wild animal is observed, its physical condition and surrounding environment are recorded in SISS-Geo, either by experts or volunteers. These records are grouped with other related records (previously reported) resulting in a collection of events characterizing a phenomenon. This is the grouping stage and, although it may sound trivial, it involves the challenge of conceiving/training models with the discriminative capacity to recognize similarities and dissimilarities between events, based on criteria such as spatial and temporal distance between records, the similarity between species and the reported physical conditions, among others. This flow of learning is summarized in Fig. 8.

Fig. 8
figure 8

Machine learning flow of SISS-Geo, starting with a new occurrence, going through decision and analysis steps, and finishing in either a database update to record the occurrence or in an alert confirmation followed by the creation of a new model

The second part consists of modeling the characteristics of observation records that make them more or less relevant, i.e., training the alert model. It means predicting the severity of records according to information brought by events and the geographic/environmental context. For example, a record involving an animal in isolation exhibiting symptoms is less severe, in general, than occurrences containing similar events but covering groups of animals. Of course, in real situations, the characterization of an alert situation is usually much less noticeable, commonly taking into consideration many factors for decision-making. In some cases, a single record is sufficient to generate an alert, such as the registration of a wild canid with symptoms of rabies and non-human primates with Yellow Fever symptoms.

It can be seen that the activities mentioned above refer to the grouping and data classification task, typical of machine learning, and well known for the wide variety of approaches and methodologies. They are therefore complex tasks, both by nature as well as by the large volume of data expected for the systemFootnote 9.

However, the challenges of grouping and classification that are present in SISS-Geo go beyond the classic challenges of these tasks.

Phenomenon characterization

The characterization of what defines a group of events (phenomenon) lies in the problem of non-conventional similarity measurement formulation (e.g., not necessarily Euclidean). Grouping rules based on expert experience are a reasonable alternative but has as shortcoming the limited formalization of knowledge and, consequently, the potential for the introduction of unwanted biases. Another approach is to treat this problem as a machine learning process, aiming at the training of similarity models: given a new record and the existing ones, determine to which group it belongs—or whether it characterizes a new group. The process is characterized as supervised learning, since it is possible to determine reliably, a priori or a posteriori, which records belong to which phenomenon, either by empirical tests or by expert confidence.

Feature extraction

Once constituted the phenomena, it is necessary to evaluate them as to the potential threat to wildlife health and its possible outbreak in humans, as phenomena alone do not necessarily constitute alert situations. In this sense, information characterizing a group of events needs to be extracted and provided to the alert prediction model. The difficulty is thus to derive statistics which better represent the phenomenon described by the group in order to maximize the performance of the prediction model; in other words, raise the necessary information to facilitate the learning process. Experts recommend the use of certain statistics, such as the type and quantity of affected animals, number and frequency of occurrences, among others; however, the space of possible features goes well beyond that and could be used to improve predictive performance. Thus, an open question is how to exploit this vast space automatically? An interesting line of research and a potential solution to this challenge is the investigation of automatic feature extraction methods [17, 18]. In a nutshell, the task can be cast as a supervised machine-learning problem by taking as independent variables the union of the information of all events in a group and, as the corresponding dependent variable, whether or not an alert was issued at the time—of course, this requires the existence of pre-labeled alerts. Then, a machine-learning algorithm can be applied to learn a function (or set of functions) that maximizes the correlation between groups’ content and alert prediction; this optimized function can be understood as the extracted feature.

Alert prediction model

Although its use in the system is similar to sufficiently known methods described in the literature, the alert prediction model is probably the most strategic component of SISS-Geo’s intelligence. The viability of the system is fundamentally based on the accuracy of the prediction model, both in detecting true positives (alerts) as true negatives (non-alerts). The failure to detect an alert condition (false negative) can result in severe consequences to wildlife, environmental, and human health. On the other hand, false positives would overwhelm the relatively small network of laboratories and experts responsible for confirming or denying alerts (more details below). In this sense, methods that combine multiple models (ensemble methods) usually produce more accurate and robust solutions. Therefore, they are promising candidates as training algorithms for prediction models [42]. Still, since the large portion of the system’s data has no associated class, that is, phenomena whose alert predictions have not yet been confirmed, the semi-supervised learning is an interesting approach due to its ability also to leverage unlabeled instances in the training process [7].

Alert confirmation

Another key component of SISS-Geo—on which all others depend—is the process of alert confirmation. This step is the second (and last) time a human interacts in the process, the other being the upload of the observation record. As expected, a great challenge and bottleneck result from the need for direct human participation in the confirmation procedure, either in the field or laboratory; it is an expensive and slow process, even considering the extensive network of qualified collaborations linked to SISS-Geo. When there are more alerts issued by the prediction model than the capacity of experts and the laboratory network to confirm them, the phenomena need to be prioritized. In this situation, one can think of prioritizing the phenomena associated with alerts (1) by alert severity weighted by the confidence of prediction, or (2) by relevance to regions of great interest, be it social, environmental or economic. However, a strategy focused on the medium and long term is the prioritization of confirmation (or denial) of alerts with greater potential for improving the accuracy of the prediction model. This line of research is recent, and it is called active learning [47]. The same method can also be used in possible cases of false negatives, thus avoiding the possibility of degeneration of the prediction modelFootnote 10: the phenomena predicted as non-alerts but that are promising from the learning point of view would be subject to confirmation (of the non-alert condition) by an expert.

2.3.2 Prediction of Ecological Opportunities for Disease Occurrence

Another line of fundamental importance in SISS-Geo is the prediction of scenarios and environments that favor ecological opportunities for disease occurrence arising from wildlife or, put differently, raising scenarios conducive to the occurrence of a particular event, such as an outbreak of a disease.

In short, trained alert models can be used to evaluate different scenarios and characterize those potentially susceptible. From these, environmental, social, and human and animal health variables are taken (see Section 2.2), leading to a set of instances that share the status of “abnormality,” according to the alert model predictions. Then, data-driven models are built over this set in order to estimate a distribution of socio-environmental variables related to alerts. Finally, the resulting models can be applied for predictive or descriptive purposes. While in the former the goal is to assign a degree of suitability for disease occurrence in the geographic space (regions), the latter aims at the understanding of the factors associated with the disease occurrence, akin to the discussion in Section 2.3.3.

In order to construct these predictive/descriptive models, methods for linking of the mentioned variables, such as the ones applied to ecological niche modeling [48] or, preferably, the less specific traditional machine learning methods can be applied in this context. Since the alert models are trained based on confirmed/denied events (Fig. 8), what the disease occurrence models will be reconstructing is not only the (realized) niche of the observed animals but hopefully the environmental and climatic parameters that favor the realized niche of the pathogen, which potentially include portions of the niche of its components including vectors and hosts (since non-human primates coexist with other species) [36]. Take, for instance, the Sylvatic Yellow Fever disease. When an alert is issued (due perhaps to an observed high number of non-human primate deaths), specialists will confirm or deny the alert. In this case, it is the same as confirming (or not) the circulation of the YF virus among non-human primates, which in turn is connected with the circulation of YF mosquitoes.

It is worth observing that this kind of modeling outputs the suitability for disease occurrence, not the actual probability of occurrence. In other words, the model measures how close a given region’s socio-environmental variables are to the distribution of the corresponding variables of regions that had confirmed alerts [36].

2.3.3 Gaining Insights Through Model Interpretation

An essential feature of symbolic modeling methods, such as decision trees, rule extraction algorithms, and meta-heuristic genetic programming [25], is that they reveal in human-readable form the existing relationships between the input and output data.

The potential of this class of models to aid experts is remarkable in the analysis and understanding of the phenomenon investigated, leading to a man-machine interaction: the model suggests hypotheses that best fit the data while the expert validates them.

In order to gain meaningful insights from the model, it is necessary to accurately define its structure/language or, in other words, to incorporate expert knowledge properly. While doing that, care should be taken to find the ideal balance between bias, usually resulting from structural simplicity of the model, and variance, an issue usually associated with structurally more complex models.

3 Evaluation

As of February 2018, SISS-Geo was downloaded more than a thousand times from the Google Play store and had an average rating of 4.8 out of 5 stars. Even though the potential number of observations related to wildlife health usually being a fraction of the population of a species, SISS-Geo has 3014 records in its database performed by 1881 citizen scientists. Its Web interface has been accessed 4,463 times. These records correspond to 764 mammals, 815 birds, 383 reptiles, 227 amphibians, 47 fish, and 540 not identified. Table 1 lists the ten most recorded taxonomic groups in SISS-Geo. It is important to emphasize that the records were uploaded by volunteer collaborators that often do not have taxonomic knowledge, which can have adverse effects on data quality. To tackle this issue and improve wild animal monitoring, which can lead to better assertive models for the emergence of zoonoses, SISS-Geo has developed a tool for expert-supported record validation. Figure 9 shows the geographic distribution of the observations recorded by SISS-Geo that are georeferenced.

Table 1 Ten most recorded taxonomic group in SISS-Geo—data until March 2018
Fig. 9
figure 9

Geographic distribution of records (red dots) in Brazil until March 2018

SISS-Geo integrates data-based computational modeling, development, and high-performance computing. It was selected in 2014 as the best project [5] in the “Health” category of the Grand Challenges of Computing event of the Brazilian Computer Society. In 2017, SISS-Geo received the National Biodiversity Prize from the Brazilian Ministry of the Environment.Footnote 11 It allows the monitoring of wildlife and can support the identification of zoonoses, such as the Yellow Fever outbreaks, which in its sylvatic cycle circulates among non-human primates. The fact that monkeys become ill or die before there are human cases of Yellow Fever causes the surveillance of outbreaks, such as the recent one [11, 31], in these animals to be of vital importance in the control and prevention of the disease. The collaboration of the population is critical because prevention actions can be improved and streamlined, and everyone will benefit. With the participation of ordinary people, the application makes available, in real time, the occurrences of dead or diseased animals for public health and biodiversity conservation, assisting the Epizootics Surveillance System in Nonhuman Primates (PNH), of the Brazilian Ministry of Health, and records of dead monkeys are reported to the responsible bodies investigating the cases. The information recorded in SISS-Geo serves to generate computational models for predicting zoonoses and for the adoption of preventive measures. Tables 2 and 3 list the recorded conditions and the most recorded abnormalities in SISS-Geo, respectively.

Table 2 Recorded conditions in SISS-Geo until March 2018
Table 3 Recorded abnormalities in SISS-Geo until March 2018

Some of the observations performed with SISS-Geo triggered alerts and contributed to biodiversity conservation actions, such as (i) 59 dead turtles were recorded in the south of the Brazilian state of Bahia in November 2017, generating a notification to the responsible environmental agency and a legal notice to those involved in predatory fishing in the area; (ii) observations of dead foxes with rabies in the Northeast were able to support decision-making by health surveillance agencies; (iii) 73 dead monkeys were recorded in 2016 during the recent Yellow Fever epizooty, which directed health surveillance actions in the field.

The outbreak of yellow fever, which occurred in 2016 and spread throughout southeastern Brazil [31, 51], was evidenced by SISS-Geo from the recording of non-human primates in Minas Gerais and other states. Among the various prevention and control actions, the Health Surveillance Secretariat of the Ministry of Health of Brazil carried out five training courses with all the agents and stakeholders involved in the surveillance of Yellow Fever in all the states of the country. In these training sessions, SISS-Geo was presented and offered to agents and managers as a monitoring tool for zoonoses epizootics [1]. Other training activities were carried out with community health agents, park guards and civil defense agents directly involved in the actions of human vaccination and collection of biological samples of non-human primates to confirm cases, in addition to the capture of animals. The use of SISS-Geo, although unofficial so far, since it is necessary to restructure the national flow of information, has been adopted as an additional tool in surveillance, especially for the ability to generate georeferencing, photographs, and real-time information. As a result of this work, working groups, municipal agents, and collaborators from 25 Brazilian states record animals, which has already helped to inform about 200 deaths of non-human primates throughout the country.

SISS-Geo also contributed to the monitoring of species on the IUCN Red List of Threatened Species, with the availability of the location and information of some species already registered as Panthera onca, Puma concolor, Tapirus terrestris, Myrmecophaga tridactyla, Bradypus torquatus, Chrysocyon brachyurus, Chelonia mydas, Leontopithecus chrysomelas, Alouatta guariba guariba, and Crax blumenbachii.

As seen, the SISS-Geo platform, including its data and methodologies, allows analyses that can go beyond the initial planning: from the monitoring of specific groups of animals to its complete adaptation to new contexts. Thus, examples of the expansion of SISS-Geo to new scenarios can be seen by the use of data already collected, its tools, or even all its computational framework.

In this sense, projects that rely on the records of the platform to support biodiversity monitoring, such as in Serra dos Órgãos National Park (PARNASO - Parque Nacional da Serra dos Órgãos in Rio de Janeiro State), are already in progress. Besides that, the collected records can also be used in many other scenarios, for instance: estimating species distribution along with their health status and training models of automatic species identification from SISS-Geo’s images.

Besides, the SISS-Geo computing framework, designed to integrate locality information, photographic and animal records, is easily replicated, taking advantage of all the tools and methodologies developed and applied in its design. An example of this reuse is the project under development as a partnership between Fiocruz, the National Center for Flora Conservation of the Rio de Janeiro Botanical Garden Research Institute (CNCFlora/JBRJ), and the Rio de Janeiro State Secretary of the Environment (SEA-RJ). It aims to adapt SISS-Geo to a citizen-scientist platform for searching for rare plants, within the context of the project “Campanha Procura-seFootnote 12.” By simply adapting the information from the module “Animal” to a “Plant” one, it was possible to replicate most of the previously described concepts and flows.

4 Related Work

He et al. [20] present the eMammal framework for wildlife monitoring supported by citizen scientists. Animal images collected with camera traps are sent to its database where visual animal recognition techniques are applied. The species identification recommendations generated are reviewed by citizen scientists and, subsequently, by experts. The resulting validated records are made available to wildlife and ecological researchers. eBird [49] also leverages the capability of citizen scientists to gather bird observation records. Automated data quality filters are used to support species identifications performed by citizen scientists. iNaturalist [21] is another biodiversity citizen-science initiative available as both a mobile and Web application. Volunteers can record and identify species observations that can be validated by other users and biologists. After an observation is validated, it is annotated as “research grade” and uploaded to GBIF. The World Organization for Animal Health (OIE), which maintains the World Animal Health Information Database (WAHIS) InterfaceFootnote 13, is accessible online and contains updated information on disease outbreaks. However, most of OIE’s pertinent and relevant information relates to infectious agents that have an impact on livestock production and human health, when the two situations are interlinked. As an example, there is no notification of Yellow Fever epizootics in humans and non-human primates, and this information was only requested to Brazil in 2017 after thousands of deaths of people and non-human primates. In WAHIS, space for “other diseases” of irrelevant paper to the economy was added only in recent years. It is important to note that the information provided by the OIE comes from member countries in half-yearly reports. Brazil has the National Animal Health Information System (SIZ)Footnote 14; however, the database refers to the mandatory notification diseases described in Normative Instruction No. 50, 23rd of September 2013. The national list does not include pathogens that have no interest for animal production, although there is also a field where any notification can be made. Therefore, in Brazil, there is no system for collecting and systematizing wildlife diseases, one of the main reasons for the development of SISS-Geo, created by a project in partnership with government agencies of livestock production and the environment.

More general biodiversity databases exist at the global, national, and ecosystem levels. GBIF [13] gathers species observation data on a global scale. In February 2018, it had 54 national nodes. Along with other types of participants, GBIF gathers data from 1152 institutions, totaling approximately a billion records. SiBBr [16] is the Brazilian GBIF node, publishing species occurrence records and providing an ecological niche modeling portal [45]. BaMBa [27] is a biodiversity database that focuses on marine ecosystems that is also integrated with GBIF. These systems use the IPT tool [41] to extract observation records from local databases, export them to Darwin Core [53], and publish them on GBIF.

SISS-Geo is both a citizen science application and a biodiversity database. eBird, eMammal, and iNaturalist while being citizen science applications as well, do not provide tools for data analysis as SISS-Geo plans to do with the application of machine learning techniques to generate wildlife health alerts following the methodology proposed in Section 2.3. GBIF, SiBBr, and BaMBa focus on data mobilization and publication and do not directly provide tools for enabling the participation of citizen scientists.

5 Conclusion

The proposal was inspired by the desire to make public and seek reinforcements for a long walk that brings together researchers, experts from multiple areas and society so that, through computing, information and disease prevention actions reach the most remote regions of the country. It emerges from many years of practice of field research in the Brazilian semi-arid region, where relevant information on diseases in wild animals have been lost or dispersed, and the lack of systematization turned necessary actions impossible both for the containment of diseases in humans, as for conservation of species.

SISS-Geo was born out of efforts to create innovative and integrated actions for the mainstreaming of biodiversity in the sectors of the country. It integrates the actions of the Oswaldo Cruz Foundation (Fiocruz) in “Public-Private Actions for Biodiversity Project” – PROBIO IIFootnote 15, coordinated by the Brazilian Ministry of the Environment, and developed by FUNBIO, Embrapa, the Brazilian Ministries of Agriculture and Livestock, Health, and Science Technology and Innovation, the Botanical Garden of Rio de Janeiro, ICMBio, and Fiocruz. The National Laboratory for Scientific Computing joined the Fiocruz project and ensured its execution in a long-term knowledge-building partnership.

By automating the search for occurrence patterns, the information reaches more efficiently citizens nationwide, from the general population through experts, as well as provides the opportunity for the acquisition of knowledge about the possible patterns and parameters that contribute to the occurrence of diseases. In the medium and long term, it also builds the capacity of researchers to develop complex modeling in the ecology of diseases that can exploit geographic information in order to improve accuracy. Moreover, occurrence patterns yield data that can assist national policy on health and biodiversity conservation.

It should be pointed out that all the described design decisions and techniques embodied in SISS-Geo could also be readily adapted to other similar settings. Besides the increasingly popular concept of leveraging citizen scientists to propel a collaborative monitoring tool, which would fit countless different scenarios provided they can either tolerate some inaccuracy or have a mechanism to validate user input, the proposed machine-learning flow is sufficiently generic to cover a wide range of related contexts. For instance, it is rather common the case in which a phenomenon cannot be recorded all at once, but incrementally in time and space, possibly by different collaborators, as in the jigsaw puzzle; here, everything discussed in Section 2.3.1 would be potentially useful. Another potential contribution to other settings is the discussion about feature extraction from groups followed by the alert prediction/confirmation, which could play a central role in situations—especially in those related to unattended monitoring—where certain events should trigger alerts.

In the context of SISS-Geo, the incorporation provenance information is planned to allow the alert generation process to be traceable, meaning that one can recover the data, configuration parameters, people and computational activities involved. This enables many applications, such as assessing the quality of the alerts generated, verifying compliance with governmental regulations, and the reproducibility [3, 9] of the alert generation process. Provenance information [29, 30], which contains details about the planning and execution of computational processes, such as scientific workflows, describing the processes and data involved in the generation of its results may be used to facilitate this task. They allow an accurate description of how a computational process was planned, which is called prospective provenance, and what occurred during execution, which is called retrospective provenance. Some applications of provenance include reproducibility of computational processes for validation, sharing and reuse of knowledge, data quality evaluation, and attribution of scientific results. One of the concepts commonly captured in provenance is causality, which is given by the existing dependency relationships between computational activities and data sets. These dependencies can derive, by transitivity, dependencies between data sets and between processes.

An application programming interface (API) is being implemented and will serve the data stored in SISS-Geo to other systems. The installation of an instance of IPT [41] is also planned and will allow this same data to be exposed in the Darwin Core [53] standard along with EML [15] metadata . This will enable global and national biodiversity databases, such as GBIF and SiBBr [16], to collect the species occurrence records stored in SISS-Geo. As a result of the need to integrate information on wildlife, human and livestock health, several conversations have been made with the OIE’s Brazilian focal point in the government’s Ministry for Agriculture, Livestock and Food Supply of Brazil. The plan is for information from SISS-Geo to feed the SIZ (Brazilian National Animal Health Information System) database, which will subsequently power the WAHIS database. Since 2017 the transfer of information has been done informally, with the reporting of notifications of dead and sick animals. Systematic integration in the various national information systems, both human health and livestock production, also depends on many political and legal advances and, in particular, on the strengthening of the structure that supports the laboratory diagnosis of pathogens that do not appear in the notifiable diseases nor for humans, or livestock. The integration data from additional data providers that are relevant to the application area of SISS-Geo is planned, following the rules and national legislation.

In one way or another, the authors believe that the SISS-Geo platform addresses the five challenges mentioned in Section 1:

  1. 1.

    Decision-makers can be sensitized to the importance of wildlife monitoring through multiple avenues, such as (i) scientific communication (as this document itself) and (ii) models of diseases occurrence that are capable of anticipating outbreaks. SISS-Geo is being used in ongoing collaborative work with the Secretariat of Health Surveillance (Brazilian Ministry of Health) to generate data-driven Yellow Fever models.

  2. 2.

    Regarding the second challenge, being an easy-to-use GIS-based tool that leverages citizen scientists as the primary source of data collection from wildlife health, SISS-Geo is not impaired by large territorial extension neither is it dependent on sectoral policies and government health staff. The independent network of experts and laboratories takes care of alert confirmation; concomitantly, the active-learning approach proposed in Section 2.3.1 would hopefully minimize the human resource involved in this task.

  3. 3.

    SISS-Geo has been designed since the beginning as a platform aimed at integrating seamlessly multiple agents and skills. Each agent, whether it is a citizen scientist, specialist, laboratory, or decision-maker, has a well-defined role in SISS-Geo. Citizen scientists and specialists can both upload records of observations to SISS-Geo; the difference is that records provided by experts tend to be more comprehensive and reliable. After that, they are validated and, if alerts are predicted, the network of experts and laboratories are asked to confirm or deny them. From the confirmed/denied data, models that correlate factors and occurrence are built via machine-learning techniques with the aid of experts. Then, occurrence cases, alerts, and models are communicated to decision-makers, enabling them to make informed decisions on possibly imminent outbreaks.

  4. 4.

    For obtaining, storing, and managing data properly, this is effectively accomplished in SISS-Geo thanks to its architecture with a dedicated Web server, database server, and HPC resources (Fig. 7). There are ongoing efforts into making the SISS-Geo mobile application capable of operating flawlessly in offline mode, which is very common in remote areas of the country.

  5. 5.

    Finally, the fifth challenge concerning the identification and prediction of risks from data, as well as the extraction and communication of relevant information, are also part of the SISS-Geo workflow, tackled respectively by the tasks of modeling of disease occurrence (Section 2.3.2) and model interpretation (Section 2.3.3).