Keywords

1 Introduction

An exponential amount of sensors, monitoring, data storage systems, multimedia and devices has been generating what we identify as the Big Data phenomenon [1]. Big Data are large sets of data not merely outlined by their amount, but also by a higher degree of complexity as well as a larger value derived by the application of innovative analysis techniques [2]. Further than that, Big Data are characterized, in comparison to traditional Small Data, by a higher velocity and variety [3]. This means that they are often generated on real time (velocity) and composed by different sorts of data as images, audio, texts, etc. (variety) [3]. Due to their relevance for the business, such as the statistical trends forecasting and key indicators obtained, currently companies set more and more effort than ever in their gathering, storage and evaluation [4].

On the other hand, Machine Learning (ML) methods can extract meaning from Big Data as such [5]. Today this is already being applied in very different fields by all different kind of private companies [6]. But the establishment of own free data stocks for ML analysis is failing due to the high costs and the comparably low benefit for society. This complicates the role of research in universities, and take them out of a strategical role in the Big Data research [7].

With medical data such a data collection is not possible today, as the transfer of medical data is protected by law and usually complicated. Without explicit consent of the users, this would be highly unethical [8]. However, this is not seen in all countries, and there are also efforts all around the world to loosen these regulations [9].

In order to unify efforts of different actors and institutions for academic data gathering and annotation, we designed and prototyped a Web Application which avail of two pieces of collaborative software. A collaborative software is defined as an application whose intent is to realize a shared purpose, dividing the effort among users [10]. This approach can generate solutions to common problems and thus create a solid base for innovation, generating a common innovation model [11]. The power of a collaborative software, if it is not directed towards economical compensation and comes from unpaid development, can be an essential part of a free innovation model - a project where the innovation designs are not being protected by the developers [12]. These constraints are motivated by the self-reward, which is based on benefits excluding compensated transactions, and is sometimes motivated by altruistic purposes [12]. Based on this path, we got inspired by two medical projects that have been created with the goal of a free sharing of innovations design and ideas, named Patient-Innovation.comFootnote 1 and NightscoutFootnote 2.

At this moment, to our knowledge, it does not exist any platform for extensive labeling of different medical images, which maintains their statistical analysis and results free and available for research scopes. Our paper presents an approach to develop such a platform and maintain labeled images within the academic community. Therefore, we propose requirements, research design and prototypes for a crowdsourcing Web Application, based upon the free innovation and collaborative paradigm, for labeling medical images. We explore the free cooperation among Web users, researchers and image donors as a gateway for enhancing the performance of ML algorithms on medical images. This will lead to better automatic segmentation and detection algorithms, improving clinical decisions support systems and reducing the human-based error on diagnostic and therapeutic evaluation [13].

In the following chapter of this paper, we will analyze first the literature sources which helped us to explore the main scientific areas of the project: medical Big Data and their analysis, free innovation patterns and collaborative innovation, crowdsourcing, crowdsourcing for medical images segmentation and its gamification.

In the third chapter, the core of our paper, we will proceed then enucleating our research design. We will first outline, through a use case scenario, the interaction among actors and the requirements for our medical images segmentation application.

After that we will enter the areas of data gathering and storage requirements of the system, where we will describe the functionalities of our distributed file system prototype. At last, we will briefly introduce the data analysis and algorithm evaluation step. In the fourth chapter we will then state the conclusions of the paper and the next steps for our project and research.

2 Literature

In Table 1 we summarized all the important pieces of literature for our research design.

Table 1. Selection of literature, categorized by main topic of interest.

2.1 Medical Big Data

As a first step into our literature research, we deepened our knowledge on Medical Big Data. We first focused on their potential for medical analysis, gaining an overview over their importance for clinical diagnosis and research [14]. In particular, we looked towards the fundaments for the collection of medical imaging online, which is a sought-after topic in the scientific research [15]. On the side of data collection, the Internet of Things (IoT) and diffused sensors have been shown to be an important and pioneering step in the direction of data gathering for a multi-sources analysis [16]. But concerning the more related topic of Big Data in form of medical imaging, we evaluated some papers that introduce to their collection and analysis, which is the final purpose of the project [17, 18].

2.2 Free and Collaborative Innovation

On free innovation, the kind of innovation the project aims at, the research of Eric von Hippel provides a theoretical framework and pattern analysis [12]. In this work it is analyzed the motivation behind free innovators and, through a quantitative analysis on surveys over free innovators projects in Finland and Canada, it is provided an insight on motivation for free innovators [12, 19]. On the other hand, collaborative innovation relates to a cooperation in the prototyping and implementation of our project. Collaborative innovation is a very effective method to get a better contribution for different tasks from other individuals, enhancing the chances of excelling in a larger amount of assignments [21]. At the same time, a collaborative innovation tends to have a broader diffusion, reaching a larger share of potential users [20]. As a result, a collaborative work on the free innovation pattern can lead to an even better result, sharing the effort and the costs of the design and development and introducing new and effective ideas [12].

2.3 Crowdsourcing Innovation

For enhancing our perspective on performing the images segmentation, we deepened into the crowdsourcing innovation. For the development of free innovations, crowdsourcing gives the possibility of conveying additional expertise to the project: This could help to find better and more creative solutions to existing problems and features [22, 23]. The practical contribution of crowdsourcing in problem solving tasks comes from the vast experience and different background that a multitude of individuals takes along [22]. This mechanism explains the rising of crowdsourcing in solving specific tasks and long-term projects [24].

2.4 Crowdsourcing for Medical Images Segmentation

A general review of crowdsourcing for health-related tasks has been useful for a first orienteering within the technical aspects of this topic [26]. The first important fundament of our research is the effectiveness of crowdsourcing for medical image segmentation. The result of scientific works shows that a large crowd of non-expert can reach a high accuracy on image labelling for particular medical tasks, resulting comparable to the accuracy of expert medical doctors [27]. Crowdsourcing has shown its effectiveness for segmentation of medical images in tasks where the detection of particular entities has been required on large numbers of images. In specific tasks as cell mitosis in breast-cancer, Convolutional Neural Networks based on ground truth data generated by a crowd of non-experts has reached outstanding performance on diseased cells mitosis detection [25]. In order to collect a large number of data and collaborators, a Web Application based workflow has been previously proposed: Its online availability would increase the amount of people contributing to the tasks [28].

2.5 Crowdsourcing, Gamification and Segmentation

For attracting a crowd of users to accomplish medical tasks, gamification has been proven to be a very effective strategy. This has been shown for medical students [29] as well as for crowds of non-experts [31]. This technique is a gateway to gain better motivation for users even for non-trivial tasks [30]. Regarding objects segmentation, a statistical analysis over the crowd contribution and the filtering out of inconsistent segmentations dramatically improves the result of a non-expert crowd. This leads to the status where the performance of non-experts moves very close to the performance of professionals [27]. Through semi-supervised objects segmentation, it is possible to keep track of the algorithm performances and adapt the algorithms to the proposed task. A precise quantification of time involved and difficulty level of the task helps furthermore to elaborate a better gaming interface and attract a larger crowd to the proposed online project [32].

3 Research Design

The research design has been structured in four steps, corresponding to the main points of the research. In order to conduct a structured research, we will borrow concepts and guidelines from the design science [33]. In particular, we will refer to the design science in the Information Systems Research. The artifacts that we are going to develop present in fact an innovative solution to an existing problem, and its utility is going to be evaluated on our specific domain, from a technical and business related point of view [33].

3.1 Step I: Design and Planning

After an accurate review on literature and the achievement of a structured theoretical background, the first step of the research is the design and planning of an evolutionary prototype for a Web Application [34], which is used to accomplish the goal expressed for the free innovation development: Create better automatic segmentation algorithms for medical organs images. As the evolutionary prototype pattern implies, the prototype will be robust and will constitute a reliable basis for a final version of the application [34].

In order to explain the functionalities of the software proposed, we will borrow the concept of use cases. Use cases are defined as a succession of events generated by actors, which point out dependencies and functional structure of the software [35].

The three main actors identified in the Medical Monkeys Application are the Image Donor, the SolverGamer and the Researcher.

The Image Donor provides own medical images for the Web Application. He donates his images for medical purposes and manages them, editing or even excluding them from the game.

The SolverGamer participates to the segmentation game on the medical images. His performance will be tested by the system and, in case of enough accuracy, his result will be submitted and the data collected and stored, for further statistical evaluation.

The Researcher has the role of analyzing and interacting with the data submitted by Image Donors and Solvers. In the first case, the researcher will be in charge of an evaluation over the uploaded images (so that they can be relevant for the research). In the second case, the researcher will be collecting and analyzing data results from the segmentation game.

Given these actors, we can build up a scenario for the use cases. A scenario is defined as an ordered amount of interactions among partners, who are mostly represented by external actors and a given system [36]. In the use case scenario, the succession of events (e.g. the work steps and interactions) is pointed out. The diagram in Fig. 1 is representing graphically our use case scenario.

Fig. 1.
figure 1figure 1

Use case scenario diagram.

The medical images received from the image donors are collected and categorized following the organ representing. These images will be inserted, as a 3D model representation, in the Web Application, which will offer a game interface, where a crowd of users can participate to segment the donated organs images.

The gaming interface will be organized in levels and will motivate users with scoring and competition mechanisms. Through that, users will be triggered in keeping focused on the task and improve their performance. The levels will be structured by increasing difficulty, in order to stimulate solvers to get more accurate. The results of the game labelling task will be taken and analyzed by the researchers. A continuous improvement of the algorithms will be ensured by an increasing amount of gamers and donors. The larger the amount of images and the more the segmentation gamers, the more accurate will be the algorithms for organs images segmentation [27].

3.2 Step II: Web Application Implementation

As a second step, the tool will be implemented on a Web Platform. From this point, the Web Application will be available, and data of patients and game-solver will be collected. During the initial part of the data collection, the gamer will play against automatic segmentation algorithms, in order to obtain an own score. The Web Application will be gradually implemented and deployed for mobile devices as well, enhancing the chance of increasing participation by game-solvers. For each annotated layer the user gets points, which he can then post on social media. A public high-score list should then also allow the formation of groups, which then allows a competition between institutions. Currently, the Web prototype is to be found online, together the informative webpage. Image Donors and Gamers can register and manage their own profile, images and accountFootnote 3. The segmentation game is still in development. We have as well a public repository for our Web ApplicationFootnote 4.

3.3 Step III: Data Collection and Storage

In the current step of our project, we request patients for data donations after medical investigations. Image data are particularly suitable for that, since these are often available as a DVD for the patients. Images of patients will be sent to us together with an agreement certificate. Once provided, we will store the images in a private cloud service. Before dispatching, the donor taps a TAN on the envelope, with which he then accesses his data online and can revoke the user authorization. Furthermore, the donor will get access to all research results obtained with his data. All the scientific work based on these data will be therefore published as open access.

For a first implementation of the storage, we realized a fault-tolerant and scalable distributed file system [37] based on Hadoop and managed by a local application. This is used mainly for uploading administering the medical images. Apache Hadoop enables Big Data to be stored, accessed and processed in a distributed way across clusters of commodity servers [38]. In order to provide an appropriate Hadoop based architecture for the storage and uploading of the medical images, we outlined first the requirements for our architecture, as shown in Fig. 2.

Fig. 2.
figure 2figure 2

Distributed file system requirements

The local application we prototyped for fulfilling these requirements has a lightweight frontend-solution which directly interacts with the distributed file system. Therefore, the system consists of three major components which implement all necessary interactions with the data storage mechanism. The diagram in Fig. 3 shows the components together with their utilized interfaces.

Fig. 3.
figure 3figure 3

Local application connected to the distributed file system

Both, frontend and backend of the application interact with HDFS using WebHDFS [37]. To initialize the backend processes, the frontend uses a REST interface. The architecture is implemented with JavaScript, and we show with the following flowcharts, the necessary steps for storing and administering images in accordance to the given requirements (see Fig. 4). The algorithm implemented splits the images in tiles, which will be separately compressed in order to enhance the speed for reading the images and show them online for gaming and administration purposes. The introduction of Hadoop consent data as well as algorithm parallelism in our artifact. This means, faster writing as well as reading operations. For a deeper look into the implementation, it is possible to refer to this repositoryFootnote 5.

Fig. 4.
figure 4figure 4

Local application flow chart – administering images

3.4 Step IV: Data Analysis and Algorithm Evaluation

The last step will focus on the evaluation of the data obtained and the derived machine learning algorithms. Through statistical analysis, data coming from anomalous or not focused player will be ignored in the processing. This will lead to an improvement in the data collection and analysis [27]. In this step, a ground truth for the segmentation algorithm will be formed from the the valid values obtained by gamers. Given the ground truth created, ML algorithms will be measured and their accuracy will be evaluated against the state-of-the-art performances.

4 Conclusions and Future Work

ML algorithms for medical images segmentation are sensitive to the lack of large labelled training sets [39]. The reasons for that are mainly to be found in the privacy policies [39] and in the missing labelling which, differently from public internet images, cannot easily be performed by a non-expert crowd [40]. For solving this problem, we propose through Medical Monkeys a collaborative free innovation pattern that, using crowdsourcing to increase diffusion, enhances the amount of medical images and the segmentation accuracy of the algorithms, through a copious labeling. This paper presented the research design, requirements and prototype of our crowdsourcing Web Application, developed over a free innovation pattern, for medical images segmentation. We explored the possibility of free cooperation from Web users, researchers and images donors for improving the application, as a gateway for enhancing the performance of machine learning algorithms on medical images. This will lead to better automatic segmentation and detection algorithms, improving clinical decisions support systems and reducing the human-based error on anomalies evaluation [13]. The next steps of our research will lead to exploring effective way for designing our segmentation game and increasing the participation of the crowd. At the same time, we will continue with the data collection from hospitals and cooperation with institutions, in order to create a larger dataset of images.