Machine learning application development: practitioners’ insights

Rahman, Md Saidur; Khomh, Foutse; Hamidi, Alaleh; Cheng, Jinghui; Antoniol, Giuliano; Washizaki, Hironori

doi:10.1007/s11219-023-09621-9

Machine learning application development: practitioners’ insights

Published: 30 March 2023

Volume 31, pages 1065–1119, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Software Quality Journal Aims and scope Submit manuscript

Machine learning application development: practitioners’ insights

Download PDF

Md Saidur Rahman ORCID: orcid.org/0000-0002-5677-5927¹,
Foutse Khomh¹,
Alaleh Hamidi¹,
Jinghui Cheng²,
Giuliano Antoniol² &
…
Hironori Washizaki³

639 Accesses
5 Citations
2 Altmetric
Explore all metrics

Abstract

Nowadays, intelligent systems and services are getting increasingly popular as they provide data-driven solutions to diverse real-world problems, thanks to recent breakthroughs in artificial intelligence (AI) and machine learning (ML). However, machine learning meets software engineering not only with promising potentials but also with some inherent challenges. Despite some recent research efforts, we still do not have a clear understanding of the challenges of developing ML-based applications and the current industry practices. Moreover, it is unclear where software engineering researchers should focus their efforts to better support ML application developers. In this paper, we report about a survey that aimed to understand the challenges and best practices of ML application development. We synthesize the results obtained from 80 practitioners (with diverse skills, experience, and application domains) into 17 findings outlining challenges and best practices for ML application development. Practitioners involved in the development of ML-based software systems can leverage the summarized best practices to improve the quality of their system. We hope that the reported challenges will inform the research community about topics that need to be investigated to improve the engineering process and the quality of ML-based applications.

Quality issues in machine learning software systems

Article 11 September 2024

A Taxonomy of Software Engineering Challenges for Machine Learning Systems: An Empirical Investigation

Integrated multi-view modeling for reliable machine learning-intensive software engineering

Article Open access 03 July 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Artificial intelligence (AI) and machine learning (ML) have emerged as powerful tools to develop data-driven solutions for diverse real-world problems. Recent breakthroughs in machine learning have greatly inspired the surging adoption of AI capabilities for automation by embedding intelligence into modern software and services (Amershi et al., 2019). AI-based automated supports now span almost every sphere of human life: business, education, healthcare, research, communication, security, assistive technologies, and so on. With the diversity in application domains, the types of problems and the characteristics of the data may vary greatly and so the ML algorithms. From an engineering perspective, once an algorithm is implemented, it requires a solid architecture, model/data validation, proper monitoring for changes, dedicated release engineering strategies, judicious adoption of design patterns and security checks, and thorough user experience evaluation and adjustment. A failure to properly address these challenges can lead to catastrophic consequences. Classically, we have constructed software systems in a deductive way, or by writing down the rules that govern the system behaviors as program code. With machine learning techniques, we generate such rules in an inductive way from training data. This shift of paradigm induces some challenges that are unique to ML application development (Khomh & Antoniol, 2018; Khomh et al., 2018).

Recently, practitioners from leading software companies like Google (Sculley et al., 2015) and Microsoft (Amershi et al., 2019) have been reporting about their experience building ML-based applications and raising awareness on some of the challenges posed by ML application development. Sculley et al. (2015) outlined some challenges of ML application development by identifying harmful design patterns that may incur excessive maintenance costs. In addition to characterizing the challenges, they also made some suggestions on how to deal with those challenges. Amershi et al. (2019) presented a survey conducted with developers from Microsoft, showing how AI application development aligns with a nine-stage development workflow. They outlined three fundamental differences between ML application development and traditional software development. They observed that data management for ML applications is quite complex compared to other types of software, and that model customization and reuse requires some specific skills. They also reported that AI modules are difficult to handle compared to traditional software components due to complex inter-component relationships and non-monotonic error behavior. Amershi et al. (2019) also suggested some best practices for software engineering of ML applications, focusing on data and model management, and the interfaces between ML components and the overall system.

Although these studies (i.e., Amershi et al. (2019), Sculley et al. (2015)) have provided valuable insights on the challenges of developing AI/ML applications at scale in the context of large companies, we still do not know how small and medium-sized enterprises (SMEs) handle ML application development. It is important to know the challenges and best practices followed by practitioners building ML applications across different domains and in diverse development settings. This paper aims to fill this gap by examining experiences and collect insights from ML practitioners from across the globe with varying skills and experiences and from diverse development domains. We present a survey of ML development practices and insights obtained from the feedback of 80 ML practitioners working in the software industry or in academia.

For the survey, we reached out to over 700 AI/ML practitioners by email. We communicated our request for participation in the survey using contacts from the professional network LinkedIn. We selected the participants based on their profile information indicating their roles associated with AI/ML in academia or industry. We also collected the emails of the participants from GitHub based on their contributions to ML projects. We received responses from 80 participants with diverse technical and professional background. We analyze the survey data to derive insights and summarize them along the phases of the ML development workflow described in Amershi et al. (2019).

In this paper, we make the following contributions:

We conduct a comprehensivesurvey involving 80 ML practitioners from diverse backgrounds to identify the state of practices and challenges in ML application development.
Our survey covers four key phases of ML application development life cycle, namely (1) data collection and preprocessing, (2) feature engineering, (3) model building and testing, and (4) integration, deployment, and monitoring, to identify challenges and practices from practitioners’ perspective.
We synthesize our 17 key findings to show how those findings can benefit researchers and practitioners in developing ML applications of high quality.

Practitioners embarking on new or ongoing efforts to develop ML-based applications can take advantage of the summarized best practices to improve the quality of these applications.

The remainder of the paper is organized as follows. Section 2 discusses some basic concepts of ML application development, common trends in ML application, their benefits and challenges. Section 3 presents the detail of the survey including design, objective, participants, data collection and analysis methodologies. Section 4 presents the results of our survey. Section 5 discusses these results. In Section 6, we discuss potential threats to our methodology and findings. Section 7 presents some prior research related to our study followed by the conclusions in Section 8.

2 Background

This section briefly presents some important concepts of ML application development. We also briefly compare and contrast traditional software systems and ML-based systems.

2.1 Machine learning applications

Traditional software systems are constructed based on a well-defined set of rules that govern the system’s behavior. However, in ML applications, the behavior is controlled by rules inferred from the data (Khomh et al., 2018). ML applications as data-driven systems have induced a paradigm shift in the software development process, making the development, testing, and verification of the ML applications intrinsically harder. A defect in a ML application may come from training data, program code, execution environment, or third-party frameworks. Given the increasing adoption of ML/AI, it is important to understand the challenges of ML application development and devise some best practices. Since ML/AI is an emerging field, we believe that developers who are currently building ML applications are best positioned to reflect and report about the challenges and pitfalls of ML application development. Hence, in this paper, we conduct a survey of ML developers to document their experiences and formulate best practices and the challenges of ML application development.

2.2 ML application development life cycle

In our study, we consider the ML application development life cycle presented by Amershi et al. (2019) as shown in Fig. 1. We study practitioners’ perceptions of the challenges and common practices in ML application development. We briefly discuss the phases of the ML application development life cycle below. A more detailed discussion of the ML application development life cycle is available in Braiek and Khomh (2020).

2.2.1 Model requirements

In this phase, developers define the requirements for data and algorithms regarding a ML problem at hand. They need to identify relevant and representative data. The requirement is very important since it has a significant impact on the success of the other phases of the ML workflow. Selecting insufficient or biased data will likely lead to inadequate ML models. In this phase, developers also often have to mediate between different conflicting goals. For example, ensuring high performance of models while satisfying restrictions enforced by regulations governing privacy and security of information (which often restrict access to some data). Regulations can also induce requirements on the models. For example the General Data Protection Regulation (GDPR) enforces the right to explanation, which requires that ML models be explainable and interpretable.

2.2.2 Data collection and preprocessing

ML applications are data-driven and thus the collection and preprocessing of the data is important. In this phase, data is collected from internal or external sources (e.g., mainframe databases, sensors, IoT devices, and software systems) and is presented in different formats (e.g., various media types). It can be structured (such as database records) or unstructured (such as raw text) and is delivered to ML models either in batch (e.g., discrete chunks from mainframe databases and file systems) and/or real-time (e.g., continuous flow from IoT devices or Stream REST APIs). Developers often have to leverage complementary automated tools that support batch and/or real-time data ingestion strategies, to collect data needed for training their ML models. Once data is collected, it often must be cleaned to ensure consistency and the absence of redundancies. Common data cleaning tasks include: removing invalid or undefined values (i.e., Not-a-Number, Not-Available), duplicate rows, and outliers that seems to be too different from the mean value); and unifying the variables’ representations to avoid multiple data formats and mixed numerical scales. This preprocessing step is often done using data transformations such as normalization, min-max scaling, and data format conversion.

2.2.3 Feature engineering

Feature engineering is the process of extracting informative features from the data that ML algorithms can learn from to build ML models. Features need to be able to represent the characteristics or patterns in the dataset. Once suitable features are extracted, it is also important to select the best subset of features for the models. This process is called feature selection. Extraction and selection of features comprise the feature engineering process. It is an essential step in the construction of conventional ML models. However, in the case of deep learning models, the features are inferred automatically. In fact, deep learning models build complex features automatically as a part of their statistical learning process from data. For example, conventional computer-vision models require image features, including edges, corners, and blobs that can be detected using low-level image processing operations, while convolutional neural networks process raw images directly.

2.2.4 Model training and evaluation

During the training phase, a suitable machine learning algorithm is applied to the cleaned and prepared dataset. Different model parameters are tuned iteratively to learn the mapping between the features and the corresponding labels (in case of supervised learning). Models are trained up to a desired level of accuracy. The trained model is evaluated on the validation dataset, to evaluate the performance. The performance of the model is measured using a predefined set of performance metrics such as prediction or classification accuracy.

2.2.5 Integration, deployment, and monitoring

Once a trained and validated model is available, it is integrated into the target application for the desired functions. The application is deployed on suitable devices or platforms. Deployed ML models need monitoring for performance and potential errors during real-world executions.

In case of errors or major shifts in the patterns in the data, the models may need to be retrained. Thus, the phases of the ML workflow are not linear as it looks like in Fig. 1, rather the phases in the ML application development life cycle are iterative.

In our study, we focus on the following four phases of ML workflow except the requirements phase namely data collection and preprocessing, feature engineering, model training and evaluation, model management (covering integration), and model deployment and post-deployment monitoring. We do not cover the requirement engineering phase in this survey and we plan a future study of its own. This is because requirements engineering for ML is quite complex (Belani et al., 2019; Vogelsang & Borg, 2019). ML engineering introduces a paradigm shift compared to conventional software engineering (Wan et al., 2019) and so the requirements engineering (Vogelsang & Borg, 2019). ML applications are likely to have ML and non-ML requirements. ML application are often developed as a component interacting with other non-ML components to build large and complex systems. Functional and nonfunctional requirements, ML-specific quality trade-offs, and ML and non-ML components’ interactions require different considerations. These make the requirements engineering of ML application a challenging task. Ishikawa and Yoshioka (Ishikawa & Yoshioka, 2019) in their recent study listed requirements engineering as the most difficult activity for the development of ML systems. Our survey thus focus on the above mentioned four phases of ML workflow and identifies the common practices and key challenges in the ML workflow.

3 Study design

We conducted an online survey to understand the practitioners’ experiences in ML application development. We present the overall approach of the study in Fig. 2. We briefly discuss our study objectives and methodology as follows:

3.1 Objectives of the study

Our key objective in this research is to know the perceptions of the ML practitioners about the challenges and state of practices in developing machine learning applications. Using an online survey, we ask the developers questions on development activities encompassing different phases of the ML application development life cycle. Our key focus in this study is understanding the challenges and best practices in data collection and preprocessing, feature engineering, ML model building, testing, and deployment. As ML applications are data-driven, we first focus on data processing and feature engineering. We aim to know about the current practices in data processing and feature engineering including source and types of data, data preprocessing activities, tools, and frameworks. Then we focus on identifying the challenges and best practices in model building, testing, deployment, and post-deployment model maintenance.

Table 1 Research questions

Full size table

3.2 Survey design

To conduct the survey we defined an online questionnaire for the ML practitioners to participate anonymously. The first and the third author prepared the initial design of the questionnaire based on the study of common practices and challenges reported in the existing literature (Amershi et al., 2019). The other authors then reviewed the survey questionnaire. The questionnaire was then updated based on the comments from all the collaborating authors. The questions in the questionnaire cover the development activities of different phases of the ML application development life cycle. In addition, we asked the participants to report their technical skills, experience in ML and software development, job roles, and domains of their ML application development. The survey forms were made available to the interested participants through a web page. As part of the survey design, we first conducted a pilot study to collect feedback on the survey questionnaire from ML practitioners. We shared our initial survey questionnaire with 10 randomly selected practitioners with at least five years of experience in ML application development. We selected participants based on their experiences shared on their Linkedin profiles. We received anonymous feedback from three (3/10) participants on the questionnaire. All three participants have PhD and hold relatively senior positions (Lead data scientist, Senior ML engineer, ML Research Associate) in the industry or in the academic ML research lab. We refined our questionnaire based on their feedback by adding/modifying questions and the types of questions (open/closed). The data from the pilot study is used only to improve and finalize the design of the questionnaire and is not included in the final survey data. We then communicated the updated survey questionnaire to the participants in the final study.

Table 2 Structure of the survey

Full size table

The survey has three parts as shown in Table 2. Part 1 collects some demographic information about the participants including the type of organization (e.g., industry or academia), job roles, skills, experience, and ML domains of expertise. Part 2 of the questionnaire focuses on challenges and practices in the data collection, preprocessing, and feature engineering. Part 3 of the questionnaire asks the participants about their development practices, tools, technologies, and frameworks in ML model building, testin,g and deployment. All sections contain both open-ended and close-ended questions and also options to add comments by the participants where applicable. All the questions collectively meet the data requirements necessary to answer the research questions we defined in Table 1 for this study. In addition, an informed consent form was also available to the participants on the online survey page outlining the detailed objectives, privacy and data use policy of the study. All queries and concerns of the potential participants were clarified by email responses from the authors.

3.3 Data collection

To collect responses from the machine learning practitioners regarding our survey, we communicated the online link of the survey to the prospective participants by email along with our research objectives and requested their participation. Interested participants submitted their responses anonymously using the randomly generated participants’ identification numbers. At the end of the survey deadline, we downloaded the responses of the participants. We used the participants’ IDs in tracking and analyzing the anonymous survey data.

3.3.1 Selection of participants

We selected participants based on their self-declared profiles in the professional network LinkedIn. We also selected ML developers from the GitHub user community contributing to the development of ML applications. In both cases, we ensured that they are professionally attached to ML/AI application domains. For example, from LinkedIn, we selected users either based on their employment in different roles related to ML/AI application such as AI/ML engineer/developer, data scientist, AI/ML researcher/scientist, Software engineer, software architect, and PhD or Masters student in ML or relevant areas. For GitHub users, on the other hand, we select users from the list of contributors in ML/AI projects. In either case, our focus was to reach out to potential participants with expertise and experience in developing ML applications. Once selected, we requested the potential participants by email to participate in the online survey. We gave the necessary details on the objectives, procedures, and policies of the study and asked for their consent to participate voluntarily.

We received responses from practitioners of diverse backgrounds. From about 700 requested potential participants, 81 respondents completed the survey which is about 11.57%. To mention, out of the 81 respondents, all responded to Part 1 of the survey, 49 participants responded to Part 2 and 44 participants responded to Part 3 of the survey. We excluded responses of one participants with partial response to only Part 1 of the survey. So at the end, we retained the responses of 80 participants for our analysis.

3.4 Data collection and analysis

Our survey was designed using Google forms and was made available to the respondents through a provided web link. We collected the data once the survey period was ended. We did some preprocessing of the responses to remove formatting or minor linguistic differences for correct analysis and descriptive statistics. To answer the research questions, we analyzed the data to compute descriptive statistics. We then used visualization techniques to present the responses to have better insights into the trends, similarity, and contrast among different classes of responses. For qualitative analysis of the responses from open-ended questions, we applied grounded theory (Stol et al., 2016; Charmaz, 2006) based coding of the responses for categorization of the challenges and practices in different phases of the ML development. Here, we assigned qualitative coding for the segments of data from the participants’ responses. This aims to make analytic interpretations of the concrete statements from the survey participants to compare, categorize, and summarize the responses. We named (coded) each distinct segment of data to develop abstract concepts for interpreting that data segment. The coding is to link data to an emerging theory that aims to explain the data. We started with initial coding that is open to possible concepts followed by more focused coding to organize or synthesize frequent initial codes. We did theoretical integration during focused coding and continue for subsequent steps to pinpoint the most salient categories from the data. Two of the authors performed classifications independently regarding the goals defined by the corresponding research questions. The authors resolved the disagreements observed in some cases by meeting in person to finalize the data classification. The classified data was further summarized based on analyzing the distributions and visualization. Based on the analysis, we summarized the practices and challenges in ML application development as reported by the survey participants.

3.5 Privacy and anonymity

To ensure the privacy and anonymity of the participants, we did not collect any personal information. The participants were assigned a randomly generated code to use as the user ID. We use cookies to keep track of the returning user to assign the same user ID for different parts of the online survey. Participants were able to access the privacy and data usage policy along with the consent from for voluntary participation. Participants’ data will be securely preserved for seven years. Participants were allowed to withdraw themselves and request data removal at any stage of their participation.

4 Results

In this section, we present our results from the survey to answer the research questions. We also present our insights into the survey responses from the expert practitioners regarding the challenges and best practices in ML application development. We present our findings in the following subsections:

4.1 Demographic distributions

We summarize the demographic information of the participants as follows:

4.1.1 Background

Among the 80 respondents who completed the survey, 56 (70%) participants are from the software industry, 18 (22.5%) from academia or research, 1 (1.25%) was with both academic and industry affiliation, and 5 (6.25%) participants identified themselves with other affiliations (Fig. 3). The participants are from diverse academic background (Fig. 4) comprising 16 PhDs or above (20%), 32 Masters (40%), 31 Bachelors (38.75%), and 1 (1.25%) mentioned as with “Other” level of educational qualifications.

The participants are from diverse roles (Fig. 5) in their corresponding organization with 26 (32.5%) AI/ML engineer, 18 (22.5%) data scientist, 24 (30%) researcher with 10 (12.5%) of them identified themselves as AI/ML research scientist. Besides, 9 (11.25%) of the participants are with the roles of AI/ML developer/analyst, one (1.25%) software development intern, and 4 (5%) with upper-level roles including one chief AI officer, ML software architect, software team lead, and deep learning manager. In addition, the participants include three (3.75%) PhD students, two (2.5%) Masters students, and one other student. The above diversity in the participants comprising both researchers and practitioners allows us to obtain a good representation of the skills and experience of varying levels.

4.1.2 Professional experience

As shown in Fig. 6, the participants are highly experienced in software development with 53.8% of them have a minimum 4 years of experience in software development. Among the participants, we have 35 (43.8%) participants who have worked for five years or more in software development and 8 (10%) with four years, 9 (11.3%) with three years, and 19 (23.8%) with two years of experience respectively. Only 9 (11.3%) of the participants are relatively novice with less than 1 year of experience. The participants have diverse levels of experience in machine learning (Fig. 7) with more than 80% of the participants having at least two years of experience in machine learning application development. To be specific, 13 (16.3%) participants have five years or more experience in ML while 11(13.8%) have four years, 12 (15%) have three years, 30 (37.5%) have two years, and 14 (17.5%) are relatively novice with less than one year of experience in ML.

It is important to note that there is a drop in the percentage of participants in higher experience categories. For example, participants with experience of five years or more dropped from 35 (43.8%) to 13 (16.3%) from software development to ML application development context. This could be explained as the migration of experienced developers from traditional software development to ML application development to adapt to the increasing AI/ML trends in the software industry. This is valuable to our study as such participants have wealth of knowledge and experience to compare and contrast the traditional software development and ML application development especially regarding the challenges and best practices.

4.1.3 Domains of expertise

The survey participants work on developing applications in diverse machine learning domains. Our survey data (Fig. 8) shows that image processing and natural language processing (NLP) are the two domains with the top two number of participants, 45 (56.25%) and 44 (55%) from each respectively. Among the participants, 38 (47.5%) work in the area of predictive analytics and recommendation while 31 (38.75%) participants claimed to have working experience on clustering. Besides, 20 (25%) and 13 (16.25%) participants work on video processing, and speech and audio processing respectively. Also, 3 (3.75%) of the participants use reinforcement learning (RL) in their ML applications while some other application domains of the participants include areas such as control and optimization, games, rendering and animation, security (anomaly detection), music generation, and biomedical engineering. Representation of participants from different application domains provides us with the opportunity to have developers’ insights on the challenges and practices regarding the diverse area of machine learning and AI.

Participants have expertise in a diverse set of programming languages and technologies (Fig. 9). Among the participants, 77 (96.25%) are Python users, which shows that Python is a remarkably popular language among ML practitioners. Besides Python, we have 16 (20%) C++ users, 11 (13.75%) R users, 10 (12.5%) Java users, 8 (10%) Matlab users, and 6(7.5%) SCALA users. In addition, a few participants claimed to use one or more of C#, CUDA, STAN, JavaScript, Node JS, and Clojure as their languages in ML application development.

4.2 Trends in ML application development

Here, we report the current trends in developing ML applications in the industry based on the response of the practitioners. We focus on the types of ML applications software industries are developing, software development methodologies, and the ML frameworks and tools developers are using to develop ML applications to answer the following research question:

RQ1: What are the current industry trends in developing ML applications?

4.2.1 ML Application Types

Responses of the participants give an overview of the ongoing trend in the AI/ML industry regarding the types of applications developed (Fig. 10). We asked the participants to list the types of AI applications commonly developed in their companies. We observe that companies are developing diverse classes of AI-based solutions encompassing different aspects of daily life, business, education, health, commutation, security, entertainment, research and innovation, social networking, and so on. Based on the survey, we observe that software industries are highly focused on developing AI-based solutions for business intelligence (29 (36.25%)). This is reasonable given the ongoing trends in the companies to leverage AI for improved products and services, customer clustering, product recommendations, and prediction and forecasting for business decision support. The practitioners are also involved in document processing (20 (25%)) commonly based on the application of natural language processing. Companies are also developing solutions for entertainment (12(15%)), healthcare (9(11.25%)), education (7(8.75%), security (7(8.75%)) and communication (6(7.5%)).

Besides, there has been a considerable focus on developing ML-based solutions for business including E-commerce, finance, insurance, retails, and revenue management as 10 (12.5%) of the participants reported these application types developed by their companies. Another important application area the practitioners are working on is environmental data analysis and forecasting as reported by 9 (11.25%) participants. Participants also reported working on building applications for social network analytics, control, and automation such as self-driving cars and other areas of research and development in ML/AI including computer vision, speech processing, and simulation. So, our survey shows the diverse area ML/AI is being applied as the recent trends.

4.2.2 Software development methodologies

As reported by the practitioners (Fig. 11), agile software development methodologies have been widely adopted in software industries for ML application development. Among the participants 52(65%) participants report that they use agile process for ML application development. Some widely used agile process frameworks used by the practitioners are namely SCRUM Schwaber (1997), Kanban Anderson (2010), and LEAN Poppendieck and Poppendieck (2003). Practitioners also reported the use of tools such as Jira^{Footnote 1} and Zenhub^{Footnote 2} for the management of agile development process. Among the participating developers, 10 (12.5%) reported to use other data- or test-driven development process. A portion (18(22.5%)) of participants reported that they do not use any specific development process for developing ML applications.

As mentioned by the practitioners, although agile process are the most commonly used, the development process is sometimes tailored to fit specific application development context, i.e., “agile/scrum but tailored towards ML model development processes”. Some practitioners refer to their agile development process as “loosely organized agile” or “light agile” and “more explorative”. Depending on the context, some developers use either some adhoc or agile process for ML application development. They mentioned that “many smaller-scale models are prototyped on an ad-hoc basis with no formal project methodology. Medium and larger projects borrow agile techniques”. While many practitioners do not use “specific development process”, some prefer to use a data-driven or “feature-driven” or “test-driven development” development processes involving “unit testing, integration testing, devops (continuous integration and delivery)” for ML application development. Thus, we observe that practitioners mostly use agile methodologies for ML application development. However, the choice of development process may vary and the development process may require to be tailored to fit into specific ML application development needs.

4.2.3 ML frameworks and tools

From the responses of the participants, we have a list of popular ML frameworks and tools widely used by the ML practitioners (Fig. 12). Among the respondents, 58(72.5%) use TensorFlow as their ML framework for application development showing it as the most popular framework in AI/ML application development. Then, 53 (66.25%) of the participants reported that they use PyTorch making it the second-highest popular ML framework followed by Keras, a high-level ML framework based on TensorFlow which is reported to be used by 44 (55%) participants. Among other ML frameworks MXNet, Scikit-learn, Caffe, and Deeplearning4j are reported to be used by 9 (11.25%), 5 (6.25%), 4 (5%), and 3 (3.75%) participants respectively. Some participants have also reported that they use frameworks like Chainer, Tensorflow.js, Caret, OpenCV, ML.Net, XGBoost, MLlib for their ML application development. It is to be noted that each participant may use multiple frameworks for ML application development and thus the counts of participants for different frameworks are not mutually exclusive.

4.3 ML data collection and pre-processing

Machine learning applications are data-driven, and so it is intuitive that the quality of the input data is very important for the performance of the ML models. Based on the responses from our survey participants we compile different processing tasks, common practices in ML data preparation. From the responses, we know the state of practices adopted by the ML practitioners. We summarize the common practices and challenges related to ML data processing as in the following:

4.3.1 ML data sources

Depending on the ML application domains, the types of the data may vary widely as well as the sources of the data. Data can be of different forms such as text, images, videos, speech, business transactions, time-series data, and so on. Similarly, these data may come from different private or publicly available sources (Fig. 13). As mentioned by the survey participants, companies rely on one or more sources for ML datasets for their ML application development. One of the common sources of ML data is the open-source data sets made publicly available by different academic institutions, companies, and various tech and research communities (e.g., Arxiv, Kaggle). As mentioned by the participants, companies rely on internal company data for developing ML solutions either for themselves or for others. Many software companies develop custom ML solutions for their third-party clients based on their supplied data regarding business transactions, users, and the data collected from internal operations or even external environments using sensors over a certain period of time. ML data is also collected from online sources by web crawling and scraping.

To summarize, open-source datasets are the leading source of data for ML application development. Besides, private data and data from third-party clients are also common sources of ML data as reported by the practitioners.

4.3.2 RQ2: In practitioner’s perception, what are the important quality attributes of ML data?

ML models are data-driven and so the quality of the data is important for the performance of the ML models and consequently the applications containing the ML models. We asked the practitioners about this important topic to learn about the quality attributes that ML developers focus on in practice while assessing data quality. We then compile and classify the data quality attributes based on the responses of the survey participants. We list the key observed quality requirements of the ML data as pointed out by the practitioners as follows:

Feature representativeness

In machine learning, the primary purpose of the data is to train the ML models. For this, the data must be representative of the necessary discriminative features to learn from. Thus, how well data represent the characteristics capable of differentiating different hidden patterns in the data is very important. Practitioners thus emphasize on “feature quality” which requires “high discrimination between features.” This can be assessed by statistical measurements on the data set such as “balanced distribution,” “high variance,” and “low correlation” among the features and with the “target” variable(s).

Adequacy

ML models need an adequate amount of data samples for training. In practitioners’ word ML models need “lots of samples with wide variation, equal(ly) distributed across fields/classes.” The adequacy of the data is hard to define and depends on different factors such as the data, problem, number of features, number of distinct classes, and ML algorithms.

Diversity

ML models need to have “diversity” regarding the coverage and distribution of data among different classes present in the data set. The practitioners have emphasized on the diversity mentioning that the data should contain “...samples with wide variation, equal(ly) distributed across fields/classes.”. The practitioners also emphasized on the “distribution of response variables, (and) distribution of each features.” They also mentioned “subject area coverage, sampling uniformity, sparsity, vocabulary coverage” as important characteristics that enhance the diversity in the dataset. Like other data quality characteristics, different diversity factors and their importance may vary with data, problem, and the algorithms.

Labelling accuracy

Labelling accuracy is very important for the ML dataset. So, it is important to ensure that there are “no mislabelled data. The dataset should be treated with the utmost care, because a bad dataset means a bad model even if it’s trained well.” Practitioners thus emphasize on data “quantity and correct labels.” “The quality of the labels i.e.,“reliability of (data) annotations” is very important and “the structure, accuracy and quality of information would play a large role in determining the importance of the ML data sets.” So, in practitioners’ word “(labelling) consistency is very important; for a particular field I was working on a year ago, there were only two available data sets, but both of them had inconsistent labeling, which made them unusable.”

Completeness

Machine learning data need to be complete meaning that there should not be missing values in the data or at least there should be “enough data with minimum missing values.” Data samples with missing values are either dropped or some transformations are applied to fill in the missing values with the best approximate values.

Consistency

Like the adequacy of the ML data, it is very important for the data to be consistent. The consistency of the data can be in terms of the correctness of values, data types, or the format or structure of the data or even the labelling. The practitioners thus focus on the “structure, accuracy and quality of information.” ML data need to be “consistent with inference data; be relevant for the model; be consistent with itself.” Consistency defines the suitability of the data to use in the ML models.

Reliability

ML data need to be reliable, meaning not only correctness and consistency but also the reliability of data source, data collection, and annotation procedure. The reliability of ML data can be validated by different cross-validation processes. The practitioners suggest checking if “it (the data) has been verified by multiple sources.” It is crucial especially in health and safety critical domain such as for “medical data.” ML data to be reliable, practitioners expect that the “data is clean, well explained, come from good annotations. You (developers) also need to know how the data was generated.”

Noise level

ML data can have noise in it due to missing or erroneous values and outliers in the data. The data can be incorrect in terms of values or data types. Thus, ML data require different transformation and cleaning to remove noises and to improve data quality.

Relevance

ML data need to be relevant for the problem, i.e.,the data should represent the necessary characteristics meaning the “existence of viable features” that ML models can learn from. Like other data quality requirements, the relevance of the data “depends on the problem.”

Class balance

For ML data, class balance is crucial for the accuracy of the ML model. For an unbalanced dat set, the model is likely to be biased to the majority class, leading to poor accuracy, especially for the minority class. Practitioners recommend the data “samples (to be) well balanced across classes” i.e., data is “equal(ly) distributed across fields/classes.”

Distribution

ML data should have balanced distribution across the classes and have “sampling uniformity.” Different statistical measures (i.e.,descriptive statistics), variance, and correlation are commonly used by the practitioners to measure data relation and distribution.

Performance impact

One of the key concerns is how well the model performs based on the given training data. The quality of the data is thus also reflected in the performance of the model. For such quality assessment, practitioners often build a prototype model based on the subset of data and measure model performance such as “AUC ROC on test set.”

Low bias

There can be different sources of biases in the ML dataset. The biases can originate from the human error or perception differences of it can be from the data historically containing discrimination or biases in it. The biases should be eliminated from the data set as much as possible. Thus, the practitioners recommend that ML data need to be “diverse, not biased.”

4.3.3 RQ3: What is the state-of-the-practice regarding the data processing tasks, techniques and tools for quality assurance of ML data?

Quality of the dataset is one of the key factors that contribute to the performance of the ML models. Here, we discuss common data processing tasks, techniques, and tools for ML data processing for quality assurance of ML data.

Data processing tasks

Practitioners may need to employ a series of preprocessing and transformation to ensure the desired quality of the data or ML models in turn. Based on the practices reported by our survey respondents, we can broadly group the data processing task into the following:

Data transformation: ML practitioners often need to apply different transformations on the dataset to prepare for machine learning algorithms. These transformations may include simple corrective transformation such as adjusting the data types or structure of the data. Data may also need some advanced transformations like reducing the dimensions of the data while preserving its key characteristics or hidden patterns. ML data often require normalization and scaling to transform the values to a range suitable for ML algorithms. Another important quality attribute of ML data is the class balance, which can affect model performance. In such a case, some practitioners reported that different boosting and re-sampling techniques are used to remove class imbalance problems in the ML dataset.
Data analysis: To analyze and assure the quality of ML data, practitioners employ different analysis techniques. The first step in quality assurance is to understand the dataset regarding the distribution and basic trends. Practitioners commonly do a manual analysis to have the basic perception of the data characteristics. Another common approach as mentioned by the practitioners is the visualization of the data. The common visualization techniques include the presentation of data using different charts and graphs. Some practitioners also use advanced visualization techniques such as t-SNE (van der Maaten & Hinton, 2008) that facilitates the visualization of multidimensional data in a more flexible and elegant way. In our survey, practitioners also reported that they use exploratory data analysis to evaluate data quality. This analysis helps to understand the common characteristics, category, and trends in the dataset. Another common approach to data quality assurance is to perform statistical analysis or to cluster data to understand the distributions and trends in the ML data set. The analysis can be performed on randomly selected samples from the data set or on the whole dataset. Another approach to assess ML data quality is to build a prototype model based on a subset of the data and verify the model performance. The type and extent of analysis may depend on the problem, data, and specific objectives of the data analysis.

Tools and techniques for ML data processing

The practitioners depend on different tools and techniques for ML data analysis. One common technique reported by our survey participants is the manual inspection of the data. Manual inspection is a reliable technique as the developers can take advantage of their domain knowledge to assess the quality of the ML data to perceive the common patterns in the dataset. ML data may also need to be annotated manually for categorization and labelling. However, manual analysis is likely to be costly and may suffer from scalability issues in case of a large dataset. Another approach commonly used by the practitioners is to visualize the dataset. As reported by the survey participants, open-source tool Jupyter Notebook^{Footnote 3} is a widely used tool for data exploration and visualization. Practitioners also reported using other commercial data analysis tools (e.g., Kibana^{Footnote 4}) for exploratory data analysis and visualization. Practitioners also reported that they use Apache Spark^{Footnote 5} for ML data processing especially in the big data context.

Another technique used by the developers is to write custom scripts for data analysis and visualization using descriptive statistics, charts, and graphs. Custom scripts can also be used to fix for missing and duplicate values, to identify data types and value range inconsistencies, detection of labelling errors, and for checking data structures or formats. One important point to note is the fact that many practitioners reported that they do not use specific tools for data quality analysis and some times do not even check data quality, and instead rely on assumed quality based on the source of the data. However, this reliance may fail to identify potential issues in ML data quality and may consequently lead to poor quality ML models. However, despite of different commonly used tools and techniques, domain knowledge plays an important role in the application of tools and techniques and the effectiveness of ML data quality assurance.

Common practices in ML data processing

One of the key challenges of the ML data collection and preprocessing is that the data and the necessary processing can be domain and problem-specific. Thus, no specific tool may fit all the problems or data processing requirements. The responses from our survey participants also reflect the challenges of dealing with these variabilities. Overall 76.6% of the participants mentioned that they do not use a very specific tool for ML data analysis. One of the key reasons is likely to be the abovementioned fact that one specific tool is not capable of handling diverse data analysis requirements and practitioners may use very domain or problem specific tools and techniques. It can also be explained by the limited availability of data analysis tools with comprehensive features to cover the processing of data from diverse domains as only 14.9% of the practitioners have reported to using specific data analysis tools. Besides, some of the participants reported that they rely on existing Python libraries and frameworks to develop their custom data analysis scripts. Thus, it is important to develop necessary tools for data analysis with comprehensive coverage of data analysis requirements in diverse problem settings.

4.3.4 RQ4: What are the challenges of ML data cleaning?

Cleaning ML data is an important data preprocessing step to remove noise from the ML dataset. Based on the practitioners’ responses we list the following challenges in ML data cleaning:

Generalization

Data cleaning like most other tasks in the ML application development workflow is hard to generalize as it is usually “geared towards specific applications.” This is due to the inherent domain- and problem-specific variations in data, ML frameworks, and algorithms, and even the target application platforms. This has also been reflected in the practitioners’ responses as one respondent mentioned “There is no one-size-fits-all tool and probably will never be one.” Another respondent mentioned “...it is practically impossible to make a general tool, as it depends on the data and the problem at hand.” So, “they are not generalizable to different use cases, like text and images.” One common practice adopted by the ML developers is to develop or customize their data cleaning solutions as mentioned by one respondent: “Sometimes they are not adaptive enough for my problems so I have to write my own.”

Scalability

Another key challenge in data cleaning as reported by the survey practitioners is the “scalability to big data sets.” Most tools and techniques may suffer from the scalability issues. This challenge is intuitively understandable particularly because of the rapidly growing volume of ML data. The data volume may often exceed the processing memory (“scaling to multi terabytes”). Thus, practitioners need to devise custom techniques to process larger datasets in small-capacity machines under resource constraints. Otherwise, it may impact the data processing cost due to large data processing resource requirements.

Automation

Some practitioners feel the need for “automated analysis” for data cleaning and reported that current data cleaning techniques are “poorly automated”. However, practitioners are aware that “some tasks cannot be automated...” and recommend that “...rule-based and AI/ML techniques need to be applied to data cleaning itself.” This suggests the idea that ML techniques can potentially be applied to automate the data cleaning tasks. Data regarding the cleaning techniques applied to existing ML applications are likely to be leveraged. Due to various diversities in data and problems, it is challenging to integrate the data cleaning and processing tasks into the ML workflow which further limits the automation of the data cleaning and other preprocessing tasks.

Data quality

“Most data is noise (noisy)” and thus cleaning of these types of data can be costly. Moreover, data can be from different sources and in different forms and so their levels of quality. For example, text data can be with different encoding schemes while image data can be in different formats and quality. When the data is too noisy, the cleaning task becomes costlier and often impossible given the tools and techniques available. Data from companies are proprietary data and the structure of the data is likely to be driven by other business or technical factors than the application of ML.

Lack of standard

Another issue the practitioners commonly face is that there is no defined standard of “clean-data.” The cleanliness can be relative and may vary with data, problems, and algorithms. This makes it harder to devise robust techniques for data cleaning.

Efforts and costs

Data cleaning can be costly (“It’s very time and labour intensive”) and thus “requires a lot of efforts,” time, and computational resource requirements. Also, data processing tasks can be highly iterative and the continuous expansion of the data may trigger repetitive data processing incurring high cost.

Lack of tools and features

As the data types and the required processing may differ widely from one problem to another, tools are likely to be with a domain or problem specific features. This limits the adaptability of tools for diverse ML data. The lack of features and data dependencies limit the usability of the data cleaning tools and techniques. The practitioners also mentioned “the high complexity of use” as a challenge to using data cleaning tools effectively. Also, “sometimes they (tools) are not adaptive enough for my (specific) problems...” and this lack of flexibility also limits the use of tools for processing ML data of diverse characteristics.

Context and perception differences

From the responses of our survey participants, we observe a difference of perception on the challenges of data cleaning between the ML practitioners. This difference of perception is likely due to the different contexts in which they performed their data cleaning tasks. The response “Till now, from my usage experience, I didn’t find any limitations on the tools I have used during my projects.” is thus likely to represent a domain and context-specific view of the respondent and may or may not apply to the development contexts of other participating practitioners.

Requirement for domain expertise

ML data processing requires a clear understanding of the data, target problem, and algorithms. However, understanding the structure and semantics of the data from a particular domain may often require basic and sometimes advanced knowledge in the domain. For example, to process natural language texts in the mental health domain the ML practitioners are in “... need for expertise in linguistics and mental healthcare.” The requirement for domain expertise may vary across the domains and the type of the problem being addressed by the ML application.

4.3.5 RQ5: What are the challenges of data labelling faced by the ML application developers?

Feature labelling is very important as incorrect labelling affects model accuracy. However, labelling of features is a challenging task especially when data volume is large and due to different domain- and problem-specific requirements and constraints. Based on the responses from our survey participants, we identify several key challenges perceived by the ML practitioners as follows:

Data volume

One of the key challenges in feature labelling reported by the practitioners is the large volume of data. Labelling commonly involves manual effort and given the increasing volume of ML data, labelling can be very challenging due to the constraint “large dataset vs limited human resources.” Because of the data volume, “the amount of work to be done can be overwhelming.”

Cost

“Labeling the data is quite hectic and time taking process” and thus may incur “high cash and labor costs.” “Also in some cases, labeling the data also requires a lot of knowledge of the field to which the data belongs.” However, “using experts for labeling is expensive” while “using non-expert for labeling results in low quality training set.” Again, “it is better to have opinions of several experts rather than a singleton labeling to avoid biased opinions and ensure (the) validity of the labeling.” This further increases the cost of the labelling of the ML dataset.

Required domain expertise

Data labelling often requires expert-level domain knowledge. For example, labelling data in medical imaging such as diabetic retinopathy requires “a super-skilled workforce, such as doctors to estimate the level of diabetic retinopathy from images.” However, “domain experts are hard to reach.” So, the requirements for domain expertise in ML data labelling not only make the labelling task challenging but also make labelling excessively expensive. However, the required level of domain expertise may vary across domains.

Automation

Labelling of ML data is mostly manual or “poorly automated.” “Manual labeling can be very frustrating and time taking” and thus “labeling is slow and expensive.” For automation, standard procedure is necessary. However, due to diverse variabilities involved, “coming up with a standard annotation procedure” is quite challenging. “It is very hard to make a criteria that can be validated by scripts or other automated tools.” Thus, although “human annotation is expensive” there is a “lack of tools for annotation.” Also, for automation, it is necessary to have ground truth for validation. However, there is also “lack of clear ground truth.”

Domain dependency

ML data and the labelling objective may vary widely across different application domains. As the practitioners claimed, data labelling “becomes more difficult is(as) the dataset is domain specific.”. Thus, the data labelling criteria and the required expert-level knowledge are also very domain specific. For example, “for legal documents, lawyers are best suited to annotate.” This domain dependency puts a limit on the human labelling experience to be transferable to other domains.

Biases

One of the key challenges in ML data labelling is the potential biases or inconsistencies. There are different sources of possible biases or errors in data labelling. As data labelling is manual in most cases, “discrepancies among humans” i.e., the differences in knowledge and perception among labellers can introduce labelling biases or inconsistencies due to “subjectivity” as “everyone has their own point of view.” Again, the annotators may “have no understanding of the importance of the quality or lack proper training, so the labeling is inconsistent.”

Data quality

The quality of the data also has impact on the data labelling. Too much noises in the data and incompleteness of data due to missing values can affect the labelling. There can be multiple labels for single data and to avoid label confusion requires clear labelling guidelines.