On the usage and development of deep learning compilers: an empirical study on TVM

Wu, Xiongfei; Yang, Jinqiu; Ma, Lei; Xue, Yinxing; Zhao, Jianjun

doi:10.1007/s10664-022-10221-7

On the usage and development of deep learning compilers: an empirical study on TVM

Published: 20 September 2022

Volume 27, article number 172, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

On the usage and development of deep learning compilers: an empirical study on TVM

Download PDF

Xiongfei Wu¹,
Jinqiu Yang²,
Lei Ma¹,
Yinxing Xue³ &
…
Jianjun Zhao¹

968 Accesses
Explore all metrics

Abstract

Recent advances in deploying deep learning (DL) models have inspired the innovation of DL compilers from both industry and academia such as Facebook Glow and TVM. Given the importance of DL compilers, we seek for answering the important question to ease the adoption and development of TVM: What challenges do users face when using DL compilers and what are common challenges for developers when developing DL compilers. This paper presents the first empirical study on identifying the challenges in both usage and development of a DL compiler. We choose TVM as the representative DL compiler and manually inspect 347 sampled posts from its official discuss forum. We identify a taxonomy of challenges in usage of TVM consisting of 15 categories and seven types of common topics about developing TVM. Furthermore, we characterize TVM bugs in total of four impacts to obtain an initial understanding on defects of TVM through manual inspection of 44 bug reports and propose five implications for both developers and researchers in order to improve the development practices and build more robust DL compilers.

Silent bugs in deep learning frameworks: an empirical study of Keras and TensorFlow

Article 29 November 2023

Common challenges of deep reinforcement learning applications development: an empirical study

Article 14 June 2024

Demystifying API misuses in deep learning applications

Article 16 February 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deep learning (DL) has been widely applied to many cutting-edge areas, e.g., machine translation (Wang et al. 2020; Hoang et al. 2018), natural language processing (Deng and Liu 2018), image processing (Hemanth and Estrela 2017), cancer diagnosis (Fakoor et al. 2013), self-driving cars (Badue et al. 2021). To meet the requirement of these wide applications, various DL models such as convolutions neural network (CNN) (Lecun et al. 1998), recurrent neural network (RNN) (Rumelhart et al. 1986), long-short term memory (LSTM) (Schmidhuber and Hochreiter 1997) have been proposed. With the rapidly growing complexity of DL models, it is decisive to alleviate the efforts on programming such DL models. Up to now, many high-performance DL frameworks such as TensorFlow (Abadi et al. 2016), PyTorch (Paszke et al. 2019) have been proposed, allowing researchers to quickly implement and experiment with various DL models.

Nevertheless, due to the specific data-driven programming paradigm, DL applications often come with high computation complexity. Generally, most of the current DL workloads are running on general-purpose platforms, i.e., GPU, CPU. To further push the limit of performance on DL workloads and energy efficiency, enormous effort has been put into designing DL-specific hardware from both industrial and academia, e.g., Google TPU (Jouppi et al. 2017), dedicated DL accelerator based on field programmable gate array (FPGA) (Lacey et al. 2016), Apple Bionc (Kingsley-Hughes 2017), Cambricon (Liu et al. 2016).

The diversity of DL hardware and DL frameworks has witnessed the prosperity of DL community. However, it can be tedious for developers when it comes to actually deploying DL applications built upon different frameworks for various hardware, especially considering that DL models and operations need to be optimized for each hardware to get the optimal performance. Besides, deployment issues are even more difficult to solve compared to other aspects of a DL application (Chen et al. 2020). To migrate this problem and alleviate the burden of optimizing DL models for various hardware, several DL compilers have been proposed such as nGraph (Cyphers et al. 2018), TVM (Chen et al. 2018a), Tensor Comprehensions (Vasilache et al. 2018), XLA (Leary and Wang 2017), Glow (Rotem et al. 2018). Given a DL model by one of the DL frameworks, a DL compiler parses the model definitions and generates the optimized implementation for a target hardware. So DL compilers seem to be a promising solution, however, there is a fundamental question remain unclear: What challenges do users face when using DL compilers and what are common challenges for developers when developing DL compilers.

To answer this question, this paper presents the first empirical study on identifying the challenges in both usage and development of a DL compiler. Regarding the popularity of DL compilers, this study can help DL compiler users to avoid common pitfalls in using DL compilers and developers to be more clear about how to better help users in a more specific way. Several potential directions have been discussed for researchers in order to build more robust DL compilers. Building on these considerations, we select TVM as the representative DL compiler since TVM has a outstanding performance compared to other DL compilers (Li et al. 2021) and it has sufficient documents and discussions. We analyze relevant posts from the TVM Discuss Forum, which is the main communication channel for both TVM users and developers. We manually analyze 347 randomly sampled posts from the discussion forum including both usage and development topics. Based on these posts, we focus on the following research questions (RQs).

RQ1: What are the challenges users may have when using TVM? :: To figure out what challenges user may face when using TVM, we randomly sample and analyze 279 posts on the usage of TVM. We finally build a taxonomy of challenges consisting of 15 categories.
RQ2: What are common topics that TVM developers discuss? :: To have a better understanding of the challenges and inspire future research or tool support, we carefully analyze 22 posts on the development of TVM. We find TVM developers generally have seven types of posts.
RQ3: What are the impacts of user- and self-reported TVM bugs? :: We take an initial step to understand the impacts of the TVM bugs by analyzing 44 bug reports identified from the discussion forum and 297 bug-relevant commits crawled from the official repository of TVM on GitHub. We finally summarized four types of impacts of TVM bugs.

To the best of our knowledge, this is the first paper to analyze challenges in using/developing DL compilers through mining the collective knowledge. Besides, We make all the materials we used in this study public. The crawled Apache TVM Community posts and the manual inspection results are made publicly available.^{Footnote 1} Researchers interested in conducting analysis on DL compilers may utilize this dataset. The rest of this paper is organized as follows. Section 2 provides background knowledge about DL frameworks, DL hardware and DL compilers. Section 3 describes methodology used to collect the posts, to build the taxonomy. Section 4 presents the taxonomy of challenges in using TVM along with the description of these categories. Section 5 describes the taxonomy of challenges and common topics about developing TVM. Section 6 describes the characterization of TVM bugs. Section 7 contains a discussion of our findings and describes several implications. Section 8 reviews our threats of validity. Section 9 discusses related work and Section 10 finally concludes this paper.

2 Background

In this section, we describe background knowledge about deep learning (DL) frameworks, DL hardware and DL compilers, especially TVM, i.e., the study subject of this work. All of these three are essential to developing DL software, i.e., DL frameworks provide training and executing DL models, DL hardware provide better hardware support to enable more efficient computation of DL models, and DL compilers support the deployment and optimization of DL models generated by DL frameworks on DL hardware.

2.1 Deep Learning Frameworks

Deep learning frameworks provides building blocks for designing, training and executing various DL models. In this section, we briefly introduce some popular DL frameworks to provide an overview of DL frameworks. As shown in Fig. 1, DL frameworks are divided into recent popular frameworks, ONNX supported DL frameworks and historical DL frameworks.

Recent Popular DL Frameworks

Due to the prosperity of DL community, various DL frameworks have been proposed from both industry and academia. TensorFlow (Abadi et al. 2016) and PyTorch (Paszke et al. 2019) are two representative DL frameworks. TensorFlow is famous for its static computation graphs while PyTorch adopts dynamic computation graphs and defines a neural network on-the-fly (Zhang et al. 2019).

ONNX Supported DL Frameworks

Open Neural Network Exchange (ONNX) is an open format for representing deep learning models, which allows developers to train DL model in one framework and then export and deploy the model into other frameworks for inference (ONNX 2020). ONNX provides a definition of an extensible computation graph model so that DL models from different DL frameworks can be transformed into ONNX format. Most of the current DL compilers support ONNX interchange format in frontend, allowing them to parse models from different DL frameworks. As shown in Fig. 1, ONNX is supported in frameworks such as Caffe 2, MXNet and so forth. Note that frameworks like Keras and TensorFlow can be converted into ONNX using the converter provided by ONNX. However, such conversion is not yet officially supported by the DL frameworks (e.g., Keras and TensorFlow).

2.2 Deep Learning Hardware

DL hardware, which can enhance the performance for DL models, has drawn attention from Internet giants to newly startups. As deep learning requires extensive matrix-based computations, it requires specialized hardware support. Generally, the DL hardware can be categorized into two types: 1) general-purpose hardware, which can support DL workloads through adding specially designed components and providing highly optimized libraries. 2) DL-specific hardware, which is specially customized to have better performance on DL workloads (LeCun 2019).

General-Purpose Hardware

The current mainstream solution to accelerating DL workloads is to use Graphics Processing Unit (GPU). The massive parallelism of GPUs allows them to speed up computations that involve matrix-based operations, which are the heart of many DL implementations (Nguyen et al. 2019). CPU is an alternative to GPUs that can be used as general-purpose hardware due to its flexibility. Besides, manufacturers may offer accelerated libraries to further improve the performance of their products on DL workloads. For example, NVIDIA provides CUDA Deep Neural Network library (cuDNN) that includes highly optimized primitives for deep neural networks, which can leverage specialized hardware components of NVIDIA GPUs and thus improves the performance (NVIDIA 2020). Intel offers oneAPI Math Kernel Library (oneMKL) to increase application performance on Intel-based systems (Intel 2020). Except libraries from hardware manufacturers, Open Computing Language (OpenCL) is a platform that can provide heterogeneous parallel computing ability on cross-vender and cross-platform hardware (Tompson and Schlachter 2012).

DL-Specific Hardware

DL-specific hardware is fully customized for DL workloads to further push the limit of performance and energy efficiency. Popular DL-specific hardware includes dedicated hardware based on Field Programmable Gate Array (FPGA) (Lacey et al. 2016) and Google Tensor Processing Unit (TPU) (Jouppi et al. 2017).

2.3 Deep Learning Compilers

DL compilers are proposed to alleviate the engineering efforts of developers when deploying or optimizing DL models on different hardware. Given a DL model by one of the DL frameworks, a DL compiler parses the model definitions and generates the optimized code implementation (i.e., deployable module) for a target DL hardware.

2.3.1 The Architecture of DL Compiler

In general, the compilation process of DL compilers that transforms a model definition to the highly optimized code implementation can be divided into four layers: 1) frontend, 2) intermediate representation (IR), 3) optimization, 4) backend, as shown in Fig. 2.

Frontend

The frontend takes a DL model from one DL framework as input and transforms it into a computation graph representation (high-level IR). Then computation graph optimization techniques will be applied to the computation graph. Finally, the optimized computation graph will be passed to the backend for further hardware-specific optimizations. TVM uses a frontend named Relay (Roesch et al. 2018), which supports to parse DL model from almost all the popular DL frameworks and could perform various hardware-independent optimizations.

IR

There are two kinds of IR involved in the DL compilation process, namely the high-level IR (graph IR) and low-level IR. TVM uses Relay IR (Roesch et al. 2018) which is a functional IR that adopts both directed-acyclic graph (DAG)-based IR and let-binding-based IR as its high-level IR. The low-level IR of TVM is based on well-known IR called Halide IR (Ragan-Kelley et al. 2013) and TVM has further improved Halide IR into an independent symbolic IR.

High-level IR is a high-level abstraction of DL models, expressed as a computation graph and is hardware independent (Xing et al. 2019). High-level IR enables DL compilers to perform graph-level optimizations.
Low-level IR resides in backend of the DL compiler and can represents the computation of the DL model in a more fine-grained view. It enables the DL compilers to utilize hardware-specific optimizations and optimized libraries regarding specific target platform.

Optimization

Since optimization is associated with IR, there are also two kinds of optimization involved in the compilation process. For high-level optimization, TVM supports standard optimizations such as fusion and constant propagation. TVM also supports traditional hardware-specific optimizations such as hardware intrinsic mapping, memory allocations and fetching in backend for low-level optimization. Furthermore, TVM utilizes an auto-tuning optimization based on machine learning for further optimization.

High-level optimizations are involved in the frontend of the DL compiler and are applied to the computation graph.
Low-level optimizations are performed in backend of the DL compiler by using hardware-specific optimizations, auto-tuning methods and optimized libraries.

Backend

The backend transforms the optimized computation graph into low-level IR, then performs optimization regarding the target hardware, and finally packs the generated code into a deployable module (Chen et al. 2018a). TVM defines the compiled object as module, which can be deployed on the target device (TvmDeveloper 2020c).

Except the aforementioned four common components, different DL compilers may have their specific components in order to enhance its functionality. For example, TVM provides Versatile Tensor Accelerator (VTA), which is an open and customizable deep learning accelerator with TVM-based compiler stack. TVM regards it as an extension of the TVM framework in order to advance deep learning and hardware innovation (TVM 2020a; Moreau et al. 2018). Furthermore, TVM provides remote procedure call (RPC) that is useful for cross-compilation and can alleviate users from remote testing.

2.3.2 An Example of Using TVM to Deploy DL Models

The whole TVM stack can be divided into two components, namely the TVM compiler and TVM runtime. The TVM compiler is to perform all the compilation and optimizations while the TVM runtime runs on the target devices. Users do not need to build the whole TVM stack on target device, especially when target device only has limited computing resources. TVM allows users to cross-compile a DL model on a desktop or server and then deploy the compiled module on target device installed with TVM runtime that is very minimal (TvmDeveloper 2020b).

Figure 3 shows an example of how to use TVM to compile a pre-trained ResNet18 model from MXNet (Foundation 2020) and deploy the compiled runtime module on Raspberry Pi 3b+ that only has TVM runtime installed. The code snippet in the upper-left corner will download and compile the ResNet18 model (i.e., through relay.build). Note that the variable target that contains the target description and the codegen module is already simplified by TVM, i.e., TVM has built-in the target parameters for Raspberry Pi 3b+ (tvm.target.arm_cpu("rasp3b")). The complete form of this description is tvm.target.Target("llvm -device=arm_cpu -model=bcm2837 -mtriple=armv7l-linux-gnueabihf -mattr=+neon"). Users will need to manually specify these target parameters if their device parameters are not built-in with TVM.

The compiled module contains the optimized computation graph (graph), module containing necessary libraries (lib) and parameters of the final graph (param), and the library is saved to “net.tar” for further deploying (i.e., through export_library). When deploying the compiled module on Raspberry Pi, the first thing to do is to load the compiled module (“net.tar”) and then to create a runtime module (module). Finally, the compiled module could be run on Raspberry Pi. For developers, there are various areas of TVM stack they could contribute to. For example, developers could add new operators or a compiler pass to relay, which is related to the frontend of TVM. As for the backend, developers can implement new backend for new hardware platform regarding their demand (e.g., Hexagon,^{Footnote 2} TI DSP^{Footnote 3}).

3 Methodology

To better understand the challenges in using/developing TVM, we analyze relevant questions and answers posted on Apache TVM Community,^{Footnote 4} which is the official discussion forum of TVM where users/developers seek for technical advices on unsolved issues. In this section, we provide descriptions on how we selected the study subject (i.e., TVM) and data source (i.e., the TVM discuss forum), how we collected the data for this study, and how we performed the study.

3.1 Data Collection

Among the well-known and widely used DL compilers, i.e., TVM, nGraph, Tensor Comprehensions (TC), Glow and XLA (Li et al. 2021), we focus our study on using and developing one DL compiler, i.e., TVM, for two main reasons. First, TVM supports more DL frameworks than other DL compilers with outstanding performance (Li et al. 2021). Second, we do not find sufficient data to study the use and development of these DL compilers except for TVM. XLA, which is a DL compiler supported by Google as part of TensorFlow, does not have a discussion forum. Both Glow and Tensor Comprehensions are developed by Facebook, while Glow is part of PyTorch and shares the discussion forum with PyTorch.^{Footnote 5} Tensor Comprehensions does not has a discussion forum and mostly uses a slack channel^{Footnote 6} for discussion and most of the messages are posted 2 years ago. nGraph is supported by Intel and it is part of the Intel OPENVINO project. nGraph uses the OPENVINO discussion forum^{Footnote 7} and there are only 78 posts found to be related to nGraph after manual inspection. Our search of Stack Overflow for questions on any of the abovementioned DL compilers yields very few posts. Thus, we choose TVM to be the representative DL compiler in this study.

To study TVM, we identified a list of resources we could utilize for understanding the challenges in developing and using TVM, namely Stack Overflow, TVM slack channel, TVM discuss forum and TVM GitHub repository. Upon further investigation, we found that there are few discussions about TVM on Stack Overflow. As shown in Table 1, we used these search terms to cover all the questions related to TVM and the important components of TVM (e.g., relay, autotvm). The “answers:1” parameter is used to ensure that the returned results will have at least one answer. Then we manually inspected all the returned results and only found 11 questions related to TVM. This is not surprising, as 1) TVM is an emerging but relatively new topic (i.e., TVM released the first version in 25 October, 2017; 2) Answering TVM questions requires non-trivial expertise; and 3) the TVM community encourages discussions on the official forum.

Table 1 Questions about TVM on stack overflow

Full size table

In addition, for TVM slack channel, we find that developers are advised to only use Slack as a non-archival place for quick sync and the discussions should still happen in discussion forum or GitHub.^{Footnote 8} In the end, we decide to choose the Apache TVM Community and TVM GitHub repository as the only data sources of this study. Specifically, we utilize the Apache TVM Discuss Forum for studying the challenges TVM developers and users may face and TVM GitHub repository for studying TVM bugs. The Apache TVM Community (also referred to as the TVM Discuss Forum) is launched in April 2018 (around the same time that TVM releases version v0.4) and has been the main communication channel for TVM users and developers.

Step 1: Crawling the TVM discussion forum. :

We collected the TVM dataset by crawling the official Apache TVM Community (i.e., the TVM Discuss Forum) on November 23, 2020. We collected a dataset of 3,727 posts from 4 April, 2018 to 23 November, 2020. The crawled data of each post contains all the metadata of the post, including the title of the post, all the replies, the link to the post, number of replies, number of views and the time of the latest activity.

Step 2: Identifying relevant topics. :

We performed an initial screening on the collected posts as some of the topics that the posts cover are not of interests to this study. In particular, we leverage the official categories associated with each post to identify relevant topics. Table 2 shows the 10 official categories for classifying topics (second column) and the number of posts under each category in the collected dataset. The official categories are recommended by the TVM discussion forum and are manually identified by the forum users when posting the questions, i.e., one post can only have one associated category.

However, after examining 50 randomly sampled posts, we found that the users loosely follow the intention of the official categories, i.e., the standard of applying the categories is inconsistent across the posts, especially on usage-related categories such as troubleshooting. However, we noticed that one official category development is consistently used by the posts that TVM developers publish for communicating on TVM development issues. Hence, to facilitate the subsequent stratified sampling process, we merge the five usage-related (i.e., non-development) official categories and produce two high-level categories, namely, the categories Usage and Development. We manually examined all the posts under the categories Announcement, Meetup, d2l, uTVM, and Site Feedback and concluded that these posts are indeed irrelevant of usage and development of TVM. Hence, we excluded these categories from the remaining of the study.

Step 3: Crawling bug-fixing pull requests (PRs). :

We collected bug-fixing PRs of the official repository of TVM from July 30, 2019 to November 23, 2020 on GitHub using the GitHub search API (git 2021). Upon examining the sampled posts from TVM discuss forum, we notice that there is only a small portion of posts about TVM bugs, which leads to the small number of TVM bug posts for study in the sample. Hence we include bug-fixing PRs to enhance the dataset of TVM bugs. We followed the previous work (Garcia et al. 2020) to collect the bug-fixing PRs that contain at least one bug-related keyword (i.e., fix, defect, error, bug, issue, mistake, incorrect, fault and flaw). Then the first two authors manually inspected and classified these bug-relevant PRs independently. As a result, 297 TVM bugs are identified.

Table 2 Statistics of the TVM Discuss Forum

Full size table

3.2 Manual Investigation

To categorize the challenges of using and developing TVM, we follow the open coding method in Berg et al. (2004).

3.2.1 Construction of Taxonomy of Challenges in Using/Developing TVM

Step 1: Initial category distillation. :

In this step, the first two authors jointly inspected 50 randomly sampled posts from the crawled dataset and constructed an initial set of categories. The detailed procedure is described as follows.

The first two authors thoroughly read all the posts to get familiar with them. All elements in the posts including title, main body, replies, code snippets will be carefully examined. URLs mentioned in the posts will also be tracked in order to get a precise understanding of the question.

Once the two coders have been familiar with the samples, they start assigning short phrases as initial tags to describe the challenges behind these posts. To determine the category of a post, we follow the method adopted in Chen et al. (2020). Specifically, for those posts raised without deep investigation (usually in the form of “how”, e.g., “How to schedule fused ops?”) or detailed information, the two coders often can summarize the challenges based on the post descriptions; for those posts with detailed descriptions of faults or unexpected results, the coders identify the challenges based on their causes. For instance, if a developer files a post and seeks for help on an error he/she encountered when exporting DL models and the coders can identify that the cause is the incorrect build configuration of TVM from the descriptions, comments and replies from other users, the coders consider build configuration as the challenge behind this post. Then the two coders will start clustering similar tags into categories and create a hierarchical taxonomy of challenges. If there are conflicts between the proposed categories, an arbitrator will be involved into the discussion and the conflicts are marked as resolved when all the participants have reached a consensus. Finally, an initial set of categories are distilled.

Step 2: Independent labeling and constructing extended categorization. :

Upon constructing the initial set of categories, the two coders continued to analyze a statistically sample of posts independently. We adopted the stratified sampling strategy to randomly sample a total of 347 posts to ensure a 95% confidence level and 5% confidence interval in the population of 3606 posts (i.e., the categories of usage or development of TVM). The sample contains 312 posts under the category of usage and 35 posts under the category of development.

For questions not related to usage/development of TVM, we mark these posts as False Positives and are excluded from the dataset in this study. False Positives are posts that are not related to both Usage and Development of TVM. As shown in Table 3, upon manual inspection, we excluded 46 posts (two false positives and 44 bug reports) that do not belong to neither development nor usage and corrected the corresponding category of some posts: there are four posts about development, however marked with usage in the sample posts, one post that is irrelevant to both usage and development of TVM, and 11 posts marked as development are actually about usage. The posts with incorrect categories are corrected by the authors, and are utilized to conduct RQ 1 or RQ 2 according to its corrected category. Note that the 2 False Positives are excluded from all the RQs in this paper. Hence we end up with 279 posts in the usage category and 22 posts in the development category, which were used to conduct RQ 1–2. 44 bug reports were excluded from RQ 1–2 and were used to conduct RQ 3.

Table 3 Distribution of posts before/after manual inspection

Full size table

During the labeling process, the two coders evolve the initial categories into the final taxonomy in an iterative manner, in which the two coders continuously look at the existing categories and the post being inspected to refine the taxonomy. There are two kinds of changes that may be applied to the initial categories: 1) if any coder can not fit a post into one of the initial categories, this post will be jointly inspected by the aforementioned two coders with and arbitrator to determine whether a new category should be added; 2) if any coder find a category is not representative, all the authors will meet up and discuss about revising the corresponding category. If agreement has been reached to change a category, the corresponding category will be modified and all the posts in this category will be inspected and labeled again to avoid misclassification.

Furthermore, the two coders kept a note of the resolution status of the examined posts. There are multiple ways that TVM users can mark the resolution status of the posts: 1. Users may add “SOLVED” to the title after getting a correct answer; 2. There could an explicit reply of the post such as “Thanks, it really works.” even if the user does not add “SOLVED” to the title. The two coders took consideration of the aforementioned cases when marking the resolution status of the posts.

The questions on using and developing TVM are actively viewed and discussed by TVM community. Figures 4 and 5 show the boxplots of the number of view counts and replies for the posts under the Usage (279 posts) and Development (22 posts) respectively. As the figures show, the posts under both of the categories have been receiving active discussions and attention, i.e., the median view count is 269 for Usage posts and 244 for Development and the median of number of replies received is 4 and 4.5 respectively.

In summary, 347 posts are inspected during the manual inspection, and 39.7% of the inspected posts have an accepted answer (either marked as resolved or with an explicit acknowledge in the replies). The inter-rater measurement of independent labeling results is 0.72 using Cohen’s Kappa (Cohen 1960), which implies substantial agreement and demonstrates the reliability of our coding procedure. The manual inspection procedure takes about 400 man-hours. The final taxonomy is shown in Fig. 6

4 RQ1: What are the Challenges Users May Have When Using TVM?

Motivation

As discovered by previous studies (Zhang et al. 2019; Chen et al. 2020), developing and deploying machine-learning backed software pose unique challenges for data scientists and software engineers. The extensive use of frameworks (e.g., DL frameworks, DL inference engines) makes the development convenient, fast-evolving, but also introduces overhead to practitioners. Deep learning compiler represents an emerging line of techniques that aims to provide end-to-end model optimization and deployment. It is important and timely to study the challenges and problems that the users of DL compilers (i.e., developers of machine-learning software) may commonly encounter. Our findings will identify future research ideas and tool support to better facilitate the adoption of DL compilers in the development of machine-learning software.

Method

We followed the steps described in Section 3.2 and identified a total of 279 posts (including both the questions and all the replies) on the usage of TVM. Note that we do find some posts (39 under the category of Usage) by TVM users appear to be about troubleshooting in the beginning, but later turn out to be caused by bugs in the current TVM implementation. Such bug reports require further analysis and are beyond the scope of using TVM. Hence we classify 39 bug reports into an individual category (i.e., not included in the category of usage) and discuss the bug reports in RQ3. We categorized the posts based on the challenges and problems described.

Results

We describe the categories derived for the posts on the usage of TVM, present the distribution and discuss the main challenges we identify. At the high level, we identify three categories: troubleshooting (150/279, 53.8%), general questions (119/279, 42.6%), and feature request (10/279, 3.6%). Figure 7 shows the distribution of the three high-level categories and the distributions of the sub-categories under each of the high-level category.

Category 1: Troubleshooting

Troubleshooting covers the largest number of posts on the usage of TVM. TVM users encounter errors and issues (e.g., compilation errors, runtime errors) when using TVM and thus seeking for help in the discussion forum. We further derive the following six sub-categories, namely configuration, procedure, performance, selection/usage of API, limitation of frameworks/platforms, and lack of documentation/examples.

Configuration (58/279, 20.8%). Improper configurations of TVM may cause severe reliability issues in the client code (i.e., machine learning software), such as crashes and low performance. There are three types of configurations that TVM users need to set when using TVM.

Build configuration (24/279, 8.6%).This category is about the challenges about building TVM from source and issues caused by building TVM incorrectly. To use TVM, developers are required to build TVM from source. To build TVM, developers need to edit configuration files (i.e., config.cmake) in order to enable external libraries (e.g., BLAS, cuBLAS) or backends (e.g., LLVM, CUDA). Wrong configuration can result in build failure or application crash. For example, users may forget to build TVM with LLVM enabled. This may not have any effect during the building process, but may raise errors when using TVM to export the deployable module. However, the official guide of building TVM from source^{Footnote 9} only says that it is recommended to build TVM with LLVM to enable all the features without further explanation.
Environment configuration (27/279, 9.7%). Using TVM requires to correctly setting up the environment which includes configuring complex software and hardware dependencies. Similar to the findings by Zhang et al. (2019) regarding using DL frameworks, we notice configuring environment for TVM is prone to various types of challenges such as version incompatibilities issues. For example, when running remote procedure call (RPC) tutorial on remote device (i.e., Jetson TX2), one of the configurations a TVM user need to configure is the version of CUDA on both the host and target devices. Due to a misconfiguration, a TVM user experienced runtime errors that are difficult to resolve (i.e., costing 12 days to resolve).^{Footnote 10} The misconfiguration is a simple CUDA version mismatch between the host and target device. To prevent similar incorrect configurations in the future, TVM developers write a complete RPC deployment tutorial on how to configure common boards and targets.
Compilation configuration (7/279, 2.5%). A deep learning compiler such as TVM provides highly configurable compilation process. In fact, among the steps in the complicated compilation process (as shown in Fig. 2), TVM users can opt to configure many of the steps, i.e., quantization size, shape of the GEMM tensor, optimization level. Incorrect or conflicted compilation options could result in crash, compilation failure, undesired behaviors and low-grade performance. For example, if the batch size is set too large or the optimization level is too high (opt_level ≥ 2), TVM may cost too much device memory and raise CUDA_ERROR_LAUNCH_OUT_OF_THE_RESOURCE error (TvmUser 2019b).

We also find that TVM users complain about lack of sufficient examples or tutorials, which makes the configuration problem even harder to solve than it is already.

Procedure (52/279, 18.6%). We find that users often ask for help about how to perform a very specific task typically when users have difficulties debugging the errors caused by their code. Despite the efforts of providing learning resources of TVM by TVM developers (e.g., tutorials (TVM 2020d), language reference (TVM 2020c), and a book under construction (TVM 2020b)), it indicates a gap between the available learning resources of TVM and TVM users’ needs.

This type of questions is different from more general how-to questions (later explained in “Category 2—general questions”), e.g., “Entire procedure of compilation”, and shows that the users have certain level of knowledge of using TVM. Due to TVM’s highly diverse functionalities, there are seven sub-categories under Procedure.

Adding/registering new operators/targets (5/279, 1.8%). Due to the quick involving DL community, the operators/targets provided by TVM are insufficient for developers. Developers may need to implement the operators (targets) and adding the customized operator to run-time library (or a certain backend), which can be challenging for those who are unfamiliar with TVM.
Auto-tuning devices/workloads (16/279, 5.7%). TVM provides auto-tuning for developers to get the best performance for a specific device (workload), i.e., ARM CPU, x86 CPU. However, the auto-tuning process is highly sophisticated and may consist of many steps, i.e., install additional dependencies, define workload, configure tuning setting and create tasks. Using AutoTVM (the auto-tuning module of TVM) requires developers to write tuning templates regarding their workloads (devices). Incorrectly written templates or configurations can result in tuning failure (TvmUser 2019c) or performance degradation (TvmUser 2020a).
Parallel programming issues (2/279, 0.7%). Developers may have difficulty dealing with parallel programming, i.e., synchronization, thread scope. For example, a TVM developer sought for help in order to do global synchronization in IR builder on GPU (TvmUser 2018c).
Model quantization/dequantization (3/279, 1.1%). TVM utilizes quantization to enable high-performance inference on edge devices (TvmDeveloper 2018b) and reduce power and compute requirements. For this technique, developers have difficulty in quantizing a model which has a operator that is not quantized in TVM (TvmUser 2019f) or dequantizing the weights back when needed (TvmUser 2020h).
Pattern matching (5/279, 1.8%). TVM developers often need to identify pure data-flow sub-graphs of the Relay (frontend of TVM) program and transform these into example passes, i.e., fusion, external code generation and device specific optimizations, which requires a lots of tedious boilerplate code (TvmDocs 2020). To alleviate users, TVM provides a pattern language and APIs to enable pattern matching and pattern processing. TVM users often have troubles writing the correct pattern to match a specific operator (TvmUser, 2020b 2020g) or finding existing patterns regarding their purposes (TvmUser 2018f).
Exporting/converting models (13/279, 4.7%). These posts cover the challenges in exporting/converting models into the formats for a specific target platform to deploy the models. Similar to the findings by Chen et al. (2020), we notice exporting/converting models using TVM is prone to various types of challenges such as confusion about a specific step in the exporting (converting) process. One example is “[SOLVED] How to export model library to so file instead of tar for armv7 on x86 box” (TvmUser 2018g). While how to export a model library as tar file is known to the user, the user had issues debugging the compilation failures and therefore sought for help. As TVM offers highly configurable functionalities and much flexibility, learning resources (i.e., tutorials, documentation) may not cover every aspect of such flexibility although the code snippet that demonstrates a similar task may be covered in the tutorials.
Importing/loading models (8/279, 1.8%). To deploy compiled DL models, developers also need to tackle with importing and loading models. For example, a user had issues loading the exported parameters in big endian system (TvmUser 2018d). In addition, users may also suffer from calling unsupported operator issues during the importing/loading process (TvmUser 2019d).

Limitations of the underlying frameworks/platforms (8/279, 2.9%). TVM is expected to support various hardware, DL frameworks with respect to different platforms (e.g., Linux, Windows). However, due to the inherent differences of such variations, TVM users often have troubles finding the solutions to resolve the problems, e.g., one TVM code works fine at one platform but not at another. For example, a TVM user sought for help regarding the difficulties to run tutorial code in Windows (TvmUser 2018a) due to the failure of autoTVM.LocalRunner of TVM. TVM developers suggested a workaround solution. However the workaround does not provide a seamless solution, i.e., it requires extra steps to set up in Windows compared to Linux. Another type of limitations is caused by defects, either in the underlying frameworks/libraries or caused by the incompatibility issues between the TVM version and the frameworks/libraries. For example, a TVM user was suffering from stuck during tuning problem and the follow-up investigation shows the problem is caused by an incompatible version of XGBoost (i.e., a widely-used gradient boosting library, which used as the cost model when auto-tuning with TVM) suggested the user downgrade to XGBoost 0.9.0 in order to avoid unexpected errors. (TvmUser 2020e).

Performance (12/279, 4.3%). TVM users have performance concerns about the time spent on auto-tuning, resource usage of TVM and runtime performance of the compiled model by TVM. Many posts complain why the performance of the compiled model is slower than the original one. For example, a TVM user sought for help about how to limit the CPU usage of the TVM when performing model inference. The developer suggested the user to directly change the number of CPU threads by using a config_threadpool function, which was not documented in the tutorials of TVM (TvmUser 2019e). Other performance issues contain: 1) resource usage of TVM, which is usually about the usage of GPU memory and CPU usage. Some of these issues are caused by the external dependencies (e.g. TensorFlow).^{Footnote 11} 2) The performance of the compiled model is slower than the original one. For example, A TVM user complained about the inference time of the compiled model is very slow. Further exploration indicated that the selection of API is the root cause for this issue (TvmUser 2019a).

Lack of documents and examples (7/279, 2.5%). TVM users are sometimes clueless about how to achieve a specific step when using TVM. Different from the Procedure category where TVM users may have some clues (i.e., indicated by the code provided in the question), TVM users may have no ideas about how to perform certain tasks due to the lack of documents and tutorials/examples.

Selection/usage of API (13/279, 4.7%). TVM provides a large number of APIs (more than 630 C++ APIs^{Footnote 12}) as it needs to take models of different frameworks and output an executable target for various hardware/platform. We find that users may be confused about which APIs to use to fulfill their demands. For example, a TVM developer was confused about the relay.build_module.build() and relay.build_module.create_executor() API. These two APIs can both be used to generate code and were used in two separate tutorials. The results turned out that there was some subtle difference between these two APIs (TvmUser 2019i). Note that the user is actually the lead maintainer of XGBoost and has contributed to solve a TVM issue.^{Footnote 13} However, we found other posts in which a user complained about the performance regression after getting the DL model compiled. That user finally found that there was significant performance difference between these two APIs (TvmUser 2019a).

Category 2: General Questions

This category covers relatively high-level questions that are not about a specific step in using TVM. We conclude the following three sub-categories.

Entire procedure of compilation (53/279, 19.0%). This category refers to general questions about the whole procedure of compiling DL models using TVM, usually raised without any in-depth investigation.

Although TVM provides tutorials that may cover similar tasks that TVM users ask, the users’ use cases are usually more complicated than what the tutorials provide. Furthermore, some of the tutorial may lack of background knowledge and thus hard to comprehend by TVM users. For example, a TVM user wanted to know how to populate a tensor after reading the C++ deployment example.^{Footnote 14} It seems like that the tutorials and examples are designed to give users a “feeling” and make them understand the basic concepts, but this will become a problem when users actually start using/modifying TVM since some background knowledge is lost and the example is too simple.

Design/Implementation Details of TVM (44/279, 15.8%). As the TVM community continues to draw attention, more and more developers are joining the TVM development. This category represents a strong desire of TVM users/developers to have an in-depth understanding of TVM. Questions are mainly about the implementation details or design philosophy of TVM and may vary from naive ones like “Where is DLDataType defined?” (TvmUser 2020i) to very profound questions like “TensorArray GlobalVar and GlobalTypeVar Confusion” (TvmUser 2019h) that needs several proficient TVM developers to work out a solution.

Conceptual Questions (4/279, 1.4%). Questions in this category are raised to understand fundamental concepts or background knowledge about DL compilers, such as “What’s the Model bias in TVM paper” (TvmUser 2019j). This category of questions are also spotted in previous studies developing machine learning software (Bagherzadeh and Khatchadourian 2019; Chen et al. 2020).

How to study TVM. (3/279, 1.1%). TVM users ask questions for the purpose of learning TVM better, i.e., steps to follow for a better understanding. For example, “What should I do to understand tvm source code?”, “How can I understand IR?” and “Any material of Relay for beginners?”. These questions are not related to any specific step in using TVM and it’s not as specific as the posts in “design/implementation details”.

Development Progress of TVM. (15/279, 5.4%). TVM users may ask about the recent development progress of TVM, especially if they are waiting for new features to be released or bugs fixed. For example, a TVM user may file a post to ask whether dynamic shaped tensors has been supported in TVM or are there anyone is working on this feature.^{Footnote 15} These posts are different from the category Lack of Doc/Example, because these users only want to check whether TVM support the functionality they want yet.

Category 3: Feature Request (TVM Users) (10/279, 3.6%)

We find that TVM users may request for new features, e.g., support for new operations or specific hardware. TVM developers may follow up with the feature requests, e.g., some of the feature requests are later implemented and some may be divided into several follow-up posts (i.e., multiple feature requests) in the development category. For example, the post with largest view count (3534) is in this category, namely the “INT8 quantization proposal” (TvmUser 2018e). This post is further divided into two posts in Development category.

Discussions

Among all the usage categories, Configuration, Procedure and Entire procedure of compilation are the top three most frequently asked questions. Except How to study TVM, Feature Request questions seems to be most difficult questions to solve, 10% of the Feature Request questions have been resolved, compared to 41.35% of all other questions. Limitations of underlying Frameworks/Platforms requires longest time to get an answer. Figures 8 and 9 show the boxplot of the number of view counts and replies for inner categories under the Usage. As the figure show, the median of number of replies of these categories is near and the median view count is also close except How to study TVM and Lack of Documents and Examples, which is 691 and 749.5 respectively. However, the boxplot of the response time in Fig. 10 shows that the median of the response time varies broadly (from 314 to 10814).