1 Introduction

Lately in the software industry, the need of data engineering is on the rise due to the increased adoption of AI and data analytics applications.

Like any other software engineering applications, data engineering application requires a sound design, a correct implementation, and thorough testing to ensure the quality of the end products. In the presence of big data, many modern data engineering applications often incur resource overhead. One of the cost centers come from the development overhead due to late-discovered or undiscovered errors. For instance, we would like to eliminate most of the run-time errors, in particular those in the latter stages of data processing pipelines. It is obvious that errors arising at these stages are extremely costly due to higher computation overhead and higher utilization of development resources such as man power and infrastructure. Testing with sampled/scaled-down dataset might not be sufficient to detect all errors. In this article, we propose to use some static analysis techniques with the assistance of the type systems, to eliminate errors in the data engineering applications as much as we could and retain the code readability, modularity, and re-usability.

This article is organized as follows. In Sect. 2, we walk through a simple batch data processing example in Spark and we illustrate how to use static type system to eliminate the need of extra run-time checking in the data processing stage and yet retain the code re-usability. In Sect. 3, we look into a different example in the streaming data context and study how to use phantom types and type constraints to eliminate the possible run-time errors. In Sect. 4, we will conclude this article with some further discussion.

2 Towards Type Safe and Reusable Spark Applications

As Apache Hadoop (Hadoop 2019) became the widely adopted system for big data infrastructure, Apache Spark (Spark 2019) became the most popular engine by lifting data parallelism to the next level. Spark offers a simple computation model which abstracts over a MapReduce-like data batch processing run-time. In many data engineering applications, it out-performs the “battery included” MapReduce library shipped with Apache Hadoop.

In this section, we will highlight some of the common pit-falls using Spark for data processing and the remedies. For convenience, we use Scala as the main language for the discussion, but the concepts are not language specific. We are aware of projects that attempting to enhance the type system of dynamic type languages such as Python (MyPY 2019). The main idea can be applied to these system/language extensions.

2.1 Loosely Typed Data

Let us consider a simple example which constructs a language model from some textual data using the two-gram operation.

Function twogram takes a paragraph of text, splits it into words, and stores them in a list as stated in line 2. For simplicity, we assume there exists some helper function rmpunc that removes punctuation from the string. For each word, we construct the two-gram character lists by zipping the word with itself having the first character dropped as stated in line 3. s1.zip( s2) forms a new sequence by pairing up elements from s1 with those from s2. For example, "abc".zip( "123") yields a sequence of ( a,1) , ( b,2) , ( c,3) . w.drop( 1) removes the first character in w. Let l as a sequence, method invocation l.flatMap( f) applies the function f to each element in l. Assuming the results of f are sequence, the inner sequences are concatenated (or flattened). In line 4, we use map to apply the anonymous function p=>s"$p._1$p._2" to each pair in pairs. Given a pair p, p._1 extracts the first component and p._2 gives us the second one. s"$x$" embeds a Scala value referenced by variable x in a string. For instance, twogram( "hello world") yields a list

Next, we load the input data from a CSV file into a Spark dataframe. Dataframe offers a high-level abstraction and operations to manipulate data. It is particularly appealing to Python programmers due to its similarity to Pandas dataframe.

At line 4 above we instantiate a spark session, followed by a read from a CSV file. The last statement allows us to take a look at the data, which likes the following.

Next, we would like to apply function twogram to the text field from the above dataframe.

Function twogram_df applies function twogram to the “Text” column of the input dataframe. Note that we need to check for the existence of the “Text” column. In case of failure, a None value is returned. Thanks to this check, the returned value has to be of type Option[Dataset[List[String]]]. This has some implication to the subsequent data processes in the pipeline, that is, all the use of twogram_df would need to check for potential None as incoming values. Though in a strongly typed language such as Scala, the compilation process would have enforce these checks in an earlier stage. In the next section, we pursue this direction.

2.2 Type-Setting the Data

An alternative approach would be use Dataset instead of Dataframe as the main collection data structure.

In Spark, Dataset is a generic/polymorphic version of dataframe. A Dataset type constructor expects a type argument which represents the type of the contained elements. In this running example, we make it explicit that the elements in the Dataset are of type Article. Dataset offers a set of data processing APIs subsuming those offered by dataframe. In Spark, Dataframe is a type alias of Dataset[Row], where Row is a closed world encoding of universal data representation. A Row object can be seen as a non-generic Vector whose elements of type Object. Using Row as the element type imposes certain limitations as we have seen in Sect. 2.1, i.e., we need to keep checking for existence of a given column in the rows.

At line 1 of code snippet above, the case class keyword introduces a singleton algebraic data type Article, which has a constructor Article (same as the type’s name) taking three arguments, source, date, and text. The use of algebraic data type gives us multiple advantages over the Row representation.

  1. 1.

    Firstly, it frees us from restricting the data to be structured data. Algebraic data type is a natural encoding of semi-structured data as studied by previous works (Wallace and Runciman 1999; Sulzmann and Lu 2008).

  2. 2.

    Secondly, it provides a type level declaration of the incoming data’s structure. It can be regarded as a contract enforcing the roles and responsibilities between different processors in the data processing pipeline, i.e., the data must be cleansed and parsed before being used by the current processing step, none of the columns should be null or absent.

  3. 3.

    The third advantage is that the “down-stream” processors are greatly simplified, for example, we rewrite twogram_df into twogram_ds as follows:

    In the body of twogram_ds, we simply apply twogram to the text fields of all articles in the dataset. This simplification applies to all subsequent steps which use twogram_ds since the Option type is no longer required.

    However, there remains a draw-back in this approach. Unlike the dataframe approach, the current dataset approach demands a fix schema to the input type of the function twogram_ds. It limits the re-usability of this function.

2.3 Sending Them for Classes

To resolve the issue of lacking of re-usability, we resort to a well-known programming language concept, type class (Peyton Jones et al. 1997). Type class was introduced in some functional programming languages, such as Haskell. It allows programmers to have a type-safe mixture of parametric polymorphism (such as generic) with ad-hoc polymorphism (such as function overloading). Type class is not a first class citizen in Scala. There exist some well-known tricks to encode type classes in Scala.

At line 1, we declare an object HasTextTypeClass which has two declarations, a trait HasText and an accompanying object HasTextOps. A trait in Scala is similar to a Java interface, which defines an abstract contract. All implementations of the trait need to fulfill the contract by providing a concrete implementation of the trait member functions. In our case, all implementations of HasText[A] are obliged to provide some concrete implementation of getText. The object HasTextOps in line 6 defines a helper function instance. Function instance takes a higher-order function argument fn to define an implementation of the HasText[A] trait, by overriding getText.

Referencing HasTextTypeClass as a library, we can re-define a generic version of twogram_ds.

In the above adjusted definition of twogram_ds, we generalize the input argument type as Dataset[A], where A is a type variable (A.K.A. generic). In addition, we include an implicit argument ev (short for “evidence”), which indicates an implementation of HasText[A] must be provided / inferred in the current context. In the function body, we use the trait member function ev.getText to extract the text field from a, which is an element in the dataset. The above implementation of twogram_ds and HasTextTypeClass can be packaged as a library module for re-usability purposes.

Let us get back to our data processing application. To apply twogram_ds to our actual argument of type Dataset[Article], we must first provide an instance of HasText[Article].

At line 1, we define an instance of HasText[Article] as an implicit value. In Scala, when a function with implicit arguments is invoked, the compiler will search for all the matching implicit values based on their types in the current context. At line 4, we apply twogram_ds to the dataset mydata without the need of explicitly applying articleHasText as the argument.

2.4 A Quick Summary

In this section, we illustrated how to make a simple data processing task type safe and yet retaining re-usability by using algebraic data types and type classes. The same trick is applicable to other distributed data structures such as resilient distributed dataset (RDD) and larger scaled tasks.

3 Sailing Safe Through the Storm

It is common that at certain point in time, a data engineering team sorts out most of the historical data via batch processing and turns to developing a strategy to process the new incoming data. Compared to the historical data, these new incoming data are much smaller in size and refreshed in every second. It is inefficient and in-economic to apply the batch processing models to this tiny little “delta.”

Spark streaming (Spark 2019) was introduced to address this issue by providing simple wrappers around existing Spark batch mode data processors. This greatly reduces the need of re-development due to shift of data processing mode. However, Spark streaming supports data parallelism but not task parallelism. Data parallelism parallelizes computation by applying common instructions to different subsets of data. Task parallelism parallelizes computation by identifying non-interfering sub-tasks (with different instructions) in a main task and executes the sub-tasks in parallel.

Apache Storm (Storm 2019) is a robust framework for handling real-time infinite stream of data. It supports both data parallelism and task parallelism. In a nutshell, Apache Storm models a flow-based programming paradigm where the processes are represented as objects of class spout and class bolt. A spout denotes a data-generator and a bolt denotes a data processing/transforming possess. Such an architecture allows flexible configuration and is designed for a stream processing system. However, in Apache Storm, data are propagated across different processors in an untyped manner. This leads to potential run-time errors.

3.1 An Untyped Storm Topology

To illustrate the idea, we adopt the example, “word-count topology” from Apache Storm tutorials such as (Gkatziouras 2017),

We present the word-count Storm topology in Scala together with a directed-graph representation. In the graph representation, double-line rectangle denotes a spout which generates data. single-line rectangles denote bolts which are data processors. The Scala implementation was given in the main function in object WordCountTopologyUntyped. We use a TopologyBuilder to “assemble” the three components via setSpout and setBolt methods. setSpout method takes a string as the name of the spout instance, the spout object, and a non-negative integer which denotes the number worker threads that the spout will spawn. The method setBolt has a similar set of arguments and it connects bolt objects to the topology. Methods shuffleGrouping and fieldGrouping specify the names of the up-stream processor/generator. The difference of the two will be discussed in the next few paragraphs.

In Fig. 1, we provide the complete definitions of RandomSentence spout, SplitSentence bolt, and WordCount bolt. In the class RandomSentence, the open method defines the initialization routines. The main routines of the spout is defined in the method nextTuple, which randomly picks one sentence out of the predefined set and passes it to the next process in the pipeline using _collector.emit. The method declareOutputFields creates a simple label for each generated data field, which can be referenced by the processors in the downstream. The class SplitSentence takes the output from its up-stream processors in the pipeline and splits it by spaces. Each word from the split will be forwarded to the downstream, as defined in the execute method. The class WordCount maintains a mapping from words to their numbers of occurrence in a variable named counts. When execute is invoked, the incoming value is treated as a word. The word’s count will be increased by 1 if the word already exists in counts, otherwise the count is initialized to be 1.

Fig. 1
figure 1

A simple word count topology (spout and bolts definitions)

Let us go back to the main topology construction. When a bolt is “connected” to its up-stream, an input declaration has to be made. In case of the SplitSentence bolt, the distribution of the inputs from RandomSentence does not matter, hence shuffleGrouping is used. In case of the WordCount bolt, the input assignment is crucial, as we would like to ensure that the same word must always go to the same bolt instance otherwise the counting is meaningless. Thus, fieldsGrouping is used.

3.2 Storm Is Dangerous

Everything is neat and tidy except that we spot two problems.

  1. 1.

    The connection between spout and bolts is dynamically typed, i.e., if there is a type mismatch, it will only be discovered during run-time. For instance, let us consider there is a slight adjustment of the definition of SplitSentence by replacing lines 27–29 in Fig. 1 by the following:

    As a result, the bolt emits List[Char] instead of String.Footnote 1 The compiler happily accepts it and type checks it. The type mismatch between the output from SplitSentence and input to WordCount will only be discovered as a run-time error. This is going to be potentially costly and dangerous. In production code, we are dealing with large set of data and large-scaled topologies with reusable components. There are cases where this type of mismatch run-time errors become hard to trace. Testing and debugging will drain the team resources.

  2. 2.

    The construction of the topologies requires that there should be at least one spout followed by zero or more bolts within a topology. It does not make sense to add a bolt into an empty topology. However, the TopologyBuilder does not enforce this constraint statically. For instance, if line 4 from the WordCountTopologyUntyped topology in Sect. 3.1 was omitted, the compiler does not report errors, but a run-time exception will be raised.

3.3 Phantom Types to the Rescue

To solve these problems, we adopt a well-known technique called phantom type (Finne et al. 1999; Cheney and Hinze 2003), which is often used to provide type safety in domain specific language embedding. In a nutshell, phantom types are data type whose type variables are not fully mentioned in its data constructors.

We follow style of phantom type encoding described here (Iry 2010). First of all, we introduce two phantom types to capture the input and output types of the spout and the bolt.

In the above, we define two traits which embed a spout (or a bolt, respectively). StormSpoutT expects a type parameter Out which describes the output type of the embedded spout and StormBoltT expects two type parameters In and Out which define the input and output type of the embedded bolt. Note that none of these type parameters are mentioned in the body/member of the traits. Hence, we call them phantom types.

The next step is to provide some extra type annotations to the RandomSentence spout, SplitSentence bolt, and WordCount bolt. Note that the definitions of these spout and bolts remain unchanged. For brevity, we omit the repetition of the definition and provide only the signatures.

Here we provide concrete implementation of the StormSpoutT and StormBoltT traits. RandomSentenceT extends the StormSpoutT trait by declaring the underlying spout to be RandomSentence and at the same time specifying the output type of the spout is String. SplitSentenceT embeds SplitSentence as the underlying bolt and specifies the input type of the bolt is String and the output type is String. Similarly, WordCountBoltT embeds WordCount which has String as input type and ( String, Int) as output type.

Next, we define three possible states of constructed topologies, an ordering among the three states, and a phantom type representing a state of a topology. Note that the ordering is implicitly enforced via a sub-typing relation TopWithBolt <: TopWithSpout <: TopEmpty.

TopologyBuilderT expects two type parameters, State and Out. The plus + sign suggests that both type parameters are covariant. This case class embeds the actual TopologyBuilder instance. In addition, we define two combinators >>  and >>>  for the ease of topology construction, which enforce the matching constraint between the output type of current processor and the input type of its following processor. Furthermore, as the topology being constructed, we also keep track of the state of the topology. For example, when a topology is initiated, its state should be TopEmpty, until a spout is added, the state becomes TopWithSpout. Codes that add a bolt to a topology with a TopEmpty state will be rejected by the type system.

>>  adds a spout into a topology whose state is bounded by TopEmpty, which is enforced via the implicit type constraint at line 17. As a formal argument type annotation, <:< sets an upper bound to the left-hand side type variable. It then updates the topology state to the result topology type by annotating it with TopWithSpout at lines 18 and 20. In addition, it also ensures that the resulting topology shares the same output type as the spout by annotating with type variable NextOut at lines, 14, 18, and 20.

>>>  adds a bolt into a topology whose state is bounded by TopWithSpout and returns a topology with bolt. This is enforced by type constraints at lines 23 and 27. Note that <: serves the same purpose as <:< except that it is used at the type variable declarations instead of formal argument type signatures. The result must be a topology with bolt added, as enforced by annotations at lines 29 and 32. In addition, it ensures the input type of the bolt to be added agrees with the output type of the current topology being extended, i.e., Out, as in line 25. The output type of the resulting topology has to be NextOut, which is the output type of the bolt. This is enforced in lines 25, 29, and 32.

With these new phantom type and combinators, we can rewrite the topology construction in Sect. 3.1 as follows:

where the two issues mentioned in Sect. 3.2 are fixed via the type system. The code verbosity of the type annotations and constraints can be reduced by using techniques such as macros and meta-programming (Burmako 2017).

4 Conclusion

In this article, we walk through a few examples to illustrate how static typing tricks can be applied to reduce testing and run-time checking overhead during data engineering project development. It is well-known that

Program testing can be used very effectively to show the presence of bugs but never to show their absence. (Dijkstra 2017)

With type system, static analysis, and logical proof, we are able to verify whether the given properties hold in the program codes. It follows that a large subset of test cases/run-time checks can be eliminated to stream-line the data engineering projects as long as we verify these properties using type system, static analysis, and logical proof. As a resource, development resource and infrastructure can be reduced or re-allocated for other more important tasks.

All the code examples in this article can be found in (Lu 2019).