Keywords

1 Introduction

Current technological development enables financial institutions to collect extreme amounts of data every day. This situation poses new demands and requirements to sort these bulks of information and extract valuable contexts. Emerging links help corporations and advertisers to understand sequences of past actions, to learn from them and to extrapolate for future actions [1].

To [2] fulfill the computational requirements of massive data analysis, an efficient framework is essential to design, implement and manage the required pipelines and algorithms. In this regard, Apache Spark has emerged as a unified engine for large-scale data analysis across a variety of workloads. Following its advanced programming model, Apache Spark has been adopted as a fast and scalable framework in both academia and industry. It has become the most active big data open source project and one of the most active projects in the Apache Software Foundation.

2 Overview of Apache Spark

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLib for machine learning, GraphX for graph processing, and Spark Streaming [3]. Composition of the layers in system are presented in Fig. 1.

Fig. 1.
figure 1

Apache Spark overview

For the purposes of the experiment were used the following bindings and steps as follows:

  1. 1.

    Upper-Level Libraries

    • Spark SQL

    • Python programming language

  2. 2.

    Spark Core

    • RDDs API

    • Transactions

    • Actions

  3. 3.

    Cluster Manager

    • Hadoop YARN

  4. 4.

    Storage

    • HDFS

As the main query language and accessible interface was used Spark SQL that [4] provides DataFrames, which is a new data structure for structured (and semi-structured) data. DataFrames offers us the possibility of introducing SQL queries in the Spark programs. It provides SQL language support, with command-line interfaces and ODBC/JDBC controllers.

2.1 Cluster Manager

Spark uses master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run. The driver and the executors run in their own java processes. Physical machines are called hosts or nodes [5]. A detailed view of the cluster manager architecture is shown in Fig. 2.

Fig. 2.
figure 2

Cluster overview

3 Experiment Design

Data used for deeper exploration [1] described customer relations with a single bank and consisted of multiple data columns each representing separate attribute related to individual customer ID (such as balance movement, number of ATM used, etc.). Values of these attributes were recorded monthly for each ID. However, most data columns still contained either missing or error values, preprocessing of the data was required.

Data were provided by PricewaterhouseCoopers Česká republika, s.r.o.

3.1 Preprocessing Data

Due to a complexity and large volume of obtained data, it was necessary to carry out the preprocessing phase. During this phase data was cleaned and normalized with the goal to reduce complexity of the data set.

Main operations of the preprocessing phase were as follows:

  • selection of the most influential attributes

  • selection of appropriate records

  • handling of incomplete data columns

Originally, obtained data contained 3.3 billion atomic cells in total each representing a specific value. After all operations were performed size of the data set was reduced to 2.1 billion cells resulting into 36.65% reduction caused mainly due to the sparsity of some data columns.

3.2 Input Data Set

In our solution we have received several types of documents. The files had to be processed, categorized and passed through their semantic value. The original data structure was adopted in a form of comma-separated values. In the first series of assignments was necessary to select subsets that clearly interpreted the given set of data. Documents were separated into the next groups where each group contained the final set of files.

Fig. 3.
figure 3

Total count of files per group

  1. 1.

    Overall group - Contains all received files

  2. 2.

    Live data group - Include files that doesn’t represent dictionaries

  3. 3.

    Candidate group - Include not empty files that can be used for further analysis

  4. 4.

    Preprocessed group - Include files that have been subjected to the advanced analysis using Apache Spark

In the next Fig. 3, it is possible to see the partition of the total number of files per each group. From the collected data was observed that 57% of the files belongs to “Live data group” collection. This kind of data was obtained during the client usage of the system held by financial institution. From the other point of view 86% of the files contain values that can be reused for deeper analysis and from the measurement flows that 24% of the files requires the necessary preprocessing in the cloud computing infrastructure.

3.3 Data Clustering

To obtain more detailed statistics the source files were clustered according to their meaning and content value. The resulted distribution was made to the following sections:

  • Application - appl

  • Business plan - bplan

  • Customer lifetime value - clv

  • Contract - contr

  • Credit bureau - crb2

  • Campaign - crm

  • Dictionary - dict

  • Channel - chan

  • Churn - churn2

  • Scoring - score

  • Segment - seg

  • Grouped visualizations - visualization

After the individual sections were identified we had to reduce the complexity of the data model by correctly selecting the appropriate features (e.g. customer attributes that are relevant to the amount of revenue generated by said customer) in next steps:

Fig. 4.
figure 4

Attributes per section

  • selection of suitable attributes

  • selection of appropriate records

In our example, attributes related to balance movement, offer success ratio and DTI (debt-to-income) were considered to be the most relevant ones.

The outputs of the selection operations are presented in Fig. 4. It shows the original total amount of attributes in the system before the preprocessing phase happened and their target total amount of attributes after the preprocessing phase was performed. The results are also clustered by sections.

By examining we came to the conclusion that the largest drop in attributes was in the application section, because this one mostly contained data from previously discovered “Live data” group, which primarily includes records applied to the users and to their unambiguous identification. Additionally this loss of data is caused by the issues related to the data confidentiality, personal data protection and processing by third parties.

3.4 Cloud Computing

Preprocessing step for handling an incomplete data columns is introduced in this chapter. Partial algorithms and code snippets from the overall solution are mentioned later in this section. Brief survey of mandatory algorithm for determining ratio of “NULL” records per column was designed as followed.

The algorithm described below calculates the ratio:

figure a

After the previous step was completed there was necessary to specify the ratio condition to match the sufficient features. For our purposes the variable was set to 80%. This variable describes surface for dropping features and is applied in next part of our algorithm.

figure b

All of the above mentioned steps belongs to the transformations and the operation itself is executed on Apache Spark from the next code.

figure c

One of the main ideas and advantages of Apache Spark to cluster that big data is that we can smoothly consume larger volumes of workloads. As well we can reapply part of algorithms mentioned above to produce live outputs. These results are achieved through the use of in-memory computing instead of using file storages, as is usually the case in other systems.

3.5 Report Outputs

If we look at the results of preprocessing phase from the overall ratio, also here is the data cleansing a major factor. The ratio of the filtered attributes is described in Table 1. The results are compared with the standard environment and cloud environment.

  • Filtered attributes - the ratio of all removed attributes across the system

  • Filtered processing attributes - the ratio of all dropped attributes in files that have been preprocessed in Apache Spark

Table 1. Attributes filter ratio
Table 2. Rows filter ratio

The same steps that were applied to the attributes (vertical axis) were also performed within individual records (horizontal axis). This has resulted to the reduction of entities that would ultimately distort the calculated models. The ratio of filtered entities is described in Table 2.

  • Filtered rows - the ratio of all deleted records across the system

  • Filtered processing rows - the ratio of all deleted records in files that have been preprocessed in Apache Spark

The resulting view of the entire dataset is presented in Table 3.

  • Total original cells - total number of cells in the system

  • Total cells - total number of cells in the system after the preprocessing phase

  • Cell loss ratio - ratio of all filtered cells

Table 3. Total cells ratio

In summary, we have reached the data structure, which is reduced by about 1.2 billion cells, while a significant part is still formed by the dictionaries.

3.6 Performance

During all operations, the duration of individual operations and system performance was recorded. The technical specification of the infrastructure that was used is as follows:

  • Apache Spark - version 2.1.0 (Linux)

  • Master - 2 nodes

  • Slave - 4 nodes

Master nodes are machines that are memory optimized due to the high demands on the synchronization of the entire process. These individual machines can be interpreted in Fig. 2. In the detailed description of each machine, the configurations were as follows:

  • Mater node - 4 cores, 28 GB RAM, 200 GB SSD

  • Slave node - 8 cores, 28 GB RAM, 400 GB SSD

The overall robustness of the architecture is presented in Table 4.

Table 4. Infrastructure resources overview

The results of the measurements and the average values obtained are presented in Table 5. The table aggregates metrics related to attributes, records and also a conversion of performance to 1 million records. The measurements obtained are output from the Apache Spark platform.

Table 5. Performance metrics

4 Results

To uncover the nature and patterns in the data, it was necessary to analyze the links between the structured files and to perform operations that work with the raw data structure on the basis of the discovered relations.

Upon closer measurement, it was found that the dataset contains a total of approximately 3.3 billion atomic cells representing a specific value of the attribute in the system. From the above findings, it has become clear that the Big Data Ecosystem was essential for further data processing. For our solution was selected Apache Spark technology stack.

Using Apache Spark helped us to get deeper insight into the patterns hidden behind the data. It was necessary to get the input values into a uniform form. Because as was expected, a lot of source files contained incomplete data or the data was in the wrong format.

The use of cloud technology accelerated the process and helped straighten inconsistencies in the data structures, that the input files contained. Thanks to use the Apache Spark environment, the whole process could be parallelized and applied over such a large-scale solution, which was not feasible from a performance point of view by using a local resources.

The task of preprocessing data was also necessary in terms of following investigations. This preprocessed data served as the basis for developing a client revenue computing program. When MDP (Markov Decision Processes) are used for modeling of selected marketing process to maximize return value of the customer [1].

5 Conclusion

In this paper, Apache Spark was used for data processing. From the above findings, we came to the conclusion that the main problem in the data was their confidentiality. This whole area is therefore based on a cloud data protection solution. One possible solution to this problem is to use homomorphic encryption. A fully homomorphic scheme can be used in many applications, and one of these can be, for example, the implementation of secure cloud computing [6].

This problem was resolved by Craig Gentry in his dissertation, describing the use of a fully homomorphic encryption scheme (FHE). Thanks to Gentry’s technology, the analysis of encrypted information can produce as detailed results as if the original data were fully readable to all [7].

For further research, it’s crucial to find solution how to use FHE in field of cloud computing. For better continuous development and improvement of the results, we also recommend ability to dynamically scale out running environment.