Keywords

1 Introduction

Data science is a relatively new rapidly developing field dealing mostly with big data. The goal of data science applications is to analyse and to extract knowledge from data, which can be used for simulation, decision making, process optimization, innovation support, etc. The process of knowledge discovery comprises data preparation techniques, statistical modeling, computational methods, visual analysis, and domain-driven problem solving. Data science relies on methodology developed in such fields as statistics, data mining, machine learning, big data, or human–computer interaction. Visualization and visual analytics build an essential part of data science.

Although both domains are dealing with visual representation, their scope and impact are different. Visualization provide techniques for presentation of data or relationships for the purpose of explanation, interpretation, communication etc. Visual analytics encompasses a process of knowledge discovery by supporting the analyst to discover patterns in data, building formal models that can be processed by machines, and developing new hypothesis. Visual analytics strongly relies on effective interaction of a human and a machine. The Human in the Loop (HiL) concept is characteristically for the continuous support of machine processing by human feedback. In the context of Visual Analytics, HiL stands for providing continuous feedback, correcting algorithmic approaches and selecting appropriate techniques during the analytical process.

We present in this work a survey on visual analytics in data science. The next section describes the main visualization aspects. Then we discuss the domain of visual analytics and take a closer look on the HiL concept as well as present a comparison of visual analytics tools developing by experts and described in scientific publications in computer science journals.

2 Visualization and Visual Analytics

The design space of visualization is enormous ranging from aesthetic aspects to structural organization that affect perception. Many proven techniques have been developed to present information. These include various types of diagrams for graphical representation of proportions or numerical values as well as structures for mapping relationships and set affiliations. Visual processing pursues several different objectives such as the recognition of contexts, patterns and trends, finding outliers and clusters, classification or examining structures. Data objects can be integrated in a visualization in different ways, such as points, lines or areas as a representation object over the visual properties such as colour, shape, size and position, and the layout of the presentation.

Visualization should always be done very carefully. It is easy to misinterpret or to get misleaded by charts and graphs. Misleading data visualizations are often designed by violating standard practices. People are used to the fact that pie charts represent parts of a whole or that timelines progress from left to right. When those rules get violated, we would very likely misinterpret such a visualization. Nonetheless, a non-standard technique can make a visual representation more expressive and give it an outstanding position. Such visualizations draw our attention and make us aware of the subject representing.

Visual analytics is an interdisciplinary approach to support exploratory knowledge discovery especially regarding large and complex data sets [1,2,3]. The subject of visual analytics is the effective obtaining, extending and generation of knowledge. The central concept of visual analytics is to combine automated analysis techniques with interactive visualization in order to improve the overall analysis [2]. Through combining these fields, more challenging problems can be solved, which cannot be addressed by using only pure analytical or visual approaches. On the one hand, the data is often too large for pure visual approaches and therefore requires an automated pre-processing step. On the other hand, many problems are too exploratory in nature for purely automated analysis, which demands for the integration of human cognition.

Visual analytics is primarily concerned with the needs of exploratory data analysis [1]. Problems, which are solved by visual analytics, are initially unknown or vaguely formulated and presume a specific background knowledge, which the machine does not possess [4]. Through the integration of the user in an early stage of the analysis process, the user gets into the position to interactively learn the data and thus to specify and set up hypotheses. The exploratory orientation of visual analytics assists the analysts in unveiling unexpected connections and phenomena [3]. The analyst can follow various intentions in different sections of the analysis like overlooking the data, the exploratory search for new insights or the testing of hypotheses. In addition, the analysis process ranges from high level analytical tasks which strongly depend on background knowledge and expertise of the user to low level activities like the selection of the underlying data sets.

2.1 Human in the Loop

One of the main concepts regarding interactive analysis is the HiL concept. HiL emerged from the observation that some machine approaches require analytical judgement, others can be significantly improved or accelerated by interacting with humans. Characteristically for HiL is the continuous support of machine processing by human feedback. In the context of visual analytics, this concept occurs in terms of providing continuous feedback and correcting algorithmic approaches within the analysis. There is no general accepted definition of HiL and it remains open, which loop is finally meant. The optimization of learning behaviour in machine learning is a field of application for HiL in visual analytics, which is often used. The user is hereby directly integrated in the train-tune-phase as the user implicitly or explicitly alters the parametrization. For instance, the user extends the underlying training data, corrects them or might provide additional information.

The combination of analytical reasoning and computational models can result in usability problems. Problems arise for instance if interactions are not intuitive. Often users have to express their expectations through a variety of parameters and configurations. Users are confronted with the issue to communicate their knowledge and expertise with the machine, which affects the overall process negatively. Therefore, Endert et al. argues for a shift towards a Human is the Loop perspective to focus more on a seamless integration of interactions [5]. There is a demand for orientating interactions more on analytical cognitive processes in order to reduce cognitive burdens.

Nonetheless, human reasoning, decision making and knowledge generation processes are essential for the effectiveness of HiL and should have a central role in the process. We consider the loop as part of the computational sense making by means of adjustable environments. In our opinion HiL describes a machine environment that is managed by human knowledge in order to conduct continuous analytical discourses.

3 Comparison

As we mentioned above, visualization and visual analytics represent two different approaches to gain insights from data. Visualization is concerned with the depiction of data to assist the perceiving of patterns, structures and coherences. The objective hereby is to gain insights in order to comprehend data, make decisions and to build trust in the underlying data. Data as well as models are depicted through visual artefacts or diagrams. In order to visualize data effectively, it has to be in a structured form. In this context pre-processing steps like data cleaning and wrangling are applied on the raw data. Visualizations are not just static, but also provide interactions to change views.

The main focus of visual analytics is on the solving of problems. Visual analytics uses the techniques and methods provided by visualization to close the gap between algorithmic processes and human factors. Visualization is primarily concerned with the selection of optimal visual forms and interactions for a given problem, whereas visual analytics deals with the integration of analytical approaches into the entire knowledge generation process. The challenges of visual analytics are to find the best analytical approaches for a given problem, automate them as much as possible and finally provide an integrated tool, which considers human factors [2].

Since finding the right models and parameters for a given problem can be a difficult task, this task can also be shared with the user. In comparison to visualization, visual analytics includes a structured reasoning process to gain knowledge out of data [6]. The user plays an active role by steering the underlying processes and models via interactive interfaces. Visualization is used to continuously clarify the state of the analysis, to visualize data and to communicate new knowledge mutually. The combination of data-centric methods with user-centric interactions through visual interfaces is an object of visual analytics. Whereas visual representations without reference to analytical approaches are the subject of visualization.

Visual Analytics focuses on exploratory analysis and is not limited to visualization and automated analysis. It also includes the entire infrastructure for creating visual analytics tools. By connecting machine processing power and capacity, visual analytics is also applicable to huge amounts of heterogeneous and dynamic data, where visualization cannot be used. Several visual analytics tools were developed in the last years. The purpose and effectiveness of those tools varies depending on utilization scenarios, provided visualization techniques and user knowledge. Table 1 presents our comparison of visual analytics tools developed by experts and described in scientific publications. Each row in the table is the main publication corresponding to one visual analytics tool. The references are listed below the table in a chronological order of the publications (top down). Each column of the table corresponds to a subsection from four interrogative questions - why, who, what and how. Each question serves as a category for classifying the tools by:

Table 1. Comparison of visual analytics tools from (top down) [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]
  • Why - Why is it appropriate to use the particular tool, for what reason?

  • Who - Who should use or is able to use the tools?

  • What - What are the main features that can be analysed and visualized?

  • How - How the visualization can be done, what techniques can be used? For this category we distinguish only between three general techniques.

4 Conclusion

This survey is dedicated towards reviewing the state-of-the-art in visualization and visual analytics in data science. We presented both topics separately and compared well-known visual analytics tools based on four categories presented in Sect. 3. Several times we underlined the importance of visualization and visual analytics in data science. Data are more than just numbers and words. Analysing data is similar to storytelling. The stories in this process are dealing with the real world and can be simple and straightforward as well as complex and uncontrollable. We intend to focus our further research on development of a platform for visual data science. Our vision is to allow a flexible and dynamic interconnection of analytical methods and visual techniques with automatic adaptation to data and possible problem statements.