Introduction

Data can be interpreted better when presented as visualizations, wherein one of the simplest and most ubiquitous forms is the class of charts. Chart representation specifically is a widely used approach, which is evident from the inclusion of the basic understanding of simple charts in the curriculum of primary school education. Simple charts, e.g., bar charts, scatter plots, etc., are commonly found in documents (textbooks, publications), print media (newspapers, magazines), and on the Internet; and are most prevalent in image format. There are use cases of redesign and reconstruction of charts for getting high-resolution images for applications such as generating accessible reading materials for differently abled students. The chart redesign also enables students with learning difficulties to understand data using alternative designs. These applications pose a problem when the source data for the charts are not available alongside the chart image for ready consumption. Thus, data extraction in the form of semi-structured tables [1] from these chart images is a relevant problem, specifically in the space of improving assistive technologies.

Chart interpretation includes data extraction and textual summary generation. The redesigning of multi-class charts is a motivating application of the data extraction, as they are relatively difficult to interpret [2]. The redesign entails the requirement of source data that is used to generate the original plot as well as information about multiple classes being represented in the image. A motivating application of chart summarization is for generating alternative text (alt text) for web and document accessibility. While there are applications where user intent can refine the chart summary [3], use of the extracted data facilitates more generic applications where the goal is to inform the user of the source data of the chart. In applications pertaining to accessibility, chart summaries succinctly describe the data better than the data table. One of the approaches for generating the textual summary of chart images generically and yet automatically is using its extracted data along with its extracted textual content.

Since the design space for charts is large, in terms of chart types and their formatting, we focus on a single chart type here. Amongst all statistical plots, bar chart representation is the most commonly used one for visual summarization. Bar charts have subtypes, depending on the data type and user requirement, such as simple, stacked, grouped bar charts, to name a few. Stacked and grouped bar charts help visualize multi-class or multi-series data. The grouped bar chart gives inter-and intra-class trends, and the stacked bars give part-to-whole information for multiple classes. Overall, we focus on bar charts and its subtypes here.

The state-of-the-art reasoning over scientific plots includes bar charts as a chart type of interest [1, 4]. These methods use convolutional neural networks (CNNs) for the object detection component for extracting bar geometry. However, we find that CNNs for bar extraction may fail specifically for the stacked bars. Hence, we use the image processing method exploiting spatial locality for object detection [5]. We, thus, propose a semi-automated workflow (Fig. 1), called BarChartAnalyzer [6], that can take an image as input, identify the bar chart, sub-classify it to bar chart type, and then perform data extraction. The extracted data can be further used for reconstruction to get customized charts of high-resolution image quality, as well as for redesigning the complex to simpler charts. Here, we extend our previous work on BarChartAnalyzer by improving the text summarization component using a novel design study. Our approach is inspired by the design study methodology (DSM) used for generating data visualizations and information graphics [7]. In summary, our contributions are:

  • an end-to-end semi-automated workflow (Fig. 1) for interpreting images of bar charts and their seven subtypes, with an improved text summary,

  • a novel design study-based approach for template-based chart summarization for bar chart subtypes,

  • training dataset for bar chart images for bar chart subtype classification, along with text summaries.

Our most significant contribution here is the text summary generating component (C7), for which we adopt DSM, which is usually used for visualization design. Here, we propose the text summary generation from chart images as its novel application. We have designed and implemented a user study as a part of DSM. Our current work builds on our previous work of the use of tensor fields for object (i.e., bar) detection [5], and the proposed workflow (C1–C7) in BarChartAnalyzer [6]. In the latter, chart summarization relies entirely on a template, which is now improved here by the observations from a user study.

Fig. 1
figure 1

Our proposed workflow for data extraction from a given chart image using our proposed semi-automated BarChartAnalyzer (BCA), with seven components (C1–C7), for applications including chart reconstruction and redesign. Significant components are C1 for classifying the chart image to bar charts and its subtypes, C2–C4 for feature extraction, C5 for text detection for data contextualization, and C6 for data table generation. Here, we use the design study methodology for improving the text summary in C7. This image is a modified version of the one in our previous work [6]

Related Work

Chart interpretation is generally divided into smaller tasks such as chart type classification, data extraction, and, optionally, reconstruction or redesign, and summarization. ReVision is a system that performs tasks like identifying chart type, extracting visual elements, and encoded data by creating feature vectors and identifying geometric structures in pixel space [8]. WebPlotDigitizer is another system that provides both automatic and manual procedures to extract data from given chart images [9]. However, the tool requires extensive user interaction for aligning axes to select data points. It works for simple bars, but fails for stacked and grouped bar charts in giving class information.

Machine learning models have been effectively used for classification and/or object detection problems in chart analysis. Beagle is a web-based system for classifying charts in scalable vector graphics format [10]. Text type classification has been done using feature vector generated using the geometric property of text along with mark type classification using a fine-tuned AlexNet [11]. FigureSeer uses a similar fine-tuning approach [12]. A convolutional neural network (CNN) model used for chart classification can also be used for object detection, e.g., for chart objects such as bars in the source image [4]. ChartSense uses GoogleNet to classify line, bar, pie, scatter charts, map, and table types [13]. ChartSense further uses the connected components method to extract bar objects, using the x-axis as a baseline in the image. While this method works for simple bar charts, the charts with bars of multiple classes (e.g., grouped bars) get incorrectly identified as belonging to the same class. The existing methods using object detection-based approach have not been shown to work for all subtypes of bar charts, e.g., stacked bars, for which training data is currently unavailable. Hence, we rely on conventional computer vision and image processing methods on all bar chart subtypes to identify bars, and bar segments, in the case of stacked bars. We choose a corner detection method that exploits spatial locality using second-order tensor fields [5] in BarChartAnalyzer. These methods largely fall under the category of chart interpretation through visual structure extraction.

Apart from visual structure extraction, chart analysis can be done through automated chart question–answering (CQA) systems, and data parsing. PlotQA is an example of CQA that uses a more accurate neural network for object detection for visual elements, such as bars [1]. While PlotQA is an example of an approach where data are parsed from charts, and then used for QA, an alternative approach for CQA is using natural language understanding (NLU) methods. An example of such an approach is through the use of transformers for answering questions from charts directly, of which Structure-based Transformer using Localization, STL-CQA [14], is a recent solution.

Text detection is important for chart inference. Automated data extraction for bar charts has been done by identifying graphical components and text regions independently [15, 16]. Text detection and recognition are both required for extracting the text content from a chart image. Chart data are finally extracted using inference from both graphical and text content.

A textual summary of a chart is a relevant task for its interpretation. While its relevance may appear counter-intuitive as the charts are themselves visual summaries of data, its text summarization has been found to be useful for the visually impaired (VI) users to read its images, usually embedded in documents. Visualization of summaries in a graphical interface is an alternative to text summaries of charts, e.g., Chartseer [17], but does not improve the accessibility to the VI users. Hence, the conventional textual summaries are of interest to us. The challenges involved in textual summarization are in the selection of relevant information, organization of the content into fluent text, use of appropriate comprehensible sentence structures, and choice of appropriate expressions in the concerned language, i.e., English in this case [18]. The purpose of the chart summary determines the relevant information that is added to the content [16]. Once the content is identified, the summary is generated using two different natural language generation (NLG) approaches, namely, predefined templates and deep-learning methods. The predominantly used template-based approach provides more rigid content than the deep learning methods that is more recently used. For generic applications or for creating master summaries, the former is simpler to implement and also suffices.

Using predefined templates, iGRAPH-Lite system generates a short summary explaining the visual description of the chart itself, but does not provide insight into the information provided by visualization using the chart [19]. Linguistic constructs have been used to generate a chart summary with the help of semantic graph representation [20]. Three types of features have been selected from charts, namely, salience, trend, and rank, that encode details like increasing or decreasing trend, any specific colored bar showing highly prominent detail. However, we have found that the summary output of this system has limited description. Summaries, especially for bar charts, have also been generated by calculating differences using existing attributes in chart images and providing the core message represented by the selected chart [21]. In an improved version, the level of importance of the different aspects of the content has been determined using a user study [18]. A centrality-based algorithm based on PageRank has been used for content selection prior to templatized sentence creation for the bar (simple and grouped) and line charts for improving accessibility of information graphics to the VI users [22]. Unlike these methods, we use the design study methodology used in human–computer interaction (HCI) [7] as an alternative for content selection. Our work is also different from users collaboratively designing chart summaries using a graphical interface, e.g., Chart Constellations [23].

Recently, deep-learning methods in NLG have been used for chart summarization. These include the 1D convolutional residual network in Autocaption for creating captions of visualizations [24], long- and short-term memory network (LSTM) with user inputs in cyberphysical social systems [3], and transformer-based encoder–decoder architecture in Chart-to-Text [25].

The 7-Component Workflow

We propose a workflow, BarChartAnalyzer, that interprets a given chart from its image. It has seven main components (Fig. 1), namely, chart subtype classification, chart image annotation, canvas extraction, tensor field computation, text recognition, data table extraction, and chart summarization.

Chart Subtype Classification (C1): Data are represented using different chart types based on the number of variables and user requirements. For example, scatter plots and bar chart representations visually encode the data differently, requiring different chart analysis approaches. In the case of bar charts, there are commonly used subtypes, namely, simple bars, grouped bars, stacked bars, and of different orientations, depending on design requirements. Since the data extraction process from a given chart image depends on its chart type, identifying the type/subtype is the first step of our workflow.

Image classification is a widely studied problem in computer vision, and it has been done using different CNN-based classification models, such as AlexNet and GoogleNet. These models were trained and tested for a set of natural images provided during the ImageNet challenge [26]. The natural images contain characteristics other than the shape of the object, like texture, finer edges, color gradients, etc. However, compared to natural images, the chart images are sparser and more structured with repeating patterns. Hence, the models that work for natural images do not work effectively for chart images.

The chart objects, such as bars, scatter points, and lines, are distinguishable based on their shape and geometry, unlike objects found in natural images. However, the chart subtypes for bar charts all have similar geometry, i.e., bars. Hence, contour-based techniques for chart subtype classification are inefficient. A few pre-trained models have been used for chart type classification of images by imposing certain constraints, e.g., training with a small image corpus; however, the classification outcomes have low accuracy. The classifier in ChartSense has used GoogleNet [13], which has been trained on different chart types. This model classifies subtypes, such as grouped and simple bars, also, since the features are similar in both subtypes. However, other subtypes of our interest, namely stacked bar charts, have not been explored. On the other hand, mark-based chart classification is an alternative approach [11], where the classifier is trained to recognize five mark types: bars, lines, areas, scatter plot symbols, and other types. This classifier is limited to the identification of the chart type without any scope of extension to subtypes. Thus, we explore all bar chart subtypes, including the stacked bar charts and histograms, that have not been considered widely in the state-of-the-art.

Fig. 2
figure 2

(Left) The architecture diagram of our CNN-based classifier for identifying bar chart subtypes. (Right) Human-guided annotation of the chart image, where the chart canvas is used for object detection. This image is a modified version of the one provided in our previous work [6], specifically for the CNN architecture diagram

Our proposed subtype classifier is inspired by the VGGNet (Visual Geometry Group Network) architecture [27], which is widely used for object detection and segmentation tasks on image databases, such as KITTI [28], an image benchmark dataset for road-area, and ego-lane detection [29]. We choose the VGGNet architecture as it is efficient in feature extraction from images and addresses depth in convolutional networks. It is also simple for a new model, with the flexibility of adding more VGG blocks. The VGGNet architecture has a stack of convolutional layers (Fig. 2 (left)) that generalizes the deep-learning tasks. Our CNN model is a combination of the convolutional, pooling, and fully connected layers, as specified in Table 1. The convolutional layers are responsible for extracting features by convolving images using kernels. Our classifier uses max-pooling to reduce computation by reducing the spatial size by half. The tailing layers in our classifier are the fully connected layers that take the results of the pooling/convolutional layer and assign it a label/class. Our classifier identifies the bar chart subtype of an input image. This classification also checks if the given image is of bar chart type, as the workflow downstream accepts only bar charts, rejecting the others.

Training Dataset—The chart image dataset that we have curated consists of images of seven different subtypes of bar charts, namely, simple, grouped/clustered, and stacked bar charts of horizontal and vertical orientations and histograms. The training dataset of our CNN model consists of images of these seven subtypes and additionally an “other” category. The “other” category includes scatter plots and line and pie chart images that are commonly used, excluding bar chart subtypes. Histograms are included as one of the subtypes of bar charts, since some plotting tools, e.g., Google sheets and Microsoft Excel®, use bar charts for histogram plots. Also, we observe that the geometry of bins in histograms is the same as columns/bars in the bar charts. Our CNN-based classifier requires input images of fixed size for training; hence, we first resize the images in the dataset to \(200\times 200\) size. The image resizing and classifier implementation has been done using Python imaging (PIL) and Keras libraries, respectively. Our CNN model for chart subtype classification is novel in its application for classifying bar chart subtypes. Our classifier assigns class labels to an input image specifying the bar chart subtype and its orientation, except in the case of histograms, e.g., “horizontal grouped bar,” “vertical stacked bar.”

Table 1 The VGGNet [27]-inspired architecture of our proposed convolutional neural network for chart type classification, diagrammatically given in Fig. 2
Fig. 3
figure 3

The illustration of the workflow for C2 and C3 demonstrates the chart image annotation and canvas extraction sub-components of the BarChartAnalyzer with their intermediate steps. Here, we follow the steps in C2-C3 as described in our previous work [6]

Image Annotation (C2) and Canvas Extraction (C3): Image annotation is usually performed to generate training datasets for computer vision-related problems like object detection, segmentation, etc. Image annotation entails labeling different regions of interest (ROIs) in the images. These predefined labels are used to detect and extract ROIs in test data using learning approaches. Since this requires contextual labels and appropriate associations between labels and ROIs, human-guided annotation is an optimal approach.

For chart images, manual marking and annotation of bounding boxes for ROIs have been widely used [1, 4]. Different labels are assigned to the components of chart images based on their role in the visualization, such as canvas, x-axis, y-axis, x-labels, y-labels, legend, title, x-title, and y-title, as shown for a sample chart image [Fig. 2 (right)]. We use LabelImg [30] to mark and annotate bounding boxes for ROIs of the above-mentioned labeled components of a chart image. LabelImg is a Python tool with a graphical user interface (GUI) for interactively selecting an image, drawing a bounding box for an ROI, annotating the ROI, and labeling the ROI. We use the label Canvas for the ROI that contains the chart objects, such as bars, lines, or scatter points, and is defined as chart canvas, which is one of the chart image components [5]. The annotation is generated as an XML file that is processed to extract the canvas region as well as for text localization. The former is used for chart extraction (C3), and the latter for text detection (C5).

The canvas extraction step (C3) includes image preprocessing methods to remove the remaining elements other than chart objects such as gridlines, overlaid legends, etc. The subsequent step (C4) on tensor field computation is sensitive to the presence of these extraneous elements, which leads to erroneous results. Image processing techniques, marker-based watershed segmentation, and contour detection algorithm have been used to remove such components in the chart canvas effectively [5]. These steps also fill hollow bars, as required, since the tensor field is computed effectively for filled bars. Highly pixelated edges in aliased images lead to uneven edges in each bar object. This issue is addressed using the contour detection method to add a fixed-width border to bars [5]. Overall, we perform these steps, as illustrated in Fig. 3 with an example, to extract a chart canvas containing chart objects used in C4.

Fig. 4
figure 4

Illustration of the steps in C4 for corner point extraction of bar chart image using tensor field computation using DBscan clustering of critical points from anisotropically diffused tensor voting with saliency value \(C_l<0.4\). Here, we follow the steps in C4 as described in our previous work [6]

Tensor Field Computation (C4): Tensor fields have been widely used to exploit geometric properties of objects in natural images [31] using structure tensor and tensor voting. We use a local geometric descriptor as a second-order tensor for tensor vote computation [32] that further leads to corner detection in the case of bars for a given bar chart. The steps in C4 are illustrated in Fig. 4 with the help of an example.

Structure tensor \(T_s\) at a pixel provides the orientation of the color gradient computed from the local neighborhood. \(T_s\) is the gradient tensor \(T_g\) at the pixel with intensity I; convolved (using \(*\) operator) with Gaussian function \(\mathcal {G}\) with zero mean and standard deviation \(\rho\). Then, \(T_s\) is given as follows:

$$\begin{aligned} T_g &= {} (G^TG), \text { where gradient vector } G=\begin{bmatrix} \frac{\partial I}{\partial x}&\frac{\partial I}{\partial y}\end{bmatrix}, \nonumber \\ T_s &= {} \mathcal {G}_\rho *T_g. \end{aligned}$$
(1)

The tensor vote cast at \(x_i\) by \(x_j\) using a second-order tensor \(K_j\) in d-dimensional space is \(S_{ij}\), computing using the closed-form Eq. [33]. If \(d_{ij}\) is the distance vector, i.e., \(d_{ij}=x_j-x_i\), then its unit vector is \(r_{ij}=\hat{d_{ij}}\). Let \(I_d\) be the d-dimensional identity matrix, and \(\sigma _d\) be the scale parameter which is used with the inverse distance weight using Gaussian function, \(c_{ij}=\text {exp}\big (-\frac{\Vert d_{ij}\Vert _2^2}{\sigma _d}\big )\). Thus, we compute \(S_{ij}\) as

$$\begin{aligned} \begin{aligned} S_{ij}&=c_{ij}R_{ij}K_jR_{ij}',\\ \text { where we compute } R_{ij}&=\left( I_d - 2r_{ij}r_{ij}^T\right) ;\\ R_{ij}'&=\left( I_d-\frac{1}{2}r_{ij}r_{ij}^T\right) R_{ij}. \end{aligned} \end{aligned}$$
(2)

Here, we use the gradient tensor \(T_g\) as \(K_j\) [34]. If \(r_{ij}r_{ij}^T\) is the distance tensor, then \(R_{ij}\) is physically the normal tensor to the distance tensor, when treated as the tangent tensor.

We compute the color gradient using the color in the CieLAB space [5]. Owing to the use of color for corner detection, BarChartAnalyzer fails for bar charts that use texture for region-fill of the bars. This is observed especially in the case of grouped and stacked bar charts where texture is used to differentiate bars of different classes/series.

Anisotropic diffusion As the tensor votes \(T_v\) in normal space has to encode object geometry in tangential space, we perform anisotropic diffusion to transform \(T_v\) to tangential space [5, 32]. The eigenvalue decomposition of the two-dimensional \(T_v\) yields ordered eigenvalues, \(\lambda _0 \ge \lambda _1\), and corresponding eigenvectors \(v_{0}\) and \(v_{1}\), respectively. Anisotropic diffusion requires a diffusion parameter \(\delta\), for which (\(\delta =0.16\)) is widely used [5, 35]. Anisotropic diffusion of \(T_v\) gives \(T_\text {v-ad}\), that is a positive semidefinite second-order tensor

$$\begin{aligned} T_\text {v-ad}=\sum \limits _{k=0}^1 \lambda _k'.v_{k}v_{k}^T,\quad \text { where }\; \lambda _k'=\text {exp}\Big (-\frac{\lambda _k}{\delta }\Big ). \end{aligned}$$
(3)

Saliency computation The saliency of a pixel to belong to geometry features of line- or junction/point-type is determined by the eigenvalues of \(T_\text {v-ad}\) [5]. We get the saliency maps at each pixel of an image of its likelihood for being a line- or junction-type feature, \(C_l\) and \(C_p\), respectively

$$\begin{aligned} C_l=\frac{\lambda _0-\lambda _1}{\lambda _0+\lambda _1}\;\text { and }\; C_p=\frac{2\lambda _1}{\lambda _0+\lambda _1}, \end{aligned}$$
(4)

using eigenvalues of \(T_\text {v-ad}\) of the pixel, such that \(\lambda _0\ge \lambda _1\). The pixel with \(C_p\approx 1.0\) is referred to as a critical point or degenerate point in the parlance of tensor fields. We detect all the critical points in the chart canvas during C4.

DBSCAN clustering The critical points of the chart image computed from tensor field computation form sparse clusters at the corners of each bar [5]. These pixels are localized using density-based clustering, DBSCAN [36], and cluster centroids are computed by tuning the hyperparameters of DBSCAN clustering to specific chart types. These cluster centroids are treated as corners of the bar. Using the positional layout or arrangement of these corner points based on the specific chart type and subtype, we heuristically compute the height of each bar in pixel space.

Text Recognition (C5) and Information Aggregation for Data Extraction (C6): At this juncture, the data we have extracted from the chart image, using tensor field computation, are in the image (or pixel) space. However, the extracted data must be in the data space for accurately summarizing and optionally reconstructing the chart. Hence, to transform the data from the pixel space to the data space, we now combine the data in pixel space with the text information in the image. We perform text detection to get x-axis and y-axis labels and compute the scale factor between the pixel and data spaces. The recognition of other textual elements, namely, plot title, legend, x-axis, and y-axis titles, also plays a crucial role in analyzing chart images, e.g., the information is used in summary (C7).

The deep-learning-based OCR, namely Character Region Awareness for Text Detection, CRAFT [37], is used for effective text area detection, including arbitrarily-oriented text. This approach is designed for relatively complex text in images, and it works by exploring each character region and considering the affinity between characters. A CNN designed in a weakly supervised manner predicts the character region score map and the affinity score map of the image. The character region score is used to localize individual characters and affinity scores to group each character to a single instance. Therefore, the instance of text detected is not affected by its orientation and size. The text orientation is inferred from the detected text boxes and then rotated to horizontal orientation for the proper extraction.

The CRAFT text detection model can be followed by a unified framework for scene text recognition that fits all variants of scenes, called the scene text recognition framework STR [38]. Being a four-stage framework consisting of transformation, feature extraction, sequence modeling, and prediction, STR resembles the combination of computer vision tasks such as object detection and sequence prediction task and, hence, uses a convolutional recurrent neural network (CRNN) for text recognition. We find that the CRAFT model, along with the STR framework, works efficiently to retrieve labels and titles of the chart image better than the widely used Tesseract OCR [39]. This is because Tesseract OCR fails for commonly found characteristics of chart images, such as text content with different colors, sizes, orientations, curvy fonts, and different languages, along with interferences or issues in the text, such as low resolution, exposure, noise, motion blur, out-of-focus, varying illumination, etc. Thus, we use the CRAFT model with the STR framework here.

In C6, subsequent to text detection and recognition, we transform the data extracted in pixel space to data space and add appropriate textual information for the variable name and bar width. Once the information is in the data space, we extract the data table. For both stacked and grouped bar charts, we additionally identify class/series information using the legend. The class/series information is then added to the extracted data table.

Chart summarization (C7) Our next step is to generate a summary of the chart image based on its retrieved data table and the perceivable visual elements of the chart image. While our approach is to use a predefined template for generating the textual summary, we propose a design study for generating it, as explained in section "Design Study for Chart Summarization". A design study is apt here, since this is a task involving human–computer interaction (HCI), and the output of the workflow, i.e.,  the summary, is to be designed for real users [7]. The design study is necessary to facilitate consideration of the user inputs during summary generation. This is different from the conventional practice of using user study for evaluation of chart summary [3, 25]. Thus, in our work, we utilize a user study for design iterations of the summary template.

Design Study for Chart Summarization

An effective summary should concisely represent the significant information of the chart image instead of a detailed textual representation of the entire chart data table. The information perceived as significant by a user varies based on various aspects, such as education, work, and their reasoning. Given our requirement of creating a generic summary without serving any specific intent, a predefined template suits our requirement better than the deep learning solutions based on user inputs [3].

We use the 9-stage design study methodology [7] to arrive at the predefined template to be used for the chart summary. Even though the design study methodology pertains to visualization as an output of human–computer interaction (HCI), it can also be extended to the design of chart summaries. This is because the chart summary is computer-generated but consumed by human users, just like visualizations themselves. The nine stages are divided into three top-level phases in sequence, namely precondition, core, and analysis, and the nine stages are split across these phases. The precondition phase, which focuses on the design’s preparation, includes learning, winnow, and cast stages in sequence. The core phase, central to the actual design process, includes discover, design, implement, and deploy stages in sequence. The analysis phase, which is done retrospectively, includes reflect and write stages in sequence. Even though these nine stages form a linear process, the design study involves several feedback loops leading to a highly iterative dynamical methodology.

For specifically C7, i.e., chart summarization, we loosely consider both the determination of the problem statement pertaining to automatically interpreting chart images and the design of components C1–C6 of our workflow as the precondition phase. Here, we elaborate on the core and the analysis phases.

Fig. 5
figure 5

Our proposed flowchart of sentence structure formation in the chart summary (Source: [6]), that is implemented in the design stage of the core phase in the design study

Core phase Here, we implemented the following:

  • Discover Problem Characterization and Abstraction— this stage entails gathering requirements from experts and users. Thus, to understand the requirement of blind users’ usability of automated chart interpretation with chart summary to improve accessibility to the visually impaired, we listened to experts in education for the visually impaired (VI), one visually impaired person, and volunteers who create educational resources for the VI, such as haptics, braille textbooks. We selected this group of experts and users in the context of interpreting charts given in high school mathematics textbooks. We gathered that the manually created summary currently includes visual characteristics of color, geometry, and details of chart title, axes, and ranges. The summary also optionally includes apparent characteristics of the data, and usually, the chart accompanies the data table, which can be read using screen readers. Since our goal is to improve accessibility to charts in print and mass media, where data table is usually unavailable to the user, we added this as an additional requirement.

  • Design Encoding—To satisfy the requirements, we arrived at a sentence structure for specifically bar charts and their subtypes [6], using a flowchart (Fig. 5). This well-built sentence structure serves as the predefined template for the summary. It contained the visual summary of the appearance of the chart and the data table summary capturing the chart title, axes, and statistical descriptors. The descriptors include data distribution, range, mean, median, mode, standard deviation, and correlation values of attributes based on the subtype of bar charts. The summary also includes the variable-based trend patterns in the chart image in the case of ordinal attributes.

  • Implement Prototype—We implemented the chart summarizer, i.e., C7 in the workflow (Fig. 1), as explained in section "The 7-Component Workflow". These system-generated summaries are programmatically generated for all the charts in our dataset.

  • Deploy Release and Gather Feedback—We then released a set of sample chart images and its corresponding system-generated summaries in a user study to gather feedback. Since the purpose of the user study is to evaluate the completeness of the system-generated summary, visually abled participants are used in the feedback process.

User Study We conducted a user study to collect feedback on the initial system-generated summaries of chart images. This study is intended to refine the sentence structure in the predefined template for the chart summary. Our user study is similar to that conducted for identifying the intended message of a graphic [18], where the users evaluate our system-generated summary.

The core part of our user study has been designed as two different modules based on the objectives of the study. The first objective is to learn about the understanding and interpretation of bar chart images by the user. The second objective is to assess the completeness and effectiveness of the initial system-generated summary and help us further improve it. The second objective is partially similar to the user study for identifying the intended message of a simple bar chart [18], where participants are asked to rate specific propositions, i.e.,  pieces of information, as “essential,” “possible,” and “not important”.

Fig. 6
figure 6

A snapshot of user study displaying web page for a participant’s details collection, b instructions of survey, c module-1 for creative summaries collection, and d module-2 for comparative assessment of summaries

We designed an online user study that allows us to collect diverse responses from various sections of the population. The online study is also required owing to the social distancing during the pandemic. The participants are invited to the study through email, social media, etc., and are informed about the purpose of the user study. The participants’ responses are recorded using a web form on the study website. The web form has the following four parts in sequence, as shown in Fig. 6:

  1. 1.

    Participants’ details—This information includes participant name, email, age, gender, educational degree, profession (if worked/working as an educator), vision deficiency, and self-assessment of familiarity with usage of charts. This information helps us analyze how factors like age, gender, educational conditioning, and vision deficiencies impact chart understanding and interpretation of participants.

  2. 2.

    Instructions—A web page of written detailed instructions is included in the study website, including the expected time commitment for the study. In addition to these instructions, further queries have been answered through telephonic calls, emails, social media correspondences, etc.

  3. 3.

    Module-1—The first module of the core part of the user study satisfies the first objective of the study. Here, each participant is shown a set of three bar chart images. These images are randomly selected from 26 images from the test corpus. We ensure that the set presented to each participant includes a grouped bar chart, a stacked bar chart, and a simple bar/histogram. The participants see an image and complete the assigned task for the image before being shown the next image. The task for each image entails writing one’s own creative chart summary for the image. The participant is advised to prepare the summary in a way he/she would narrate to a VI person.

  4. 4.

    Module-2—The second module of the core part of the user study satisfies the second objective of the study. Here, we present the user with the same set of chart images from the first module, their creative summaries by the participant prepared in Module-1, and the corresponding system-generated summaries from the design stage. These three entities are juxtaposed for the participant to compare the two summaries. The task entails comparing both the summaries and assessing using Likert-like scales. We use three components of the scores, i.e.,  compare-, sufficiency-, and grammar-scores, to rate our system-generated summaries qualitatively against their creative summaries. This task of comparison is done for one image at a time, as done in Module-1.

The three components of the total score used in our study are given below, with the range of rating in parentheses and the interpretation of each rating.

Compare-Score (1–3):

  1. 3.

    System-Summary is better than Creative-Summary

  2. 2.

    Both summaries are equivalent or similar

  3. 1.

    Creative-Summary is better than System-Summary.

Sufficiency-Score (1–5):

  1. 5.

    System-Summary is sufficient but has extra information

  2. 4.

    System-Summary is just sufficient

  3. 3.

    Some minor details are missing in System-Summary

  4. 2.

    Some critical information is missing in System-Summary

  5. 1.

    System-Summary is not sufficient at all.

Grammar-Score (1–4):

  1. 4.

    System-Summary is grammatically/structurally correct

  2. 3.

    System-Summary needs minor modifications

  3. 2.

    System-Summary needs major modifications

  4. 1.

    System-Summary is grammatically/structurally incorrect.

As a final step of the user study, participants are provided a text box in the web form to optionally provide short feedback with suggestions for improving the system-generated summary.

The 30 participants of the study, who were primarily undergraduate and graduate college students, were in the age group 18–45 and completed both Module-1 and 2. Each participant is given a set of three chart images with a grouped bar chart, a stacked bar chart, and a simple bar/histogram. These chart images are randomly selected from a generic test dataset of 26 images obtained from different sources like the internet and synthetically generated datasets that are simple and easily understood by most sections of the population. The outcome of the survey is that, in most of the cases, the participants were satisfied with the system-generated summary. They stated that the non-visual statistical description, such as distribution of histogram, standard and correlations values in the stacked and grouped bar, etc., improved their understanding of the data of the chart image. However, a few participants contradicted that the statistical description in the system-generated summary was extraneous. They also pointed out that the system-generated summaries had not included a few visually perceivable details, such as intra-class inferences, overall bar heights, and inter-class differences in heights in grouped and stacked bars. Some of the participants misinterpreted the histogram chart as a simple bar chart without considering the semantics of the bar being counts or frequencies and that of the chart being a visualization of the data distribution.

Apart from the suggestions on improving the content of the summary, there were suggestions on editorial corrections to the summaries, e.g.,  proper noun usage, capitalization of words, etc. These summaries also had instances of a few words being misspelled due to improper text recognition in C5. Some of the participants suggested that detailed data representation is desirable for charts with fewer bars, say \(<\,8\). There was another feedback to use compound subjects in sentences to shorten the summary, e.g.,  “A is correlated to B, B is correlated to C, and C is correlated to A” is to be rephrased as “A, B, and C are correlated to each other.”

Quantifying the survey outcomes, the average grammar score is 3.33 out of 4.0, which states that minor grammatical modifications are required in initial summaries. The average compare-score is 2.33 out of 3.0, which states that our initial system-generated summaries cover similar details as the creative summaries by the participants. The participants reported that the system-generated summaries outperformed creative summaries, especially in cases of charts without any context, such as missing chart title or axes titles. The average sufficiency score is 3.97 out of 5.0, which states that system-generated summaries are sufficient. The outcomes of the user study lead to the analysis phase in the design study. The reflect stage of the analysis phase entails refining the summary using the feedback from the user study.

Fig. 7
figure 7

Our proposed algorithm for generating the text summary in the BarChartAnalyzer, refined by the responses from the user study. This algorithm is implemented in the reflect stage of the analysis phase in the design study

Analysis Phase: We have implemented the following steps given in the design study methodology (DSM) [7]:

  • Reflect: Confirm and Refine—We confirmed our findings from the user study and then determined a change in the text organization, from the sentence flow format, for the predefined summary template. Our refined text organization is implemented using a new algorithm, as shown in Fig. 7. This algorithm creates summaries specific to each chart subtype, thus exploiting the semantics of the type. Here, we also use best practices of reusing components in programming, where the grouped and stacked bar charts are reduced to simple bar charts, and the summarizer for the simple bar chart is invoked. We refined the visual summary to include intra- and inter-class trends in grouped and stacked bar charts. For histograms, we re-annotate the variables to include terms such as “frequencies” and “bins” to convey the chart semantics. The new text organization now has three parts in sequence [22]—the chart specifications, the visual summary, and the data table summary. The content from the visual and data table summaries are then de-duplicated to make the content crisp. We additionally used an open-source Python library, GingerIt, to correct the spelling and grammar mistakes based on the construction of the complete sentences. This gives us the final chart summary.

  • Write: Design Study Paper—Our elaborate documentation in this section serves as the written report of the entire design study.

We conclude that our chart summarization provides a master summary. This summary can be further pruned based on user inputs, thus, customizing this component for intended purposes. The parts of the modified text organization can also be considered as the granularity of information, which can be used in an interactive framework for providing high-to-low level details of the graphic, e.g.,  Interactive SIGHT (Summarizing Information GrapHics Textually) [40].

Experiments and Results

This section discusses the overall implementation, and performance of the BarChartAnalyzer.

Implementation: The workflow is implemented in the Pycharm 2018.3 tool on an Intel i5 processor equipped with 8 GB RAM in Mac OS 10.14.3. The manual intervention is required for image annotation in C2, and optionally for tuning hyperparameters for DBScan algorithm in C4. We manually annotate the chart image using the LabelImg tool separately installed on our machine, which takes \(\sim \,2\) min. We then input the chart image along with the generated XML file to our BarChartAnalyzer. The execution time of the entire workflow of BarChartAnalyzer is \(\sim \,3\) min for an image of 192 DPI (Dots Per Inch), of which the tensor voting computation consumes two-thirds of the entire running time. Our current implementation of the tensor voting computation is serial and has the scope of parallel implementation in future to reduce the execution time of BarChartAnalyzer.

In C1 of the BarChartAnalyzer, we have trained the CNN model for classification using 1000 images belonging to eight types of charts, namely, the seven subtypes of bar charts and a complement set, “others”, consisting of chart images of line charts, scatter plots, and pie charts. The training set excludes images for charts with textured, hollow, or hand-drawn bar objects. The training accuracy for our classifier is currently at 85%. For testing, we have used a dataset of 50 chart images each from these eight types. For experiments, we generated a dataset that includes bar chart images of these eight types from two sources, namely images downloaded from the Internet and synthetically generated images. The latter is from bar charts generated using the Python plotting library, matplotlib from known data tables.

Fig. 8
figure 8

The key steps in our BarChartAnalyzer workflow of corner detection (C4), text detection (C5), data extraction (C6), chart reconstruction, and chart summary (C7) of the source input chart images. We observe that the ordering of the classes could be reversed in (ii) the grouped and (iii) stacked bar charts, even though the data are extracted accurately (Source: [6])

Fig. 9
figure 9

Reconstruction of synthetically generated bar chart images with their error evaluation in normalized mean absolute error (nMAE) and mean absolute percentage error (MAPE) (Source: [6])

Experimental Results: We have tested our entire system on the 50 chart images for each of the eight bar chart subtypes, apart from using them for testing the chart type classifier. The results from the BarChartAnalyzer for a subset of our experiments are shown in Fig. 8. The images are first classified, and only those of bar charts and its subtypes pass through the BarChartAnalyzer. The source images are given in Fig. 8a. The tensor field analysis on extracted canvas detects the corner of the bars using critical points identified by the saliency value calculation. The results of pixels identified by corner detection are shown in Fig. 8b. The critical points are detected at the top and bottom corners of bars and at the bar segment junctions in the stacked bar chart. The histogram displays the distribution of such points at the junction where the transition between bins occurs. The visualization of critical points at corners also guides us in tuning the hyperparameters for DBSCAN, e.g., distance (eps), minPts.

The OCR-based text detection model [37] works with a 0.95 F1-score on ICDAR 2013 dataset. The model fails to detect certain text components during testing, as shown in Fig. 8c, ii. Our workflow addresses this limitation while performing data extraction based on pixel scaling and the intervals retrieved from the detected text/values. Our reconstructed charts in Fig. 8d can be visually compared with the original images in Fig. 8a.

Color is an important property of the images of the multi-class bar charts like grouped and stacked bar, as color is a visual encoding of the metadata of the classes. In such cases, color represents the identity of the classes the data item belongs to. However, in the case of simple bar charts and histograms, the use of color is cosmetic. Our algorithm preserves the source color value only in the case of it being a visual encoding, where we use the color value identified in the legend for the bars during reconstruction. In the cases where color is not used as a visual encoding for the chart, we use a default color value, i.e., black, during reconstruction. Thus, color is preserved in reconstruction for charts in Fig. 8ii, iii, but not in Fig. 8i, iv. However, even where color is preserved, the order of rendering the classes is not guaranteed to be preserved, as shown in Fig. 8ii, iii, as the ordering of the classes is not an important property in the multi-class bar charts.

Evaluation: The premise of our work is to extract data from images of charts that do not have accompanying data tables, i.e., the ground truth. Hence, to compare the extracted data with source information, we run our algorithm on images of charts generated using the plotting library, e.g., matplotlib, from known data tables. While our algorithm works well with such synthetically generated images owing to their high resolution and fidelity, they are useful in computing exact numerical errors in the extracted data table.

The data extraction involves mapping the pixel location of the cluster center of degenerate points and text location extracted using OCR. Such a mapping causes the extracted values to have numerical precision errors predominantly. Hence, to compare the difference between the extracted values, we compute the normalized Mean Absolute Error (nMAE), and the Mean Absolute Percentage Error (MAPE) for the synthetic images (Fig. 9), which are bounded in [0,1]. MAPE is commonly reported in a percentage format. We observe that nMAE captures our performance better than MAPE, as it does not augment numerical precision errors as much as MAPE. MAPE is augmented in the case of missing extracted data in grouped bar charts (Fig. 9ii) and stacked bar charts (Fig. 9iii) owing to relatively short bars or bar segments. For N data items with source data value \(x_i\) and its corresponding extracted value \(x_i^{(e)}\), for \(i=1,2,\ldots ,N\), we compute

$$\begin{aligned} \mathrm{nMAE} = \frac{\sum \nolimits _{i=1}^N |x_i-x_i^{e}|}{\sum \nolimits _{i=1}^N x_i}. \end{aligned}$$
(5)

In our representative examples in Fig. 9, we observe relatively low nMAE values. Histograms are not included in this analysis as the source, and extracted data in its case are a frequency table, different from a data table in the case of other subtypes of bar charts.

Table 2 Comparison of accuracy of data extraction of BarChartAnalyzer with the state-of-the-art
Fig. 10
figure 10

Comparison of the summaries generated during the core and analysis phases of our design study for chart summarization. The boldface formatted excerpts in row E2 are added during the reflect stage of the analysis phase in the design study

Table 2 gives an overall accuracy of the data extraction component of our BarChartAnalyzer, comparing them to the results reported by the state-of-the-art data extraction methods. Our work differentiates the performance of the bar chart subtypes, unlike the state-of-the-art methods. Owing to the low count of images used for computing the scores, a few failures in data extraction tend to bring the scores down. Hence, increasing the test corpus size for validation would give us more realistic comparative results. Despite the caveat of the test corpus size, our proposed workflow performs as comparable as the state-of-the-art data extraction methods. BarChartAnalyzer achieves near-perfect accuracy for high-resolution bar chart images, created with standard or minimal formatting available commonly across plotting libraries. The morphological methods for image preprocessing in C3 in BarChartAnalyzer improve data extraction accuracy from low-fidelity images. The aggregated accuracy for PlotQA [1] for CQA is 22%, and STL-CQA [14] achieves near-perfect accuracy, but with synthetic datasets. However, comparing our work with the CQA algorithms is not fair, as the goals are different, even though there are overlapping outcomes.

Figure 10 shows the sample results of our workflow for text summary generation for different bar chart types. We observe that the summaries are qualitatively complete after the design study. The summaries for grouped and stacked bar charts are longer and contain more details, serving well as the master summary.

Fig. 11
figure 11

Examples of bar chart images that give erroneous results in BarChartAnalyzer (Source: [6]). The errors in the chart reconstruction are indicated using red translucent boxes in row D

Fig. 12
figure 12

Bar charts generated in different design spaces, which are known to not work with our chart analysis workflow, BarChartAnalyzer (Source: [6])

Limitations: In a limited number of cases, our system suffers from errors in detection, specifically when DBSCAN clustering does not distinguish small/insignificant height differences between bars/bins [5]. We have identified two such cases. The first case is of false negatives when bars are close to the baseline, which is the x-axis and y-axis for column and bar orientations, respectively (Fig. 11d, ii). The second case is when heights of adjacent bins in a histogram have relatively small height differences, and the extracted data do not capture the differences (Fig. 11d, iv). This error is also manifested as missing values in grouped and stacked bar charts when the bars or bar segments are relatively short (Fig. 9ii, iii).

The text recognition model [38] identifies text with an F1-score of 0.93 on ICDAR 2013 dataset. This recognition model misidentifies and confuses the alphabet ’O’ or ’o’, irrespective of the case, as the numeral ’0’ and vice versa in chart images. Also, the model has gaps in handling special characters, such as $,%,£, sign(−), and cannot handle superscript symbols, e.g., degrees, and exponents (Fig. 11c, iii). These shortcomings affect the accuracy of the extracted data scale (Fig. 11c, i). The inaccurate results in text recognition also manifest as errors in the textual summary of the source image. Some of these errors in text detection are shown in the reconstructed chart in Figs. 11d, i and d, iii. The accuracy of our chart summarization is limited by that of text recognition, as expected [16].

One of the critical drawbacks of our BarChartAnalyzer is in the false positives for corner detection in relatively low-fidelity images, owing to aliasing and subsequent pixelation. Our method, in combination with more recent deep-learning methods for corner detection [3], may alleviate this issue. Also, our classification model cannot handle variants of bar charts with textures in the bars or hollow bars. Our workflow fails for chart canvas extraction for such exceptional test cases, e.g., images shown in Fig. 12. Even though not widely practiced, bars can be re-annotated, e.g.,  by including text or bar value written inside each bar (Fig. 12b), which cannot be analyzed using BarChartAnalyzer. BarChartAnalyzer also fails for another test case where the data extraction process cannot interpret bar charts with divergent axes, e.g.,  presence of negative bars (Fig. 12c). Our text recognition model fails to identify text written in the hand-drawn chart shown in Fig. 12d. With respect to corner detection, BarChartAnalyzer fails in a few cases owing to the errors in human-guided annotation, critical point detection in tensor fields, and hyperparameter setting for DBSCAN for clustering corner points.

Improving chart summarization further requires an iterative process of feedback gathering through a user study in the deploy stage and the refinement of the summary in the reflect stage in our proposed design study. Hence, a careful design of subsequent user studies is in the scope of our future work. Several of the state-of-the-art methods use bilingual evaluation understudy (BLEU) [41] for validation of the chart summaries [3, 25]. In the absence of a ground truth or gold summary, we have not used the BLEU evaluation in our work. Even though we applied BLEU with respect to the creative summaries of users, BLEU underperformed owing to the variety and language styles in the creative summaries. An alternative that can be pursued in the future is to manually generate gold summaries for all chart images and to compute scores of the system-generated summaries with respect to the gold summaries based on content selection, relation generation, and content ordering metrics [42].

It is an interesting exercise to provide a confidence score upstream in our system on the success of automating data extraction for any input chart image. This requires running our system for a larger set of images from the different subtypes, and performing an in-depth analysis on the causality of the failures. This analysis is in the scope of future work of improving the usability of the BarChartAnalyzer system.

Conclusions

As a next step, a subtype-based analysis can be extended to other chart types, such as scatter plots. Our workflow requires user interaction for tasks such as image annotation for canvas extraction and setting hyperparameters of DBSCAN. In the scope of future work, the automation in our workflow can be increasing using more CNN models for annotation. We currently use tensor field computation on the chart images, which can be made more robust to separate chart objects from the source image. As VGGNet has been widely used for object detection tasks, our goal is to improve our classifier to automate the canvas extraction step, thus making our end-to-end workflow completely automated. Super-resolution algorithms may be explored as an additional component in our algorithm to improve the accuracy of both OCR and object detection, especially for severely aliased images.

In summary, we propose a workflow BarChartAnalyzer using standard image processing techniques and deep-learning models to perform the critical task of chart image digitization and summarization for bar charts. BarChartAnalyzer is novel in handling seven different bar chart subtypes. Our contributions include the mapping between pixel space data and the data space using the text detection model. We propose a novel application of the design study methodology for chart summarization. The design study provides the predefined template with information from the chart type details, perceivable visual features in the chart, and the data extracted from the chart image. The summarization achieved from our system has the potential of being used in a language processing module, such as gtts in Python, to generate an audio summary of the given chart image for the visually impaired audience. Our text summary can be pruned to provide crisper summaries as per user inputs. As discussed, our workflow has limitations of the dependency of workflow on the image fidelity, object size, training dataset, a variety of chart images, etc. Overall, our work shows promising results in automatically interpreting the chart images for bar charts and its subtypes, mostly human-out-of-the-loop approaches.