Keywords

1 Introduction

Data visualization represents the effective presentation of information and involves a multidisciplinary communication approach. Its goal is to communicate a specific message to a user. Indeed, a visual representation of data has a main goal to communicate quantitative and qualitative information clearly and effectively through graphical means which can be static, animated or interactive [1, 2].

Selecting a color schema in data visualization process is also very important. It allows the designer to set the tone of these visualizations and try to keep a consistent representation [1].

Data visualization has been used to tackle several challenges in many disciplines such as economics, medicine, and education. As eHealth is an actual topic for today and extremely important to all practitioners, we highlight in this paper the importance of data visualization in this area. In fact, many opportunities have received attention to date for supporting people to make health document sense and for supporting them in better understanding their own illnesses and their health conditions to manage them more effectively [3, 4].

Indeed, clinical-researchers are confronted today with a huge and complex patient records based on which they must study them to make sure quality control, and discover new diseases [5].

The same applies to people as they are becoming more aware of for their own health. They need to understand their own diagnostics to improve and manage their health and to better communicate with their doctors.

The body of this paper will be as follows, we start with Sect. 2 to highlight earlier works related to data visualization. The Sect. 3 sheds light on our proposed data visualization system based on LSM (Least Squares Method) which is a statistical method arisen in machine learning to find semantic relationships between a set of terms and not only between a pair of terms. The Sect. 4 gives the obtained results followed by conclusions and future works.

2 Medical Visualization Systems

In this section, we review the state of the art of data visualization systems to support users (patients and clinical researchers for example) to understand personal health information. In eHealth topic, data or information visualization is part of an overall visualization field that incorporates both information and scientific visualization, which are defined in the literature separately and considered as different [3].

Scientific visualizations represent scientific concepts such as molecules, parts of the human body, or natural phenomena, mainly in 3D [6]. The goal of these visualizations is essentially the confirmation or rejection of a particular hypothesis. Information visualization is a visualization tool to represent abstract concepts or terms. Author in [7] classifies this kind of visualization as exploratory analysis visualizations, due to their goal, help user’s to find a hypothesis. The power of this tool derives from its ability to represent a large information at once, including internal relations.

In our case, we focus on information visualization and through the literature review, we can find several main researches in medical field, which investigate this kind of visualization [3].

In [8] authors proposed AsbruView tool, a visualization tool developed to assist in handling treatment plans in Asbru. AsbruView relies on a graphical metaphor where plans are represented as a running track which the physician ‘runs’ along while treating the patient [3].

LifeLines proposed by [9] was one of the first tools to be used for the electronic health records representation. It was originally developed as a general-purpose visualization tool to represent personal histories that was then applied for the visualization of patient records. Our aim in this paper, is to develop a visualization tool, integrated in content based retrieval system, that presents semantic relationships between medical terms appeared in patient records and not to present only patient records data. The goal is to help users to make sense of returned documents when they use content based retrieval systems and search their needs.

3 Proposed System

In this section, we present our proposed content based retrieval system that incorporates two main steps. The first step is based on a local automatic documents-analysis to define semantic relationships between query terms and terms of the top m returned documents deemed relevant. The second step illustrates how to visualize these relations in a graph to help users making sense of the returned documents. The process of our search system is illustrated in (Fig. 1)

Fig. 1
figure 1

Proposed content based retrieval system

3.1 Semantic Relationships

Measuring similarity and semantic relationships between terms in a set of documents has become a primary task and plays an important role in the natural language processing (NLP) field in order to improve and to interpret search results [10]. It is the backbone of several applications, such as query expansion, disambiguation, automatic creation of thesaurus [11].

Previous approaches that study this latter idea, can be classified into three main categories [12,13,14]: those based on semantic knowledge (such as ontologies, data dictionaries), those based on content-based methods documents using general statistical methods [15, 16], and hybrid approaches that combine the earlier two categories.

In our case, we will adopt statistical methods to define semantic relationships between documents terms. The choice to adopt this type of method is justified first, by the independence of this process to the used language and secondly, by its ability to define these relations between a set of terms and not only between a pair of terms. We will apply Least Squares Method (LSM) for text analysis [17, 18] which is a method often used to define approximately relationships that may exist between many variables [17, 19,20,21,22]. Indeed, this method, known as linear regression, is the most widely used predictive model in the field of machine learning which present a particular approach to artificial intelligence [23].

Indeed, machine learning is a data analysis method which automates analytical model building process. The main idea of this method is to create algorithms that can receive input data and use statistical analysis to predict an output value.

It is an approach of artificial intelligence based on the idea that systems can learn from data and make decisions with little user intervention.

LSM tries to find the connection that may exist between an explained variable (y) and explanatory variables (x). In our case, we take a term j as an explained variable and the remaining terms in a set of documents as explanatory variables (term 1⋯n). The goal is to find the relation between these variables as follows:

$$\displaystyle \begin{aligned} term_{j} \approx \sum_{i=1}^{j-1}(\alpha_{i}term_{i})+ \sum_{i=j+1}^{n}(\alpha_{i}term_{i})+\epsilon \end{aligned} $$
(1)

where α represent the real coefficients of the regression model and the weights of relationships between terms. 𝜖 represents is the associated error.

Explanatory variables are defined from the top m returned documents that meet user’s needs. As a result, for each variable which is a term in our case, we will have m measures that represent the tf − idf weights. To minimize calculation complexity, we study as explained variables term j only the distinctive terms of the user’s query (The process of this step is illustrated by Fig. 2).

Fig. 2
figure 2

Process of defining semantic relationships

For example, when a user sends a query with three terms (t i,t k,t l), our content based retrieval system retrieves the top m returned documents which will be treated with a matrix representation. Indeed, we obtain (terms × documents) matrix with (n ∗ m) size where n presents the number of terms in the set of returned documents.

For each query term, if it exists in the terms set of returned documents, we calculate then its relationship weight vector \(A_{term_{j}}=(\alpha _{1},\cdots ,\alpha _{n})\) with other n terms. Least Squares Method gives the solution to find this vector in an approximate way:

$$\displaystyle \begin{aligned} A_{term_{j}}={(X^{j T}\times X^{j})}^{-1}\times X^{T}[.,j]\times T_{j} \end{aligned} $$
(2)

Where:

  • T represent TF-IDF weight vector of term j.

  • X present the TF-IDF matrix whose columns represent the keyword set and rows represent the m returned documents.

  • X j is obtained by removing the column of the term t j in matrix X.

  • X T[j, .] represents the transpose of the weight vector of the term t j in all documents.

At the end of this process, we obtain terms by terms matrix (see Fig. 2) (Query-terms × terms of top m returned documents matrix) which contains the relation values founded for each query term with the remainder terms.

Once the relations are defined, we study them in order to design the graph of semantic relations with the most related terms (for example terms that have positive relations with query terms).

3.2 Data Visualization

Semantic relationships defined in the previous section does not make it possible to interpret the similarity between the terms in an easy way. It is thus preferable to have a comprehensive view of these semantic relationships in order to better assimilate them.

As visualization plays a very important part in the results interpretation, we propose to visualize the defined relations in a graph which will be generated after the search process for each user’s query.

The generated visual graph comprises a set of nodes and a set of edges representing respectively terms and semantic relationships between these terms. We have decided to color semantic relationships defined for each query term by a different color because visual sweeping of colors takes less time and effort.

To enhance the importance of defined relations, we have modified the color intensity and the thickness of arcs which will be proportional to the similarity value. Indeed, if the defined relationship between two terms is strong, the arc becomes thicker. In order to help users interpret results, we have used the research option of the tool PrefuseFootnote 1 which makes it possible for users to easily find a term searched in the whole graph (for example term coronari in Fig. 3).

Fig. 3
figure 3

Graph of Semantic Relationships: query 53 in 2015

4 Results

In order to check the performance of our proposed method, an experimental procedure was set up. This evaluation is performed on a large collection of documents provided from the CLEF company for the two successive years 2014 and 2015 [24,25,26,27].

4.1 Document Collection

The document collection is composed of a set of medical documents covering a wide set of medical topics. This collection is around of one million documents provided by the Khresmoi project [24,25,26,27] which come from different online sources such as known databases and medical sites (e.g. ClinicalTrial.gov, Genetics Home Reference, the health certified websites).

The test set in 2014 comprises 50 professional and medical queries provided by experts (clinical-researchers for example). These queries present different cases of patient diseases. In 2015, test set has 67 circumlocutory queries provided by patients when they are faced with symptoms and signs of a medical condition.

4.2 LSM Results

Table 1 shows the results of defined semantic relationships for some queries. Terms that are written in bold font represent the original query terms, the other terms represent the most related terms to the original query in their root form. We notice that selected terms usually express the same context of the original query which proves the effectiveness of our proposed method.

Table 1 Semantic relationships examples

For example we take query number 53 in 2015 and we explain some related terms like:

  • podiatri: a podiatrist is a health professional who diagnoses and treats disorders of the feet.

  • garment: is a pneumatic antishock garment an inflatable garment used to combat shock, stabilize fractures, promote hemostasis and increase peripheral vascular resistance.

  • suzi: Extra Roomy Shoes that are cleverly designed to look slim-line but have lots of room for swollen feet.

  • lipodermatosclerosi: is a skin and connective tissue disease.

Another example query 35 in 2014, related terms are in strong relation with the original query which talk about peptic ulcer. It is a disease that has long been considered chronic, defined anatomically by a loss of parietal substance not exceeding the submucosa.

4.3 Interactive Data Visualization

Interactive data visualization plays an important role in making sense process. This importance grows when we talk about the eHealth field that is a current topic that draws attention of any person which is today more aware of and take greater responsibility for his health.

The above figures show the illustrations of semantic relationships defined by the statistical method LSM. Each illustration is, as already mentioned, presented as a graph whose nodes represent the terms and the arcs represent relations between these terms.

These graphs are generated automatically and after each user’s query. In fact, these types of graphs help users to understand the general context of the returned documents and to expand the knowledge on the subject later by studying the presented relationships. This study is easy to any user due to the used colors and the difference thickness edges. Indeed, if the edge between two terms is thick, the user can interpret the relation between these two terms is pertinent.

For example, in Fig. 3, edge between terms (swollen, garment) is very thick, so we conclude the strong relationships between them.

In Fig. 4, we notice the presence of several relations with the term peptic that can expand user’s knowledge.

Fig. 4
figure 4

Graph of Semantic Relationships: query 35 in 2014

In the case where we end up with a rich graph with many semantic links between the terms, the research option provided by the Prefuse tool makes user’s easily find a specific term. Take the example shown in Fig. 5, the user looks for the word coronari, our visualization tool try to seek this term, if finded this term will be colored by a pink oval.

Fig. 5
figure 5

Graph of Semantic Relationships: query 1 in 2014

5 Conclusion and Future Works

The goal of this work is to help users making-sense when searching medical documents. Our proposed idea is to return with the search results, a graphical representation that illustrates the semantic relationships that may exist between terms in the user’s query and terms of the returned documents assumed as pertinent, instead of returning only the relevant assumed documents.

Illustrated relationships in the graphs are defined by the statistical method LSM which is the most widely used predictive model in the field of machine learning that present a particular approach to artificial intelligence.

Obtained semantic relationships as well as the obtained graphs show the efficiency of the LSM method which gives significant results that help users to explore more the medical field and to ameliorate their queries with adding appropriate terms.

For future work, we try in the first time, to ask users to explain their needs (queries) based on their interpretation of provided data visualization graph. In the second time, we study the impact of query reformulation process on our content based retrieval system.