Keywords

1 Introduction

Our era is characterized by continuous advances in biomedical sciences and a corresponding large amount of scientific publications each year. Literature topics and trends analysis is increasingly employed to give insights on past and future research directions. Several statistical algorithms have been applied to model topics in scientific literature [1,2,3,4,5]. As such methods require considerable mathematical and programming background, recent research proposes user friendly integrated tools to enable researchers of various backgrounds to explore topics analysis [6,7,8]. However, currently available tools do not cover the entire topics and trends analysis workflow and require custom set up. In this paper, we propose an open source and platform independent service to support topic modeling and trends analysis for the biomedical expert. The service allows creation and description of biomedical literature corpora, supports the entire workflow of topic modeling and trends analysis and provides visual navigation of the results.

2 Topic Modeling

Topic modeling [9] refers to a suite of algorithms that aim to analyze the hidden structure of a collection of documents. Each document is characterized by a variation of topics, each topic consists of a collection of words and each word has its own statistical weight. Several topic modeling approaches are available [3,4,5, 10], Latent Dirichlet Allocation (LDA) being one of the most popular. The algorithm starts by randomly assigning each word of a document in one of K topics. Then, it calculates conditional probabilities for each topic in each document ((t|d) where t denotes the topic and d denotes the document) and for each word in every topic ((w|t) where w denotes word). Through an iterative process, it reassigns words and topics until they reach a steady state. The algorithm requires setting the initial number K of assumed topics and the parameters that define the Dirichlet prior for the per document topic distribution (parameter α) and for the per topic word distribution (parameter β).

Topic modeling has been successfully applied in many other research areas, for example to analyze and classify genomic sequences [11], classify images based on visual words topic modeling [12], detect discussion themes in social networks [13] and analyze source code [14]. Additionally, there are several implementations of topic modeling (and especially of LDA) in different programming languages [15,16,17,18]. In this paper, we integrate some of the existing implementations in an online service which provides added value functionalities, including user-friendly interface to visualize and label topics and tools to support trend analysis. The service also allows for generation of rich metadata for each step of the workflow, to fully document the topic modeling experiments.

3 Topic Modeling and Trends Analysis Service

An overview of literature topic modeling and trends analysis workflow is presented in Fig. 1. The process starts with the generation of the initial literature corpus, as a collection of relevant published papers; most often the collection is limited to paper titles and abstracts due to access restrictions. Following a rudimentary text preprocessing, the topic modeling algorithm is parametrized and applied to identify topics that are essentially word collections. Human intervention is required to label topics so that they are meaningful for human interpretation. Finally, popularity of topics over time is assessed for trends analysis.

Fig. 1
figure 1

The basic workflow, inputs and outputs of the platform

We have developed a user-friendly web-based environment to encapsulate this entire workflow of topic modeling and trends analysis and provide this as a service for the non-expert biomedical scientist. The input of the service is a corpus of research abstracts retrieved from PubMed as a result from a specific query. The system allows the user to describe each corpus via relevant metadata, including corpus generation date, initial database query, study aim and user details. Basic corpus statistics are also calculated, e.g. number of publications per year (given as a graph), total number of articles, number of articles with an abstract and minimum–maximum year of articles.

Text preprocessing is routinely used to clean the corpus via: (1) removal of all the punctuation and escape codes; (2) exclusion of stop-words; (3) conversion of all words to their lemmas by applying the stemming procedure; and (4) exclusion of articles with no words in their abstracts or less than 3 letters in their titles. Current service implementation uses the most commonly used Krovetz stemmer [19] as a default option for the stemming process. However, the service allows importing of additional stemming algorithms [10]. The service provides basic preprocessing statistics and allows the user to generate metadata to richly describe preprocessed corpora for future reference.

The processed corpus can be archived and used as input to the topic modeling procedure along with the necessary execution parameters. Currently, we support two different implementations of LDA based on the Java libraries Mallet ParallelTopicModel [17] and jLDADMM [18] with input/output performance enhancements. Service architecture supports easy integration of other LDA implementations based on predefined public interface descriptions.

Topic modeling experiments are resource and time consuming while they often be repeated with different initialization parameters. The proposed service displays a current status of scheduled topic modeling experiments and supports a powerful experimental lab-bookkeeping. As shown in Fig. 2, the user is guided to insert relevant metadata that describe in detail each topic modeling experiment. Metadata can be edited and updated, while they automatically inform saved experimental results and trends analysis and visualizations produced in the following steps of the workflow.

Fig. 2
figure 2

Metadata table for topic modeling

Another important service feature of added value is the ability for the user to label each topic. The procedure of assigning labels to topics is shown in Fig. 3. For every topic that has been generated by the execution of the algorithm, the top most words (number indicated by the user) that describe the topic are ranked by statistical weight and displayed for the user in tabular form or as word clouds. The user can then assign a title to each topic and create nested categories to organize various topics.

Fig. 3
figure 3

Assigning labels and categories to topics

The final step in the workflow involves trends analysis on the identified topics. The popularity (t, y) of the topic (t) for each year (y) is calculated as the mean of the weight of this topic for all documents published this year (Dy):

$$ P\left( {t,y} \right) = \frac{1}{{\left| {D_{y} } \right|}}\sum\nolimits_{{d \in D_{y} }} {\frac{{\left| {\left\{ {{\text{w}} \in {\text{d}}:{\text{topic}}\left( {\text{w}} \right) = {\text{t}}} \right\}} \right|}}{\left| d \right|}} $$
(1)

where t represents a topic and w is a word in document d of the documents’ collection Dy for year y [20]. Calculated trends are then displayed as graphs. The service supports for rich visualizations which allows comparative displays of different topics and categories and corpora, while preserving metadata information describing the different experiments whose results are compared. An example is shown in Fig. 4. The user can generate graphs on the fly for any group of selected topics or categories and compare trends for a chosen time range.

Fig. 4
figure 4

Categories Graph. The user selects the result and the categories that wants to compare. The topics of each category appear after the user check them

The system is implemented in NodeJS with LoopBack framework (http://loopback.io) and is accessible at https://trends.duth.carre-project.eu/. Data storage is based on the MongoDB (https://www.mongodb.org). The frontend is powered by AngularJS framework (https://angularjs.org) and the graph visualizations are implemented using Chart.js (http://www.chartjs.org/) and Vis.js library (http://visjs.org). In the backend, we developed a mechanism for the management of parallel processes that are possible to be requested by the same user or not. For this purpose, we used a FIFO philosophy (first-in, first-out) for the execution of processes and limitations on the number of processes (e.g. three) that are possible to be executed simultaneously. This is required because our system has limited computing resources and the topic modeling algorithms require high computational cost.

4 Discussion

This paper proposes a web-based service that allows biomedical researchers with no experience in data modelling and programming to execute topic modeling and trends analysis experiments of biomedical literature corpora, keep experimental details and visualize the results. Work in progress includes to make our web service free of bugs, support more topic modeling algorithms with an easy mechanism to add new implementations of them, and to develop a mechanism that would add a batch of processes with different parameters with goal to select the appropriate ones (e.g. the number of topics). Additionally, we plan to perform an evaluation of our system regarding the system’s performance and the users’ satisfaction.