SWISH DataLab: A Web Interface for Data Exploration and Analysis

Bogaard, Tessel; Wielemaker, Jan; Hollink, Laura; van Ossenbruggen, Jacco

doi:10.1007/978-3-319-67468-1_13

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 765))

Included in the following conference series:

Benelux Conference on Artificial Intelligence

697 Accesses
4 Citations

Abstract

SWISH DataLab is a single integrated collaborative environment for data processing, exploration and analysis combining Prolog and R. The web interface makes it possible to share the data, the code of all processing steps and the results among researchers; and a versioning system facilitates reproducibility of the research at any chosen point. Using search logs from the National Library of the Netherlands combined with the collection content metadata, we demonstrate how to use SWISH DataLab for all stages of data analysis, using Prolog predicates, graph visualizations, and R.

Access provided by CONRICYT-eBooks. Download conference paper PDF

StatSpace: A Unified Platform for Statistical Data Exploration

Keeping it under lock and keywords: exploring new ways to open up the web archives with notebooks

Article Open access 04 July 2022

XPath, XPointer, and XInclude

Keywords

1 Introduction

Data is ubiquitous, and so are tools supporting data analysis. More often than not, different tools are used for different stages of the analysis. For example, the preprocessing and exploration of data is handled in one tool using one programming language and the analysis in a completely different environment, with scripts spread out in different files stored locally. This interferes with transparency, shareability and reproducibility of the research. SWISH DataLab^{Footnote 1} provides a web interface. It is a Wiki-like collaborative environment combining processing, exploration, and analysis of data, supporting transparency of the choices made. It blends the clarity of Prolog with the statistical computing power of R^{Footnote 2}. Data cleaning and creating concepts and abstractions over the data benefit from the elegance of Prolog’s rule-based logic programming paradigm; the statistical analysis and visualization are the strength of R. The combination of logic programming and R has been shown in [1], and applied in the context of biomedical research, e.g., [2], and in sentiment analysis of social media, e.g., [3]. SWISH DataLab integrates SWI-Prolog and R into a single computational environment accessible through a shared web interface.

Using SWISH DataLab, it is easy to quickly try out different data abstractions on a sample and evaluate the impact on the results. SWISH DataLab is currently being developed as a collaborative environment for responsible data science using data from the National Library of the Netherlands. In this case study we combine six months of search logs with the content metadata from the historical newspaper collection. We explore and process the datasets using the Prolog programming language, and use R for a statistical analysis of the data.

2 SWISH

SWISH DataLab is an instantiation of SWISH (SWI-Prolog for SHaring), geared to data analysis. In SWISH, a web interface gives access to the computational environment. The development of SWISH [4, 5] is influenced by Jupyter^{Footnote 3} and JSFiddle.^{Footnote 4} With JSFiddle it shares the model of a server where people can save and share programs and documents. From Jupyter the concept of notebooks is taken, a mixture of text and program fragments that can be edited in a browser [6].

In SWISH, programs can be executed from the browser and the results appear in the browser as plain answers or rendered as tables, graphs or charts (Fig. 1). The core language of SWISH is SWI-Prolog. Unlike Jupyter it does not support other languages directly. Instead, it allows for extending the Prolog core by binding it to other systems. On the backend it may be connected to external data using e.g., ODBC or SPARQL. Computational flexibility can be extended using e.g., R. The complementary power resulting from combining logic programming and R has been demonstrated in [1]. The frontend can be enhanced using server-side support from e.g., R or Graphviz^{Footnote 5} as well as client-side support using e.g., D3.js^{Footnote 6} or C3.js^{Footnote 7} as visualization methods.

2.1 SWISH and the R Programming Language

R is made available to Prolog by means of the Rserve^{Footnote 8} package. This implies that every SWISH user has a private instance of R, providing both isolation and concurrency.

R can be accessed from Prolog using two distinct mechanisms. The predicate Result <- Expression uses Prolog syntax to represent R expressions (Fig. 2). The match is close, but not 100%. For example, where ‘10.’ is a valid R floating point number, Prolog requires writing this as ‘10.0’ and R identifiers that start with a capital letter need single quotes to avoid misinterpretation as a Prolog variable. Complex R objects such as functions cannot be expressed using the Prolog syntax. This problem is resolved using quasi quotations [7]. Quasi quotations allow for embedding external languages verbatim, while interpolating values from Prolog. The basic syntax is {|lang(param...)||code|}. For example, we can write \(\mathtt{?- \{|r||plot(c(1,2,3))|\}.}\) to realize a simple R plot. We can combine this with Prolog as shown:

Quasi quotations allow for reusing long snippets of R code verbatim, while the Prolog syntax is more natural for relatively simple R calls and allows for building R calls dynamically.

3 Case Study: Analysis of Online User Search Behavior

The goal of this use case is to understand user search behavior. Under strictest confidentiality agreement we have received six months of server logs from the full text search platform^{Footnote 9}, spanning a period from October 2015 to March 2016. This search platform provides access to combined collections from the National Library of the Netherlands and other national heritage and research institutes. These collections are–as is the case for other digital libraries and archives–characterized by bibliographic data describing the content (e.g. publication date, type of document, origin of document). These metadata values are reflected in the search interface in facets that can help filter the results (Fig. 3). Over 90% of user requests accesses the historical newspaper collection. For this reason we focus on this collection, that contains over 100M documents across four centuries. Using SWISH DataLab, we have linked the content metadata values to the clicked and downloaded documents, enabling a comparison between facet-use in search and the metadata of clicks and downloads.

3.1 Iterative Data Exploration Through Abstraction

Constructing a vocabulary. The goal to describe user search behavior has motivated an exploration of the server logs based on user interactions within sessions. The sessions are defined based on IP address and a 30 min timeout. In order to recognize usage patterns, we have visualized these sessions in graphs with Graphviz for the rendering (see Fig. 4). As an added benefit, the graphs help to conceptualize the data.

Visualization of abstractions. The graphs make visible that users visit the same search engine results page often in a session, leading to a Prolog rule where we abstract this to a single node with multiple incoming arrows. This abstraction produces a different count of how many search interactions a user has engaged in, a count where revisiting the same results page in the same session is not seen as a new search (Fig. 5).

Removing reloads. We also removed repeated visits of the same web page right after each other, as this is likely a reload of the web browser and not a new interaction by the user. This has resulted in a more practical count of the number of clicks and a clearer definition of the dwell time on a document (as we time this from the first load of the page until a new interaction and not as separate shorter dwell times).

Return to any previous state. These incrementally written rules to clean the data and visualize the graphs can be fully traced in the web environment, where all intermediate saved versions of both data and code have been saved and can be retrieved. Being able to return to an exact previous state improves the reproducibility of the research, making it possible to save the state of the project at the time of publication and to rerun the exact steps on the same (or possibly updated) data.

Evaluate concepts and abstractions. We can inspect the statistical effect of different symbolic definitions of the data, such as shown in Fig. 5, where two abstractions over the data are set aside each other in a single notebook. The environment supports this type of transparency in code and results, making it possible to measure the impact of these abstractions on the results.

4 Conclusions and Future Work

With SWISH DataLab we are moving closer to a single environment for responsible data science shared between researchers. The use of visualizations in combination with concepts and abstractions defined as rules makes data cleaning more transparent and more thorough, and patterns in user interactions more insightful.

Future work on the use case will include a better definition of a session based on the graphs and machine learning (using for example the machine learning algorithms available in R) for prediction of behavior.

SWISH DataLab will be extended with facilities to improve collaboration such as sending change notifications, shared editing and a commenting/chat service. We also plan to facilitate generating a permanent link that captures a result (e.g., a chart or table) and all programs and data needed to reproduce this result reliably.

Notes

1.
A version of SWISH for teaching Prolog is available online: http://swish.swi-prolog.org/.
2.
https://www.r-project.org/.
3.
http://jupyter.org/.
4.
https://jsfiddle.net/.
5.
http://www.graphviz.org/.
6.
https://d3js.org/.
7.
http://c3js.org/.
8.
https://www.rforge.net/Rserve/.
9.
http://www.delpher.nl/.

References

Angelopoulos, N., Santos Costa, V., Azevedo, J., Wielemaker, J., Camacho, R., Wessels, L.: Integrative functional statistics in logic programming. In: Sagonas, K. (ed.) PADL 2013. LNCS, vol. 7752, pp. 190–205. Springer, Heidelberg (2013). doi:10.1007/978-3-642-45284-0_13
Chapter Google Scholar
MacIntyre, D.A., Chandiramani, M., Lee, Y.S., Kindinger, L., Smith, A., Angelopoulos, N., Lehne, B., Arulkumaran, S., Brown, R., Teoh, T.G., Holmes, E., Nicoholson, J.K., Marchesi, J.R., Bennett, P.R.: The vaginal microbiome during pregnancy and the postpartum period in a European population. Sci. Rep. 5, 8988 (2015). EP
Article Google Scholar
Andreasen, T., Christiansen, H., Have, C.T.: Querying sentiment development over time. In: Larsen, H.L., Martin-Bautista, M.J., Vila, M.A., Andreasen, T., Christiansen, H. (eds.) FQAS 2013. LNCS (LNAI), vol. 8132, pp. 613–624. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40769-7_53
Chapter Google Scholar
Wielemaker, J., Lager, T., Riguzzi, F.: SWISH: SWI-Prolog for sharing. CoRR abs/1511.00915 (2015)
Google Scholar
Beek, W., Wielemaker, J.: SWISH: an integrated semantic web notebook. In: Kawamura, T., Paulheim, H., (eds.) Proceedings of the ISWC 2016 Posters & Demonstrations Track co-located with 15th International Semantic Web Conference (ISWC 2016). CEUR Workshop Proceedings, Kobe, Japan, 19 October 2016, vol. 1690. CEUR-WS.org (2016)
Google Scholar
Ragan-Kelley, M., Perez, F., Granger, B., Kluyver, T., Ivanov, P., Frederic, J., Bussonier, M.: The Jupyter/iPython architecture: a unied view of computational research, from interactive exploration to communication and publication. In: AGU Fall Meeting Abstracts, vol. 1, p. 07 (2014)
Google Scholar
Wielemaker, J., Hendricks, M.: Why it’s nice to be quoted: Quasiquoting for Prolog. CoRR abs/1308.3941 (2013)
Google Scholar

Download references

Acknowledgments

We thank the National Library of the Netherlands for their support. The development of SWISH DataLab was partially supported by the VRE4EIC project, a project that project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 676247.

Author information

Authors and Affiliations

Centrum Wiskunde & Informatica, Amsterdam, Netherlands
Tessel Bogaard, Jan Wielemaker, Laura Hollink & Jacco van Ossenbruggen
Vrije Universiteit Amsterdam, Amsterdam, Netherlands
Jan Wielemaker & Jacco van Ossenbruggen

Authors

Tessel Bogaard
View author publications
You can also search for this author in PubMed Google Scholar
Jan Wielemaker
View author publications
You can also search for this author in PubMed Google Scholar
Laura Hollink
View author publications
You can also search for this author in PubMed Google Scholar
Jacco van Ossenbruggen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tessel Bogaard .

Editor information

Editors and Affiliations

Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Tibor Bosse
Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
Bert Bredeweg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bogaard, T., Wielemaker, J., Hollink, L., van Ossenbruggen, J. (2017). SWISH DataLab: A Web Interface for Data Exploration and Analysis. In: Bosse, T., Bredeweg, B. (eds) BNAIC 2016: Artificial Intelligence. BNAIC 2016. Communications in Computer and Information Science, vol 765. Springer, Cham. https://doi.org/10.1007/978-3-319-67468-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-67468-1_13
Published: 15 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67467-4
Online ISBN: 978-3-319-67468-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SWISH DataLab: A Web Interface for Data Exploration and Analysis

Abstract

Similar content being viewed by others

StatSpace: A Unified Platform for Statistical Data Exploration