1 Introduction

Discovering knowledge through event databases is a challenge that is being pursued for decades. The first motivation seems to originate from politics, since in the 60s, McClelland (1961) tried to analyze political interactions at a macro-level. However, several years were needed for this initiative to reach an accomplished version, the codebook of WEIS (McClelland 1976). The WEIS was perhaps the first systematic approach to structure event data, which allowed a departure from the idiographic study of political events, towards the use of quantitative methods. The importance of having a systematic method to listen and to record political events, was that big, that different initiatives (including projects funded by Defense Advanced Research Projects Agency, and National Science Foundation) produced systems like KEDS (Gerner et al. 1994), CAMEO (Gerner et al. 2002), ICEWS (O’Brien 2010), and recently GDELT (Leetaru and Schrodt 2013). The long sought out goal of all these efforts is the discovery of regularities of events occurrences.

The aim of this work is to enrich the toolbox of analysts that look for this kind of regularities by enabling a process perspective for event data. Aiming at delivering rich descriptions, we provide a guide on reviewing event datasets, empowering knowledge discovery techniques, and facilitating the analysis with relevant visualizations. We shall notice of course that by “process perspective” we refer to the structural aspects, since elements of the functional context of business process management (e.g., redesign, execution) are not applicable to event analytics (at least to social/political event analytics).

Discovering knowledge through event databases becomes more and more complex, and interesting, since the primary obstacle in acquiring such data (manual collection and coding procedures) has been mainly solved. Currently, virtually all news articles that can be spotted by web crawlers are machine-coded, and are getting registered. Therefore, the challenge has shifted to the analysis side. Lately, many event analytics techniques with appealing visualizations have been developed (see Sect. 2 for a brief overview). However, these (rich) techniques seem to univocally focus on the pure sequentiality of events occurrences. The dominant ways to portray regularities is Frequent Sequence (or Pattern) Mining (applied to sequences of temporally ordered events) (Gotz et al. 2014), and flow diagrams (Liu et al. 2017), like Sankey plots, which in the case of small data (i.e., few events per case and limited event alphabet) are producing alluring illustrations.

Nevertheless, real-world news events often do not follow a linear movement that is implied by a sequence, but they have varying structures, and these may reflect elements of a larger process that has many descriptive components (Peuquet et al. 2015). In this work, the ambition is to enable the observation of the events through a process perspective. We advocate that such an activation will allow analysts to consider questions common in process diagnostics (Bose and van der Aalst 2012) for event analysis. In particular, we expect to be able to get answers to questions like: (1) Are there any paths, of any size, that reveal some kind of frequent interaction patterns? (2) Can we manifest decision points, or deviating behaviors by introducing gateway semantics (flow splitting and joining points) on events’ occurrences? (3) Does the thematic characterization of events alter the interaction patterns? (4) Do the involved actors interact based on any patterns? (5) Can we recognize any properties for the events (or actors), e.g., milestone events, hub or authority events or actors?

The key components of our approach are quire straightforward: Since event data standards qualify similar formats (that include a source actor, a target actor, an event, and a timestamp), it is clear that any approach that has a slightest aspiration to make an impact, must exploit this standardized format. Therefore, we propose transforming the original format into various options, all of them enabling event data formats to get loaded into Process Mining (van der Aalst 2016) tools. Process mining techniques will eventually allow process models discovery. These (automatically discovered) process models can be effectively illustrated, and conceivably deliver the contributions we claimed in the previous paragraph. In addition, a plethora of decision support methods (already proposed for original process mining applications) will be able to get leveraged for event analytics.

We should note that our approach does not intend to deliver a theory or a general model but rather a methodology and a set of recommendations on how to get richer insights for event analysis. The epistemological considerations of such an approach are discussed in Sect. 3, in tandem with the methodological approach, and the presentation of the case study that we are going to use to illustrate our proposal. In Sect. 2 we provide a brief review of existing approaches for event analysis, as well as for relevant process mining methods. Our proposal about how a process orientation can be infused into raw event datasets is presented in Sect. 4, while a short discussion concludes the paper.

2 Background

The “Global Data on Events, Location, and Tone” (GDELT) project, supported by Google, consists of hundreds of millions event records, extracted from broadcast, print, and online news sources from all over the world (Leetaru and Schrodt 2013). The GDELT provides a free, rich, comprehensible, daily updated, CAMEO-coded (Gerner et al. 2002) dataset, which applies the TABARI system (Best et al. 2013) to each article to extract all the events disclosed therein.

This valuable resource has been exploited in many research efforts (e.g., Keertipati et al. 2014; Kwak and An 2016; Ward et al. 2013; Phua et al. 2014), however, event data are treated either as input to ordinary analytics techniques, or as time-series data (Jiang and Mai 2014). Considering event analysis, the most visible methods are survival analysis and event history analysis (Broström 2012). Classical survival analysis focuses on research questions involving the time elapsed from a start event to an event of interest. This is an interesting research problem because of censoring and truncation (deriving form incomplete observations of events). To calculate the survival function and hazard rate (the most popular outcomes of research questions) several methods like the Kaplan–Meier and the Nelson–Aalen estimators, regression models (e.g., Cox and Poisson), parametric models (e.g., proportional hazard or accelerated failure time models), or even frailty models have been developed. However, it is often the case that events of interest occur more than once for an individual (recurrent events), or there are multiple types of events of interest. Event history models arise to allow following subjects over time and making notes about what happens and when (Aalen et al. 2008).

Indeed, treating event data like sequences, can yield effective visualizations that could support decision makers. Several methods for querying, filtering, and clustering multiple event sequences have been proposed, for example Fails et al. (2006), Vrotsou et al. (2009) and Wongsuphasawat et al. (2012), or the works of Gotz and Wongsuphasawat (2012) and Wongsuphasawat and Gotz (2012) that can handle much larger numbers of event sequences and provide effective visualizations for their aggregations. Moreover, when these methods can be combined in a user-friendly dashboard, decision support can be further improved (Gotz and Stavropoulos 2014). For a concise description of how event information can be modeled, retrieved, and analyzed, the reader is directed to Gupta and Jain (2011).

Nevertheless, if we assume a process perspective (i.e., that events are not happening in random, but their occurrence is part of a larger, implicit process), the process mining paradigm is enabled, and it substantially augments the decision support potentials. In particular, the following competences will be facilitated:

  • Discover complex structures of events (splitting and merging points, long-distance and multi-step causalities, etc.) even when the process is drifting over time (e.g., due to recurring seasonal effects) (Martjushev et al. 2015)

  • A family of process mining techniques aims at detecting and explaining differences between executions that lead to different outcomes. Under the general term deviance mining (Nguyen et al. 2014), we can see for instance, approaches that return in natural language the significant differences between traces that lead to a special outcome and traces that don’t (van Beest et al. 2015), or point out the factors that differentiate the flows (De Leoni et al. 2014; Delias et al. 2015b). Similar behavior (which eventually leads to similar results) can be also identified through trace clustering (Song et al. 2008; Bose and van der Aalst 2009), where trace profiles are created based on control-flow, organizational, performance or additional criteria, and then traces are grouped according to some similarity metric. Trace clustering techniques are particularly useful to unclutter process models when a lot of variation exists.

  • Check deviation from expected pathways. Another type of process mining is conformance checking, where an existing (ideal) process model is compared to the event log of the same process. Conformance checking can be used to check if reality, as recorded in the log, conforms to the model and vice versa (van der Aalst et al. 2012). Local or global diagnostics can be created to check deviations form the expected pathways, or even to check the effect that possible “history modifications” would have to the discovered model (van der Aalst et al. 2015).

  • Putting timestamped events into a process structure allows to observe the temporal evolution (performance) of the process. In process mining, this family of methods is known as performance analysis (van der Aalst et al. 2011; Adriansyah and Buijs 2012), and can respond to questions like: how process performance evolves? Are some events delayed due to bottlenecks? Would the resolution of bottlenecks or following some special paths accelerate some events (Nguyen et al. 2016)?

With respect to our knowledge, this is the first time that a framework that enables a process perspective on event data is proposed. This is a significant contribution to the field, since a whole novel family of methods will be enabled.

3 Preamble of the approach

3.1 Case study

July of 2015 was no ordinary month for Greece. Following a dubious negotiation strategy, the Prime Minister Alexis Tsipras, on the 27th of June 2015, announced a referendum to decide whether Greece was to accept the bailout conditions, proposed by the European Commission, the European Central Bank, and the International Monetary Fund. Although the result was a clear “No” (61.3%), one week later, Greece’s parliament signed on to harsher bailout terms, i.e., a new mid-term Memorandum of Understanding. News was coming rapid and spectacular, since the government soon called an early parliamentary election. The story has been extensively covered by international media, making it eminently suitable for an illustrative example for the proposed approach. Dozens of thousands of events were registered to the GDELT, allowing for rich and informative event analytics.

Several techniques and visualizations are provided through the GDELT Analysis Service (http://analysis.gdeltproject.org). For example, by exploiting the geographical tagging of events, it is possible to plot a heat-map which will help us to understand the spatial patterns of the events. Moreover, the GDELT constructs the so-called Global Knowledge Graph (GKG) by extracting information such as the connections of persons, organizations, locations, emotions, and themes identified in the source texts, making it possible to generate word clouds of the most popular themes identified in the events, or tone (emotional connotation of the words in the article) timelines. As authors in Delias and Kazanidis (2017) illustrate, in that period’s broad-casted news about Greece, the term “tax” dominates, while other terms that stand out are “debt”, “bankruptcy”, “negotiation”, “fragility”, all of them accurately reflecting the situation of that period. In addition, the announcements for the referendum and the early elections, caused two sudden drops of the tone (the negative emotions become even more negative).

All these interesting visualizations provide interesting insights, derived from various aggregation perspectives of the events. However, in all of them, events are considered fragmentary, “rambling” elements of the story. Should anyone develop a hypothesis that events occur as part of a regularity, i.e., a process, these kind of visualizations can not support the relevant checking. In the following sections, we present our proposal for a methodology that responds to this challenge, and allows analysts to exploit the richness and availability of raw event data. Ultimately, the proposed methodology enables a process perspective for the (otherwise disjointed) events.

3.2 Methodology and epistemological considerations

We should note that in our approach, we make the fundamental assumption that the notion of a process is relevant for the realizations of events, i.e., that there is some kind of rational structure over the events, and therefore the challenge is to unveil this structure. Therefore, our approach follows the prescriptive paradigm that suggests discovering a model, suitable for a given situation in a particular context, and does not intend to be general (like, for instance, in a descriptive approach, which would have aimed at deriving global laws from the observed phenomena). Therefore, should we try to position epistemologically our work, we could say that it is located between the bottom-line which aims to produce a narrative based on empirical observations (namely rich descriptions of the phenomenon), and the ambition to develop a model with some level of abstraction. This area shares a portfolio with the grounded theory methodology, at least with its implementations in information systems research (Wiesche et al. 2017), yet we do not claim that our work can be classified as a grounded theory methodology approach mainly because of the big differences in the data collection and analysis procedures.

Considering the methodological aspects, we build on the decision support paradigm (Roy 1994), and we follow a rather more contextual and operational approach, inspired from Delias et al. (2015a). More concretely, in Delias et al. (2015a), authors, following the process mining paradigm, propose a methodology based on a practical synthesis of the common methodological steps of existing approaches to guide the implementation of process analytics projects. In particular, the original methodology suggests four basic phases:

  1. 1.

    Business understanding, including actions like defining project’s scope, uncovering facts, constraints, and assumptions, aiming at figuring out the business context and developing the shape of solutions

  2. 2.

    Data collection and reviewing, where actions like filtering, imputing missing data, data transformations are performed

  3. 3.

    Discovery, which includes a series of techniques intending to extracting knowledge relevant to the project’s objectives

  4. 4.

    Decision aid, in terms of actions like proving recommendations, deploying the solutions, etc. meaning to build a rapport between results and business goals.

In this work, we endorse that methodology, by presenting a refinement. More specifically, we exemplify actions of the first two phases by adjusting them into the event databases context. However, the prime focus of our refinement lies in the discovery phase, wherein we discuss what are the particularities of analyzing event databases with process mining techniques, and we reveal the benefits of such a resolution. From a procedural point of view, since the focus is on communicating insights from the data analysis, we could label our approach as “memoing” (Glaser 1978), conditional of course on the research goals, which are to deliver rich descriptions and not a theory. Finally, we leave out of the discussion the last phase, the decision aid, since the political nature of the events that are involved, requires for high-level governmental conduct, making the decision aid phase out of scope for this paper. A figurative register of the steps that this methodology comprises to deal with the analysis of case study is illustrated in Workflow 1. The next section instantiates this workflow and describes the details of its application.

figure a

4 Getting insights by applying a process perspective

4.1 Business understanding

Although the term “business” seems a bit unorthodox to describe political events about countries interactions, we shall preserve it to address the general situation of describing the context of the problem. In that sense, describing business objectives takes the shape of specifying the expected benefits in the context terms. Therefore, in this work, the objective is to reveal regularities in the occurrences of political events among countries, as an additional tool to perform more refined empirical analyses of political and/or social operations. The case study limits the scope on a specific time period (the period relevant to the Greek referendum of 2015), and to specific countries (the countries relevant to the involved institutions).

However, besides defining the project’s scope, an additional scoping activity, which is recommended to be part of the “understanding” phase, and which is compatible with our process-oriented perspective, is defining the process scope. According to van der Heijden (2012), defining process scope deals with acknowledging what parts of the process can be tracked through the logged data and what type of information is both available and useful. In this regard, it is of primary importance to consider the original event data format.

The standard format of raw event data (e.g., ICEWS, GDELT) comprises at least the following basic elements: A timestamp, two fields indicating the involved actors (the one treated as the source, and the second as the target), and a code for the pertinent action that took place. For example, Table 1 shows a snippet of a GDELT record, that happened on the 2nd of July 2015, and involved Greece (Actor code: GRC) and refugees from Syria (Actor Code: SYRREF). The event that took place is the 043: “Host or receive a visitor at residence, office or home country”, but it doesn’t refer to the actual arrivals of refugees (it would be quite ironic to call that a “visit”), but to actors from the popular TV show “Game of Thrones” that visited Greece to Call on EU Leaders to Help Refugees Stranded in Greece (the original news can be found at: http://people.com/tv/game-of-thrones-stars-call-on-eu-leaders-to-help-refugees-stranded-in-greece/). Many additional fields (e.g., the coordinates for the location of the event, its tone, the url of the article) are recorded as well. However, this format can not be directly matched to a format that will enable a process perspective, such as the input requirements of Process Mining. In order to make it useful for our objective, the activities of the following phases are proposed. In any respect, we shall recall from Sect. 3.2 that a fundamental assumption of this work is that there is some kind of rational structure over the events’ occurrences, and that this structure can be observed through the event datasets.

Last, a meaningful activity for the “understanding” phase is to suggest a set of performance indicators, which ideally will match indicators to objectives, and which provide a basis of comparison for the delivered results. Since following this methodology we expect to expose regularities of events’ occurrence, metrics such as support, confidence, language fit, determinism, and coverage, proposed by Tax et al. (2016a) might become relevant. However, following the “memoing” procedure that we suggested in Sect. 3.2, the need for quantitative indicators can be relaxed. We can therefore accept as a legitimate result whatever contributes in building a rich description of the phenomenon.

Table 1 A sample record of GDELT (only basic fields are showing)

4.2 Data collection and reviewing

To collect a relevant dataset, we queried the GDELT database. We performed three queries: one that returned all events that happened between the 20th of June 2015 and the 20th of July 2015 (the contested period of the referendum) and whose source actor’s country was Greece. The second query was similar, with the single difference that we asked for the target actor’s country to be Greece. The third query asked for all events that happened during the same period in Belgium (the host country of E.U. institutions, and where most official negotiations took place), without specifying any actors’ countries. These queries returned a dataset of 30,000 rows. Apparently, this dataset is not complete (it will take far more countries and longer time period to reach a more complete dataset), yet it is a sample that can sufficiently demonstrate our approach.

An integral activity of the data collection and reviewing phase is filtering. Although the filters to be applied on event datasets will always be case-dependent, we can recommend the following ones:

  • Since actors (either source or target) are an indispensable variable of the process perspective (see Sect. 4.2.1 below), records that do not include any information about the involved actors can be removed. In our case, because we care about the interactions at a country level, we removed the records that were blank in either the source or the target (or in both) actor’s country field.

  • It is highly possible that different media had captured the same event, so it is equally possible for the dataset to contain many duplicates. Such duplicated rows were removed.

  • Embracing a process perspective signifies that we care on sequences of events. Therefore, if a case (see Sect. 4.2.1 about how a case can be defined) consisted of just one event, it was not not helpful for our process-oriented analysis, and hence it was removed.

In addition, we manually browsed the url of the events, to detect irrelevant ones. Actually, we encountered several extraneous events that were removed, yet some noise remained. Popular themes of the extraneous events were Brexit, refugees migration, terrorist attacks, or even recreational events like football players’ movements, and concerts.

Finally, since the focus of this illustrative example is on countries interactions, we chose to remove the events that included the same country both as source and target actor, a pattern that commonly indicates the debates between opposition and government within that country, and clutters the countries interactions’ visualizations. Applying all the above filters resulted in a dataset of 4476 events, which indicates the necessity of filtering as a data reviewing activity.

4.2.1 Transformation of event data

As we have discussed in Sect. 4.1, the standard format of event data can not be directly matched to a format that will enable a process perspective. More specifically, Process Mining requires at least three basic fields for every record: a case ID (to correlate events with a particular case), a timestamp, and an activity (van der Aalst 2016). All applications of Process Mining on different types of data, from low-level machine or software logs (Günther et al. 2009; Mannhardt et al. 2016) to incidents’ status changes in CRM systems (van Dongen et al. 2013), assume this data format. This work proposes the following mappings to enable event data to get exploited by process mining tools:

As long as it concerns the timestamp field, the pairing is clear: it can be directly matched to the corresponding field of the raw event data. However, it is not clear at all what will be the case ID, and/or the activity. In Table 2, we provide six alternative mappings that can be applied, each of them delivers a different view of the data, yet all of them administer a process-oriented view.

Table 2 Alternative transformations to match raw event data to process mining input requirements

The first alternative transformation uses the combinations of the two actors (hence the underscore as the joining delimiter) to distinguish the case identifiers, and the event code field to create the alphabet of activities. This way, a case is a bilateral relationship of two actors, which leaves a trace of events that happened and involved both of them. By transforming raw event data in this format, we can use the dataset as an input for automated process discovery.

The next two proposed alternative transformations are similar, in the sense that they still make use of the event code as the activity, and an actor as the case ID, yet these transformations use a single actor (i.e., not a combination). These alternatives are suitable for datasets where many countries participate, and there is a limited thematic selection. In such situations, we expect process maps to reveal if there are any behavioral patterns for single countries (e.g., if countries that “criticize or denounce” are “using conventional military force” as a following action).

Transformations 4, 5, and 6 inverse the logic. Traces are now joined by the event code (case ID). The items that each trace comprise (activity) are either single actors (source or target) or their combinations. This inverse logic allows to observe the interactions of actors subject to a special thematic. In Sect. 4.3 we provide illustrative instantiations for the relevant templates.

4.3 Discovery

Following the first transformational template of Table 2, and with the support of Celonis Academic Cloud (Celonis 2017), we created the illustration of Fig. 1. We shall note that only the most frequent activities and transitions are illustrated. Since, as the first template suggests, every case is a pair of countries, the process control-flow map actually plots regularities of events’ occurrences, responding this way to the first question we mentioned in the Sect. 1 (“Are there any paths, of any size, that reveal some kind of frequent interaction patterns?”) . Moreover, we have colored the activities according to their CAMEO code category (see the legend of Fig. 1).

We shall point out two interesting patterns: First, the “Consult” activities are connected either to other “Consult” activities or they are interleaved to “Make public statement” activities. Since statements are typically subordinate events (events are assigned with a code of that category only when they do not further imply appeals, agreements, support, apologies, demands, disapprovals, rejections, threats, etc.) and because “Consult [NS]” is assigned when the place of the meeting is not explicit in the lead, (so no visit made or hosted can be extracted), and when no negotiations are implied, we can consider this pattern as a preparatory phase of the institutional discussions.

Next, hosting a visit, when not leading to another “Consult” activity leads to either making a pessimistic comment, or to making an appeal or request, or to expressing intent to meet or negotiate, a pattern that competently reflects the turbulent political situation of that time when hosting a visit did not immediately mitigate the political tension that existed the days before or after the referendum.

Moreover, by visually examining Fig. 1, we can observe that the activity “Host a visit” acts like a split gateway since the flow at that point diverges (there are 6 outgoing flows from it). The inverse (several incoming flows) can be observed at the “Express intent to meet or negotiate” activity, which acts like a join gateway. This kind of visual interpretation offers a quick response to the second question we described in the Sect. 1 (“Can we manifest decision points, or deviating behaviors be introducing gateway semantics (flow splitting and joining points) on events occurrences?”)

Fig. 1
figure 1

Process map following the transformation template 1 of Table 2. The color of the event nodes is after their thematic characterization according to CAMEO. Only the top-12 most frequent events are showing. The visualization was created with Celonis Academic Cloud (Celonis 2017) (color figure online)

To capture the interactions of actors subject to a special thematic, we may use the transformations 4, 5, and 6 of Table 2. The thematic characterization is the focus of the third question we mentioned in the Sect. 1, so we exemplify the potentials of our methodology with the sixth transformation, which is of particular interest for our illustrative example. We demonstrate this kind of potentials with Fig. 2. Figure 2 presents what pairs of actors were involved in “Make Public Statement” events (e.g., make pessimistic/optimistic comment, acknowledge or claim/deny responsibility, make empathetic comment). An interesting pattern that emerges is the path that connects the pair of EU and Greece, followed by Greece and USA, followed by Greece and EU, eventually followed by Germany and Greece. This path either acts like a closed loop, or it escapes to other active EU member states (e.g., France, Italy). This path reflects a common situation of that time: European authorities making a statement about their opinions/decisions on Greece, then Greece making a comment about the IMF, and then watch the member states (the protagonists) making refined commentaries.

Fig. 2
figure 2

Process map following the transformation template 6 of Table 2. The cases are filtered to the “MAKE PUBLIC STATEMENT” thematic category. Only the top-12 most frequent countries’ pairs are showing. The visualization was created with Celonis Academic Cloud (Celonis 2017)

Another advantage of plotting the events with a process-oriented perspective is that its representation via a network-like structure can lead us to the recognition of properties for them (like it is expressed in the fifth question in the Sect. 1). More specifically, it is possible to calculate the hub or the authority score (centrality) for any activity. Authority is expected to assess the importance of the activity, whereas hub is expected to reflect the value of its links to other activities. An activity has a high authority score if it is pointed by many activities with high hub scores whereas an activity has a high hub score if it has pointed to many activities with high authority scores. In that respect, Tables 3 and 4 reveal additional insights. For example, in Table 4, we see that the node with the highest authority is the node “USA-GRC” which stands for the interactions of the IMF with Greece. It is worth mentioning that this node while it has the top authority for the “CONSULT” thematic, it has a lower authority when the thematic is about “MAKE PUBLIC STATEMENT” (see Fig. 2), where the top authorities are the nodes “DEU-GRC” and “BEL-GRC”. This is an indication that when considering “consult” activities, the IMF was the protagonist, but in making statements, that role was played by the European partners. This is of course an additional hint how thematic characterization influences the interaction patterns (recall question (3) in the Sect. 1).

Table 3 Activities with high hub scores for a process map following the transformation template 1 of Table 2
Table 4 Activities with high authority scores for a process map following the transformation template 6 of Table 2, and the thematic “CONSULT”

The control-flow maps, like the ones illustrated in Figs. 1 and 2, can render events’ occurrences into meaningful structures, and designate regularities, however, there are two issues pertaining to the process-like representation. The first issue is about the expected variability of flows, and the second is about the lack of definitive start and end points in intermediate relationships.

The former issue is rather common in many processes of “flexible” environments, like healthcare or customer service. A typical way to deal with it is trace clustering, namely to group cases, and discover a distinct process model per group, thus delivering more comprehensible results. Nevertheless, several trace clustering techniques have been proposed, varying in the representation of traces, the distance/similarity measure employed, and the cluster approach (Thaler et al. 2015), making the selection of the appropriate technique a challenging task for analysts. The critique of the existing trace clustering techniques is out of the scope of this paper, yet we present a visual approach that can support analysts in comparing their output.

Assuming that following a trace clustering technique we are provided with a hierarchical clustering tree (and thus a dendrogram that captures the tree structure), we can calculate the cophenetic distance between two traces as the height of the dendrogram at which those two traces are first joined (Sokal and Rohlf 1962). In Fig. 3 we have computed the Spearman correlation between any two cophenetic distance matrices of the trees that resulted from clustering traces using the following distance measures (Studer and Ritschard 2015):

  • om: Optimal matching edit distance with a substitution-cost matrix derived from the transition rates between events

  • omloc: Localized optimal matching edit distance

  • nms: Distance based on number of matching subsequences

  • dref1: Normalized longest common subsequence distance to the most frequent sequence

  • \({\hbox {Chi}}^2\): Chi-squared distance over the maximal length of traces

  • Euclid: Euclidean distance over the maximal length of traces

Fig. 3
figure 3

Spearman correlation of cophenetic distances of the hierarchical clustering trees. The fuller the pie of a pair the greater their correlation, while the color of the pie indicate if the correlation is positive (blue) or negative (red). Illustration produced with Galili (2015) (color figure online)

Figure 3 confirms that based on the distance measure, different clustering results will be derived. However, the added value of such a plot is that the analyst can easily check which distance measures yield similar results and which generate contrasting clusters. This empirical control can be further supported by illustrations. In particular, to further explore the similarity and difference between the alternative clustering options, we can turn to the visual product of a “tanglegram” (see Fig. 4). A tanglegram is a pair of trees on the same set of leaves with a bijection between the leaves in the two trees (Venkatachalam et al. 2010).

Fig. 4
figure 4

A tanglegram for the hierarchical trees resulted from clustering with the Chi-squared and the Euclidean distances. Bijections connect the same objects (traces for pairs of countries), and colored branches indicate the cluster membership. Illustration produced with Galili (2015) (color figure online)

In a tanglegram, we expect to observe shared common sub-trees (between the dendrograms produced by the two clustering options), and distinct edges. In Fig. 4, we have used two clustering options that yield rather similar results (we based this estimation on the correlation plot of Fig. 3), and we asked for 3 clusters, which we colored in different colors. Although the produced illustration is rather dense, with a close examination we can see that there is a group of countries’ pairs (“DEU-GRC”, “GRC-EUR”, and “EUR-GRC”) that are clustered together in both options, but there are also some “important” cases like the pairs “GRC-BEL” and “EUR-BEL” that are clustered differently by the two techniques. This is an example of how analysts can increase their confidence (or their doubt) about the groupings delivered by the various techniques and visually assess the sensitivity of the clustering results. We shall note that in Fig. 4, a trace represents a sequence of events between two countries, therefore if two objects are grouped together, we can assume that the events that are happening in the two pairs of countries expose a regularity, an insight that would be inaccessible by an event-oriented (i.e., not a process-oriented) analysis.

The second issue, the lack of definitive start and end points, appears because of the nature of the problem, and it is harder to address. We discuss two relevant approaches that appear to have the potential to deal with it: Trace alignment and describing patterns instead of models. Trace alignment for a set of traces \({{\mathcal {T}}}=\left\{ T_1, T_2, \dots , T_n\right\}\) is defined in Bose and van der Aalst (2012) as a mapping of the set of traces in \({{\mathcal {T}}}\) to another set of traces \(\bar{{{\mathcal {T}}}}=\left\{ \bar{T_1},\bar{T_2},\dots ,\bar{T_n}\right\}\) where every \(\bar{T_i}\) derives through \(T_i\) by the addition of one of many “gaps”, in such a way that \(|\bar{T_1}| = |\bar{T_2}|=\dots = |\bar{T_n}=m\), where m is the length of the alignment. Trace alignment is generally expected to facilitate process diagnostics, however, because of the last requirement (equal length of the aligned traces), and because in our case study the original traces varied significantly in length (from 2 to 292 events per trace), we were not able to mine any interesting results.

A different approach is to look for interesting fragments of the process, i.e., not for an end-to-end process that will demand start and end points. One option is to take a declarative approach to model business processes. Declarative techniques e.g., Maggi et al. (2011) and Pesic et al. (2007) introduce constraints in models as rules that have to be followed, but these constraints are often binary or they involve limited set of activities in pre-specified structures. To reach ampler patterns, Local Process Models have been proposed in Tax et al. (2016b). To generate local process models, an heuristic approach based on Markov clustering is proposed in Tax et al. (2016a).

In this work we propose a different tactic. We commend a network-based representation for portraying the flows of political event occurrences, nodes representing the events (or whatever plays the role of “activity” after the transformations of Table 2) and links between the events representing the strength of sequential occurrences in traces.

A common practice to construct such networks is to aggregate the pairwise connections in traces, and put them as edges’ weights. This practice assumes the Markov property (first-order dependency). This assumption implies that the occurrence of one event depends only on the previous event that has occurred, so it misses to capture important information, like for instance what has happened two or three steps before, because, by definition, the flow on the network is aggregated in every step (Xu et al. 2016). Therefore, we claim that higher order networks yield more appropriate representations. In Fig. 5, we plot a network of order 3 as the traces’ set fit to a Markov chain. To estimate the chain of order 3, we solved the linear programming problem of Eq. 1.

$$\begin{aligned} \begin{array}{lr@{}c@{}} \text {minimize} &{} \left\| \sum \limits _{i=1}^3{{{\mathcal {E}}}- \lambda _i M_i {{\mathcal {E}}}} \right\| \\ \text {subject to } &{} \sum \limits _{i=1}^3{\lambda _i}=1 \\ &{} \lambda _i \ge 0 \end{array} \end{aligned}$$
(1)

where \({{\mathcal {E}}}\) is the distribution of events, \(M_i\) is a non-negative \(m \times m\) transition matrix (with m being the number of nodes) and columns sums equal to one, and \(\lambda _i\) is the weight for each lag i. This optimization model is described in Ching et al. (2013) and in this work we used the implementation of Scholz (2016). We should stress that what we promote here is the visual aid of higher-order Markov chains and not their suitability for process discovery, which is left as future work.

Fig. 5
figure 5

Markov chain of order 3. The color of the event nodes is after their thematic characterization according to CAMEO. Only the top-12 most frequent events are showing. The visualization was created with Scholz (2016) (color figure online)

We shall notice two interesting remarks from the fit of the higher-order Markov chain to the traces’ set:

  • The “Reject [NS]” event appears to be a rather digressive, rambling event in Fig. 1, since it is connected with weak links to just two events, downstream to “Express intent to meet or negotiate” and upstream to “Make a visit”. However, in Fig. 5, we observe that the occurrence of a “Reject [NS]” is a rather probable event, if we allow three steps (i.e., three interactions among a pair of countries) to happen.

  • “Consult [NS]” is a very probable event, signifying that eventually, two countries will be involved in a “consult” event, if we allow some steps to pass.

These two insights reflect the actual situation of that time (empirically witnessed), when although we observed countries to exhibit “political correctness” and discuss issues in an institutionalized context, eventually, some actions of some country (commonly Greece), were rejected.

5 Discussion

In this work, we tried to exploit the richness and availability of raw event data by adding analytics capabilities. These capabilities are allowed by the activation of a process perspective for the (otherwise disjointed) events. Through a refinement of a process mining methodology and following the decision support paradigm, we were able to provide analysts with a guide on how to get richer insights, and with several recommendation on how to exploit the relevant visualizations to facilitate the interpretation of the events’ occurrences.

There are of course several limitations in our approach. First, one must endorse our epistemological and methodological assumptions. Second, there is a plethora of process mining techniques that were not checked for their relevance to our approach. Indeed, our recommendations are relevant only in the context of the described case study, and although there are some potentials for generalization, we did not evaluate their reliability and validity for more general situations. We can not stress enough that techniques that were originally developed for a business context (the archetype motivation of Business Process Management) are not straightforwardly applicable to social/political event analysis. Anyway, this paper puts forward a big promise for event analytics, and many challenges may appear, nevertheless, given the efforts that have already been devoted to data collection issues, the focus needs to be shifted towards the analysis side.