Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Decision support systems (DSS) comprise a set of computational tools whose purpose is to support better decision making processes within organizations.

DSS allow decision makers to visualize, analyze, process and mine data of various types. Data is collected from a diversity of sources and integrated to create a knowledge base repository. The knowledge base is then used by decision makers to help them in reasoning about some possible scenarios.

DSS are increasingly used in a variety of fields such as business intelligence and criminal intelligence. In the area of business intelligence, DSSs help companies to asses their competitor’s position in the market, additionally to determining market trends, and in planning future investments.

In criminal intelligence, DSS help intelligence agencies to tackle organized crime, detect crime patterns and analyze the structure of criminal organizations.

DSS have become more sophisticated in the internet era. The internet has eased the communication and sharing of information by people, organizations and the IT systems of companies. Moreover, social media is increasingly used by hundreds of millions of people, including government, media and business organizations, but also by organized crime.

The internet is used by traditional mass media for distributing news and information in digital format. For instance, some of the most common sources are RSS feeds, blogs, and specialized web portals that include video, audio and text about events or persons.

The internet has also eased the distribution of specialized technologies. Computer code be found on the internet that may be used to exploit the weaknesses of IT systems and/or to perform malicious or criminal activities. The easy access to specialized code has allowed hackers and organized crime to carry out cyber-attacks against specific servers of government institutions or to the whole computer network of a country. These attacks are carried out using a botnet of “zombies” computers distributed around the world [1].

In the case of intelligence applications, traditionally, the data sources used to track down organized crime were classified. Classified, secret data about individuals or organizations is kept in secure data repositories isolated from the internet.

However, intelligence agencies have recognized the value that the information publicly available on the web has in their investigations. In many criminal cases the usage of social media by organized crime and its associates may leave intended or unintended traces that can be collected for analysis [2].

Information collected from open sources on the web is being used not only in criminal investigations but also in a variety of areas such as, situation monitoring and assessment, and to produce early warnings of possible crisis.

The collection of methods used in collecting, managing and analyzing publicly available data is called Open Source Intelligence (OSINT).

The data sources employed in OSINT are varied. Some of these data sources are electronic media such as newspapers and magazines, web-based social media such as social networks, or specialized web portals and blogs, public data from government sources, professional and academic literature, geospatial data, scanned documents, video and data streams, among the most common sources.

OSINT creates important technical, legal and ethical challenges. One of the main technical challenges is to collect relevant meaningful information from reliable sources, among the huge amount of data sources available on the web. This is a critical issue since information may be of low reliability or bogus, obsolete, duplicated and/or available only in certain languages. Information may be of different types and being available in different formats.

Once reliable data sources are found, the relevant information contained in them must be identified and extracted. The extraction of relevant information from a corpora of documents is also technically challenging, because entities expressed in natural language such as events, individuals, organization and places must be recognized and disambiguated.

After the extraction process, information should be stored in knowledge bases, a step that coverts raw information into knowledge. The knowledge stored in knowledge bases is commonly represented in the form of ontologies and semantic networks. These ontologies contain specialized and general domain knowledge about entities such as type of crimes, individuals, organizations and events.

The knowledge bases allow analysts to reasoning about entities and their relationships.

One important feature of DSS is their visualization tools. These tools allow the analyst to look at summaries of data from different perspectives, using for instance dashboards, graph viewers, maps, and plots of different types.

All the technical aspects that have been briefly described in previous paragraphs are very important in DSSs for OSINT applications, but they are not the only challenge that must be addressed when such systems are developed.

There other non-technical challenges in OSINT are related to legal and ethical aspects. In OSINT, relevant information must be collected in a way that respects the privacy of individuals and a the same time, that does not violate the existing national and international laws in this regard.

The collection of personal information about individuals and organizations from the numerous open (and proprietary) sources on the internet has opened up the possibility for using such data not only to target specific organizations but also for mass surveillance purposes.Footnote 1

This issue has created a debate on the ethical and legal aspects involved in the use of OSINT-based technologies.

In the case of the VIRTUOSO project, these issues were addressed since the inception of the project. One of the tasks continuously performed during the whole development process of VIRTUOSO, was to make sure that privacy was respected and that no laws (for instance copyright) were violated when collecting and storing data.

The VIRTUOSO platform was developed and implemented, addressing all the technical and ethical challenges described above.

As is described later in this chapter, VIRTUSO consists of several software components The DSS in VIRTUOSO is one of the its key components. The DSS employs computational intelligence techniques that allow analysts to reason under uncertainty, represent and fuse knowledge, among other tasks.

This chapter describes the overall architecture of the VIRTUOSO system and the main components that comprise VIRTUOSO’s DSS.

This chapter is organized as follows. Section 2 describes the architecture of the VIRTUOSO platform. Section 3 describes some representative components of the decision support system in VIRTUOSO. Section 4 describes some possible applications of decision support systems in cyberwarfare. Finally, Sect. 5 presents some conclusions.

2 VIRTUOSO’s Architecture

In summary, the goal of the VIRTUOSO project is to retrieving unstructured data from open sources available on the web and converting it automatically into structured actionable knowledge. To achieve this goal a flexible architecture for the whole system was designed.

The architecture of VIRTUOSO is based on a Service Oriented Architecture (SOA). SOA is a recommendation for how to structure component systems based on web services. SOA was proposed to ease communicating, synchronizing and integrating diverse software components that implement these services. This feature was an extremely important issue in VIRTUOSO, because software components could be developed by different partners participating in the project, using different languages and technologies.

In VIRTUOSO the SOA model was implemented using the WeblabFootnote 2 platform. Weblab is a platform, whose main purpose is to build software systems specifically for OSINT applications based on the SOA specification.

The architecture of VIRTUOSO consists of three main processing stages: (a) data acquisition, (b) data processing, and (c) decision support. A special portal allows users to configure and monitor the different processing stages.

In the data acquisition stage, data from unstructured open sources is retrieved using web crawling techniques. Web crawlers acquire different types of data from a wide diversity of sites on the web i.e. electronic-text data, multimedia content, and even from scanned papers. These multiple types of data come from web sites, blogs, tweets, RSS feeds, trends, video streaming sites, and paper documents.

Regarding electronic-text data, at the current state of the project more than 500,000 documents are processed every day, written in 39 different languages from 188 countries. These documents are retrieved from 28,000 open sources.

The data acquisition stage is continuously connected to the Internet to retrieve all relevant data. Additionally, at this stage some pre-processing is performed. For instance, normalization, object recognition, entity naming, event extraction, image and video classification, source assessment, and speech recognition.

Normalization of different types of media and documents is performed by representing them in a single XML based format that contains pointers to the real location of data. The source assessment stage attempts to evaluate the reliability of a data source.

The number of pre-processing steps performed by VIRTUOSO platform can be configured.

After the pre-processing stage, a special data repository is created, containing all the results of all pre-processing steps that may have been performed.

Both, the data acquisition and preprocessing stages were implemented by integrating all its components on the SOA model. Figure 1 shows the SOA platform with the tree main processing stages of VIRTUOSO together with the crawlers required to download data from open sources on the web.

Fig. 1
figure 1

Data acquisition, preprocessing and data repository

Contrarily, the data processing stage of VIRTUOSO is not connected directly to the internet. This is mainly done for security reasons.

The data processing stage contains several components, among which are: a full text and multimedia search engine, a summarization component, automatic translation of documents, determination of document similarity, and query translation.

The knowledge base is one of the key components in VIRTUOSO. The knowledge base is created apriori with general domain knowledge and is updated with knowledge extracted in the data pre-processing stage.

To being able to use the data repository created in the pre-processing stage, an import/export component is available at the processing stage. During importing, data may be manually or semi-manually validated to ensure that no irrelevant or dubious data is introduced in the knowledge base.

3 The Decision Support System

The decision support system of VIRTUOSO is one of its key components. The purpose of the DSS is to provide intelligence analysts with a set of software components that can be used to extract and store knowledge and to visualize, analyze, process and mine data of various types.

One of the main benefits of using DSS in applications such as VIRTUOSO is to improve the decision making process of analysts in making more informed and effective decisions. For instance, DSS provide analysts with apriori knowledge about certain types of crimes and organizations. Using this knowledge, and the data available for a current situation or event, analysts can look a different views of the data and look for patterns and trends that may help them to asses the importance of the situation or event. Analysts can also look at how participants in an event are related to each other (i.e. their social network) to determine how important these individuals are within the social network.

However, given that domain scenarios for criminal investigations are different, expert analysts must decide which components of a DSS should be applied in each particular case.

VIRTUOSO’s decision support system consists of a variety of software components. Figure 2 shows a high level view of the data processing and decision support components of VIRTUOSO on the SOA platform.

Fig. 2
figure 2

Data processing, knowledge base and decision support system

The portal allows users to interact with the DSS. The knowledge base stores apriori knowledge in the form of ontologies and knowledge extracted from open source documents. The knowledge base is also part of the decision support system contained in VIRTUOSO. The data processing stage and the decision support system, share the same SOA-based Weblab infrastructure.

Some of the components that are part of the processing stage in the DSS, can be applied to process documents, directly to data or to both data and documents. For instance, the components that can be applied to documents are: metadata viewer, source assessment, geographical search, multimedia and text search, trend analysis, social media topic, sentiment monitor and semantic search.

The components that can be applied to data are: graph viewer, tabular viewer, graphical SPARQL querying, rule editor, entity editor, similarity of entities, similarity of strings, semantic analysis, and social network analysis component.

Finally, the components that can be applied to both data and documents are: dashboard visualization, knowledge browser, centipede and traceability component.

All these components allow the analyst to perform a wide variety of tasks, from querying semantic knowledge bases, to the visualization and processing of different types of graphs, data and documents. Some of the software components available in VIRTUOSO’s decision support system are not used directly by the analyst, but instead, they provide services to other components in the system.

All the components work seamlessly together on Weblab’s SOA platform.

Due to the wide variety of components that are available in the DSS of VIRTUOSO, it is not possible to describe all of them in detail. Thus, in the rest of the chapter we will describe a few of the most representative components in the DSS. The documentation of the VIRTUOSO project contains a detailed description of the rest of the components [3] that were briefly mentioned in previous paragraphs.

3.1 Knowledge Base, Fusion and Uncertainty Management

One of the key components in VIRTUOSO’s DSS is its knowledge base. The knowledge base comprises ontological knowledge (conceptual and geographic) and operational knowledge (factual).

The ontological knowledge consists of the known existing relationships among the concepts employed in the intelligence domain. The knowledge base contains several ontologies about general knowledge and specialized knowledge about the domains of criminal and intelligence analysis.

Ontological knowledge is represented as triples in the form (predicate, subject, object) or (p,s,o) for short, as specified in the resource description framework (RDF) schema. Internally the knowledge base employs a slightly different format based on RDF.

Semantic knowledge in VIRTUOSO may be introduced in the knowledge base either manually for highly specialized domains or in an automatic or semiautomatic way for other types of domains. For instance, part of the ontological knowledge included in VIRTUOSO is imported from existing ontology resources. However, to use these or other existing ontology resources a process of semantic disambiguation and fusion of information is performed.

The fusion component in VIRTUOSO merges two graph structures, using an operation called “maximal joint”. This method was originally proposed to fuse conceptual graphs in [4]. However, in VIRTUOSO the maximal joint heuristic was applied to semantic graphs.

The joint operation was divided into two parts. First, the compatibility of the two elements to fuse is evaluated. Two entity nodes are considered compatible if the type of entity is the same (e.g. person, location) and if a high proportion of entities’ properties is similar. The similarity measure that will be applied depends on the type of properties that entities may have.

In VIRTUOSO the nodes in the graph structures correspond to entities that have properties defined as strings of characters or numbers.

The similarity of string properties, like names for instance, was evaluated using Levensthein string edit distance. This distance basically evaluates how many insertions or deletions of characters are needed to convert one string into the other.

For numerical properties, the similarity was calculated using several techniques. For instance, in the case of date properties the number of days between the two dates was used as the distance. For other types of numerical properties the similarity was evaluated using the following equation:

$$\begin{aligned} sim_{num}(\beta ,x,y)=e^{\frac{\beta (x-y)^2}{\beta -1}} \end{aligned}$$
(1)

where \(\beta \) represents the sensibility of the measure to the distance between two similar numerical values \(x\) and \(y\).

Two entities were considered compatible if the similarity value of the numerical and string properties was above certain threshold value.

Once the similarity between entities was determined, the maximal join operation was used to fused the sub graphs of two distinct but compatible graphs using the following method. Nodes determined as being compatible were fused, creating an extended graph that included the sub-graphs of the two compatible nodes. This procedure was repeated recursively in each node of the subgraphs until incompatibilities were found.

The operational knowledge contained in the knowledge base consists of information extracted from the open sources. The basic entities available in the knowledge base are physical entities (e.g., persons, vehicles), legal entities (e.g., organizations), non-physical entities (e.g., phone number), and event entities (meeting, travel).

The knowledge base contains also various types of metadata, such as time dependencies, validity (or certainty), sensitivity, confidentiality, and provenance information (to being able to trace back to the sources).

To manage uncertainty the RDF triples in the knowledge base were extended by adding an extra parameter \(\beta \) that depending on the entity and type of uncertainty may represent a probability distribution or a possibility distribution. RDF triplets stored in the knowledge base were represented as {(predicate, subject, object),\(\beta \)}, using that information.

3.2 Social Network Analysis Component

Social Network Analysis (SNA) comprises the study of relations, ties, patterns of communication and behavioral performance within social groups. In SNA, a social network is commonly modeled by a graph composed of nodes and edges. The nodes in the graph represent social actors and the links the relationship or ties between them [5].

Since criminal organizations are also a form of social network, they can be represented as graphs in the user interface of VIRTUOSO. The nodes in the graph are the individual members of an organization and the links represent their known relationships. In general, the relationships may be the known connections between individuals (e.g. friendship) or they may represent the structure of command within an organization. These connections may be manually introduced in the network or extracted autoamatically from a document collection and stored in the knowledge base.

In SNA, multiple metrics have been proposed that aim at evaluating the importance of each of the nodes within a social network. One of the most important metrics in SNA is centrality [6, 7]. Centrality describes a member’s relative position or importance within the context of his/her social network.

One of the applications of the centrality measures that are commonly used in SNA is to discover key players [8]. Key players are these nodes in the network that are considered “important” in regard to some criteria, such as the number of its connections (i.e. the degree of a node), their importance regarding the diffusion of information, their influence on the network, etc.

To process social networks, the SNA component in VIRTUOSO employs some of the most popular centrality measures used in SNA such as degree, betweenness, closeness, and eigenvector centrality [9].

In each particular case, the expert analyst must decide which centrality measure should be applied. However, it is also possible that the analyst may be interested in evaluating the overall importance of a group of nodes according to several centrality criteria. In this case, VIRTUOSO will be able to calculate an aggregated centrality value using all or some of the centrality measures available in the SNA component.

The aggregation of centrality measures may be also useful when the analyst is not sure about which centrality measure should be used.

To perform the aggregation, the SNA component in VIRTUOSO employs an ordered weighted aggregation (OWA) operator [10]. The OWA operator is defined as:

$$\begin{aligned} h_w(a_1,a_2,\,...\,,a_n)=w_1b_1+w_2b_2,\,...\,+w_nb_n \end{aligned}$$
(2)

where \(w_i\in [0,1]\) are the weights and \(\sum _{i=1}^{n} w_i=1, (b_1,b_2,\,...\,,b_n)\) is a permutation of the vector \((a_1,a_2,\,...\,,a_n)\) in which the elements are ordered \(b_i\ge b_j\) if \(i<j\) for any \(i,j\).

The weights used in the OWA operator are normally calculated by stating first what is andness value that is expected from the operator. The andness value of the OWA operator is defined in terms of the weights of the operator as:

$$\begin{aligned} Andness(\mathbf w )=1-\alpha =1-\frac{1}{n-1}\sum \limits _{i=1}^{n}w_i(n-i), \alpha \in [0,1] \end{aligned}$$
(3)

where \(\alpha \) is the orness value. This andness value represents how close the OWA operator behaves as a fuzzy and operator i.e. how close the resulting aggregation value is to the fuzzy AND (minimum) value produced by all the centrality measures. For instance, with andness value of 1 the weights of the OWA will be (1,0, ... ,0), which will produce the minimum value in the aggregation. With an andness value of 0.5 all weights will be \(1/n\) and the OWA will calculate the average value of all its inputs.

When OWA operators are used, expert knowledge about a problem domain is used to decide the most appropriate andness value. In the prototype of the SNA component, weight values of (0.1, 0.15, 0.25, 0.5) were assigned by default to the OWA operator. These weights produced an andness value of 0.71. Hence, this default centrality measure produced by the OWA operator will be a value that lies between the average centrality of all the measures and the minimum value produced by all of them.

One of the issues of OWA operators, is that very different values in the weights may produce the same andness value. This is an important issue that must be considered when OWA operators are used in decision making systems and other applications. In decision making problems we normally want to aggregate all the input values in such a way that all of them contribute to the final decision and not just a few of them or in extreme cases only one. Therefore, one desirable feature of an OWA operator is to get maximum dispersion in the weight values. The weight’s dispersion measures the degree with which every input contributes to produce the output of an OWA operator, and is defined as:

$$\begin{aligned} disp(\mathbf w )=-\sum \limits _{i=1}^{n}w_i ln(w_i) \end{aligned}$$
(4)

In the case of the SNA component the weight dispersion obtained by using the default weight values was \(1.20\). In general, finding the weight values for an OWA operator is considered as an optimization problem, in which we want to get maximum weight dispersion for a specific andness value of interest between \([0,1]\), subjected to the restriction that the sum of all weights should we 1. This optimization problem has been addressed by other aggregation operators. An example is the maximum entropy OWA (MEOWA) operator [11] that employs Lagrange multipliers to solve the constrained optimization problem.

Other operators like andness-directed multiplicative or implicative weighted aggregation (AMWA or AIWA) operators [12] attempt to aggregate the importance that each input has. All these operators have been also implemented in the SNA component in VIRTUOSO. Thus, the analyst could experiment by aggregating the different centrality measures with different operators to see what is their effect.

It must be noted that for some specific constant values of andness, it is possible to use some of the analytical expressions described in [13] to calculate the weights used in the OWA operator. As is described in [13] these expressions provide good dispersion values in the weights’ distribution. For instance for an andness \(=\)2/3 \(=\)0.66 we could use the following equation to calculate each of the \(n\) weights used in the OWA operator:

$$\begin{aligned} w_i=\frac{2i}{n(n+1)} \end{aligned}$$
(5)

using this expression, the OWA weights will be \((2*1/(4*5)=0.1, 2*2/20=0.2,2*3/20=0.3, 2*4/20=0.4)=(0.1,0.2,0.3,0.4)\), which have a weight dispersion of \(1.28\) and whose values are close to the values produced by the andness of 0.71 used by default in the SNA component.

The application of the OWA, MEOWA or AMWA operators allows the analyst to use all the centrality measures available at once, in such a way that each one contributes partially to the overall calculation of the centrality of every node in the network.

The SNA component can be used used in two different ways in VIRTUOSO. One way is as a REST web service that receives HTTP/POST requests containing the description of the social network to be analyzed. This description is provided in a standard format such as graphml.Footnote 3 The output of the service is the calculation of the desired centralities for each node in the network. These values are returned in JSON format encoding.

It is also possible to use the SNA component functionality integrated as a portletFootnote 4 within the weblab platform.

4 Decision Support Systems and Cyber-Warfare

The service oriented-based architecture (SOA) of VIRTUOSO allows to reuse some of its software components to create decision support systems that could be applied in other domains such as cyber-warfare.

Recent studies have found that some cyber attacks have been performed when the computer passwords of certain employees that work in companies or government agencies have been guessed correctly by attackers. To do this, attackers analyze personal data posted by these employees in social media, to learn about the employee’s social network [14] connections.

The employee’s social network is then used to gain access to some of the acquaintances’ computers that are less secure. Once this is done, attackers may use a spearpishing attack, which consists in sending emails to the employee from one of his/her colleagues or friends’ email accounts. The email sent may include computer code hidden in apparently harmless attachments that is used to guess the employee’s password in other computer systems or to determine the answer to certain security questions asked by systems with restricted access.

In this scenario, it may be possible to use VIRTUOSO’s DSS to gather data from social media about an employee and its social network. The social network obtained in this way can be analyzed to determine how much public data may be available about an employee and his/her acquaintances. Such data may be used to determine how vulnerable certain employees may be to spearphising attacks.

The other area where some of VIRTUOSO’s DSS components could be applied, is in the area of cognitive models of decision making for cyber defense. Specifically in designing and applying cyber security ontologies and in scenario ontologies, in a way similar as it is described in [14]. The knowledge base in VIRTUOSO may be used to store these ontologies, together with knowledge extracted from closed or open sources. This knowledge could help analysts to reason about possible cyber attacks.

5 Conclusion

VIRTUOSO provides a large collection of software components that help analysts to process and visualize a large collection of open source data of various types.

These components together with the decision support system and its knowledge base, will help analysts to reason more easily about a particular scenario.

VIRTUOSO is a complex system consisting of many software components. We have described just a few of them and its features.

At the current stage, the decision support system and all the tools developed in VIRTUOSO have been tested on a few scenarios and a final presentation on the results of the project has been performed for the reviewers of the European commission in charge of assessing the final results of the project.

At the time of writing this chapter, most of the software components available in VIRTUOSO are at pre-release state.

One of the challenges of complex software systems like VIRTUOSO, is to use them in the most effective way. VIRTUOSO requires a group of IT specialists, administrators and analysts that could manage the data sources, maintain the knowledge base, define the most relevant scenarios, and analyze the results provided by the system.

The final report on the VIRTUOSO project included some recommendations in this regard. However, the procedures employed and the type organization that may be needed to use effectively the whole system must be tailored to fit each specific.

As part of the project, it is planned that a demonstrator of the VIRTUOSO platform will be installed at Aalborg University campus Esbjerg in Denmark. When this happens, interested parties within the European Union will be allowed to use the demonstrator and experiment with the system to asses its functionality.