1 Introduction

Big data and big data processing are currently concerning to both science and business. Big data processing is focused on transaction-oriented, multimedia-intensive, and other tasks [1]. Data volumes are continuously increasing, which requires new or improved software applications to handle them. Therefore, electronic information about objects and related processes represents data gathered by management information systems [2, 3]. To achieve sustainable development of information technologies, a thorough study of innovations and their rational and phased application in various fields including big data management is needed [34]. The challenge of big data management originates from the increase of information and the subsequent rising of data processing requirements [4]. For example, many enterprise resource planning (ERP) systems are focused on providing information about all company’s business processes to improve its overall efficiency at a lower cost but these systems do not take technical specifications of the product into account [5]. Normally, this type of data is managed by special management systems designed for specific product criteria and regulated by a general management system [6]. Besides, these systems carry a number of other functions that allow companies to properly manage and make use of product data. For instance, a web application for storing and managing product data in a decentralized repository allows easily distributing and updating product information that is reflected in a web catalog [7]. Apart from the increasing data volume, ERP system faces a challenge of complex information systems, the multiple components of which result in cost-intensive data analysis. The essential function of organizational information system is to create and visualize reports, but converting data from several sources and different formats is a challenging task. The number of customer requests, wrong orders and wrong deliveries may also increase [8]. Therefore, there is a need to implement syntactic and semantic constraints or defaults to avoid redundant statements or misinterpretations in big data.

In the landscape of database management systems, data analysis systems and transaction processing systems are separately managed, as they have different functionalities, characteristics and requirements [9]. The variation in data complexity gives rise to another problem– the collection and storage of large product data volumes. In order not to miss important information about the object, it is necessary to gather process and present big data in a form that is clear and easy to understand. Developing new methods for effective processing of big data is relevant to science and commerce, where data are integrated with business processes for user-centric management.

The purpose of this study was to develop a semantic approach towards big data processing. The approach in point is based on semantic methods selected by the help of mathematical statistics.

2 Background & Related Work

2.1 Big data challenges

In big data management, there are four challenges: volume, variety, velocity, and reliability [10]. Volume refers to the amount of data that needs to be processed. Variety covers different types of data, such as tabular data (databases), hierarchical data, documents, e-mail, metering data, video, images, audio, stock ticker data, financial transactions and more.

Velocity means how fast data is being produced and how fast it must be processed to meet the stakeholder demand. Reliability measures the accuracy and consistency of data. It is important because data sets come from different sources and thus may not fully meet the required standards of integrity [11]. The challenges of big data problems can be simplified with an ontological approach.

2.2 Semantic and statistical technologies

Nowadays, approaches that combine the semantic and statistical approaches to data processing are the most attractive ones [12]. These are hybrid computations that involve various algorithms that handle the same data: numerical-statistical (for example, deep learning) and logical-structural (including semantic ones) algorithms. The need for hybrid computations has already been acknowledged – when it comes to the operation based purely on statistical processing or logical conclusions, there will hardly be any progress in the issue, associated with artificial intelligence [13]. Thus, both approaches should be applied. However, there are only few examples of killer applications. Most likely, hybrid computations are used somewhere in text processing and language comprehension, when it is possible to combine the statistical machine learning and the precision of handling rules when processing certain crucial nuances related to the meaning. As an example, this is the polyglot persistence architecture, when the app handles both the schematic and non-schematic databases by using transactional and relational approaches in relation to a narrow range of problems [14]. Thus, the purpose of this research is to analyze and to develop an approach to mathematical modeling of ontology management.

2.3 Ontologies

Ontology as a declarative model of a certain problem domain is a central component of semantic-oriented intelligence. Problem domain complexity depends on the complexity of corresponding ontology. Thus, the known top-level ontologies reflect a significant number of concepts, for example, CYC – two million, and Wordnet – about 207 thousand. Complexity entails significant problems when it comes to ontology management problems. This problem class involves ontology creation, update, modification, visualization and validation, as well as origin documentation by components.

Ontology management problems lead to deterioration in the quality of ontology. In Bassaler et al. [15], ontology quality is assessed by the fulfillment of requirements regarding its completeness, correctness, and stability. Mistakes made by an expert during the complex ontology elaboration lead to non-recognition of essential concepts and links in the problem domain entailing an incomplete and incorrect ontology creation.

Complex ontology management problems are studied in several directions. In particular, there are being developed metrics and methods to measure the ontology composition. In Hurwitz et al. [16], as with the software complexity, ontology complexity is defined as problems in performing such problems as ontology development, reuse and modification. Such work as Azarmi [17] is a meta-ontology called O2 – ontology as a semiotic object. Based on this ontology, three metrics for ontology complexity are developed: structural metrics, functional metrics and usability metrics. In Gandomi and Haider [18], there are introduced metrics for a pre-normalized ontology. Ontology normalization includes such steps as class (fact) naming, inheritance hierarchy materialization, name unification and attribute normalization. Such normalization has the purpose of converting various ontologies into a semantically equivalent form to create semantic complexity metrics.

Ontology visualization tools are being developed to increase the efficiency of any ontology management expert. They are based on the combinations of text, tabular, diagram and graph data mapping [19]. An important ontology management problem is to track the origin of ontology components and facts. Its solution is required if one has to validate and ensure the correctness of ontology, since the problem domain is changing. Thus, one has to track the dependencies between ontology components, facts available from the information base and corresponding domain objects. Currently, there are four levels of origin established [20]: static (constant data), dynamic (variable data), fuzzy (the origin of these data is by nature very fuzzy and unclear) and expert (expert analysis is required). The author of [21] puts forward an idea of tracking the origin of facts by recording the history of their changes and describing the events that had entailed them. Historically, problem ontologies were introduced as a result of problem analysis development. Problem analysis methods are used to determine and formalize all factors used by an expert solving a problem. Such methods are widely used to design computer program interfaces, in expert systems, and solution support systems [22]. In this case, the major purpose is to analyze and specify problem components, determine its structure and limitations.

Unlike other types of ontologies (general, domain ontologies), problem ontologies are created separately for similar problem classes, as well as the formalized concept of a related goal. Problem ontology research is closely related to conceptual modeling, as the formalized conceptual model of problem ontology is designed during problem ontology creation [23]. Both conceptual and ontological problem modeling have one important aspect – the interaction with a domain expert that creates and validates the ontology. Ontology research has involved modeling environments that allow creating and implementing ontological models for individual problems. Currently, the major research in the field of ontological modeling is devoted to declarative ontologies – general, domain ontologies [24]. Problem ontology direction is not sufficiently developed. On the other hand, existing research in the field of problem ontology considers ontology creation for individual problem classes. At this approach, there are restrictions on the ontology transferability and reuse to solve the problems in other problem domains, since the same entities will be interpreted differently by different problem ontologies. We will refer to problem ontologies, based on a particular general ontology, as to ontological models in order to show this dependence and avoid ambiguities. Ontological models made it possible to simplify the solution of complex ontology management problems. The purpose of this research is to find ways for ontological models to simplify the complex ontology management and improve the ontology quality.

3 Big data modelling techniques

This section introduces a conceptual mathematical framework for ETL (Extract-transform-load) processes that is built upon semantic technologies.

3.1 Mathematical representation

Here is the one of the possible ways of ETL formalization with regard to applied mathematics. Initially, let us consider the widely used type of functional dependencies:

$$ y\ (t)=f\ \left(x(t),t\right), $$
(1)

Where: x, y, f - vectors with nx1, mx1, mx1 dimensions, f - known vector function, t -time. The x variable serves as a recoverable input time-dependent data, f serves as the transformation process function, y – load output data.

Next, let us consider the system of differential equations.

$$ x\dot =f\ \left(x(t),u\left(x,t\right),t\right) $$
(2)

Where: initial condition x (t0) = x0 is correct at the time interval [t0, t1]; x(t0) – input data ̇, x – the output data. In this system, u(x,t) is controlled over by a computer or by the control unit, included into the original system.

The system equilibrium can be described by linear and nonlinear algebraic equations

$$ f\ \left(x(t),t\right)=0. $$
(3)

In other words, they are the operating modes of the controlled objects (system). Random functional series can be expressed as:

$$ y(t)={\sum}_{i=0}^{\infty }{x}_i(t) $$
(4)

It is known that many continuous functions are described by this functional series. For example, the sine/cosine is expressed through the harmonic (power) series. In turn, components of the power series can be found by the interpolation formulas, introduced by Lagrange, Newton etc.

Thus, both input and output data can be calculated for a given moment of time and there is no need to store them digitally.

The basic operation intended for the text data implies the extraction of useful information without converting the original data by the index terms, template or mask.

3.2 Information modelling

Any computer system transforms (converts) the information. Such a system has an input through which it receives information to be processed, and an output, which provides the output information generated by the computer system in response to the relevant input information.

In a functional sense, human intelligence works similar to a digital computing system. Both systems work with a finite set of multi-dimensional information.

Information modelling can be carried out by means of classic alphabetical operators, when the following two features are not important: 1) their infinite domain; 2) input and output languages of a classic alphabetical operator may only contain the words with equal length. If we introduce the finite dimensionality into the definition of an alphabetical operator, we will get the concept of a finite alphabetical operator. At this point, input and output languages could include words of different length, thereby complicating the mathematical language for recording such operators.

Formal description of the natural and artificial intelligence systems requires such mathematical tools that would provide a convenient record for any final alphabetical operator. Based on these considerations, there has been developed the algebra of finite predicates [25]. The definition of a final predicate implies the following [25]:

Let us assume that A is a finite alphabet, which contains k set of letters a1, a2. …an, − − set consisting of two components designated by symbols 0 and 1, and called false and true, respectively. The variable over the A set is a literal variable, while the variable over the ∑ set – a logical variable. The finite-local predicate over the alphabet A is represented by any function f (x1, x2, …, xn) = t with n letter arguments x1,x2,…,xn, over the A set, which takes logical values t.

As can be seen from this definition, values of finite predicate variables, unlike the values of variables related to the finite alphabetic operators, presented by words, are literal. The switch to the alphabetic variables provides the possibility to develop a convenient mathematical language to describe various intelligent systems.

3.3 Data mining

The basic concept of text data mining methods centers around the similarity of objects, as well as around on the quantitative measure. The key techniques involve the following.

The first method is term-based. It is effective in computational performance and involves seeking a word in a document that has semantic meaning. This technique, however, has disadvantages such as polysemy and synonymy [36], where polysemy is a single word having multiple meanings and synonymy refers to multiple words having the same meaning. The next popular method is the phrase-based technique. Since a phrase has more meanings and is less ambiguous, this method performs better than the former one. However, it also has disadvantages such as inferior statistical properties to terms, low frequency of occurrence and the presence of excess information that is not related to requests. The more complicated and constructive methods are concept-based and pattern taxonomy methods. The first is based on sentence- and document-level analysis [37]. It stands on the following three components: semantic analysis of a sentence, building of a conceptual ontological graph, and concept extraction. The concept-based method allows differentiating between important and unimportant words, which are widely used to process the natural language. The pattern taxonomy method involves patterns with the ‘is-a’ relation between them [38]. It can be effective and accurate if patterns such as image signals are correctly selected. Signal image is a set of primary signs – the results of direct measurements or observations. The signal image or the dependent secondary characteristics are the initial data used to take one of the possible decisions regarding the object, for example, regarding its belonging to one of the specified classes. There are logical recognition methods, which imply information processing according to a well-defined algorithm to emphasize valuable information and intuitive recognition methods when valuable information is generated.

Semantic analysis plays an important role in the logical recognition methods [26], as a set of operations that support the comprehension of a natural sign system (pictures, phenomena or texts), introduced as a record, by using some kind of a formalized semantic language. This approach provides the possibility to define a new problem, which implies studying the impact of external factors on the price strategy of the enterprise. Its structure includes two sub-problems: processes of determining factors (market events) and obtaining association rules in a specified sector within a specified time limit. Association rules describe the relationship of factors occurring in a specified segment at a certain moment or period.

The first problem can be formulated as building a syntactic model of the Internet news analysis and identifying a unique market event by clustering, based on metric proximity of the two news blocks.

The second sub-problem implies obtaining association rules. This new approach is based on the idea that online news can be viewed as a marketing data container, which includes various external factors. Based on these factors, fitted on the traditionally collected internal data of the enterprise, one can create a set of rules that would specify, for example, the predicted values of the indicators. In this case, thus, the first problem is to identify market events that are significant for decision-making.

Thus, based on the morphological and syntactic analysis [27], we have formulated an original approach to identifying such market events. This approach is applied through a plurality of syntactic patterns, obtained by the domain ontology. These models take into account the category of market events (external factors): consumption and demand, competitor’s profile, inflation, international prices, R&D, consumer profile, consumer psychology etc.

4 The process of model development

The ontology is a mandatory step and allows generating a plurality of semantic fields, syntactic patterns and tokens in accordance with the subject area (certain specified market). For example, the ontology of events that have occurred in the raw material market with elastic demand (Fig. 1) allows building a syntactic model that describes the competitor’s profile category. The model includes a set of syntactic patterns made for phrases, divided into verbal and noun phrases, as well as many tokens, based on morphological analysis.

Fig. 1
figure 1

News ontology fragment

The phrasal text elements are handled with regard to their grammar. The initial elements can be allocated within a sentence. Suffixes or tokens are the center units in the analysis. In order to extract data from all news flows, similar models (grammar rules) should be formed for each news category.

We suggest a two-step processing of online news for identifying random external factors. Firstly, this implies classification, made through the syntactic and morphological analysis with regard to the M, E, G sets. As a result, we get the values for the event category vector \( \overrightarrow{c} \) . Each element ck, k = 1, K of this vector will take a value of 1 or 0, if the news refer to the k-category or another category, respectively.

The second phase implies the allocation of similar event clusters that would allow avoiding duplication and story chains, thus obtaining a stream of high-quality events. One can determine whether two news-duplicates are unified in one event by tracking the coincidence of coordinates с'⃑ of one news and with the coordinates с’'⃑ of another news. Their release dates should vary within the threshold value dt that is sufficient for reflecting market dynamics. In other words, inequation |d’ - d”| ≤ dt should be valid, where d’ and d” are dates of the first and the second news releases. The proximity assessment of two news by tokens, taken from М, E and G, is carried out by the following formula:

$$ {F}_p=\left[1+{\sum}_{i=1}^I{\left({m_i}^{\prime }-{m_i}^{\prime \prime}\right)}^2\right]\left[1+{\sum}_{j=1}^J{\left({e_j}^{\prime }-{e_j}^{\prime \prime}\right)}^2\right]\left[1+{\sum}_{h=1}^H{\left({g_h}^{\prime }-{g_h}^{\prime \prime}\right)}^2\right] $$
(5)

Where: mi,mi′′- coordinates of the vectors \( {\overrightarrow{m}}^{\prime } \)and \( {\overrightarrow{m}}^{{\prime\prime} } \) that were formed in relation to the linearly ordered token sets \( \overset{\sim }{M}={M}^1\cup {M}^2 \) of both releases with similar dimensionality \( l=\mid \overset{\sim }{M}\mid \); M1 and M2 are unordered token sets of each release. As for the first news, coordinates of \( {m}^{\overrightarrow{\prime}} \) take the following value:

$$ {m}_i=\left\{\begin{array}{c}1,{l}_i^M\in {M}^1,\\ {}1,{l}_i^M\in {M}^1\end{array}\right\}, $$
(6)

Where: \( {l}_i^M \) - a token, taken from M, and referring to all possible market-related tokens formed during the study of news domain. In this regard, ∣M ∣  > l .

The vector \( {m}^{\overrightarrow{{\prime\prime}}} \) for the second news release is formed in a similar way. The coordinate vector \( {e}^{\overrightarrow{\prime}} \) with relevant coordinates ej and ej′′ ї is formed in relation to the tokens \( {l}_i^E \) taken from the E set, containing all possible tokens of counter-agents pursuant to the linearly ordered set, \( \overset{\sim }{E}={E}^1\cup {E}^1, \)where E1 and E2 are unordered sets of counter-agent tokens with similar dimensionality \( j=\mid \bar{E}\mid \). In this regard, ∣E ∣  > j. The news market geography vectors \( {\overrightarrow{g}}^{\prime },{\overrightarrow{g}}^{{\prime\prime} } \) are formed with the relevant coordinates gh and gh′′ in relation to tokens \( {l}_h^G \) taken from the G set, containing all possible event geography tokens pursuant to the linearly ordered set \( \overset{\sim }{G}={G}^1\cup {G}^2 \), where G1 and G2 are unordered sets of event geography tokens with similar dimensionality\( H=\mid \bar{G}\mid \). In this regard, ∣G ∣  > H .

In most cases, token sets N1, related to the first news release \( {N}_1^1={M}^1\cup {G}^1\cup {E}^1 \) and the second one \( {N}_1^2={M}^2\cup {G}^2\cup {E}^2 \), will be different. Therefore, there will be no full coincidence (when the formula (1) equals to 1). This problem can be solved either by an expert assessment of news unification thresholds or by calculating a limit value through analysis.

We suggest making an analytical calculation of permissible error, as this will allow assessing the news similarity by the initial predicate. Such calculation is based on a set of additional tokens. The set refers to tokens, which are included into the news, but do not cover the event initially; they only specify its features, in particular, indicate its time-related feature, character, impact etc. Further calculation implies the introduction of intermediate assessment proximity degree by:

$$ {F}_{\alpha }=\left\{\begin{array}{c}\mid {N}_3^1\cup {N}_3^2\backslash {N}_3^1\cap {N}_3^2\Big\Vert {N}_3^1\cup {N}_3^2\backslash {N}_3^1\cap {N}_3^2\mid =0;\\ {}1,\mid {N}_3^1\cup {N}_3^2\backslash {N}_3^1\cap {N}_3^2\mid =0,\end{array}\right\}, $$
(7)

Where: are N3 sets of the first and the second news release. The coefficient can be get from (7):

$$ \alpha =\mid {N}_3^1\cup {N}_3^2\mid /{F}_{\alpha }. $$
(8)

The coefficient indicates the news proximity by irrelevant tokens: α increases as the news proximity does. Semantic meaning of the coefficient (8) implies the fact that its increase points to the probability that the released news describe the same event. This probability arises in the light of links between the words in natural languages. Therefore, first and the second predicates impose less requirements imposed for proximity.

The predicate Fα ≤ α makes it impossible to compensate the divergence by Fα through the growth of α, as the growth rate of is much higher than α; besides that, α is a finite quantity and takes a value \( \alpha =\left[1,|{N}_3^1\cup {N}_3^2|\right] \). Unlike α, higher growth rate of Fp indicates the higher specific gravity of Fp over . Therefore, N1 tokens have higher specific gravity, than the tokens.

Table 1 shows the behavior of Fp ≤ α \components for the case when =[1, 10], where np is the number of differences in word tokens, namely – \( \mid {N}_1^1\cup {N}_2^2\backslash {N}_1^1\cap {N}_2^2\mid \), and nα is the number of coincidences, namely – \( \mid {N}_3^1\cap {N}_3^2\mid \).

Table 1 The values of and α

The precise assessment of news proximity degree requires the introduction of N2 sets with restored tokens. The sets \( {N}_1^1 \) and \( {N}_2^2 \) (first and the second news releases, respectively) are formed with regard to and, but with new collected domain information added. For example, one can consider adding data on products and geographic markets of agents. Thus, the secondary condition for assessing the news proximity degree was introduced upon the reconstructed token vectors, namely, Fs ≤ αFp; its left part is calculated according to the formula, which is similar to (1), but based on the sets \( {N}_2^1 \) and:

$$ {F}_s=\left[1+{\sum}_{i=1}^I{\left({m_i}^{\prime }-{m_i}^{\prime \prime}\right)}^2\right]\left[1+{\sum}_{j=1}^J{\left({e_j}^{\prime }-{e_j}^{\prime \prime}\right)}^2\right]\left[1+{\sum}_{h=1}^H{\left({g_h}^{\prime }-{g_h}^{\prime \prime}\right)}^2\right] $$
(9)

The formulas (5), (8), (9) are united into a single complex formula, based on the prerequisites related to the coincidence of event categories and news proximity within dr

$$ F=\left\{\begin{array}{c}\overrightarrow{c}={\overrightarrow{c}}^{\prime };\\ {}\mid {d}^{\prime }={d}^{{\prime\prime}}\mid \le {d}_t;\\ {}{F}_p\le \alpha; \\ {}{F}_s\le \alpha {F}_p\end{array}\right\} $$
(10)

The predicate (10) is interpreted in the following way: Fp ≤ α means that news proximity should be evaluated at first by the N1 token set and by other token sets in case when the first evaluation has indicated high proximity. Indicates that difference between proximity estimates, obtained by the N1 and N2 token sets, should be within the error α limit.

Although the growth rate of Fs, Fp is similar, inequation Fs ≤ αFp indicates a higher specific gravity of Fp over Fs, namely – a higher specific gravity of token sets N1, not N2 – by combining two news in a cluster.

The news flow, obtained after classification and clustering, will have almost one-to-one correspondence with the real events that gave rise to relevant news.

In order to forecast, one has to develop a set of association rules for a received flow of events. At this point, let us introduce an additional notation. Let us consider that a market event happened during the time segment τ\( {Y}_i^{\tau }, \) i = 1, 2. This event caused changes, for example, in price. We denote the additional event \( {Y}_0^{\tau }, \) which reflects the line of change:

$$ {Y}_0^{\tau }=\left\{\begin{array}{c}+1,{p}^{\tau +1}-{p}^{\tau -1}>\tilde{p}_{m};\\ {}-1,{p}^{\tau -1}-{p}^{\tau +1}>\tilde{p}_{m};\\ {}0,\mid {p}^{\tau +1}-{p}^{\tau -1}\mid \le \tilde{p}_{m};\end{array}\right\} $$
(11)

Where: \( \tilde{p}_{m} \) – minimum fluctuation threshold of projected indicator for a specific market, \( \tilde{p}_{m}>0. \)

In this case, the problem of change forecasting is interpreted as a problem of finding the sequence of specific market events:

$$ {Y}_i^{\tau }{Y}_j^{\tau}\to {Y}_k^{\tau}\to {Y}_0^{\tau }. $$
(12)

The sequence (12) is called a rule. The rule (12) shows that after the simultaneous occurrence of events \( {Y}_i^{\tau } \) and \( {Y}_j^{\tau } \), event \( {Y}_k^{\tau } \), leads to the event \( {Y}_0^{\tau } \) according to the formula (11), where i, j, k ∈ Z . The rules of this type (12) can be built by means of the SPADE algorithm. Then, one can obtain the final value of a projected indicator according to association rules, based on the online news analysis that is carried out through the identification of current market conditions and rules relevant for this (current) situation.

5 Results and discussion

Thus, the problems of big data processing can be solved by an approach, based on the problem decomposition. We suggest allocating two problems: status identification and search for association rules. They were solved to illustrate how the market analysis is done by processing the array of online news, related to a specific topic. The prospect of this approach implies the following. Firstly, status identification remains urgent regardless of the problem domain. This problem can be solved by using the artificial intelligence tools. Secondly, search for regularities in large arrays of accumulated data allows collecting additional information for decision-making.

Let us denote the set of such rules like (12), as ∈R, where τ is a sequence (or set) of market events that occur before a change in the controlled parameter (price). Based on these, allocated factors may somehow affect the final value of a projected indicator within a specified market segment. In this case, we can identify the market situation leading to a predetermined value as a corresponding rule τ. In this context, there are two attributes corresponding to each obtained rule: s - supportability, characterizing the absolute frequency rule in the original sample; c – accuracy, namely – the risk of changes in the value on the background of an emerging set of events, described by the rule .

Supportability and accuracy are two important measures with the following definitions. Supportability is a number or percentage of transactions, containing a specific set of data. It reflects the frequency of element combinations. Practical problems, especially those related to customer data processing, are solved through the identification of a minimum support for the association rules. Thus, the set is of interest, if its supportability exceeds the user-defined minimum. The accuracy of association rules defines the probability of a chain of certain events to occur. The rule accuracy reflects the percentage of transactions that keep an object in the set.

However, this approach brings up an uncertainty problem at the stage of identifying the current situation and choosing rules, since association rules τ are characterized by different degree of accuracy and supportability. Any rule can have a very high supportability (obvious rule), and, in contrast, very low supportability (non-obvious rule). Consequently, the forecast quality depends on the identification method.

The introduced technology has been applied in the context of price strategy development. The sample size of price values has amounted to 800 items; the sample size of Internet news – 2700 items, while the first 600 price values and the corresponding 2100 news were taken as a training set. The two test samples were formed from the remaining values. The experiments allowed evaluating the effectiveness of an introduced technology. The methods of high-quality forecasting were assessed upon the built models with minimized value of the likelihood function. Many association rules were obtained through the SPADE algorithm. In order to assess the accuracy of introduced forecasting method, we have considered the relevant methods (Table 2).

Table 2 Experimental error values

Data analysis shows that forecasts, made by using association rules, are by 6% more accurate than the forecasts, made by using pattern conventional methods. Greater accuracy is achieved because the forecasts, based on association rules, allow considering those events that affect prices in the predicted value, while the regression methods contribute to an indirect consideration.

A number of related works showed that semantic technologies are effective in big data processing. It was shown that eClass ontologies could be integrated if special dictionaries were used [28]. Some authors showed how a big data management system can be improved with a semantic web technology [29]. This technology, however, did not allow an efficient and scalable data system. A semantic approach was applied to create the architecture of the Aletheia system, which enabled the integration of structured and unstructured information [30]. A semantic model ensures data exchange and conversion through an Internet service hub. Using an integration-oriented approach and GoodRelations ontologies [31] makes it possible to convert data from the BMEcat format. This allows improving the quality of data and restoring some information. These methods are also in good agreement with the multistage mining approach applied in this study.

The systematic analysis was applied to various tools necessary for integrating data from different sources [32, 33]. It was found that the most effective solution is to use similarity tools, which can be aimed at matching terms, graphs, etc. These technologies allow an automatic control over big data processing and integration. The relationship between big data trends and environmental issues was addressed in another review paper [35], which threw light onto an innovative approach to big data management in the field of energy-efficient green technologies. The approach in point may provide promising results. Unlike the association rule-based method used in this study, the above approach may be less effective if applied to complex systems.

6 Conclusions

Research results indicate that the future of data science and Big Data Analytics is in the latest achievements made in applied mathematics and applied in data processing modelling, regardless of data volume. In other words, semantic methods and mathematical statistics and vectorization, if applied together under a fact-oriented approach to data, produce information of practical value. The introduced approach is successful because of well-developed methods of mathematical statistics. The representation changing technique is useful for rapid data processing. However, they can be applied to nonhomogeneous and unstructured data under a semantic approach, applied to generate information that is suitable for handling the nonhomogeneous data. The fact-oriented approach and methods of hybrid data processing are at the starting point of their history. The main data mining models, considered in this article, show that this problem can be solved in several stages, most important whereof include data identification and search for regularities (rules). The text data, such as online news, can be identified by means of semantic models that contribute to the text proximity assessment not only by the coincidence of certain words, but also by their semantics. Therefore, duplicates are avoided and data are clear for further processing. In this research, the second stage of data processing implies building association rules that would allow identifying the chain of events that significantly affect the analyzed event. Based on the market prices analysis and forecast, this research shows that the original approach allows improving the forecast accuracy by 6%. we hope that our approach will be useful in this field.