Keywords

1 Relevance of the Study

Changes in the opinions and moods of social network users are a good indicator of the real processes that are taking place in society. This leads to a change in trends and in social processes this is latent. Moreover, it begins to occur in social networks much earlier than it becomes noticeable in non-network structures. To detect such phenomena, it is possible to analyze large volumes of textual information in real time, which is generated daily by users of social networks. This is a complex scientific and technical problem.

Social processes can be investigated using macrostatistics and sociological surveys. However, this approach, on the one hand, is very laborious (preparing special questionnaires with a large number of questions, forming a representative sample, conducting the survey itself, processing data, etc., which significantly reduces the reliability of the data obtained.

One of the possibilities for solving the problems of monitoring and predicting the development of public sentiment can be the study of the activity of users who leave comments on blogs and news resources on various socially significant topics.

Using text analytics tools that allow you to cluster texts on selected topics, and tools for collecting open data from social networks and news network resources, you can determine the mood of users and build graphs of their connections within selected thematic groups. Each such graph will have its own set of characteristics (network density, average intermediation coefficient, average clustering coefficient, elasticity coefficient, and others), which can change daily over time, thereby forming a multidimensional time series.

2 Setting a Study Objective

In the presented article, we propose a vector representation for describing the network of comments. The elements of the vectors are the admissible parameters of the network values (density, the average value of the intermediation coefficient, the average value of the clustering coefficient, elasticity, and others), as well as such characteristics as the share of users, which can be attributed to one of four groups based on text analysis:

  • loyalist (certainly supports the actions of the government and authorities).

  • oppositionist.

  • troll (a user using a resource to make a scandal, anger some and enjoy it).

  • undecided or neutral user.

Achievement or realization of desired or not desired states of the entire social network as a whole can be given on the basis of basic vectors (we will discuss this in the article below).

The time variation of the distance between the base vector and the current state vector can be considered as a “wander point” on the interval [Lmin, Lmax] or as a random (or almost random) time series. And some given value of this distance (the state in which management decisions should be made) can be considered as a trap or a point of an acceptable threshold for implementation, where a “wandering point” can eventually fall. This allows you to build probabilistic sociodynamic models to predict the dynamics of public sentiment.

In the traditional description of the behavior of a “wandering point”, as a rule, the diffusion model is used. However, in this case it cannot be considered reliable. As a rule, time series describing processes in complex systems (for example, financial indicators of stock and commodity exchanges) are not stationary, which is due to various reasons, including the presence of a human factor. Their selective distribution functions have a time-dependent mathematical expectation, which contradicts the simple diffusion model and shows a non-stationary time series.

In this regard, we are supposed to consider more complex models of behavior of the “wandering point”, for example, based on the Fokker-Planck equations.

3 Data Collection and Processing

As one of the examples of a study on the analysis of the structure of comment graphs, one of the news on the Echo-Moscow portal was chosen and all user comments on it and their available data were collected. After that, they (633 comments) were processed and, based on the analysis of texts, marked as belonging to one of four types: loyalist; oppositionist; troll; undefined.

The “undefined” group was singled out because, due to the small amount of text component of the comment, it was almost certainly impossible to say anything about the user’s affiliation with one of the other three groups. Some comments (77 pieces) at the time of data collection were deleted by the site moderators for rule violations, but due to the fact that the child comment retained the value of the parent comment, it became possible to restore information about their existence (but without texts). Therefore, the total number of nodes in the graph is 710. The stateful statistics of users are as follows: 10.28% loyalists (73 nodes); oppositionists 43.8% (311 knots); trolls 30.42% (216 knots); unspecified 4.65% (33 knots); 10.85% (77 nodes) deleted.

Figure 1 graphically shows us the structure of the graph obtained when processing comments on news. In color visualization of the obtained data, the nodes of the graph, depending on their state (assignments to one of four types), can be marked with different colors.

The links in Fig. 1 show the mutual commenting of users to each other. Thus, by the “color” of the nodes, one can judge their state, and by the edges of the graph about the interaction.

Figure 1 shows that there are many unrelated single vertices in this structure. How-ever, you can also notice the presence of a related component of the graph, which is separately shown in Fig. 2. Closed oval lines in Fig. 2 show users commenting on themselves.

Let’s consider the elements of the network state vector that we will use in our model:

  • The proportion of nodes that have a certain condition (for example, those who are negative towards any event in public life.

  • Clustering coefficient is a measure of the density of the connections of a given vertex with neighboring ones. The ratio of the real number of links that connect the nearest neighbors of a given node i to the maximum possible (such that all the nearest neighbors of a given node would be connected directly to each other) is called the node clustering coefficient, its value lies on the segment [0, 1]. The larger its value, the more significant this node is in the exchange of information.

  • Coefficient of mediation - shows the ratio of the number of shortest paths between all pairs of network nodes passing through this node to the total number of all shortest paths in the network, its value lies on the segment [0, 1]. The larger its value, the more significant the role of this vertex in the exchange of information.

Fig. 1.
figure 1

The structure of the column of comments on the news in question.

Fig. 2.
figure 2

Graph of user relationships by comments.

Determine the value of the elements of the basic network state vector (let’s designate it as θ). They set thresholds, the transition of which is undesirable from the point of view of state management. Given that in any community there is always a 0.10 to 0.15 share of participants always disagreeing on any issue, we will accept the share of those who are negative towards the event in question equal to 0.12.

The desired average value (for all nodes) of the clustering coefficient of such a network is also assumed to be small, for example, equal to 0.05; and the average degree of mediation of nodes in such a network is also equal to 0.05. Thus, the base vector will be: \(q = \left( {0.12;0.05;0.05} \right)\).

Note that the number of parameters with which you can describe the state of the network can be greater, and we have chosen only those that in our opinion are the most significant. In addition, the selected parameters are normalized (lie on the line [0, 1], therefore, they equally affect the calculation of the distance metric.

In our proposed approach, various columns of commenting on news on a selected resource on certain topics during the day can be combined into a single structure through connections between nodes that belong to users. Thus, we can highlight a large graph that will describe the activity of users of this network information resource during the day. Next, you can define the elements of the current state vector that describes its characteristics.

Changes in the components of this vector for each day for a certain time will form a multidimensional time series.

4 Theoretical Part

The resulting multidimensional time series can be used to describe the dynamics of the change in public sentiments of users of the intern network.

In the course of the study, articles were studied, on the basis of which it can be concluded that social network analysis methods are a useful tool for creating a complete picture of public sentiment at a time when events of a certain nature occur in a country or in a country. World. We can consider a number of works close to the topic of our study on the description of processes in complex social network structures.

Since finding similarities between nodes in a network is a time-consuming process as the network grows, researchers have used swarm algorithms to optimize the process of solving link prediction and community discovery problems [1]. Swarm-based optimization techniques used in social network analysis are compared in this article with community and link analysis based on traditionally used approaches.

In works [2,3,4] proposed the KroMFac technique, which conducts community detection using regulated non-negative matrix factorization (NMF) based on the Kronecker graph model. KroMFac combines network analysis and community discovery techniques in a single, unified framework. This technique links four areas of research, namely the detection of communities on graphs, the detection of overlapping communities, the detection of communities in incomplete networks with missing edges and complete networks.

In the work [5] proposed a new weighted summary measure for detecting influential users in social networks. This method combines the influence of several structural features of the network, as well as local and global information to obtain an estimate of weighted total centrality.

The authors of [6] propose a new index for analyzing the distribution of messages in social networks, based on the topological nature of networks and the strength of messages’ influence. This indicator characterizes the strength of each node as a means of launching a message, dividing nodes into starters and non-starters.

Works [7,8,9] on the analysis of random networks presents physically justified models and effective algorithms for determining hierarchical ranks of nodes in directed networks.

The dynamics of changes in the public mood of Internet users can be attributed to stochastic processes. The presence of the human factor (many people with different opinions, preferences and behaviors) on the one hand creates a randomness of changes (due to the wide variety of user behavioral models), and on the other hand introduces elements of purposefulness into the dynamics of changes. A detailed description of the use of stochastic methods for modeling the dynamics of social processes can be found in [10].

The most promising in our opinion for creating models of the dynamics of change in public mood are models that can be created from the Fokker-Planck equation, which takes into account both ordered and random changes.

The Fokker-Planck equation is widely used to analyze and model the behavior of time series when describing processes in complex systems [11,12,13,14].

It should be noted that in addition to the Fokker-Planck equation, other approaches are used for modeling based on differential equations, for example, the Liouville equations [14, 15], the diffusion equations [13, 16] and several others.

To simulate social processes, not only models based on partial differential equations are used, but, for example, models based on game theoretic approaches and methods for making managerial decisions based on them [17].

The Fokker-Planck equation is widely used to analyze and model transients observed in various complex systems and provides good agreement with predicted behavior and observed data. Therefore, as a hypothesis, we will assume that the Fokker-Planck equation can be used to analyze and model the appearance of comments on news and blogs. The Fokker-Planck equation has the form:

$$ \frac{{\partial \rho \left( {x,t} \right)}}{\partial t} = - \frac{\partial }{\partial x}\left[ {\mu \left( x \right) \cdot \rho \left( {x,t} \right)} \right] + \frac{1}{2}\frac{\partial^2 }{{\partial x^2 }}\left[ {D\left( x \right) \cdot \rho \left( {x,t} \right)} \right] $$
(1)

where \(\rho \left( {x,t} \right)\) - time-dependent t probability density of state distribution x (in our case, state x is the number of comments observed at time \(t\)), \(D\left( x \right)\) – state-dependent x factor determining random state change x, \(\mu \left( x \right)\) – a state-dependent x coefficient defining a targeted state change \(x\).

Applicable to our model \(D\left( x \right)\) can be interpreted as user actions caused by a spontaneous impulse that arose when reading the news or other users’ comments on it, when the event described in the news or blog is not significantly important, but the user is ready to spend time commenting or responding to another commentator (the user had a spontaneous desire to respond to this news). And \(\mu \left( x \right)\) can be interpreted as targeted actions caused by the desire to respond to a significant news or blog for the user, as well as comment on the comment of another user if he touched on a topic important from the point of view of this user (the user is constantly interested in this topic).

Next, when you need to build a model, you need to make assumptions about the dependence \(D\left( x \right)\) and \(\mu \left( x \right)\) from state x and consider two conditions. First, we take into account the dimension of the terms included in Eq. (1), and secondly, we can make the assumption that with an increase in the state of x (an increase in the number of possible comments (the significance of the news or blog) of magnitude \(D\left( x \right)\) and \(\mu \left( x \right)\) should also increase).

Logic dictates that all terms of Eq. (1) must have the same dimension, which has \(\rho \left( x \right)\). Both the first and second condition will be met if the dependencies \(D\left( x \right)\) and \(\mu \left( x \right)\) from state x will have the form: \(\mu \left( x \right) = \mu_0 \cdot x\) and \(D\left( x \right) = D_0 \cdot x^2\). In this form, on the one hand, growth is ensured \(D\left( x \right)\) and \(\mu \left( x \right)\) increasing the state of \(x\), and on the other hand, the condition of maintaining the dimension will be met.

Solving the stationary Fokker-Planck equation:

$$ - \frac{d}{dx}\left[ {\mu \left( x \right) \cdot \rho \left( x \right)} \right] + \frac{1}{2}\frac{d^2 }{{dx^2 }}\left[ {D\left( x \right) \cdot \rho \left( x \right)} \right] = 0 $$
(2)

Under the assumptions made has the form:

$$ \rho \left( x \right) = \left[ {\gamma - 1} \right]x^{ - \gamma } $$
(3)

This is the power law of distribution of commentators by the number of comments observed in practice. Thus, this suggests that the Fokker-Planck equation can be used in practice to describe social processes.

To describe the change in the value of the distance between the magnitude of the current state vector and the given base vector over time, consider the solution of the unsteady Fokker-Planck equation, which may allow the construction of probabilistic sociodynamic models to predict the dynamics of public sentiment.

Let us formulate a boundary value problem, the solution of which will describe the process of changing the value of the distance between the value of the current state vector of the comment network graph and the given base vector in time.

The first boundary condition:

When selecting the first boundary condition, we will proceed from the following considerations: \(x = L_{min}\) (the left border of a segment of possible states) determines the state through which the transition must be avoided (the area located on the segment to the left of this state is undesirable for us). The probability of detecting such a state of the system may not be zero. And the probability density, which determines the flow in the state \(x = L_{min}\), must be taken equal to 0, since the states should not go beyond this border (here the reflection condition is implemented). Thus:

$$ \left. {\rho \left( {x,t} \right)} \right|_{x = {\text{L}}_{min} } = 0 $$
(4)

Second boundary condition:

We limit the area of possible states on the right to some value \(x = L_{max}\) (the metric used in the calculations cannot be greater than the magnitude of the vector whose elements have maximum values in the space of the selected coordinates). The probability of detecting such a state over time will be different from zero. However, the probability density determining the flow in the state \(x = L_{max}\), must be set to zero (the distance between the current and base vector of states is limited by the maximum values of possible coordinates in the vector space used (the reflection condition from the boundary is realized)):

$$ \left. {\rho \left( {x,t} \right)} \right|_{x = {\text{L}}_{max} } = 0 $$
(5)

To formulate the boundary value problem, it is necessary to specify the initial condition. Since at a point in time \(t = 0\) system state (the distance between the base vector and the current state vector can be equal to some value \(x_0\), then the initial condition can be set as:

$$ \rho \left( {x,t = 0} \right) = \left\{ {\begin{array}{*{20}l} {\delta \left( {x - x_0 } \right)dx = 1,} \hfill & {\smallint^x = x_0 } \hfill \\ {0,} \hfill & {x \ne x_0 } \hfill \\ \end{array} } \right. $$
(6)

The presence of a delta function leads to the fact that the solution of Eq. (1) under given boundary conditions and the assumptions made about \(D\left( x \right)\) and \(\mu \left( x \right)\) for time-dependent probability density of system state detection in one or another value x will be:

At equation \({\text{L}}_{min} \le x \le x_0\):

$$ \rho_1 \left( {x,t} \right) = - \varphi \left( {x,t} \right)\mathop \sum \limits_{n = 1}^M \frac{{\sin \left\{ {\pi n\frac{{\ln \left( {\frac{{L_{\max } }}{x_0 }} \right)}}{{\ln \left( {\frac{{L_{\max } }}{{L_{\min } }}} \right)}}} \right\}\sin \left\{ {\pi n\frac{{\ln \left( {\frac{x}{{L_{\min } }}} \right)}}{{\ln \left( {\frac{{L_{\max } }}{{L_{\min } }}} \right)}}} \right\}}}{{\cos \left( {\pi n} \right)}}e^{ - \omega \left( {n, t} \right)} $$
(7)

At equation \(x_0 \le x \le L_{max}\):

$$ \rho_2 \left( {x,t} \right) = \varphi \left( {x,t} \right)\mathop \sum \limits_{n = 1}^M \frac{{\sin \left\{ {\pi n\frac{{\ln \left( {\frac{x_0 }{{L_{\min } }}} \right)}}{{\ln \left( {\frac{{L_{\max } }}{{L_{\min } }}} \right)}}} \right\}\sin \left\{ {\pi n\frac{{\ln \left( {\frac{x}{{L_{\max } }}} \right)}}{{\ln \left( {\frac{{L_{\max } }}{{L_{\min } }}} \right)}}} \right\}}}{{\cos \left( {\pi n} \right)}}e^{ - \omega \left( {n, t} \right)} $$
(8)

where \(\alpha = \frac{1}{2} - \frac{\mu_0 }{{D_0 }},\)\(\varphi \left( {x,t} \right) = { }2\frac{{x_0^\alpha \cdot x^{ - \left[ {1 + \alpha } \right]} \cdot e^{ - \frac{D_0 \alpha^2 }{2}t} }}{{\ln \left( {\frac{{L_{max} }}{{L_{min} }}} \right)}},\)\(\omega \left( {n, t} \right) = \frac{{\pi^2 { }n^2 D_0 t}}{{2\left[ {\ln \left( {\frac{{L_{max} }}{{L_{min} }}} \right)} \right]^2 }}\)

Probability that by the time t the state of the system will be within a period from \(L_{min}\) to \(L_{max}\), that is threshold state \((\theta )\) will not be reached can be calculated as follows:

$$ P\left( {\theta ,t} \right) = \mathop \smallint \limits_{L_{min} }^{x_0 } \rho_2 \left( {x,t} \right)dx + \mathop \smallint \limits_{x_0 }^{L_{max} } \rho_1 \left( {x,t} \right)dx $$
(9)

Probability \(Q(\theta ,t)\) that the threshold state is \(\theta\) by the moment of time \(t\) will be achieved or exceeded, calculated by the formula:

$$ Q\left( {\theta ,t} \right) = 1 - P\left( {\theta ,t} \right) $$
(10)

Defining the line boundaries of possible states from \(L_{min}\) to \(L_{max}\) we will discuss in the analysis section of the resulting model.

5 Prediction Algorithm for Achieving a Given State of the Network Comment Graph of Masmedia News Users

Predicting the dynamics of the moods of Internet media users based on the Fokker-Planck equation and changing the parameters of their commentary networks can be carried out according to the following algorithm:

  • You need to collect text comments and metadata of users on a specific topic from online news media resources with date and time binding.

  • Then you need to process data using text analytics and sentimental analysis, get a graph of user comments on a certain topic and calculate its characteristics (network density, average mediation coefficient, average clustering coefficient, elasticity, share of users with one or another mood).

  • Next, you need to set the values of the elements of the base vector, which will determine the achievement of the desired or not desired state \((\theta )\) and form, based on the processed data and the given vector, a time series of changes in the graph of user comments on a certain topic over time

  • Then we set the duration of the step \(\tau\) (hour, day, week, etc.) and by time series values in a few steps for a given \(\tau\) we determine using numerical calculations using observed data and Eqs. (9) and (10) model parameters \(\mu_0\) and \(D_0\).

  • We assume the last average value of the distance metric between the base vector and the vector of the current state of the network as the initial state \(x_0\) and using the obtained values \(\mu_0\) and \(D_0\), as well as Eqs. (9) and (10) perform calculations, and obtain a dependence on the time of probability of reaching, desired or not desired state. Next, you can set the probability value (for example, 0.95) and estimate the time to reach a given probability level (make a forecast by time). Analysis of the obtained model

For the graph shown in Fig. 2, you can determine its characteristics and elements of the vector of the current state at a given time t (which is taken as \(t = 0\)):\(X\left( t \right) = \left( {0.44;0.11;0.15} \right)\). Distance between the specified base vector of the desired state \(\theta = \left( {0.12; \, 0.05; \, 0.05} \right)\) and the current state vector \(X\left( t \right)\) at time \(t = 0\) will be equal to \(x_0 = 0.34\). By analyzing the dynamics of the time series of changes in the state of the network over the previous few days and using the equations of the model, you can solve the opposite problem and determine the values of the model parameters \(\mu_0\) and \(D_0\). In our case \(\mu_0 = 0.0003\) . and \(D_0 = 0.007\).

Right boundary of a line of possible states Lmax can be specified as the distance between the base vector \((\theta )\) and vector of maximum possible values of network parameters \(X\left( t \right) = \left( {1; 1; 1} \right)\). in the case under consideration \(L_{max} = 1.61\). The left boundary for insurance can be defined, for example, as half the length of a given base vector (in this case \(|\theta | = 0.14\)), thus \(L{}_{min}\) will be equal to 0.07.

As the results showed, if the network is not affected, then under current conditions the required state can be achieved with a probability of 0.8 for 375 days, and with a probability of 0.9 for 525 days. The result obtained is quite possible, but the question of assessing its accuracy remains, which requires additional research.

Figure 3 shows the results of the simulation as a function of the probability time of reaching a given threshold network state.

Fig. 3.
figure 3

The probability of reaching a given network threshold state versus time for the example dis-cussed.

6 Conclusion

In conclusion, we note that the complex nature of process dynamics in complex social systems can be described not only on the basis of models created on the basis of the Fokker-Planck equation. For example, in [18,19,20], models are presented for describing the stochastic dynamics of changes in the state of complex social systems, taking into account the processes of self-organization and the presence of memory.

This allows one to take into account memory and describe not only Markov, but also non-Markov processes. In these studies, a non-linear differential equation of the second order was derived, which makes it possible to set and solve boundary value problems to determine the probability density function of the amplitude of deviations of parameters describing the observed processes of non-stationary time series, depending on the values of the time interval of its determination and the depth of memory, which significantly distinguishes it from the Fokker-Planck equation.