Keywords

1 Introduction

From industry to consumer, banking to retail, from medical expert to patient and many other sectors have already been and have being embraced into Big Data - regardless of whether the information come from private or public. In terms volume, the scaling of petabyte data has been flowed into big data reservoir daily from web services, social media, astronomy, and biology science. Thus, big data can be defined as a collection of large datasets that may not be processed using traditional database management tools [1]. Mostly, problems of data storage, information manipulation, and especially searching key information from big data have been front line research areas which are widely researched and engineered by sizable researchers [2]. Specifically, the data flow into big data has two sources such as collective gathering and individual generation. The collective gathering big data includes smart city data, national geographic conditions monitoring data, and earth observation data [3]. Usually the collective gathering data are obtained by using statistical sampling techniques leading to high quality data. On the other hand, most of big data are generated by individuals by using social media data on the Internet. The individual data generation is more freedom giving low reliability and usability [4].

Collecting and analyzing data are commonly concerned with statistics which make able to judge on the basis of hypothesis. However, the big data technology is much advanced at the stages of data sampling, storage management, data computation, and data communication. In the traditional scientific paradigm, the theory is proved with the experiments. In the current scientific paradigm, the scientific finding is often obtained by computer simulation, and is mainly explored from multi-source observations from big data. In summary, we can say that big data have the characteristics of 4 V namely, volume, variety, velocity, and veracity. As the names describe the meaning we can explain these characteristics as follows. First V stands for volume of large amount of data and the second V represents variety or multiple-type or multi-source data. The third V representing velocity of generating data and processing data at high speed and the fourth V characterize veracity or value which refers to the high quality and value of captured and analyzed data. Data quality is comprehensively measured with inherent information content and its user demands satisfaction [5].

In this paper we propose a new concept and approach to the big data by introducing an analogy of big data with reservoir theory in the stochastic water storage processes. We then analyze the data inflows and outflows into buffers to investigate the insight patterns of big data to extract some important key information. The rest part of the paper includes some related works in Sect. 2, the overview and problem formulation in Sect. 3, illustrative simulations results in Sect. 4 and concluding remarks in Sect. 5.

2 Some Related Works

In this section, we shall present some research works of others which are related to this paper. Although a tremendous amount of research works concerning with Big Data analysis and Theory of Storage so called the stochastic reservoir theory has been appeared in the literature, we will describe some works which are in line with our works of this paper. The theory of storage with respect to probability concepts was first introduced by P.A.P Moran in 1954. Since then many researchers had examined and extended the works of Moran in theoretical aspect as well as application aspects. Among them, we would like to refer some works of Phatarfod [5] about stochastic reservoir theory and its extensions [6,7,8,9]. In this concern, a common concept is that a reservoir is built for preventing floods or irrigation use in which water inflows into the reservoir and released a certain amount for optimal regulation of a system in which the inflows and outflows are formed a sequences of random variables satisfying laws of probability so that the name becomes a stochastic reservoir theory. Even though we name a stochastic reservoir theory, there are many parts which demand the use of statistical techniques such as time series analysis of inflows to estimate an optimal size of a reservoir. Also, by using statistical regression analysis, we can find the probability distributions of inflows so that the input-output analysis is done for computing various important quantities including storage size, optimal regulation policies, overflow and emptiness probabilities and the respective times to be take. Those quantities are very useful to draw some analogies between the stochastic reservoir theory and buffer data storage system in the Big Data Analysis.

In the context of Big Data analysis, it has been well recognized that various sources such as social networks generate a huge amount of data having big in volume, fast in velocity, vary in variety and high in values. It will be worthwhile to note concerning with Big Data where the generation process of data is done by the users as well as providers so that most of data are unstructured making the analysis attractable for extracting some reliable and useful information. In order to extract important key information, we make big data processing by establishing some kinds of buffer storage systems where multiple streams of multiple data flow into the buffers of the big data. This is the analogy to be taking into account between the stochastic reservoir theory and big data analysis which is illustrated in Fig. 1.

Fig. 1.
figure 1

Stochastic Reservoir and Big Data Models

There is so much innovation in data platforms to enable the efficient processing of new type analytics through novel data structures which is termed as data reservoir which needs a constant flow of new data where ever it comes from existing sources or new sources in order to investigate the usefulness and reliability of the information. The data in the data reservoir can be stored in the data reservoir’s index or catalog. The catalog defines the origin, owner, and the characteristics of the data. Similar to water reservoir, the data reservoir can provide the regulation of inflows and out flow of the data for optimum usages. The data reservoir is designed to offer access the data for analytics [8]. On the other hand, big data reservoirs are large catchment areas where all kinds of data can be stored and analyzed. This fact is the foundation of a natural information systems of future developments. Also, as the data volumes are large, velocity high and the variety huge, it is essential for data reservoirs to be cost effective and flexible to other methods.

Searching data usually provides an automatic and effective way for reservoir evaluation. In [10], the authors considered a technique of data searching for logging reservoir evaluation and verify its performance. Compared with traditional evaluation methods, it is efficient and not so dependent on expertise. In the following we propose a framework which would be able to analyze.

3 Proposed Stochastic Model for Big Data Analysis

In order to make an optimal operation of data reservoir networks for the big data we propose a new concept of stochastic model from water storage systems to analyze bag data reservoir networks systems in the big data. In typical models, the inputs to the reservoir are data released from upstream reservoirs and stochastic inflows from external sources (like social media and smart phones), while the outputs correspond to the amount of information to be transmitted during a given time period (e.g., a day or an hour or a month). The dynamics for the single reservoir can be modeled by a state equation where the amount of information at the beginning of period t + 1 reflects the flow balance between the information that enters (upstream releases and stochastic inflows) and the information that is released or transmitted during period t. The amount of information to be released during a given time period from each reservoir is chosen to minimize some possibly nonlinear cost (or maximize benefit) function related to the releases. Thus, optimal management of the reservoir network can be formulated as an optimization problem, in which the aim is to determine the quantity of information releases that minimize a total cost over a given horizon of T time periods (e.g., a year).

Let X t for t = 0, 1, 2, … be the amount of data flows into the data reservoir during the time interval (t, t + 1) and at the of interval t + 1, a certain amount of data say m is taken out then the data content Z t left in the reservoir at the end of interval after the release is given the following stochastic state equation:

$$ Z_{t + 1} = { \hbox{min} }\left( {K,Z_{t} + X_{t} } \right) - { \hbox{min} }\left( {m,Z_{t} + X_{t} } \right) $$
(1)

The Eq. (1) can be solved when the probability distribution of {X t } and constant m are known. In this paper we shall consider the case of independent and identical distribution for {X t }. In order to do so, we first state Wald’s fundamental identity in sequential analysis.

Let us define \( Y_{t} = X_{t} - m \). Then \( \left\{ {Y_{t} } \right\} \) is also a sequence of independent and identically distributed random variables. We also have from Eq. (1) that \( Z_{N} = \sum\limits_{t = 1}^{N} {Y_{t} } \). In this, there are two absorbing barriers at –u and K  m (>0), for the random walk starting at the origin. Here u is the dam content between 0 and K  m.

Let n be the smallest positive integer such that \( Z_{n} \ge K - m - u \) or \( Z_{n} \le - u + m \). Then, if G(θ) is the probability generating of the distribution of Y:

$$ E\,[e^{ - \theta } Z_{n} \left\{ {G(\theta )} \right\}^{ - n} ] = 1 $$
(2)

for all θ such that \( \left| {{\text{G}}\left( \theta \right)} \right| \ge 1 \).

It is known by Wald that there is one dominant root \( \theta_{0} \) such that G(θ) = 1. Substituting \( \theta_{0} \) in Eq. (2), we can obtain the probability of absorbing at the barrier which is approximately equal to:

$$ P_{u} = \frac{{1 - e^{{ - (K - m + 1 - u)\theta_{0} }} }}{{e^{{(u + 1 - m)\theta_{0} }} - e^{{ - (K - m + 1 - u)\theta_{0} }} }} $$
(3)

By assuming the capacity of data reservoir as K and the initial data level as u, with unit release m = 1,we then have the probability of data emptiness and flooding data of overflow as described in Eqs. (4) and (5).

$$ P_{u} = \frac{{1 - e^{{ - (K - u)\theta_{0} }} }}{{e^{{u\theta_{0} }} - e^{{ - (K - u)\theta_{0} }} }} $$
(4)
$$ Q_{u} = \frac{{1 - e^{{u\theta_{0} }} }}{{e^{{ - (K - u)\theta_{0} }} - e^{{u\theta_{0} }} }} $$
(5)

From the duality theorem of random walk namely as \( P_{u} = F(K - u) \), we obtain the stationary distribution of data reservoir content as shown in Eq. (6).

$$ F(x) = {\text{Prob (Data content }} \le {\text{x)}} = \frac{{e^{{x\theta_{0} - 1}} }}{{e^{{K\theta_{0} }} - 1}} $$
(6)

Now the problem become to evaluate the value of K or the size of data buffer such that:

$$ F(K) = {\text{Prob (reservoir content }} \le K - 1 )= P{\text{ where }}P{\text{ is to be given}} $$
(7)

4 Experimental Simulation Results

First we made a test how the approximate results given in Eq. (7) are reasonably acceptable or not by comparing the ground truth done by Prabhu [11] for the water content in the theory of storage. The comparison results are presented in Table 1.

Table 1. Distribution function of reservoir content for independent gamma inputs with parameter α and 100 units’ release; size of the reservoir, ~ = 1000

From the Table 1, it can be seen that the approximate results are good while the content lies between 200 and 600. That means if the content is between the range of (K/400, 3K/400), the results get better. On the other hand, the results are not very much good when the content is near to zero. This fact does not make much effect for big data because the big data will never become shortage of data. Therefore, we regard that the approximate results for the data reservoir content will make sense for use to determine the reservoir size optimally.

In order to do so, we note that the reservoir size K satisfies the equation F(lK) = P specified probability say p in Eq. (6). Putting \( e^{{lK\theta_{0} }} = z_{0} \) into Eq. (6),

$$ \frac{{e^{{y\theta_{0} - 1}} }}{{e^{{K\theta_{0} }} - 1}} = \frac{{z_{0} - 1}}{{z_{0}^{(1/l)} - 1}} = P $$
(8)

It is known from the solution of a polynomial equation, we can obtain the unique solution of (8) other than non-zero. We then have the optimal reservoir size as:

$$ K = \frac{{n\log z_{0} }}{{\theta_{0} }} $$
(9)

By the simulation results we note that for reservoirs with the same values of l and P, 0 is a constant. Thus, to compare sizes of reservoirs with the same critical level and critical probability, we need only compare the values of θ 0 . Thus we can compare the values of θ 0 by varying the size of data reservoirs so that we can obtain the optimal size of data reservoir of computing buffer size in the big data. This will make efficient computing effect to investigate insight information for the big data. However, in this paper we can give the outline procedures only.

5 Conclusion

In this paper we had proposed a new concept of big data reservoir by using the concept of stochastic water storage to investigate the insight information from the big data. We have used Wald’s fundamental theorem in sequential analysis which was very popular in various fields of research such as queuing theory, dam theory and other operation research problem. We have presented only simulation so far. More works to be done in the future.