1 Introduction

For the past 20 years several video coding standards such as MPEG-1, MPEG-2/4, H.263, H.264 [6], have been playing a significant role in digital media revolution. The recent advancement in H.264 significantly increases coding efficiency and processor computation power with inclusion of variable block motion estimation and many other coding features [23], and that is very true for HDVT sequences and there is a need to efficiently process the video. Recently, the computation-aware (CA) concept is getting attention of video processing researchers. In software implementations, processors may have to support video coding of different frame rates, frame sizes, and search ranges. In hardware implementations, even if the frame rate, frame size, and search range have been clearly determined, the computation resource (e.g. operating frequency) may still be adjusted according to the battery power for portable devices.

The ability to enable and disable components, as well as tuning their performance to the workload (e.g., user’s requests), is important in achieving energy efficient utilization. In this work, we will present new approaches for lowering energy consumption in both system design and utilization. The Policy presented in this thesis has been experimented using an event driven simulator, which demonstrated the effectiveness in power savings with less impact on performance or reliability [8].

The fundamental premise for the applicability of power management schemes is that systems, or system components, experience non-uniform workloads during normal operation time [4, 19]. Non-uniform workloads are common in communication networks and in almost any interactive system.

Dynamic power management (DPM) techniques achieve energy efficient utilization of systems by selectively placing system components into low-power states when they are idle [1]. As illustrated in Fig. 1. The goal of CA BMA [Block Matching Algorithm]s is to find the best block matching results in a computation-limited and computation-variant environment. The authors of [21] are pioneers of CA BMAs. They contributed a novel scheme. But it was intended for software and not feasible in hardware [3, 22].

Fig. 1
figure 1

An interactive systems is busy or idle depending on the user requests

With M2M, machines could not only retrieve information from other devices but also, to some extent, take action based on the information. Sensors that gather the information that some M2M systems transmit are becoming more widely used and thus are driving demand for the technology especially in power management area. The vision of an M2M and Internet of Things built from smart devices raises several important research questions in terms of system architecture, design and development, and human involvement [7]. The lower cost of sensors and initiatives for integrating them into larger systems are also increasing the approach’s popularity. The biggest new trend is that vendors are expanding M2M into wireless technology, using radio chips or modules they can attach to almost any device or machine. Thus, M2M is gearing up for exponential

Collocated smart objects with local rule databases (1–5) form an ad hoc reasoning system in which logical queries are sent from object to object and results are passed back along the inference chain(refer Fig. 2). In this example, smart object 1 contains one rule, A if B; to evaluate B, it sends a query to objects 2 and 3, which in turn asks object 4. Reasoning chains can be limited to physical areas around the originating object. In this example, smart object 5 is beyond the distance limit set by smart object 1 (the circle indicates the maximum distance from object 1).

Fig. 2
figure 2

Peer- to peer reasoning with smart object [7]

2 Mobile multimedia from M2M context

In order to evaluate the performance of video compres-sion coding, it is necessary to define a measure to compare the original video and the video after compressed. Most video compression systems are designed to minimize the mean square error (MSE) between two video sequences Ψ1 and Ψ2, which is defined as

$$ MSE={\sigma}_e^2=\frac{1}{N}{\displaystyle \sum_t{\displaystyle \sum_{x,y}\left[\varPsi 1\left(x,y,t\right)-\varPsi 2\left(x,y,t\right)\right]}}2 $$
(1)

where N is the total number of frames in either video sequences.

Instead of the MSE, the peak-signal-to-noise ratio (PSNR) in decibel (dB) is more often used as a quality measure in video coding, which is defined as

$$ PSNR=20 \log 10\frac{255}{ MSE} $$
(2)

It is worth noting that one should compute the MSE between corresponding frames, average the resulting MSE values over all frames, and finally convert the MSE value to PSNR.

If you are expected to be late, our mobile device inform our audience automatically telling them approximately how long they have to wait. Moreover, our planner can look up the traffic condition in advance and suggests what time you should leave. Sensors can monitor the traffic conditions along the routes to your destination so that you are able to select the best route to get to the venue on time [2].

2.1 Predictive coding

However, the performance is not efficient to compress the fast changing video, so other methods to remove the spatiotemporal redundancy are proposed for video compression. The most famous method is the block matching algorithm, or motion estimation and motion compensation method. The block matching algorithm divides the current frame and the previous frame into several macroblocks, comparing the blocks in the two frames and trying to search for the best matched pairs for each block [12].

The dissimilarity D(s,t) (sometimes referred to as error, distortion, or distance) between two images Ψn andΨn-1 is defined as follows

$$ D\left(s,t\right)={\displaystyle \sum_{Vy=1}^p{\displaystyle \sum_{Vx,=1}^qM\left[\varPsi n\left(x,y\right),\varPsi n-1\left(x+ Vx,y+ Vy\right)\right]}} $$
(3)

where M(u,v) is a metric that measure the dissimilarity between the two arguments u and v.

There are several types of matching criteria and two most frequently used is MSE and MAD, which is defined as follows:

  • Mean square error (MSE):

    $$ M\left(u,v\right)={\left(u-v\right)}^2 $$
    (4)
  • Mean absolute difference (MAD):

    $$ M\left(u,v\right)=\left|u-v\right| $$
    (5)

A study based on experimental works reported that the matching criterion does not significantly affect the search. Hence, the MAD is preferred due to its simplicity in implementation [12].

These two measures (MSE/MAD) give us a clear idea about the video frame. And MSE/MAD of the previous frame gives more information about the current frame and its complexity. If the current frame of a video sequence is fast moving (complex) then more processor core are allotted to process else only one processor core is assigned.

2.2 Background

From our previous work, even though Deterministic Markov Non-Stationary Policies (DMNSP) gives best power optimization, we can find out other policies than DMNSP of DPM also give best power optimization with respect to different arrival of request in embedded system model [17, 18]. For example, when the requests of requester are coming in long time interval, the greedy policy can give best power optimization compare with others and our DMNSP policies. When the requests are coming in continuously without inter arrival time worst policy (always on) can give best result. To allow further improvement in the battery life of portable devices, one new energy reduction schemes will be needed which ought to predict a best and suitable policy amidst existing policy online. This call for the use of intelligent controllers which can learn itself to predict a best policy that balance workload against power.

The number of levels in a DAG defined as the “depth” of a DAG. Maximizing the speed of a DAG is equivalent to minimizing the depth of the DAG. Thus partitioning and optimization methods for the fine grained DRL/M hardware architecture will be considered. For this DRL architecture, consideration of parallel processing in temporal partitioning may improve execution time. Many designers are interested in exploiting the inherent parallelism often present in large signal processing applications, so the trade-off between speed and silicon area with various levels of parallelism becomes an important design issue. Hence, a partitioning technique is developed with this parallel processing consideration in mind. The ability to find a minimum depth solution with duplication is necessary. Thus, a graph is partitioned into sub-graphs with a minimum depth solution. Each sub-graph consists of a set of clustering sub-graphs. Each sub-graph is selected by the greedy method, one at a time. This method will be shown to find the minimum depth partitioning of an application.

2.3 Complexity analysis

The whole partitioning procedure has a low time complexity. Before finding the MFC, all the nodes are processed by topological sorting. Hence during portioning, the nodes are visited in topological order. The time complexity of visiting the nodes is O(V.logV) where V is the number of nodes in the given graph. The task of finding each MFC includes calculating the area, sorting the area and labeling nodes. Each floor cone area is calculated by the depth first search (DFS) methodology. Hence, the time complexity of calculating the area is O(V.logV). In sorting the area, the number of floor cones is no more than the number of nodes V. The time complexity is O(V). When labeling nodes, a node is visited once. The complexity of labeling nodes is O(V). In the MFC-processing, to apply FFD the number of MFC is no more than the number of nodes V. The time complexity is O(V.logV). The minimum-depth partitioning solution selects k-MMG’s in each greedy method iteration. Therefore, depth determining takes O(V2.logV) time. In conclusion, the total time complexity is bounded by O(V2.logV).

The reference pixel block is generated by displacement from the current block’s location in the reference frame. The displacement is provided by the Motion Vector (MV). MV consists of is a pair (x, y) of horizontal and vertical displacement values. There are various criteria available for calculating block matching. The popular criteria of sum of absolute difference (SAD) is listed bellow

$$ SAD\left( dx, dy\right)={\displaystyle \sum_{i=0}^{15}{\displaystyle \sum_{j=0}^{15}\left|{c}_{i,j}^{x,y}-{r}_{i,j}^{x+ dx,y+ dy}\right|}} $$
(6)

where c ij is a sample of the template macroblock and r ij is a sample of the candidate block. The search range(SR) is 16 × 16, and macroblock size is 16 × 16. Figure 3 illustrates the SAD calculation. Current and reference block candidate are represented by c and r.

Fig. 3
figure 3

The data flow graph of SAD calculation

As it is required to search the whole search range to find the correct MV. First say, the search starts in x direction, Fig. 3 shows the DFG of the macroblock (MB). In this graph, there are 256 (16 × 16) inputs and 1 output. However, in the HDVT there are 1920 × 1080/16 = 8100 MBs.

3 Mobile multimedia for intelligent M2M

There are also computing requirements that need to be taken into account when it comes to SDTV and HDTV, assuming we want to encode 720 p at 30 frames per second, this would require to handle 1,280 × 720 pixels per frame which gives us 27,648,000 pixels each second to process. That’s over 27 million pixels to encode and another 27 million pixels to decode. High definition is usually done using the H.264 standard (in order to lower the bandwidth required while keeping high quality) and H.264 is quite complex to compute [13].

3.1 Slice partition for HDTV

Advanced video encoding techniques take each frame, split it into smaller section called slice and then analyze and encode each macro block [15]. In other words, multi-core processors can be utilized in a fairly simple way, we split the pictures into regions and encode the regions in parallel. This is not trivial because sometimes the actual processing of a macro block is dependent on previous images or its neighbor macro blocks (especially for motion estimation algorithms), but other than that, it is quite straightforward. Each core processes each slice, we use a scheduling algorithm to switch between core when video is fast and smooth [11, 14]. For illustration we use Fig. 4 (Irene-sign language sequence), though slice are not exact rectangular split.

Fig. 4
figure 4

Multiple core share the complexity which processing a frame

3.2 Requester model

A special entity called “requester” that generates workloads including IO requests and computation needs. Request modeling is one essential part of power management because policies predict future workloads based on their requester models. We consider two requester models for designing policies: single requester, multiple requesters. These models are increasingly complex and close to the programs running on realistic interactive systems like a laptop computer.

Figure 5 depicts the concept of the single-request model. The requester generates requests for the device; meanwhile, the power manager observes the requests. Based on this observation, the power manager issues commands to change the power states of the device. Some policies explicitly use this model in determining their rules to change power states [5, 20]; some other policies implicitly assume a single requester [10].

Fig. 5
figure 5

Requester model

3.3 Reinforcement learning (RL) for M2M systems

Figure 6 shows how the RL agent interacts with the process. Nearly all RL methods currently in use are based on the Temporal Differences (TD) technique [16]. The fundamental idea behind it is prediction learning: when the agent receives reinforcement, it must somehow propagate it backwards in time so that states leading to that condition and formerly visited may be associated with a prediction of future consequences. This is based on an important assumption on the process’ dynamics, called the Markov condition [9]: the present observation must be perfectly a conditional probability on the immediate past observation and input action. In practical terms, this means that the agent’s sensors must be good enough to produce correct and unambiguous observations of the process states.

Fig. 6
figure 6

RL Scheduler and decision maker

Power consumption of IBM Hard Disk Drive under different DPM polices are shown in Figs. 7, 8, and 9. In Tables 1, 2, and 3 Columns 2 to 4 respectively show the power consumption implemented by general purpose processor (GPP), parallel processing elements without the local memory for various DPM scheme.

Fig. 7
figure 7

Power consumption of IBM hard disk drive under different DPM polices

Fig. 8
figure 8

Power consumption of Fijitsu hard disk drive under different DPM polices

Fig. 9
figure 9

Power consumption of HP-Smart-Badge under different DPM polices

Table 1 Power consumption of IBM hard disk drive under different DPM polices
Table 2 Power consumption of Fijitsu hard disk drive under different DPM polices
Table 3 Power consumption of HP-Smart Badge under different DPM polices

4 Experimental results

Our proposed algorithm can compute a theoretically unlimited number of parallel FPGA modules (DRPPU) but, for reasons of simple comparison for core to core communication, we perform simulations for parallel arrays from one to 5 DRPPU and for DRPPU areas (A DRPPU ) from 1,536, 2,304, 2,688, 4,992, 6,144, 6,656 to 9,280 CLB. Thus we are demonstrating the ability of our algorithm to help design massively parallel architecture. In fact, FPGA is intrinsically capable of such function but application up to the present time has been largely linear, due to lack of design tools and the habitual persistence of traditional thinking.

On the other hand as regards to dynamic power management policies are concern, all the policies suggested so far [9] have either under prediction or over prediction by which they pay performance or power penalty. Our policy makes sure that server is ON, when there is an event in the Service Requester and Service Queue. Which means that under prediction or over prediction will never occur. Performance penalty will never occur by the proposed scheme.

5 Conclusion

The addition of parallel processing techniques to reconfigurable computing has the potential to improve DSP applications. Therefore, multiple processing units arranged according to traditional parallel processing techniques are being applied for high computation and data intensive applications such as HMM. This paper has presented a minimum-depth partitioning algorithm for parallel reconfigurable computing. It is shown that application speed can be improved by increasing parallelism of the parallel DRC units. The resulting high-parallelism design has somewhat higher total chip area because of redundancy between parallel units. The proposed algorithm can accept arbitrary chip area constraints or maximum parallelism constraint and then optimize for speed

Good predictors should minimize the number of mispredictions. We call over prediction (under prediction) a predicted idle period longer (shorter) than the actual one. Over predictions give rise to a performance penalty, while under predictions imply power waste but no performance penalty. To represent the quality of a predictor we define two figures: safety that is the complement of the risk of making over predictions, and efficiency, that is the complement of the risk of making under predictions.

All the policies suggested so far have either under prediction or over prediction by which they pay performance or power penalty. Our policy makes sure that server is ON, when there is an event in the Service Requester and Service Queue. Which means that under prediction or over prediction will never occur. Performance penalty will never occur by our policy.