Improving memory system performance for multimedia applications

Youn, Jonghee M.; Cho, Doosan

doi:10.1007/s11042-015-2807-y

Improving memory system performance for multimedia applications

Published: 09 August 2015

Volume 76, pages 5951–5963, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Improving memory system performance for multimedia applications

Download PDF

Jonghee M. Youn¹ &
Doosan Cho²

169 Accesses
1 Citation
Explore all metrics

Abstract

The cost and performance of embedded systems heavily depends on the performance of memories it utilizes. Latency of a memory access is one of the major bottlenecks in the system performance. In software compilation, it is known that there are high variations in memory access latency depending on the ways of storing/retrieving variables in code to/from memories. To improve the latency, it needs a technique to maximize the use of memory bandwidth. A burst transfer is well known technique to maximally utilize memory bandwidth. The burst transfer capability offers an average access time reduction of more than 65 % for an eight-word sequential transfer. However, the problem of utilizing such burst transfers has not been generally addressed, and unfortunately, it is not tractable. In this work, we present a new technique that both identifies sequences of single load and store instructions for combining into burst transfers. The proposed technique provides an optimal data placement of nonarray variables to achieve the maximum utilization of burst data transfers. The major contributions of our work are, 1) we prove that the problem is NP-hard and 2) we propose an exact formulation of the problem and an efficient data placement algorithm. From experiments with a set of multimedia benchmarks, we confirm that our proposed technique uses on average 7 times more burst accesses than generated codes from ARM commercial compiler.

A Study of Burst Transfer Generation

Memory Partitioning in the Limit

Article 26 October 2015

Low power memory allocation and mapping for area-constrained systems-on-chips

Article Open access 07 July 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As embedded software grows rapidly complex and large to satisfy diverse demands from the market, an efficient use of limited storage resources is ever becoming more important for compilers targeting these processors. To attain a desired performance goal with such limited storage, the processors are designed assuming software that runs on them would make heavy use of their various special hardware features. One potentially profitable route to efficient use of memory bandwidth is through DMA engine with a burst transfer.

A large number of techniques to reduce memory access delay and increase memory bandwidth in the execution of code, have been proposed since the performance of memory subsystem usually dominates system’s performance [6]. It should be noted that most of the modern DRAMs used in embedded system design support efficient memory access modes. In particular, the burst access mode is extensively supported in DRAMs [5, 10]. In general, access latency of random access is much higher than that of burst access. Further, memory operates in lower power per bit in the burst mode than in a random access mode. For example, IBM’s Cu-11 Embedded DRAM [10], ARM macro [1] supports a random access mode of 10 ns and a burst mode access of 5 ns. It has an active current of 60 mA/Mb roughly in random cycle and 13 mA/Mb. Consequently, a full exploitation of the burst memory-access mode is very necessary to alleviate the performance bottleneck caused by the delay of memory accesses.

Panda et al. [14] proposed an algorithm for arranging local variables to memory and organizing array variables via loop-transformation techniques. To do this, they modeled a number of a realistic burst-access mode in DRAMs. The formulation of the problem of arranging variables in a memory is analogous to the one in [15], which is originally used to reduce the number of cache misses. Though the formulation looks reasonable, it is not an exact formulation in a strict sense. Khare et al. [12] extended the work in [14] to support the burst mode in a dual-memory architecture, in which they focused mainly on an efficient interleaving of memory accesses.

Johnson et al. [11] designed a new algorithm that identifies profitable memory access sequences for combining into multiple memory access instructions as a means of compacting code. Grun et al. [7] proposed an approach that allows the compiler to exploit detailed timing information of the memories. They showed a further optimization (using a burst mode) is possible when an accurate timing of memory accesses to multiple memories is extracted. They also presented an approach [8] of extracting, analyzing, and clustering the most active memory-access patterns in an application and customizing memory architecture. Ayukawa et al. [2] proposed an access-sequence control scheme for relieving the page-miss penalty in random-access mode. They introduced an embedded DRAM macro attached with a special hardware logic for access-sequence control. Mckee et al. [13] described a stream memory controller (SMC) whose access ordering circuitry attempts to maximize memory-system performance based on the device characteristics. Their compiler is used to detect streams, while access ordering and issue are determined by hardware. Chame et al. [4] manually optimized an application for a PIM-based (embedded-DRAM) system by applying loop unrolling and memory-access reordering to increase the number of burst-mode accesses. Shin et al. [16] designed and implemented a compilation technique which exploits burst mode automatically. Their technique unrolls loops and reorders memory accesses by hoisting some codes in the loop body. Hettiaratchi et al. [9] proposed an address-assignment technique for array variables to reduce energy consumption on memory accesses by maximizing burst-mode accesses. They formulated the problem into a multiway graph partitioning, and solved it using a Kernighan–Lin-based partitioning algorithm.

All of the above approaches are in some sense related to our work in that they aim at finding an “optimal memory (offset) assignment” for certain hardware feature like the burst transfer. However, their works differ from ours in that none of those studies addressed the size of variable. They all center around to solve some specific problems with word sized variables. Our work presented here solves a generalized problem to maximally utilize burst transfers including subword sized variables.

In this paper, we propose a new general approach to the problem of relocation of nonarray variables in code to DRAM so that the efficient burst mode is maximally utilized. The first contribution of this work is that we show the problem is NP-hard. Consequently, finding optimal solutions within a reasonable amount of time to relatively large-sized problem instances look infeasible. This validates our strategy to design an efficient heuristic for solving the problem. The second contribution is that we solve the original problem based on formal formulation of the problem unlike the previous approaches.

The rest of this paper is structured as follows. In Section 2 describes our generalized problem formulation. The entire structure of the proposed approach is described in Section 3. Section 4 shows the experimental results. The conclusion is derived in Section 5.

2 Problem formulation

We first define a set of constraints that should be satisfied in the process of generating burst accesses. Then, we formally define a variable placement problem (VPP) for maximizing burst accesses. The optimization problem is, given a set of read/write operations, to find a stack variable location with maximum number of burst accesses (e.g., ARM chips [1]).

Definition 1

(Burst order constraints)

In a burst access, the sequence of m variables’ location must be contiguous starting from the address specified by the content of offset, i.e., Var[offset], Var[offset + Type], …, Var[offset + Type*m-Type] where Type represents corresponding size of the variables. Var[offset], Var[offset + Type*1], Var[offset + Type*2], Var[offset + Type*3] - i.e., sequential beginning at the first access. Note that when a burst access is to be formed from reads or writes, the task of satisfying burst order constraint is closely related to the task of assigning variables to memory (i.e., variable placement).

Problem 1

(VPP) For a set RWtot of read/write operations in assembly code C, the VPP is to generate a set RWburst of burst accesses from RWtot while satisfying the burst order constraints as well as the data dependency constraint of C. It maximizes the total sum of the number of burst accesses (i.e., |RWburst|) and minimizes the number of normal accesses (denoted as |RWnorm|). The objective function is minimizing the number of accesses of |RWburst| + |RWnorm|.

The objective function stands memory access latency. The latency usually depends on hardware features. For the sake of generality of our algorithm, we assume that |RWburst| and |RWnorm| represent the number of normal R/W operations served for burst and normal accesses respectively. For example, 4 words are transferred by the burst mode, where 4 read operations are involved. Then, |RWburst| must be one. The reason is that the first access spends most of burst transfer time since it initiates the burst transfer. The other 3 accesses consisting of the burst transfer just hit caches. It is relatively short time compared with normal accesses, thus we does not count them as a factor of |RWburst|.

We describe the VPP and our proposed solution to the problem using an example code fragment from a multimedia application where the original ARM code is translated into 3-address form for a better readability. R-regions (Read regions) and W-regions (Write regions), marked with vertical bars on the right side of Fig. 1, represent the maximum ranges of cycle steps in code, in which the corresponding read and write operations can be executed without violating the data dependency specified in the code, respectively. For example, the read from the variable d in the basic block B1 of the code has an R-region stretching from cycle 0 to cycle 6 in B1 because there is no write before the read in B1 and the value read to r2 is first used at cycle 6. Similarly, W-region for the write into the variable a in the block B1 stretches from cycle 7 to cycle 8 in B1 since the written value is defined at cycle 6 and there is no read from a in B1 after the write.

Then, the problem is to generate burst accesses by grouping the read/write operations whose R/W-regions are overlapped, so that the objective function is as minimum as possible while satisfying the burst order constraints. By a simple check we can see that the results of variable placement would critically affect the satisfaction of the burst order constraint of each R/W operation. Before we describe our approach to the problem VPP, we show that problem is intractable.

Problem 2

(decision-VPP) For an VPP instance with (RWtot, C) and k, is there a variable placement that can generate RWburst such that |RWburst| < = k ?

Problem 3

(decision-MWPC (maximum-weighted path cover)) For an edge weighted complete graph G(V,W), and k’, is there a path cover P such that the total sum of the weights of the edges on P is greater than or equal to k’?

Theorem 1

decision-VPP is NP-complete.

Proof

Consider a decision-VPP the restrictions that (a) RWtot = {O1; O2; O3; O4; …; O2m-1; O2m}, (b) all operations in RWtot are reads, and (c) a burst access is possible only from each pair of O2j-1 and O2j, j = 1, 2, …,m, and nothing else. Then, we show that the decision-VLP is NP-complete. (Note that the problem is in NP since we can generate burst accesses and check the value of its RWburst in polynomial time.) Let var(Oj) denote the variable read by Oj and V be the set of all variables read by the operations in RWtot. Then, we construct a complete graph G(V’,W) where V’ = V and the weight to each edge (v1, v2) represents the number of pairs of O2j-1 and O2j, j = 1, 2, …,m in RWtot such that {var(O2j-1), var(O2j)} = {v1, v2}. We show that decision-MWPC is polynomial time reducible to decision-VPP.

Let G(V,W) be a graph in decision-MWPC and P be a path cover on G. We construct a problem instance (RWtot) and solution RWburst of decision-VPP from the instance G(V,W) and P of decision-MWPC. Initially we set (RWtot) = {}. We assign a unique variable to each node of V in G. For an edge ei of positive weight (say wi), we create two unique operations and include them to (RWtot). The corresponding edge weight is then decremented by one. The process repeats until the edge weight becomes zero. The process of a burst access generation iterates for the other edges of positive weight in G until there remains no edge of positive weight. Note that if the total edge weight of G is L, |RWtot| = 2 L. We set variables’ location to that of the corresponding node sequence of P. Then, we create a RWburst solution from (RWtot) as follows:

For each edge ei in P of G, we extract the corresponding wi pairs of operations from (RWtot) and group each pair to form wi burst accesses. Then, it is true that if the total edge weight of P of G is k, there will be k burst accesses (i.e., |RWburst| = k) and (2 L - 2 k) normal accesses (i.e., |RWnorm| = 2 L - 2 k). Consequently, if the constructed decision-VPP instance contains a solution of |RWburst| + |RWnorm| = 2 L - k, (which we set to k’) there is a solution of P with k total edge weight in the decision-MWPC instance. Conversely, if there is a solution of P with k total edge weight in the decision-MWPC instance, there is a solution of |RWburst| + |RWnorm| = 2 L-k (=k’) in the constructed decision-VPP instance.

3 The proposed approach

According to Theorem 1, in the following, we propose an efficient approach, called VPP Solver. The solver consists of two steps:

(Step 1) From RWtot, we construct a graph, called variable placement graph, Gvpp, which describes relations between variables in forming burst accesses; (Step 2) we propose a burst access forming technique and apply it to the Gvpp obtained in step 1, to generate burst accesses that minimizes the factor of |RWburst| + |RWnorm|.

Step 1.1 (Construction of Gvpp): Nodes of graph Gvpp represent the variables used in the read/write operations in RWtot. Gvpp is a multi-graph, and there is a unique edge between two nodes ni1 and ni2 if there is a pair of reads or writes operations Oj1 and Oj2 in RWtot that satisfies two conditions: (condition 1) {var(ni1), var(ni2)} = {var(Oj1), var(Oj2)}, where var(x) indicates the variable transferred by x if x is a read or write operation, or indicates the corresponding variable if x is a node in Gvpp; (condition 2) the lifetimes of Oj1 and Oj2 overlap. Figure 2 shows Gvpp of the code of Fig. 1. The graph consists of nodes, each of which represents a distinct variable used in the reads/writes. Note that each edge of Gvpp is labeled with a pair of operations. These operations are a pair of reads or writes for the variables corresponding to the end nodes of the edge. For example, “O3/O2” is labeled on the edge (c, b) because O3 and O2 respectively perform reads for c and b, satisfying condition 1, and the lifetimes of O3 and O2 overlap, as shown at the R-regions of O3 and O2 in Fig. 1, satisfying condition 2 (Fig. 3).

Step 1.2 (Adjustment of Gvpp): The previous step does not consider variable size. A node can have several variable sizes such as one byte character, two bytes short integer, eight bytes double.

We observe that many media and network processing applications make extensive use of subword data. Therefore, for such applications, by packing multiple subword variables into a single word, we can generate storage layouts that further reduce the cost of memory accessing in an efficient way.

Thus, nodes of Gvpp should be adjusted with fitting to four bytes a node since a basic unit of a burst transfer is a word (four bytes).

To fit each node into four bytes, we define an adjustment operation, called node coalescing for subword variables. Normally, double types variables are manipulated as two word variables, thus, no more specific operation is necessary to split such two word variables.

Only two nodes could be coalesced together as each was two bytes, in general more than two variables may be coalesced into a single location assuming that their combined bytewidths do not exceed 4 bytes. While a single coalescing operation involves exactly two nodes, by iteratively performing coalescing, more than two variables can be made to share the same location.

Definition 2

(Node Coalescing)

Given an access graph G(V,E), node u, v ∈ V can be coalesced into uv if the following condition holds:

Bitwidth(u) + Bitwidth(v) ≤ 32bits

The transformed graph G’(V’, E’) obtained after coalescing of u and v is obtained as follows:

V’ = V- {u, v} ∪ {uv}

Bitwidth(u,v) = Bitwidth(u) + Bitwidth(v)

E’:

for each node x ∈ V – {u,v} do

if u-x, v-x ∈ E then

uv – x ∈ E’

weight(uv–x) = weight(u-x) + weight(v-x)

elseif u-x∈E and v-x ∉ E then

uv –x ∈ E’

weight(uv-x) = weight(u-x)

elseif v-x∈E and u-x ∉ E then

uv - x ∈ E’

weight(uv-x) = weight(v-x)

elseif x-y∈E and {x,y} ∩{u,v}= φ then

x-y∈E’

endif

endfor

The node coalescing operation is formally defined above. The definition above specifies the precondition under which nodes u and v can be coalesced and replaced by node uv. Furthermore, edge weight weight(x-y) is the number of times variables x and y are accessed one after another.

Step 2 (Generation of burst accesses): This step groups read/write operations to form burst accesses using the Gvpp graph obtained in step 1. Since the formation of burst operations should satisfy burst order constraints, and the constraint is tightly related to the result of variables’ location in memory space, our key algorithm in this step is to find an optimal location of the variables that leads to a formation of burst accesses. We formulate the problem of finding such a variable location by finding a path cover in Gvpp. The proposed algorithm is designed as an iterative method.

Initially, we have an empty path P, and Gvpp. At each iteration, we select an edge in Gvpp among the “candidate” edges and expand P by including the edge to P and update Gvpp by merging the two end nodes of the edge into a new node. More precisely, in the ith iteration of the algorithm, for every edge (vi, vj) in Gvpp such that the inclusion of (vi, vj) to P does not cause a cycle in the initial Gvpp, the operations corresponding to vi and vj are merged into a burst access, and compute the cost. Cost represents the number of burst R/W operations involving in P. The first RW operation is not counted in cost computation since it takes full of memory access time to initiate the burst transfer. After cost calculation, we select, among the candidate edges, the edge (vi, vj) with the largest cost value, and merge the operations of vi and vj into burst accesses. P is then updated by including (vi, vj), and Gvpp is also updated by merging the nodes vi and vj and updating the connected edges accordingly. The process repeats until P becomes a path that covers all the nodes in the initial Gvpp. Figure 4 shows a step-by-step procedure of the generation of burst accesses by our proposed algorithm from Gvpp in Fig. 2. The table in Figure 4a summarizes the value of cost of edges in the initial Gvpp. Edge (a,d) is selected because its cost is the maximum. Consequently, two new burst accesses, O(1,6) and O(11,13), which respectively come from the results of merging operations O1 and O6, O11 and O13, are produced. The left side of Fig. 4b shows the updated Gvpp where the thick edge indicates a (partial) path cover (i.e., P). Edge (a, b) is then selected, and thus two new burst accesses are generated accordingly. Figure 4c shows the updated Gvpp and computations of the cost in the third and fourth iterations in step 2 of VLP Solver, respectively. Finally, Figure 4c shows the final path cover d-a-b-c, which becomes exactly the memory layout of variables. From the results, we can predict that the total reduction of memory latency is about 12 if ARM ISA (instruction set architecture) [1] supports three cycles’ load operation and one cycle spent in burst mode. The reduction is 45 % of the total memory latency of the code in Fig. 1.

Finally, each variable should be assigned onto memory followed by the obtained solution. At this point, variables can be placed by sequential order, circular order or interleaved order. It depends on memory subsystem design. Normally, sequential order is good enough. However, some hardware features such as rotating register files or memory bank supported by DMA engine effectively support interleaved and circular orders.

Circular placement: Var[offset], Var[(offset + Type*1)%4], Var[(offset + Type*2)%4], Var[(offset + Type*3)%4] - wrapping at the end of the aligned block, e.g., 0123, 1230, 2301, 3012.
Interleaved placement: The interleaved burst mode computes the address using an exclusive or operation between the counter and the address. Using the same starting address of 5, a 4-word burst would return words in the order 5-4-7-6. An 8-word burst would be 5-4-7-6-1-0-3-2. This way is preferred by Intel chips.

4 Experimental results

We conducted a set of experiments to check the effectiveness of the proposed variable location optimization algorithm, called VPP Solver. Our algorithm was implemented in C++ and evaluated it on a set of benchmark programs in DSPStone [17]. The evaluation wad conducted on an ARM 7 processor. VPP Solver is applied to a set of ‘unoptimized’ (initial) ARM assembly codes of benchmark programs to see how much reductions of execution cycles VPP Solver can purely achieve.

Assessing the effectiveness of VPP Solver on initial codes:

Two versions of ARM assembly code are generated for benchmarks listed in Table 1; the first version is the one which uses normal data transfers with an exception in procedure boundaries in which burst transfers are limitedly used to save few status registers, whereas the second version is the one generated by the application of VPP_Solver to the first version. The first version places variables in alphabetically sorted order in stack, since traditional commercial compilers (like ARM ADS, TI’s code composer) arrange non-array variables in such order [1]. The second version places variables with an optimal order generated by VPP_Solver.
Table 1 Description of benchmark programs
Full size table

Table 2 summaries the two code versions. Init_code and Init + VPP_Solver represent the first and second versions of code, respectively. Further, R/W operations and #burst accesses indicate the numbers of memory access and the number of burst accesses among them in the corresponding code versions, respectively, and total cycles indicates the total number of execution cycles of the corresponding codes. The comparisons of the results in Table 2 reveal that VPP Solver is able to reduce the execution time by 8.4 % on average with increasing burst accesses when a normal ARM processor is used. Specifically, we found by our hand analysis that traditional register allocation algorithms [3] generate spills without considering its effect on burst transfers. This analysis provides us with an insight that the register allocator may reduce spill costs if it is able to identify a set of the registers to spill which will be more likely spilled and reloaded together with a single burst transfer. As our future study, we are currently working on this issue to improve our register allocator in our compiler framework for burst transfers.
Table 2 Comparisons of performances before and after the application of VPP Solver to initial ARM assembly codes of DSPStone benchmarks
Full size table

Note that the exploitation of burst transfers by VPP_Solver does not lead to a dramatic decrease in the total execution time. However, considering that, although it is not shown explicitly in here, we can successfully further reduce the running time after when every effort was already made by the compiler for code optimizations, we believe these results are of some meaning. Besides, the proposed technique takes 1 ~ 9 times more burst accesses, that would bring about a tangible reduction in the amount of energy consumption shown in Fig. 5., which are also equally important performance metrics in embedded processors.
Fig. 5
Energy consumption of the benchmark programs
Full size image
Checking the reductions on total energy consumption:

We used the access figures in IBM Cu-11 Embedded DRAM [10] macro as reference in the experiments, in which a random (i.e., normal) access cycle is 12 ns at best access is 4 ns at best. In addition, under the supply voltage of 1.5 V an active current is 59.19 mA on average for random accesses, and 11.21 mA for a burst access. It is shown that careful layouts of variables for maximizing burst accesses considerably contributes to the reduction of total memory access energy consumption. Note that the access sequence of a program on a specific real processor will be extracted only after the execution of code-selection and register-allocation, which are machine-dependent parts of compilation.

Figure 5 graphically shows how much the total energy consumption is actually saved by our VPP_Solver over the conventional method. Energy consumption of each program is considered as the base line (100 %), and each bar represents its corresponding energy consumption. The energy savings are due to the more optimal variable location decisions that are possible for making burst accesses compared to that made according to the alphabetically ordering policy, which result in less burst accesses to the main memory. As a result, the proposed technique consumes 9 to 47 % less energy than the comparison shown in Fig. 5.

5 Conclusions

In this paper, we proposed a comprehensive solution to the problem of assignment for variables to achieve a full utilization of an efficient DRAM access mode, called burst transfers. Since memory access is one of the major bottlenecks in embedded system’s performance, the proposed technique can be effectively used in DRAM-access-intensive embedded system applications. For utilizing burst transfers, the contributions are that we showed that the problem is NP-hard and we propose a formal definition of the problem and solution of efficient variables’ layout algorithms, called VPP_Solver for the burst mode generation. From a set of experiments with benchmark examples, it was shown that the proposed techniques used on average 7.2 times more burst accesses over generated codes from the ARM commercial compiler.

References

ARM, www.arm.com, ARM Developer Suite - Version 1.2, Nov. 2001
Ayukawa K, Watanabe T, Narita S (1998) An access-sequence control scheme to enhance random-access per formance of embedded DRAMs. IEEE J Solid-State Circ 33(5):800–806
Article Google Scholar
Chaitan G (1982) Register allocation and spilling via graph coloring. Proc of SIGPLAN Symposium on Compiler Construction 201–207
Chame J, Shin J, Hall M (2000) Compiler transformations for exploiting bandwidth in PIM-based systems. Proc. of the 27th Annu. Int. Symp. Comput. Architecture, Workshop Solving Memory Wall Problem
CS70DL Embedded DRAM 1999. [Online] http://www.fme.fujitsu.com/products/asic/pdf/CS70DLFS.pdf
Dutt ND (1997) Memory organization and exploration for embedded systems-on-silicon. Proc. of Int. Conf. VLSI Computer-Aided Design, Seoul, Korea
Grun P, Dutt ND, Nicolau A (2000) Memory aware compilation through accuratetiming extraction. Proc Des Autom Conf 316–321
Grun P, Dutt ND, Nicolau A (2000) APEX: Access pattern based memory architecture exploration. Proc of Int Symp Syst Synthesis 25–32
Hettiaratchi S, Cheung P, Clarke T (2002) Energy efficient address assignment through minimizedmemory rowswitching. Proc Int Conf Comput-Aided Des 577–582
IBM Cu-11 Embedded DRAM Macro 2002
Johnson N, Mycroft A (2003) Combined code motion and register allocation using the value state dependence graph. In Proc. of the 12th Compiler Construction 2622:1–16
Khare A., Panda PR., Dutt, Nicolau A (1998) High-level synthesis with synchronous and RAMBUS DRAMs. Workshop Synthesis Syst. Integr. Mixed Technol
Mckee SA, Wulf WA, Aylor JH, Klenke RH, Salinas MH, Hong SI, Weikle D (2000) Dynamic access ordering for streamed computations. IEEE Trans Comput 49(11):1255–1271
Article Google Scholar
Panda PR, Dutt ND, Nicolau A (1997) Exploiting off-chip memory access modes in high-level synthesis. Proc Int Conf Comput-Aided Des 333–340
Panda PR, Dutt ND, Nicolau A (1997) Memory data organization for improved cache performance in embedded processor applications. ACM Trans Des Autom Electron Syst 2(4):384–409
Article Google Scholar
Shin J, Chame J, Hall M (2002) A compiler algorithm for exploiting page-mode memory access in embedded-DRAM devices. Proc. of the 4th Workshop Media Streaming Process
Zivojnovic V, Velarde JM, Schager C, Meyr H (1994) DSPStone- A DSP oriented benchmarking methodology. Proc Int Conf Signal Process Appl Technol

Download references

Acknowledgments

This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2010-0024529), KOFAC (Korea Foundation for the Advancement of Science & Creativity), and the 2015 Yeungnam University Research Grant.

Author information

Authors and Affiliations

Department of Computer Engineering, Yeungnam University, Gyeongsan, Korea
Jonghee M. Youn
Department of Electrical & Electronic Engineering, Sunchon National University, Chonnam, Suncheon, Korea
Doosan Cho

Authors

Jonghee M. Youn
View author publications
You can also search for this author in PubMed Google Scholar
Doosan Cho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Doosan Cho.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Youn, J.M., Cho, D. Improving memory system performance for multimedia applications. Multimed Tools Appl 76, 5951–5963 (2017). https://doi.org/10.1007/s11042-015-2807-y

Download citation

Received: 18 March 2015
Revised: 06 May 2015
Accepted: 01 July 2015
Published: 09 August 2015
Issue Date: February 2017
DOI: https://doi.org/10.1007/s11042-015-2807-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improving memory system performance for multimedia applications

Abstract