Introduction

System on chip (SoC) is a chip which has multiple different components on same silicon. These components can be processing units, memories or other functions. These SoCs are integrated circuits (ICs) but in larger scale and SoCs are usually considered to have more functionality where ICs usually are considered to have one specialised function (Chusov et al. 2021). The key is the combination of software and hardware. Hardware is very fast and has low power consumption but is not flexible, adaptable and is hard design and test. Software is very flexible, adaptable, easy to write and test but also slow and has high power consumption. This means that with programmability we lose performance and with performance we lose adaptivity. SoCs are a good middle ground when both programmability and performance are wanted (Ahmed et al. 2021).

SoCs communicate internally with interconnects and externally with communication protocols. For example, an SoC for smart phone would need a communication protocol to be able to communicate with peripheral devices. These devices can be cameras, screens or other chips. Communication protocols can be used internally as well between the different components of the SoC. Figure 1 shows an example of a high level representation of an SoC. The figure shows possible modules inside an SoC and how they are connected with an interconnect (Rumyantsev et al. 2020).

Fig. 1
figure 1

Block diagram for high level representation of SoC

SoCs are becoming more complex and contain more functionality. Complexity of design means more transistors but short time to market forces designers to use transistors less efficiently (Frolova et al. 2020). The efficient usage of transistors comes from optimization for which the designers do not have enough time. One way to battle this is the design re-usability where same design would be used for multiple applications. These designs usually leave room for some customization in architectural parameters.

Energy optimization

The following section of this paper will provide a more in-depth discussion of an optimization concept for memory subsystems that are composed of STT-RAM blocks. Due in part to the optimization potential offered by the following two characteristics of this memory technology, which can be observed in conjunction with it and which can be observed in conjunction with it. Most importantly, the trade-off between the energy consumption of STT-RAM write operations and the latency of the write operation must be taken into account. As has been demonstrated for static random access memory memories, the impact of the required memory access logic results in an increase in dynamic energy consumption with increasing memory size (Rudolf et al. 2019). Also discussed is how to take advantage of this fact by optimising SRAM memory subsystems in the same way that DRAM memory subsystems are. However, while selecting an energy-efficient set of STT-RAM memories is straightforward, it does involve aspects that can be classified as allocation and binding problems, which will be discussed in greater detail further down this page on this page (Strobel et al. 2019). We have already discussed how allocating different-sized memory blocks can make a significant difference in terms of write and read energy consumption, particularly when combined with application segment binding and taking into account different memory access frequencies.

Because of this, the system’s overall performance is influenced by several factors, including the memory subsystem’s operation frequency and the overall system performance as a result, as well as memory blocks’ operating voltage levels and the processor’s operating frequency. In a single memory block, it is referred to as different operation voltage levels assignment when different memory blocks are assigned different voltage levels (Strobel and Radetzki 2019a). When integrated into a 45 nm node, a 4 MiB STT-RAM memory can operate at a maximum operation frequency ranging from 48 to 57 MHz, depending on the selected operation mode. Figure 1 depicts a trade-off diagram for the 45 nm node, which provides a more detailed illustration of the subject (Strobel and Radetzki 2019b). This paper investigates the write operation in relation to other memory sizes, and it is discovered that a much broader range of design possibilities is revealed; for example, it is possible to use write frequencies as low as 40 GHz in megabyte memory and write frequencies of more than 100 MHz in smaller memories with storage capacities as small as a few thousand bytes or even smaller than that. Because energy consumption changes at the same rate as the environment, it is not difficult to identify a good or even optimal solution within this design space (Fan, et al. 2019). In the following section, we will discuss a memory optimization method for STT-RAM memories that takes into account both the impact of memory size and the effects of different operation voltage levels at the same time. It is possible to account for the effects of memory size and different operation voltage levels in STT-RAM memories at the same time using this method, which saves both time and effort by eliminating the need for multiple calculations (Strobel and Radetzki 2019c).

Efficient load aware memory management (ELMM) algorithm

Variables and data structures in shared memory are declared in such a way that the compiler does not perform memory optimizations and makes the most efficient use of the processor registers that are currently available because there is insufficient local storage available. This ensures that the processor registers are used in the most efficient manner possible. As a result of this performance optimization, when accessing the shared memory pool, other forms of local storage, such as processor caches, are bypassed and used instead (Sergey et al. 2019). Data is read from shared memory whenever software requests it, and the results are immediately written back to shared memory to complete the transaction. Due to its simplicity, this scheme can be implemented with some difficulty. Any processor can be used to implement this algorithm.

Despite the fact that this scheme is feasible, it can only be implemented if the processor is capable of supporting explicit instructions for flushing all local storage into memory, which is not always the case due to limitations in the capabilities of some CPUs. As is readily apparent, the most significant advantage of both schemes is the fact that there are no costs associated with data transfer between software and hardware during the development process (Casini et al. 2018). The advantage of using shared memory rather than a local storage scheme, aside from the fact that data is not buffered within the processor's local memory, is the ability to maintain data consistency throughout the duration of the transaction. However, when the software portion of the application is in use, the amount of traffic that travels across the processor bus increases dramatically as a result of the increased activity on the processor bus. Because mapped memory access times are significantly longer than those of unmapped memory, using mapped memory in place of unmapped memory can result in a significant reduction in performance when compared to unmapped memory (Kuan and Adegbija 2018). To ensure that these schemes perform as efficiently as possible, it is critical to take into account factors such as the processor bus speed, the speed of mapped memory, and the speed of the access logic that is associated with mapped memory (Figs. 2, 3).

Fig. 2
figure 2

System architecture

Fig. 3
figure 3

Block diagram of the efficient load aware memory management

As a result, the vast majority of microprocessors currently available on the market are Von Neumann machines in their most basic configuration, which is true for the vast majority of microprocessors (Sayed et al. 2018). Keep in mind that the memory architecture (Fig. 4) is the most notable exception to this rule because it stores instructions and data in separate memory spaces and has separate data, address and control buses for each memory space, as opposed to the conventional architecture, which has one data, address and control bus for each memory space and one control bus for each memory space. In addition to the fact that instruction and data fetches can occur at the same time, this approach has a number of advantages, including the fact that the size of an instruction is not limited by the size of a standard data unit. The following are some of the additional benefits of employing this strategy, which are listed below (word).

Fig. 4
figure 4

Memory architecture

The memory system architectures are made up of a cluster of high-speed processors, each of which has its own cache or local memory and access to a large, shared global memory pool, as well as a shared global memory pool, and a shared global memory pool (Fig. 5). Wikipedia states that data and programmes that will be executed by the computer are stored in the global memory of the computer before being executed by the computer itself. Also stored in this memory is a table containing the names of processes that are currently awaiting execution; this table is referred to as the “waiting list”, and it contains information about each process’ (or sub- program’s) status while it is awaiting execution (Xu et al. 2018). To be able to run semi-independently of another processors in the system, it is necessary to load the processes and data associated with them into local memory or cache on each processor’s behalf. This is accomplished by storing the processes and data associated with them in local memory or cache on each processor’s behalf (Chen et al. 2018). It is also possible to communicate with other processes through the use of the global memory system, which is accessible through the global memory system, which is accessible through the global memory system.

Fig. 5
figure 5

ELMM shared-memory

Xtra storage mapping algorithm

Memories mapping is the process of associating data from one file with an aspect of a process’s virtual address space that is currently being performed, and it is also known as memory mapping (also known as virtual address translation). In some circles, it is referred to as data association, data association and association, or data association and association, and it is also referred to as data association and association (Calinescu et al. 2018). When a file mapping object is created and stored in the database, it is possible for the system to keep track of the relationship between two files. When a process makes use of a portion of the virtual address space that has been allocated to it to access the content of files that are currently being accessed, this is referred to as the file view in computing terminology (Long et al. 2018). Reading and writing from a file are both examples of file mapping. A process can benefit from both random and sequential input and output by using file mapping. Large data files can be processed without the need to map the entire data file into memory, which is particularly advantageous when dealing with large data files. Multiprocessing is a term that refers to the use of memory-mapped files to share data among multiple processes that are running at the same time.

It is possible to perform multiprocessing using memory-mapped files. The fact that processes can read from and write to the file view when using dynamically allocated memory means that when using dynamically allocated memory, pointers are used to read from and write to the file view rather than direct access to the file view when using dynamically allocated memory. In order to improve performance, file mapping stores the file on disc while keeping a copy of the file view on the computer’s hard drive (Zhao et al. 2018). As a result, the system becomes more responsive as a result of the implementation of file mapping. The virtual protect function, which allows processes to manipulate the file view when the function is enabled, can be used to manipulate the file view by manipulating the file view. Figure 6 depicts the relationship between a, a file view, file mapping object and a file on disc (Chang et al. 2017). It also depicts the relationship between a file mapping object and a file on disc.

Fig. 6
figure 6

Xtra storage mapping algorithm

Storage systems have not improved in terms of performance when measured in terms of the amount of storage available, even though processor performance has increased dramatically. While data can be processed quickly by the CPUs, system problems can still occur as a result of the latency introduced by a data storage device (Park et al. 2017). Data-intensive workloads are associated with a significant amount of overhead, which accounts for a significant portion of the time lost to inefficiency. With the recent increase in the amount of data that is being processed by applications, there is an urgent need for a significant improvement in the overall performance of the system. In recent years, in-memory data processing has gotten a lot of attention because of its ability to process workloads more quickly by eliminating the need for I/O. In-memory data processing is becoming increasingly popular because of its ability to process workloads more quickly. Such issues, on the other hand, can be avoided in the long run by including memory-mapped file I/O functionality in your application.

Memory access latency

The actual data transmission latency calculation starts by dividing the “chunks” parameter with the mem_t data structure’s bus_width variable to obtain the total amount of data columns to be transmitted. Once the total amount of data columns is determined, the latency function will divide the amount by two and store the result to a temporary variable known as access_transfer (DOR3 memory can transmit twice per memory clock and transmit one data column per transmission).

At this point, the latency function is still expressed in terms of memory clock cycles, as it was previously stated (Lapshev and Hasan 2016). The latency function will convert the result to processor clock cycles by multiplying the values (“latency” and “Access _transfer”) by the mem t variable, mem to CPU clk, in the latency function. When a value is converted to a processor clock cycle, the latency function stores the value of access transfer in the channel's pre v burst variable for use in subsequent latency calculations.

Because the data bus will be busy until the most recent memory access (the current access) is completed, the latency function will also store the value of start time + “latency” + access transfer to the bus timer variable of the channel (Li et al. 2016). Finally, the latency function performs one final action, which is to calculate the memory access latency that is relative to the “now” parameter in the input parameter. This is completed by returning the value of bus timer, which is the word “now”.

Latency optimisation

It is necessary to employ the latency pragma because it ensures that the implemented design executes all of the operations in the functions within a specified range of clock cycles. However, while it is possible to announce the latency pragma while still inside the loop, this will result in the loop’s total latency being specified for each repetition (Kulkarni et al. 2016), however, if it is desired to operate on the total latency of the loop, it is necessary to announce the latency pragma while outside the loop as described above (Fig. 7).

Fig. 7
figure 7

Memory_access_latency function work flow

The following is the declared separating latency for all iterations:

figure a

In addition to using sequential loops for further optimization, another method of reducing latency is to combine them all. By flattening the nested loop, it is possible to reduce additional latency, which is a positive development (Balasa et al. 2017). This directive should be applied to the loop that is located at the innermost point of its parent loop body in order to flatten it out. It can be defined as: Set_directive_loop_flatten top/inner.

Implementation and results

The memory controller implementation were evaluation using trace based simulation. A VHDL testbench was used to read request from a text file and them to the controller via the host input interface. The testbench records requests as complete when they are signalled on the host output interface, which simulates the overall latency of memory request to real host as shown in Fig. 8.

Fig. 8
figure 8

Simulation method

The testbench load request as fast as possible without regard for the timing information in the trace is itself dependent on the average memory access time and using it would not give an accurate picture of how the design implemented in this work actually affect the average memory latency. It would have been preferable to perform the trace-based simulation using realistic timing, but the inter dependence of the rate at which memory requests are used with the average response time for the memory requests makes the synthesis of such timing difficult. Therefore, the traces were read into the simulation as quickly as posible. Accelerated traces allow the controller to achieve its maximum possible performance because the maximum amount of concurrency is available. Although using accelerated traces can hide some of the downfalls of certain controller implementations.

However, because of intellectual property protection, obtaining memory figures that are suitable for use in industrial applications is difficult. The use of nonlinear regression modelling of memory properties in conjunction with a black box approach is one approach that can be used to deal with a lack of available numbers in this scenario.

Figures 9 and 10 show the simulation results of the proposed method. RTL and Technology diagram of the cache memory block of the proposed method. Everything about memory characteristics that were used in the experimental evaluation was presented in the same way it had been before.

Fig. 9
figure 9

Technology view of cache memory block

Fig. 10
figure 10

RTL schematic view of memory block

As shown in Fig. 11, the results of dynamic energy minimization experiments conducted on 16 benchmark applications in terms of normalised average power consumption, along with a power analysis performed on the proposed algorithm, were compared to the theoretical results. This analysis displays total power, supply power, and other information (Fig. 12).

Fig. 11
figure 11

Power consumption based on power analysis

Fig. 12
figure 12

Device utilisation

Conclusion

When developing and designing for high performance microprocessors, it is essential to develop and design an effective cache memory algorithm that generates an effective system. Memory-intensive applications can benefit from the first method, which combines optimization of memory allocation with application binding to provide a more efficient solution. It is possible to achieve application-specific dynamic energy minimization in memory subsystems when running memory-intensive applications using this method, which is particularly advantageous when running memory-intensive applications. Based on a comparison with using a single memory block only, the determined results are optimal, with average reductions of more than 80% on an ongoing basis on an ongoing basis. ELMM memory subsystem configurations with split ELMM memory subsystems have proven to be a highly efficient switch when it comes to designing low-power System-on-Chip devices. Two memory blocks can be used in this area to achieve the greatest possible improvement instead of just one memory block, which is the case with the former. It turns out, on the other hand, that the allocation of additional memories results in only a marginal increase in the amount of energy saved.