Q-NOTE QN-7000HX Technical Information download pdf (Page 20)

Boost NEON Performance by Improving Memory Access Efficiency

XAPP1206 v1.1 June 12, 2014 www.xilinx.com 20

Therefore, when writing assembly instructions, it is best to distribute memory access

instructions with data processing instructions.

Boost NEON

Performance by

Improving

Memory Access

Efficiency

When talking about processor core performance, developers typically make some assumptions

that are not always accurate in real-world applications, specifically:

• The processor pipeline is optimal and there is no interlock.

• The memory subsystem is ideal (zero wait state), that is, the processor does not need to

wait for memory to return data or instructions.

On-chip static RAM is very fast, but too costly, and you cannot integrate large RAM sizes into

the SoC. To lower the bill of materials (BOM) cost, dynamic RAM is often used as the main

memory. In the past decade, processor clock frequencies have evolved at a much faster pace

than dynamic RAM. The slower dynamic RAM can make the high clock frequency of a

processor less meaningful. Another issue with dynamic RAM is that it normally has a long

latency to return data. For a CPU, this is a serious issue because a CPU is more sensitive to

latency than throughput.

When the memory subsystem cannot return the data or instructions needed by processor

cores immediately, processors have nothing to do but wait. The out-of-order execution of an

ARM Cortex-A9 core can alleviate this issue, but it still exists.

To reduce the gap between processor and memory subsystems, engineers introduced cache

into modern SoCs. Cache is made up of fast on-chip static RAM and a cache controller to

determine when data can be moved in or out. But cache is not a panacea for embedded

systems. To use it efficiently, you need to know some details about the underlying hardware

system.

Cache improves system performance by temporal and spatial locality. Temporal locality means

that a resource accessed now will be accessed again in the near future. Spatial locality means

that the likelihood of accessing a resource is higher if a resource near it was just referenced.

When the data needed by the CPU can be found in cache, it is called a cache hit. Cache can

return the data with low latency. When the data is not in cache, it is called a cache miss, and the

cache controller must fetch the data into cache first and then return it to the CPU. This results

in much longer latency, and thus lower actual CPU performance. Preloading data into cache

before actually using it can improve the cache hit rate, thus improving system performance. The

ARM Cortex-A9 implements a preload engine and provides instructions to do this.

In a real SoC, cache is implemented as a trade-off between performance and complexity. Direct

mapped cache is simple but low efficiency. Fully associative cache has the highest efficiency, at

the cost of very complicated hardware design. In practice, N-way set associative cache is

frequently used. There is one potential issue with this. If the code is not well written, cache

thrashing can occur, and that can lower system performance significantly. In this case, you

must access data as tiles to improve the cache hit rate.

The following sections introduce some techniques for improving memory efficiency:

• Loading and Storing Multiple Data in a Burst

• Using the Preload Engine to Improve the Cache Hit Rate

• Using Tiles to Prevent Cache Thrashing

Loading and Storing Multiple Data in a Burst

Loading and storing multiple instructions allows successive words to be read from or written to

memory. These are extremely useful for stack push/pop and for memory copying. Only word

values can be operated in this way on a word aligned address. The operands are a base

between braces.

1 2 ... 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Comments to this Manuals

No comments

Q-NOTE QN-7000HX Technical Information Page 20

Comments to this Manuals

Related products and manuals for Tablets Q-NOTE QN-7000HX