Q-NOTE QN-7000HX Technical Information Page 20

  • Download
  • Add to my manuals
  • Print
  • Page
    / 28
  • Table of contents
  • BOOKMARKS
  • Rated. / 5. Based on customer reviews
Page view 19
Boost NEON Performance by Improving Memory Access Efficiency
XAPP1206 v1.1 June 12, 2014 www.xilinx.com 20
Therefore, when writing assembly instructions, it is best to distribute memory access
instructions with data processing instructions.
Boost NEON
Performance by
Improving
Memory Access
Efficiency
When talking about processor core performance, developers typically make some assumptions
that are not always accurate in real-world applications, specifically:
The processor pipeline is optimal and there is no interlock.
The memory subsystem is ideal (zero wait state), that is, the processor does not need to
wait for memory to return data or instructions.
On-chip static RAM is very fast, but too costly, and you cannot integrate large RAM sizes into
the SoC. To lower the bill of materials (BOM) cost, dynamic RAM is often used as the main
memory. In the past decade, processor clock frequencies have evolved at a much faster pace
than dynamic RAM. The slower dynamic RAM can make the high clock frequency of a
processor less meaningful. Another issue with dynamic RAM is that it normally has a long
latency to return data. For a CPU, this is a serious issue because a CPU is more sensitive to
latency than throughput.
When the memory subsystem cannot return the data or instructions needed by processor
cores immediately, processors have nothing to do but wait. The out-of-order execution of an
ARM Cortex-A9 core can alleviate this issue, but it still exists.
To reduce the gap between processor and memory subsystems, engineers introduced cache
into modern SoCs. Cache is made up of fast on-chip static RAM and a cache controller to
determine when data can be moved in or out. But cache is not a panacea for embedded
systems. To use it efficiently, you need to know some details about the underlying hardware
system.
Cache improves system performance by temporal and spatial locality. Temporal locality means
that a resource accessed now will be accessed again in the near future. Spatial locality means
that the likelihood of accessing a resource is higher if a resource near it was just referenced.
When the data needed by the CPU can be found in cache, it is called a cache hit. Cache can
return the data with low latency. When the data is not in cache, it is called a cache miss, and the
cache controller must fetch the data into cache first and then return it to the CPU. This results
in much longer latency, and thus lower actual CPU performance. Preloading data into cache
before actually using it can improve the cache hit rate, thus improving system performance. The
ARM Cortex-A9 implements a preload engine and provides instructions to do this.
In a real SoC, cache is implemented as a trade-off between performance and complexity. Direct
mapped cache is simple but low efficiency. Fully associative cache has the highest efficiency, at
the cost of very complicated hardware design. In practice, N-way set associative cache is
frequently used. There is one potential issue with this. If the code is not well written, cache
thrashing can occur, and that can lower system performance significantly. In this case, you
must access data as tiles to improve the cache hit rate.
The following sections introduce some techniques for improving memory efficiency:
Loading and Storing Multiple Data in a Burst
Using the Preload Engine to Improve the Cache Hit Rate
Using Tiles to Prevent Cache Thrashing
Loading and Storing Multiple Data in a Burst
Loading and storing multiple instructions allows successive words to be read from or written to
memory. These are extremely useful for stack push/pop and for memory copying. Only word
values can be operated in this way on a word aligned address. The operands are a base
register (with an optional denoting write-back of the base register) with a list of registers
between braces.
Page view 19
1 2 ... 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Comments to this Manuals

No comments