Q-NOTE QN-7000HX Technical Information Page 19

  • Download
  • Add to my manuals
  • Print
  • Page
    / 28
  • Table of contents
  • BOOKMARKS
  • Rated. / 5. Based on customer reviews
Page view 18
Software Performance Optimization Methods
XAPP1206 v1.1 June 12, 2014 www.xilinx.com 19
The disadvantage is obvious. First, it is difficult to maintain assembler code. Even though all
Cortex-A series processors support NEON instructions, the hardware implementations are
different, so instruction timing and movement in the pipeline are different. This means that
NEON optimization is processor-dependent. Code running faster on one Cortex-A series
processor might not work as well on another Cortex-A series processor. Second, it is difficult to
write assembler code. To be successful, you must know the details of the underlying hardware
features, such as pipelining, scheduling issues, memory access behavior, and scheduling
hazards. These factors are briefly described below.
Memory Access Optimizations
Typically, NEON is used to process large amounts of data. One crucial optimization is to ensure
that the algorithm uses cache in the most efficient way possible. It is also important to consider
the number of active memory locations. A typical optimization is one in which you design the
algorithm to process small memory regions called tiles, one by one, to maximize the cache and
translation lookaside buffer (TLB) hit rate, and to minimize memory access to external Dynamic
RAM.
NEON includes instructions that support interleaving and de-interleaving, and can provide
significant performance improvements in some scenarios, if used properly. VLD1/VST1
loads/stores multiple registers to/from memory, with no de-interleaving. Other VLDn/VSTn
instructions allow you to interleave and de-interleave structures containing two, three, or four
equally sized elements.
Alignment
Even though NEON architecture provides full unaligned support for NEON data access,
instruction opcode contains an alignment hint which permits implementations to be faster when
the address is aligned and a hint is specified.
The base address specified as [<Rn>:<align>]
In practice, it is also useful to arrange the data as cache line aligned. Otherwise, when the data
crosses cache line boundaries, additional cache line fill might be incurred, and the overall
system performance drops.
Instruction Scheduling
To write faster code for NEON, you must be aware of how to schedule code for the specific ARM
processor. For the Zynq-7000 AP SoC, this would be the Cortex-A9.
Result-use scheduling is the main performance optimization when writing NEON code. NEON
instructions typically issue in one or two cycles, but the result is not always ready in the next
cycle (except when the simplest NEON instructions are issued, such as VADD and VMOV).
Some instructions have considerable latency, for example the VMLA multiply-accumulate
instruction (five cycles for an integer; seven cycles for a floating-point). To prevent a stall, take
into consideration the amount of time between the current instruction and the next one using its
result. Despite having a few cycles result latency, these instructions are fully pipelined, so
several operations can be in flight at once.
Another typical scheduling issue is interlock. Without adequate hardware knowledge, it is
possible to load the data from memory to registers, then process them immediately. If the
memory access receives a cache hit, there is no problem. However, if a cache hit is missed, the
CPU must wait tens of cycles to load data from external memory into the cache before
proceeding. Thus, you usually need to place instructions that are not dependent upon the VLD
instruction between the VLD and the instruction using its result. Using the Cortex-A9 preload
engine can improve the cache hit rate. This is discussed later.
Also be aware that external memory is slow and has a long latency compared to on-chip
memory. The CPU uses cache and write buffers to alleviate this issue. Sometimes, if there are
long bursts of memory write, the write buffer fills up, and the next VST instruction stalls.
Page view 18
1 2 ... 14 15 16 17 18 19 20 21 22 23 24 ... 27 28

Comments to this Manuals

No comments