Boost NEON Performance by Improving Memory Access Efficiency
XAPP1206 v1.1 June 12, 2014 www.xilinx.com 21
Generally, loading and storing multiple instructions can yield better performance than the
equivalent multiple load-and-store instructions, especially when cache is not enabled or a
memory region is marked as non-cacheable in the translation table. To understand this, you
must study the AMBA
®
specification carefully. Each memory access has overhead on the AXI
bus. To improve bus efficiency, use an AXI support burst; that is, group N consecutive accesses
together, and you will only need a one-time overhead. If you access N words in a single-beat
manner, N overheads are needed. This not only degrades internal bus throughput, but also
causes long latency.
Normally, the compiler only uses load-and-store multiple instructions for stack operations.
When the routine is memory-access intensive, such as memory copy, you might need to try
LDM/STM manually.
An example of these instructions can be:
LDMIA R10!, { R0-R3, R12 }
This instruction reads five registers from the addresses at which register (R10) points and
increases R10 by 20 (5 × 4 bytes) at the end because of the write-back specifier.
The register list is comma separated, with hyphens indicating ranges. The order specified in
this list is not important. ARM processors always proceed in a fixed fashion, with the lowest
numbered register mapped to the lowest address.
The instruction must also specify how to proceed from the base register, using one of four
modes: IA (increment after), IB (increment before), DA (decrement after), and DB (decrement
before). These specifiers also have four aliases (FD, FA, ED and EA) that work from a stack
perspective. They specify whether the stack pointer points to a full or empty top of the stack,
and whether the stack ascends or descends in memory.
Correspondingly, NEON supports load/store multiple in a similar way. For example:
VLDMmode{cond} Rn{!}, Registers
VSTMmode{cond} Rn{!}, Registers
The Mode should be one of the following:
• IA - Increment address after each transfer. This is the default, and can be omitted.
• DB - Decrement address before each transfer.
• EA - Empty ascending stack operation. This is the same as DB for loads and IA for saves.
• FD - Full descending stack operation. This is the same as IA for loads, and DB for saves.
Note that NEON has some special instructions for interleaving and de-interleaving:
• VLDn (Vector load multiple n-element structures) loads multiple n-element structures from
memory into one or more NEON registers, with de-interleaving (unless n == 1). Every
element of each register is loaded.
• VSTn (Vector store multiple n-element structures) writes multiple n-element structures to
memory from one or more NEON registers, with interleaving (unless n == 1). Every
element of each register is stored.
Comments to this Manuals