Q-NOTE QN-7000HX Technical Information download pdf (Page 19)

Software Performance Optimization Methods

XAPP1206 v1.1 June 12, 2014 www.xilinx.com 19

The disadvantage is obvious. First, it is difficult to maintain assembler code. Even though all

Cortex-A series processors support NEON instructions, the hardware implementations are

different, so instruction timing and movement in the pipeline are different. This means that

NEON optimization is processor-dependent. Code running faster on one Cortex-A series

processor might not work as well on another Cortex-A series processor. Second, it is difficult to

write assembler code. To be successful, you must know the details of the underlying hardware

features, such as pipelining, scheduling issues, memory access behavior, and scheduling

hazards. These factors are briefly described below.

Memory Access Optimizations

Typically, NEON is used to process large amounts of data. One crucial optimization is to ensure

that the algorithm uses cache in the most efficient way possible. It is also important to consider

the number of active memory locations. A typical optimization is one in which you design the

algorithm to process small memory regions called tiles, one by one, to maximize the cache and

translation lookaside buffer (TLB) hit rate, and to minimize memory access to external Dynamic

RAM.

NEON includes instructions that support interleaving and de-interleaving, and can provide

significant performance improvements in some scenarios, if used properly. VLD1/VST1

loads/stores multiple registers to/from memory, with no de-interleaving. Other VLDn/VSTn

instructions allow you to interleave and de-interleave structures containing two, three, or four

equally sized elements.

Alignment

Even though NEON architecture provides full unaligned support for NEON data access,

instruction opcode contains an alignment hint which permits implementations to be faster when

the address is aligned and a hint is specified.

The base address specified as [<Rn>:<align>]

In practice, it is also useful to arrange the data as cache line aligned. Otherwise, when the data

crosses cache line boundaries, additional cache line fill might be incurred, and the overall

system performance drops.

Instruction Scheduling

To write faster code for NEON, you must be aware of how to schedule code for the specific ARM

processor. For the Zynq-7000 AP SoC, this would be the Cortex-A9.

Result-use scheduling is the main performance optimization when writing NEON code. NEON

instructions typically issue in one or two cycles, but the result is not always ready in the next

cycle (except when the simplest NEON instructions are issued, such as VADD and VMOV).

Some instructions have considerable latency, for example the VMLA multiply-accumulate

instruction (five cycles for an integer; seven cycles for a floating-point). To prevent a stall, take

into consideration the amount of time between the current instruction and the next one using its

result. Despite having a few cycles result latency, these instructions are fully pipelined, so

several operations can be in flight at once.

Another typical scheduling issue is interlock. Without adequate hardware knowledge, it is

possible to load the data from memory to registers, then process them immediately. If the

memory access receives a cache hit, there is no problem. However, if a cache hit is missed, the

CPU must wait tens of cycles to load data from external memory into the cache before

proceeding. Thus, you usually need to place instructions that are not dependent upon the VLD

instruction between the VLD and the instruction using its result. Using the Cortex-A9 preload

engine can improve the cache hit rate. This is discussed later.

Also be aware that external memory is slow and has a long latency compared to on-chip

memory. The CPU uses cache and write buffers to alleviate this issue. Sometimes, if there are

long bursts of memory write, the write buffer fills up, and the next VST instruction stalls.

1 2 ... 14 15 16 17 18 19 20 21 22 23 24 ... 27 28

Comments to this Manuals

No comments

Q-NOTE QN-7000HX Technical Information Page 19

Comments to this Manuals

Related products and manuals for Tablets Q-NOTE QN-7000HX