
DS-Permute InstructionsĪs a previous post briefly described, GCN3 includes two new instructions: ds_permute_b32 That’s another order of magnitude, so communication between threads is much slower than just crunching data in the thread registers.īut can we do better by sharing? The answer is yes, if we further reduce our scope from a workgroup to a single wavefront. LDS is only available on a workgroup level.Īt the same time, the register bandwidth is (1,050 GHz) * (64 CUs) * (64 lanes) * (12 bytes per lane) = 51.6 TB/s. Just imagine reading all the content of a high-capacity 8 TB HDD in one second! Moreover, the LDS latency is an order of magnitude less than that of global memory, helping feed all 4,096 insatiable ALUs. Its LDS implementation has a total memory bandwidth of (1,050 GHz) * (64 CUs) * (32 LDS banks) * (4 bytes per read per lane) = 8.6 TB/s. The memory bandwidth of AMD’s Radeon R9 Fury X is an amazing 512 GB/s. Now, let’s look at the peak-performance numbers. Still, most actual compute instructions operate on data in registers. LDS is a low-latency RAM physically located on chip in each compute unit (CU). Local data share (LDS) was introduced exactly for that reason: to allow efficient communication and data sharing between threads in the same compute unit. This hardware organization affects cross-lane operations – some operations work at the wavefront level and some only at the row level. The actual GCN hardware implements 16-wide SIMD, so wavefronts decompose into groups of 16 lanes called wavefront rows that are executed on 4 consecutive cycles.


Cross-lane operations are an efficient way to share data between wavefront lanes.
