botstrio.blogg.se - Glm permute vector swizzle

DS-Permute InstructionsĪs a previous post briefly described, GCN3 includes two new instructions: ds_permute_b32 That’s another order of magnitude, so communication between threads is much slower than just crunching data in the thread registers.īut can we do better by sharing? The answer is yes, if we further reduce our scope from a workgroup to a single wavefront. LDS is only available on a workgroup level.Īt the same time, the register bandwidth is (1,050 GHz) * (64 CUs) * (64 lanes) * (12 bytes per lane) = 51.6 TB/s. Just imagine reading all the content of a high-capacity 8 TB HDD in one second! Moreover, the LDS latency is an order of magnitude less than that of global memory, helping feed all 4,096 insatiable ALUs. Its LDS implementation has a total memory bandwidth of (1,050 GHz) * (64 CUs) * (32 LDS banks) * (4 bytes per read per lane) = 8.6 TB/s. The memory bandwidth of AMD’s Radeon R9 Fury X is an amazing 512 GB/s. Now, let’s look at the peak-performance numbers. Still, most actual compute instructions operate on data in registers. LDS is a low-latency RAM physically located on chip in each compute unit (CU). Local data share (LDS) was introduced exactly for that reason: to allow efficient communication and data sharing between threads in the same compute unit. This hardware organization affects cross-lane operations – some operations work at the wavefront level and some only at the row level. The actual GCN hardware implements 16-wide SIMD, so wavefronts decompose into groups of 16 lanes called wavefront rows that are executed on 4 consecutive cycles.

Because a wavefront is the lowest level that flow control can affect, groups of 64 work items execute in lockstep.

A lane index is a coordinate of the work item in a wavefront, with a value ranging from 0 to 63.

A wavefront comprises 64 parallel elements, called lanes, that each represent a separate work item.

The basic execution unit of an AMD GCN GPU is called a wavefront, which is basically a SIMD vector.

We’ll be optimizing communication between work-items, so it is important to start with a consistent set of terminology: I’d like to thank Ilya Perminov of Luxsoft for co-authoring this blog post. This article covers in detail the cross-lane features that GCN3 offers.

Cross-lane operations are an efficient way to share data between wavefront lanes.