# GPU Reduce, Scan, and Sort 

## CS 223 Guest Lecture

John Owens
Electrical and Computer Engineering, UC Davis

## Credits

- Thanks to Andrew Davidson (reduce), Michael Garland (Satish's sort), and especially Duane Merrill (Merrill's scan and sort) for slides and discussions.


## Programming Model (SPMD + SIMD): Thread Batching

- A kernel is executed as a grid of thread blocks
- A thread block is a batch of threads that can cooperate with each other by:
- Efficiently sharing data through shared memory
- Synchronizing their execution
- For hazard-free shared memory accesses
- Two threads from two different blocks cannot cooperate
- Blocks are independent



## Basic Efficiency Rules

- Develop algorithms with a data parallel mindset
- Minimize divergence of execution within blocks
- Maximize locality of global memory accesses
- Exploit per-block shared memory as scratchpad (registers > shared memory > global memory)
- Expose enough parallelism


## Expanding Manycore Territory

## Algorithm Examples



- Sort, computational geometry, finance
- Modest control flow
- Sparse/Irregular data structures
- Irregular communication between elements
- CPU Territory
- General purpose features vital for software efficiency
- Latency sensitive applications


## Today’s Big Picture

Rest of class

$\bullet$ Complexity $=\underset{\mu}{k(O(f(n)))}$
Today

Reduction

## Tree-Based Parallel Reductions

Commonly done in traditional GPGPU

Ping-pong between render targets, reduce by $1 / 2$ at a time
Completely bandwidth bound using graphics API
Memory writes and reads are off-chip, no reuse of intermediate sumsCUDA solves this by exposing on-chip shared memory
Reduce blocks of data in shared memory to save bandwidth

## CUDA Bank Conflicts

Threads:

Left: Linear addressing with a stride of one 32-bit word (no bank conflict).
Middle: Linear addressing with a stride of two 32-bit words (2-way bank conflicts). Right: Linear addressing with a stride of three 32-bit words (no bank conflict)

Theads:
Threads:

Left: Conflict-free access via random permutation.
Middle: Conflict-free access since threads 3, 4, 6, 7, and 9 access the same word within bank 5.
Riaht: Conflict-free broadcast access (all threads access the same word).

## Parallel Reduction: Interleaved Addressing

- Arbitrarily bad bank conflicts
- Requires barriers if $\mathrm{N}>$ warpsize
- Supports non-commutative operators



## Parallel Reduction: Sequential Addressing



## Reduction

- Only two-way bank conflicts
- Requires barriers if $\mathrm{N}>$ warpsize
- Requires $O(2 N-2)$ storage
- Supports non-commutative operators



## Reduction memory traffic

- Ideal: $n$ reads, 1 write.
- Block size 256 threads. Thus:
- Read $n$ items, write back $n / 256$ items.
- Read $n / 256$ items, write back 1 item.
- Total: $n+n / 128+1$. Not bad!


## Reduction optimization

- Ideal: $n$ reads, 1 write.
- Block size 256 threads. Thus:
- Read $n$ items, write back $n / 256$ items.
- Read $n / 256$ items, write back 1 item.
- Total: $n+n / 128+1$. Not bad!
- What if we had more than one item (say, 4) per thread?
- "Loop raking" is an optimization for all the algorithms I talk about today.
- Tradeoff: Storage for efficiency


## Persistent Threads

- GPU programming model suggests one thread per item
- What if you filled the machine with just enough threads to keep all processors busy, then asked each thread to stay alive until the input was complete?
- Minus: More overhead per thread (register pressure)
- Minus: Violent anger of vendors


## (Serial) Raking Reduction Phase

- No bank conflicts, only one barrier to after insertion into smem
- Supports non-commutative operators
- Requires subsequent warpscan to reduce accumulated partials

- Less memory bandwidth overall
- Exploits locality between items within a thread's registers


VS.


## Reduction

N Threads Perforning A Reduction


- Many-To-One
- Parameter to Tune => Thread Width (total number of threads)


## Parameter Selection Comparison GTX280




Parameter selection comparison between the static SDK and our tuned (thread cap) algorithm

We see some of the problems with having static thread parameters, for different machines.


GTX280 Tuned Performance Analysis



- Auto-tuned performance always exceeded SDK performance
- Up to a $70 \%$ performance gain for certain cards and workloads


## Reduction papers

- Mark Harris, Mapping Computational Concepts to GPUs, GPU Gems 2, Chapter 31, pp. 495-508, March 2005.
- Andrew Davidson and John Owens. Toward Techniques for Auto-tuning GPU Algorithms. In Kristján Jónasson, editor, Applied Parallel and Scientific Computing, volume 7134 of Lecture Notes in Computer Science, pages 110-119. Springer Berlin / Heidelberg, February 2012.
- NVIDIA SDK (reduction example)


## Scan (within a block)

## Parallel Prefix Sum (Scan)

- Given an array $A=\left[a_{0}, a_{1}, \ldots, a_{n-1}\right]$ and a binary associative operator $\oplus$ with identity I,
- $\operatorname{scan}(A)=\left[l, a_{0},\left(a_{0} \oplus a_{1}\right), \ldots,\left(a_{0} \oplus a_{1} \oplus \ldots \oplus a_{n-2}\right)\right]$
- Example: if $\oplus$ is addition, then scan on the set
- [31704163]
- returns the set
- [0 3411111516 22]


## Kogge-Stone Scan

Circuit family



## $O(n \log n)$ Scan



- Step efficient $(\log n$ steps)
- Not work efficient ( $n \log n$ work)
- Requires barriers at each step (WAR dependencies)


## Alt. Hillis-Steele Scan Implementation

No WAR conflicts, $O(2 N)$ storage


## Alt. Hillis-Steele Scan

Warp-synchronous: SIMD without divergence or barriers


- What if we truly had a SIMD machine?
- Recall CUDA warps (32 threads) are strictly SIMD
- "Warp-synchronous"


## Brent Kung Scan

Circuit family


## $O(n)$ Scan [Blelloch]



- Not step efficient $\left(2 \log n\right.$ steps) ${ }_{d=1}$
- Work efficient ( $O(n)$ work)
- Bank conflicts, and lots of 'em



## Hybrid methods

$$
\bullet \bullet \bullet \bullet \bullet \bullet \bullet \bullet
$$

$$
\bullet \bullet \bullet \bullet \bullet
$$

$$
\bullet \bullet \bullet \bullet
$$

$$
\bullet \bullet \bullet \bullet
$$

$$
\bullet \bullet \bullet \bullet \bullet
$$

$$
\bullet \bullet \bullet \bullet \bullet \bullet \bullet \bullet
$$

$$
\begin{aligned}
& \text { - • • • • • • • • • • • • • } \\
& \text { - • • • •• • • • • } \\
& \text { - •••••• } \\
& \text { - • • • ••• } \\
& \text { - - - - - - }
\end{aligned}
$$

## Scan papers

- Daniel Horn, Stream Reduction Operations for GPGPU Applications, GPU Gems 2, Chapter 36, pp. 573-589, March 2005.
- Shubhabrata Sengupta, Aaron E. Lefohn, and John D. Owens. A Work-Efficient Step-Efficient Prefix Sum Algorithm. In Proceedings of the 2006 Workshop on Edge Computing Using New Commodity Architectures, pages D-26-27, May 2006
- Mark Harris, Shubhabrata Sengupta, and John D. Owens.Parallel Prefix Sum (Scan) with CUDA. In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851-876. Addison Wesley, August 2007.
- Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. Scan Primitives for GPU Computing. In Graphics Hardware 2007, pages 97-106, August 2007.
- Y. Dotsenko, N. K. Govindaraju, P. Sloan, C. Boyd, and J. Manferdelli, "Fast scan algorithms on graphics processors," in ICS '08: Proceedings of the 22nd Annual International Conference on Supercomputing, 2008, pp. 205-213.
- Shubhabrata Sengupta, Mark Harris, Michael Garland, and John D. Owens. Efficient Parallel Scan Algorithms for many-core GPUs. In Jakub Kurzak, David A. Bader, and Jack Dongarra, editors, Scientific Computing with Multicore and Accelerators, Chapman \& Hall/CRC Computational Science, chapter 19, pages 413-442. Taylor \& Francis, January 2011.
- D. Merrill and A. Grimshaw, Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia, 2009, 54pp.
- Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. StreamScan: fast scan algorithms for GPUs without global barrier synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13). ACM, New York, NY, USA, 229-238.


## Scan (across blocks)

## Scan-then-propagate (4x)



## Scan-then-propagate (4x)

## $\log _{b}(N)$-level upsweep/downsweep

(scan-and-add strategy: CudPP)


## Reduce-then-scan (3x)

## $\log _{b}(N)$-level upsweep/downsweep <br> (reduce-then-scan strategy: Matrix-scan)



## Merrill's 2-level upsweep/downsweep

 (reduce-then-scan)

- Persistent threads
- Requires only 3 kernel launches vs. $\log n$
- Fewer global memory reads in intermediate step (constant vs. O(n))


## StreamScan ( $2 x$ )



- Serialize reduce chain
- Atomic counter assigns blocks in proper order (no deadlocks)
- Global array, one element per block: "ready" flag + reduce value

Radix Sort

## Radix Sort Fundamentals


scatter-not efficient!


Goals: (1) minimize number of scatters;

## (2) maximize coherence of scatters

## Radix Sort Memory Cost

| Step | Kernel | Purpose | Read Workload | Write Workload |
| :---: | :--- | :--- | :--- | :--- |
| $\mathbf{1}$ | binning | Create flags | $n$ keys | $n r$ flags |
| $\mathbf{2}$ | bottom-level reduce | Compact flags | $n r$ flags | (insignificant constant) |
|  | (insignificant constant) |  |  |  |
| $\mathbf{4}$ | top-level scan | bottom-level scan |  |  |
|  | (scan primitive) |  | $n r$ flags + (insignificant constant) | $n r$ offsets |

Total Memory Workload: $(k / d)(n)(r+4)$ keys only
$(k / d)(n)(r+6)$ with values

- $d$-bit radix digits
- radix $r=2^{\wedge} d$
- $n$-element input sequence of $k$-bit keys


## Parallel Radix Sort

- Assign tile of data to each block
(1024 elements)
Satish uses 256-thread blocks and 4 elements per thread
- Build per-block histograms of current digit
(4 bit)
this is a reduction
- Combine per-block histograms

$$
(P \times 16)
$$

this is a scan

- Scatter


## Per-Block Histograms

- Perform b parallel splits for b-bit digit
- Each split is just a prefix sum of bits
- each thread counts 1 bits to its left
- Write bucket counts \& partially sorted tile
- sorting tile improves scatter coherence later


## Combining Histograms

- Write per-block counts in column major order \& scan
radix 4-16 elements



## Satish's Radix Sort Memory Cost

| Step | Kernel | Purpose | Read Workload | Write Workload |
| :---: | :--- | :--- | :--- | :--- |
| $\mathbf{1}$ | local digit-sort | Maximize coherence | $n$ keys (+ $n$ values) | $n$ keys (+ $n$ values) |
| $\mathbf{2}$ | histogram | Create histograms | $n$ keys | $n r / b$ counts |
| $\mathbf{3}$ | bottom-level reduce | Scan histograms | $n r / b$ counts | (insignificant constant) |
| $\mathbf{4}$ | top-level scan |  | $n r / b$ counts + (insignificant constant) | $n r / b$ offsets |
| $\mathbf{5}$ | bottom-level scan |  | $n$ (insignificant constant) |  |
| $\mathbf{6}$ | scatter | Distribute keys | $n r / b$ offsets + $n$ keys (+ $n$ values) | $n$ keys (+ $n$ values) |

Total Memory Workload: $(k / d)(n)(5 r / b+7)$ keys only
$(k / d)(n)(5 r / b+9)$ with values

- d-bit radix digits
- radix $r=2^{\wedge} d$
- $n$-element input sequence of $k$-bit keys
- b bits per step


## Merrill’s 3-step sort



## Merrill's sort, costs

| Step | Kernel | Purpose | Read Workload | Write Workload |
| :---: | :---: | :---: | :---: | :---: |
| 1 | bottom-level reduce | Create flags, compact flags, scatter keys | $n$ keys | (insignificant constant) |
| 2 | top-level scan |  | (insignificant constant) | (insignificant constant) |
| 3 | bottom-level scan |  | $n$ keys (+ $n$ values) + (insignificant constant) | $n$ keys (+ $n$ values) |

Total Memory Workload: $(k / d)(3 n)$ keys only $(k / d)(5 n)$ with values

- $d$-bit radix digits
- radix $r=2^{\wedge} d$
- $n$-element input sequence of $k$-bit keys
- Current GPUs use $d=4$ (higher values exhaust local storage)


## Results (NVIDIA GTX 285)



## Merge Sort

## Merge Sort

- Divide input array into 256 -element tiles
- Sort each tile independently

| sort | sort | sort | sort | sort | sort | sort | sort |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

- Produce sorted output with tree of merges



## Sorting a Tile

- Tiles are sized so that:
- a single thread block can sort them efficiently
- they fit comfortably in on-chip memory
- Sorting networks are most efficient in this regime
- we use odd-even merge sort
- about 5-10\% faster than comparable bitonic sort
- Caveat: sorting networks may reorder equal keys


## Merging Pairs of Sorted Tiles



- Launch 1 thread block to process each pair of tiles
- Load tiles into on-chip memory
- Perform counting merge
- Stored merged result to global memory


## My grad-student-days merge



Flip B, pairwise compare to $A$


Smallest element in each comparison yields smallest $p$ elements overall in a bitonic sequence

## Counting Merge

```
upper_bound(A[i], B) = count( j where A[i] \leq B[j] )
```



## Use binary search since A \& B are sorted

## Counting Merge

```
upper_bound(A[i], B) = count( j where A[i] \leq B[j] )
```



```
scatter( A[i] -> C[i + upper_bound(A[i], B)] )
scatter( B[j] -> C[lower_bound(B[j], A) + j] )
```


## Merging Larger Subsequences



- Partition larger sequences into collections of tiles
- Apply counting merge to each pair of tiles


## Two-way Partitioning Merge

- Pick a splitting element from either A or B
$\square$
- Divide A and B into elements below/above splitter

$$
A[j] \leq A[i] A[i] A[j]>A[i]
$$

$$
B[j] \leq A[i] B[j]>A[i]
$$

## found by binary search

- Recurse

$$
\begin{aligned}
& \text { merge }: \frac{A[j] \leq A[i]}{B[j] \leq A[i]} \\
& \text { merge }: A[i] \\
& A[j]>A[i] \\
& B[i]
\end{aligned}
$$

## Multi-way Partitioning Merge

- Pick every $256^{\text {th }}$ element of A \& B as splitter

| 256 | 256 | 256 | 256 |
| :--- | :--- | :--- | :--- |

- Apply merge recursively to merge splitter sets
- recursively apply merge procedure
- Split A \& B with merged splitters

| A0 | A1 | A2 |  |
| :---: | :---: | :---: | :---: |
| B0 | B1 | B2 | $\cdots \cdot$ |

- Merge resulting pairs of tiles (at most 256 elements)


## Sort papers

- Mark Harris, Shubhabrata Sengupta, and John D. Owens.Parallel Prefix Sum (Scan) with CUDA. In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851876. Addison Wesley, August 2007.
- N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore GPUs," IPDPS 2009: IEEE International Symposium on Parallel \& Distributed Processing, May 2009.
- D. Merrill and A. Grimshaw, Revisiting Sorting for GPGPU Stream Architectures. Technical Report CS2010-03, Department of Computer Science, University of Virginia, 2010, 17pp.

