Data parallelism

We'll look at two examples of using data parallelism to speed up computation using multiple CPUs

Example 1: reducing a list

using PureFun, FLoops, BenchmarkTools
using PureFun.Chunky

We are tasked with optimizing the following function:

sequential(xs) = mapreduce(round ∘ sin, +, xs);

For any applications that require reducing over an entire collection, it makes sense to reach for a PureFun.Chunky.@list:

Chunky.@list(ChunkyRandomAccessList,
             list  = PureFun.RandomAccess.List,
             chunk = PureFun.Contiguous.StaticChunk{8})

We start out by benchmarking the current solution:

julia> @btime(sequential(xs),
              setup = xs = ChunkyRandomAccessList(rand(Int, 5_000)),
              evals = 10, samples=300);
  107.854 μs (1256 allocations: 58.81 KiB)

The lists in PureFun.jl implement the SplittablesBase interface, and as a result can be used with a variety of parallel algorithms. Here we use Floops.jl to parallelize our reduction:

function parallel(xs)
    @floop for x in xs
        @reduce s += round(sin(x))
    end
    return s
end;

A quick test to make sure the function works as expected:

test = ChunkyRandomAccessList(rand(Int, 100_000))
@assert sequential(test) == parallel(test)

And indeed, we see a decent speedup:

julia> @btime(parallel(xs),
              setup = xs = ChunkyRandomAccessList(rand(Int, 5_000)),
              evals = 10, samples = 300)
  53.725 μs (793 allocations: 28.46 KiB)

Example 2: parallel heap sort

using PureFun, Folds, BenchmarkTools
using PureFun.Pairing

If comparisons are expensive, then constructing a heap can get pretty slow:

xs = rand(Int, 1_000);
julia> @btime Pairing.Heap($xs, Base.Order.By(sin));
  45.333 μs (2998 allocations: 93.69 KiB)

In How to Think about Parallel Programming: Not! Guy Steele describes a useful strategy for parallel programming:

  • from each input construct a singleton solution
  • merge solutions using an associative merge operation

We can use that strategy to construct a heap:

const ∅ = Pairing.Heap{Int}(Base.Order.By(sin))
singleton(x) = push(∅, x)

test_seq = Pairing.Heap(xs, Base.Order.By(sin))
test_par = Folds.mapreduce(singleton, merge, xs, init=∅)
@assert all(test_seq .== test_par)

And once again, we see a decent speedup:

julia> @btime Folds.mapreduce(singleton, merge, $xs, init=∅);
  18.625 μs (7016 allocations: 221.66 KiB)

This page was generated using Literate.jl.