This tutorial is inspired Tim Holy's introduction to GPU for Julia. Here we implement a custom Kernel to compute a simple value function iteration (VFI) algorithm. That part of the tutorial is based on Fernandez-Villaverde and Zarruk's parallel computing survey for Economists. For a good introduction to the computational side of value function iteration, I recommend the corresponding QuantEcon tutorial.
We create a function to build our parameter vector.
function param(;nx=1500,ne=15,T=10) p = ( nx = nx, xmax = 4.0, xmin = 0.1, ne = ne, sigma_eps = 0.02058, lambda_eps = 0.99, m = 1.5, sigma = 2, beta = 0.97, T = 10, r = 0.07, w = 5, xgrid = zeros(Float32,nx), egrid = zeros(Float32,ne), P = zeros(Float32,ne,ne)) # populate transition matrix # Grid for capital (x) p.xgrid[:] = collect(range(p.xmin,stop=p.xmax,length=p.nx)); # Grid for productivity (e) with Tauchen (1986) sigma_y = sqrt((p.sigma_eps^2) / (1 - (p.lambda_eps^2))); estep = 2*sigma_y*p.m / (p.ne-1); for i = 1:p.ne p.egrid[i] = (-p.m*sqrt((p.sigma_eps^2) / (1 - (p.lambda_eps^2))) + (i-1)*estep); end # Transition probability matrix (P) Tauchen (1986) mm = p.egrid[2] - p.egrid[1]; for j = 1:p.ne for k = 1:p.ne if(k == 1) p.P[j, k] = cdf(Normal(), (p.egrid[k] - p.lambda_eps*p.egrid[j] + (mm/2))/p.sigma_eps); elseif(k == ne) p.P[j, k] = 1 - cdf(Normal(), (p.egrid[k] - p.lambda_eps*p.egrid[j] - (mm/2))/p.sigma_eps); else p.P[j, k] = cdf(Normal(), (p.egrid[k] - p.lambda_eps*p.egrid[j] + (mm/2))/p.sigma_eps) - cdf(Normal(), (p.egrid[k] - p.lambda_eps*p.egrid[j] - (mm/2))/p.sigma_eps); end end end # exponential of egrid p.egrid[:] = exp.(p.egrid) return p end
param (generic function with 1 method)
let's get an instance of it.
using Distributions p = param()
A state in our setup is a triple (ix,ie,age)
, i.e. a value of capital and endowment shock. Here is how we could do this:
function v(ix::Int,ie::Int,age::Int,Vnext::Matrix{Float64},p::NamedTuple) out = typemin(eltype(Vnext)) ixpopt = 0 # optimal policy expected = 0.0 utility = out for jx in 1:p.nx if age < p.T for je in 1:p.ne expected += p.P[ie,je] * Vnext[jx,je] end end cons = (1+p.r)*p.xgrid[ix] + p.egrid[ie]*p.w - p.xgrid[jx] utility = (cons^(1-p.sigma))/(1-p.sigma) + p.beta * expected if cons <= 0 utility = typemin(eltype(Vnext)) end if utility >= out out = utility ixpopt = jx end end return out end
v (generic function with 1 method)
and finally, a function that cycles over age and each state
function cpu_lifecycle(p::NamedTuple) V = zeros(p.nx,p.ne,p.T) Vnext = zeros(p.nx,p.ne) for age in p.T:-1:1 for ix in 1:p.nx for ie in 1:p.ne value = v(ix,ie,age,Vnext,p) V[ix,ie,age] = value end end Vnext[:,:] = V[:,:,age] end println("first three shock states at age 1: $(V[1,1:3,1])") return V end vcpu = cpu_lifecycle(p);
first three shock states at age 1: [-2.12087, -2.08143, -2.02839]
Let's make a plot of the resulting value function:
using Plots CList = reshape( range(colorant"yellow", stop=colorant"red",length=p.ne), 1, p.ne ); p1 = plot(p.xgrid,vcpu[:,:,1], linecolor=CList, title="period 1", leg=false,label=["e$i" for i in 1:p.ne],xlabel="x",ylabel="v(x)") p2 = plot(p.xgrid,vcpu[:,:,p.T-1], linecolor=CList, title="period T-1", leg=:bottomright,label=["e$i" for i in 1:p.ne],xlabel="x",ylabel="v(x)") plot(p1,p2,layout=2, link=:y)
Multicore computation distributes the work across multiple worker nodes. A slight complication with the distributed model is that the worker processes use separate memory - even if they reside in the same machine! I find it easiest to imagine that you start several separate julia processes in your terminal, each in their own tab. Clearly, creating x=2
in the first processes doesn't make the object x
available in the other processes. This is true for data (like x
), as it is for code. Any function you want to call on one of the worker processes must be defined on that process. Again visualize the separate terminal windows, and imagine defining function f()
on process 1. Calling f()
on process 2 will return an error. This is illustrated here:
The Distributed
package provides functionality to efficiently connect those separate processes, and the SharedArrays
package gives us an Array
type which may be shared by several processes. The function workers()
shows the ids of the currently active workers. By default, this is just a single worker, which we also often call the master node:
using SharedArrays using Distributed workers() addprocs(5) # add 5 workers
5-element Array{Int64,1}: 2 3 4 5 6
Let's define a function that distributes the computation of each state (ix,ie) over our set of workers. the main thing to look out for is that we have to make the function v
available on all workers. We could save it to a file script.jl
, for example, and load julia with the command line option --load script.jl
. Alternatively, we redefine it here, by prefix with the macro @everywhere
:
@everywhere function v(ix::Int,ie::Int,age::Int,Vnext::Matrix{Float64},p::NamedTuple) out = typemin(eltype(Vnext)) ixpopt = 0 # optimal policy expected = 0.0 utility = out for jx in 1:p.nx if age < p.T for je in 1:p.ne expected += p.P[ie,je] * Vnext[jx,je] end end cons = (1+p.r)*p.xgrid[ix] + p.egrid[ie]*p.w - p.xgrid[jx] utility = (cons^(1-p.sigma))/(1-p.sigma) + p.beta * expected if cons <= 0 utility = typemin(eltype(Vnext)) end if utility >= out out = utility ixpopt = jx end end return out end
Now that this is defined on all workers, we can move on to define the calling function on the master node. Notice that we have to use a SharedArray
for the worker processes to save their results in - remember that by default they would not have access to a standard Array
created on the master worker.
function cpu_lifecycle_multicore(p::NamedTuple) println("julia running with $(length(workers())) workers") V = zeros(p.nx,p.ne,p.T) Vnext = zeros(p.nx,p.ne) Vshared = SharedArray{Float64}(p.nx,p.ne) # note! for age in p.T:-1:1 @sync @distributed for i in 1:(p.nx*p.ne) # distribute over a single index ix,ie = Tuple(CartesianIndices((p.nx,p.ne))[i]) value = v(ix,ie,age,Vnext,p) Vshared[ix,ie] = value end V[:,:,age] = Vshared Vnext[:,:] = Vshared end println("first three shock states at age 1: $(V[1,1:3,1])") return V end
cpu_lifecycle_multicore (generic function with 1 method)
Important: Notice that the worker processes do not have to reside on the same machine, as in this example was indeed the case. The function addprocs
is very powerful, so check out ?addprocs
. In particular, one can simply supply a list of machine identifiers (like for instance IP addresses) to the function. It is interesting to look at the output of htop
(or Activity Monitor on OSX) and see several different julia processes doing work, when we call this function:
Related to this is the multithread model of parallelization. The good thing here is that we operate in shared memory, i.e. we don't have worry about moving data or code around - there is just a single julia session running in your imaginary terminal from above. What is happening, however, is that julia operates several threads through our CPU, whenever we tell it to do so with the macro Threads.@threads
. One needs to be very careful to write such code in thread-safe way, that is, no 2 threads should want to write to the same array index, for example. Similarly to having to add workers via addprocs
as above, here we have to start julia with multiple threads. You can achieve that by starting julia via
JULIA_NUM_THREADS=4 julia
Here is our lifecycle problem with a threaded loop:
function cpu_lifecycle_thread(p::NamedTuple) println("julia running with $(Threads.nthreads()) threads") V = zeros(p.nx,p.ne,p.T) Vnext = zeros(p.nx,p.ne) for age in p.T:-1:1 Threads.@threads for i in 1:(p.nx*p.ne) ix,ie = Tuple(CartesianIndices((p.nx,p.ne))[i]) value = v(ix,ie,age,Vnext,p) V[ix,ie,age] = value end Vnext[:,:] = V[:,:,age] end println("first three shock states at age 1: $(V[1,1:3,1])") return V end
cpu_lifecycle_thread (generic function with 1 method)
and, again, it's instructive to look at the system monitor via htop
or similar. Now we see that a single process (the process that builds this page, executing julia make.jl
) runs at almost 400% CPU utilization. The other worker processes are idle.
Before worrying about speed, we always want to verify that our computations are accurate. Taking the content of vcpu
as the truth, let's see how close we get with the threaded computation:
vthread = cpu_lifecycle_thread(p);
julia running with 4 threads first three shock states at age 1: [-2.12087, -2.08143, -2.02839]
using Test @test vthread ≈ vcpu
Test Passed
You can see that I used the symbol ≈
here, which means approximately equal. In the julia
REPL you obtain this by typing \approx<TAB>
. Now let's worry about speed. Benchmarking incurs machine noise. This is very similar to any statistical sampling proceedure. Consider those timings obtained from a naive timing, for example:
@time cpu_lifecycle(p);
first three shock states at age 1: [-2.12087, -2.08143, -2.02839] 7.887026 seconds (73 allocations: 3.797 MiB)
@time cpu_lifecycle(p);
first three shock states at age 1: [-2.12087, -2.08143, -2.02839] 7.866052 seconds (73 allocations: 3.797 MiB)
@time cpu_lifecycle(p);
first three shock states at age 1: [-2.12087, -2.08143, -2.02839] 7.868888 seconds (73 allocations: 3.797 MiB)
You can see that they all differ. If we were benchmarking a function that runs quickly, we could use the BenchmarkTools
package. It basically runs the benchmark function many times and reports summary statistics. For long-running functions like ours, it will only take one sample, so the standard @time
will do for us.
@time cpu_lifecycle(p);
first three shock states at age 1: [-2.12087, -2.08143, -2.02839] 7.858191 seconds (73 allocations: 3.797 MiB)
@time cpu_lifecycle_thread(p);
julia running with 4 threads first three shock states at age 1: [-2.12087, -2.08143, -2.02839] 2.126724 seconds (377 allocations: 3.803 MiB, 0.19% gc time)
@time cpu_lifecycle_multicore(p) julia running with 5 workers first three shock states at age 1: [-2.12087, -2.08143, -2.02839] julia running with 5 workers 2.521773 seconds (10.19 k allocations: 2.753 MiB) Process(`/home/floswald/apps/julia-1.0.0/bin/julia -Cnative -J/home/floswal d/apps/julia-1.0.0/lib/julia/sys.so --compile=yes --depwarn=yes /home/flosw ald/git/CuArrays.jl/tutorials/src/addprocs-vfi.jl`, ProcessExited(0))
Looking at those results its important to realise that we didn't cut the runtime by exactly the number of processes we ran - quite far from it, actually. It is very rare that our parallelization would achieve this so-called linear speedup. Overhead from communication between the various processes (threads or workers) is a crucial component of non-linear speedup. A simple rule of thumb is to avoid data movement between processes as much as possible; unfortunately optimizing parallel solutions beyond that simple rule requires knowledge that goes beyond this tutorial (and its author's knowledge as well).
using CuArrays using CUDAnative function v_gpu!(V,nx,ne,T,xgrid,egrid,r,w,sigma,beta,P,age) badval = -1_000_000.0f0 vtmp = badval # intiate V at lowest value ixpopt = 0 # optimal policy expected = 0.0f0 utility = badval # strided loop index = (blockIdx().x - 1) * blockDim().x + threadIdx().x stride = blockDim().x * gridDim().x for i = index:stride:(nx*ne) ix = ceil(Int,i/ne) ie = ceil(Int,mod(i-0.05,ne)) # @cuprintf("threadIdx %ld, blockDim %ld,ix %ld, ie %ld\n", index, stride,ix,ie) j = ix + nx*(ie-1 + ne*(age-1)) vtmp = badval # intiate V at lowest value for jx in 1:nx if age < T for je in 1:ne expected = expected + P[ie,je] * V[jx + nx*(je-1 + ne*(age))] end end cons = convert(Float32,(1+r)*xgrid[ix] + egrid[ie]*w - xgrid[jx]) utility = CUDAnative.pow(cons,convert(Float32,(1-sigma)))/(1-sigma) + beta * expected if cons <= 0 utility = badval end if utility >= vtmp vtmp = utility end end V[j] = vtmp end return nothing end V_d = CuArray(zeros(Float32,p.nx,p.ne,p.T)); xgrid_d = CuArray(p.xgrid); egrid_d = CuArray(p.egrid); P_d = CuArray(p.P); block_size = 256 @cuda threads=ceil(Int,(p.nx*p.ne)/block_size) blocks=block_size v_gpu!(V_d,p.nx,p.ne,p.T,xgrid_d,egrid_d,p.r,p.w,p.sigma,p.beta,P_d,p.T) function bench_v_gpu!(V_d,nx,ne,T,xgrid_d,egrid_d,r,w,sigma,beta,P_d,age,block_size) CuArrays.@sync begin @cuda threads=ceil(Int,(nx*ne)/block_size) blocks=block_size v_gpu!(V_d,nx,ne,T,xgrid_d,egrid_d,r,w,sigma,beta,P_d,age) end end bench_v_gpu!(V_d,p.nx,p.ne,p.T,xgrid_d,egrid_d,p.r,p.w,p.sigma,p.beta,P_d,2,block_size); using BenchmarkTools @btime bench_v_gpu!(V_d,p.nx,p.ne,p.T,xgrid_d,egrid_d,p.r,p.w,p.sigma,p.beta,P_d,2,block_size)
143.363 ms (42 allocations: 1.47 KiB)
# using CUDAdrv # CUDAdrv.@profile bench_v_gpu!(V_d,p.nx,p.ne,p.T,xgrid_d,egrid_d,p.r,p.w,p.sigma,p.beta,P_d,2,block_size) # exit()
over the lifecycle
function gpu_lifecycle(p::NamedTuple) V_d = CuArray(zeros(Float32,p.nx,p.ne,p.T)); xgrid_d = CuArray(p.xgrid); egrid_d = CuArray(p.egrid); P_d = CuArray(p.P); block_size = 256 for it in p.T:-1:1 @cuda threads=ceil(Int,(p.nx*p.ne)/block_size) blocks=block_size v_gpu!(V_d,p.nx,p.ne,p.T,xgrid_d,egrid_d,p.r,p.w,p.sigma,p.beta,P_d,it) end out = Array(V_d) # download data from GPU println("first three shock states at age 1: $(out[1,1:3,1])") return out end gpu_lifecycle(p); # compile
first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839]
@btime gpu_lifecycle(p);
first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] 1.817 s (481 allocations: 2.60 MiB)
@btime cpu_lifecycle(p);
first three shock states at age 1: [-2.12087, -2.08143, -2.02839] first three shock states at age 1: [-2.12087, -2.08143, -2.02839] first three shock states at age 1: [-2.12087, -2.08143, -2.02839] first three shock states at age 1: [-2.12087, -2.08143, -2.02839] 7.921 s (113 allocations: 3.61 MiB)
how costly is it to create the CuArray
s? Let's create them beforehand.
function gpu_lifecycle2(p::NamedTuple,V_d,xgrid_d,egrid_d,P_d) block_size = 256 for it in p.T:-1:1 @cuda threads=ceil(Int,(p.nx*p.ne)/block_size) blocks=block_size v_gpu!(V_d,p.nx,p.ne,p.T,xgrid_d,egrid_d,p.r,p.w,p.sigma,p.beta,P_d,it) end out = Array(V_d) # download data from GPU println("first three shock states at age 1: $(out[1,1:3,1])") return out end V_d = CuArray(zeros(Float32,p.nx,p.ne,p.T)); xgrid_d = CuArray(p.xgrid); egrid_d = CuArray(p.egrid); P_d = CuArray(p.P); gpu_lifecycle2(p,V_d,xgrid_d,egrid_d,P_d);
first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839]
@btime gpu_lifecycle2(p,V_d,xgrid_d,egrid_d,P_d);
first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] first three shock states at age 1: Float32[-2.12087, -2.08143, -2.02839] 1.805 s (419 allocations: 893.27 KiB)
@btime cpu_lifecycle(p);
first three shock states at age 1: [-2.12087, -2.08143, -2.02839] first three shock states at age 1: [-2.12087, -2.08143, -2.02839] first three shock states at age 1: [-2.12087, -2.08143, -2.02839] first three shock states at age 1: [-2.12087, -2.08143, -2.02839] 7.910 s (112 allocations: 3.61 MiB)
not overly costly.
# We could again benchmark this with `nvprof`
using CUDAdrv CUDAdrv.@profile gpu_lifecycle(p);
==14101== NVPROF is profiling process 14101, command: julia test2.jl
==14101== Profiling application: julia test2.jl
==14101== Profiling result:
Start Duration Grid Size Block Size Regs* ...
14.9753s 70.178us - - - ...
14.9754s 1.0240us - - - ...
14.9754s 512ns - - - ...
14.9754s 608ns - - - ...
15.1679s 1.75753s (256 1 1) (88 1 1) 206 ...
16.9255s 69.122us - - - ...
Regs: Number of registers used per CUDA thread. This number includes ...
For the current problem size, the GPU solution is about as fast as the multithreaded one with 4 threads. How does this vary with problem size?
psmall = param(nx=150,ne=5) pbig = param(nx=2000,ne=25,T=100) @time cpu_lifecycle(psmall);
first three shock states at age 1: [-2.16987, -1.94911, -1.7484] 0.016509 seconds (64 allocations: 320.609 KiB)
@time cpu_lifecycle(pbig);
first three shock states at age 1: [-2.12024, -2.10028, -2.07342] 32.092092 seconds (122 allocations: 8.204 MiB, 0.01% gc time)
@time cpu_lifecycle_thread(psmall);
julia running with 4 threads first three shock states at age 1: [-2.16987, -1.94911, -1.7484] 0.007863 seconds (371 allocations: 326.969 KiB)
@time cpu_lifecycle_thread(pbig);
julia running with 4 threads first three shock states at age 1: [-2.12024, -2.10028, -2.07342] 8.500026 seconds (428 allocations: 8.210 MiB, 0.02% gc time)
@time gpu_lifecycle(psmall);
first three shock states at age 1: Float32[-2.16987, -1.94911, -1.7484] 0.071450 seconds (502 allocations: 300.016 KiB)
@time gpu_lifecycle(pbig);
first three shock states at age 1: Float32[-2.12024, -2.10028, -2.07342] 7.181435 seconds (605 allocations: 5.941 MiB, 0.03% gc time)
==16667== NVPROF is profiling process 16667, command: julia test.jl
WARNING: using CUDAdrv.CuArray in module Main conflicts with an existing identifier.
==16667== Profiling application: julia test.jl
==16667== Profiling result:
Start Duration Grid Size Block Size Regs* ... Device
19.9770s 163.30us - - - ... GeForce GTX 105
19.9772s 1.2160us - - - ... GeForce GTX 105
19.9772s 544ns - - - ... GeForce GTX 105
19.9772s 736ns - - - ... GeForce GTX 105
20.1390s 123.01ms (256 1 1) (196 1 1) 182 ... GeForce GTX 105
20.2621s 765.93ms (256 1 1) (196 1 1) 182 ... GeForce GTX 105
21.0280s 766.03ms (256 1 1) (196 1 1) 182 ... GeForce GTX 105
21.7940s 765.93ms (256 1 1) (196 1 1) 182 ... GeForce GTX 105
22.5600s 765.92ms (256 1 1) (196 1 1) 182 ... GeForce GTX 105
23.3259s 765.94ms (256 1 1) (196 1 1) 182 ... GeForce GTX 105
24.0918s 765.78ms (256 1 1) (196 1 1) 182 ... GeForce GTX 105
24.8576s 765.84ms (256 1 1) (196 1 1) 182 ... GeForce GTX 105
25.6234s 765.91ms (256 1 1) (196 1 1) 182 ... GeForce GTX 105
26.3893s 765.95ms (256 1 1) (196 1 1) 182 ... GeForce GTX 105
27.1553s 153.96us - - - ... GeForce GTX 105