TLDR: The two fundamental principles for writing fast Julia code:
The compiler's job is to optimize and translate Julia code it into runnable machine code. If a variable's type cannot be deduced before the code is run, then the compiler won't generate efficient code to handle that variable. We call this phenomenon "type instability". Enabling type inference means making sure that every variable's type in every function can be deduced from the types of the function inputs alone.
A "heap allocation" (or simply "allocation") occurs when we create a new variable without knowing how much space it will require (like a Vector
with flexible length).
Julia has a mark-and-sweep garbage collector (GC), which runs periodically during code execution to free up space on the heap.
Execution of code is stopped while the garbage collector runs, so minimising its usage is important.
The vast majority of performance tips come down to these two fundamental ideas. Typically, the most common beginner pitfall is the use of untyped global variables without passing them as arguments. Why is it bad? Because the type of a global variable can change outside of the body of a function, so it causes type instabilities wherever it is used. Those type instabilities in turn lead to more heap allocations.
With this in mind, after you're done with the current page, you should read the official performance tips: they contain some useful advice which is not repeated here for space reasons.
TLDR: Use BenchmarkTools.jl's @benchmark
with a setup phase to get the most accurate idea of your code's performance. Use Chairmarks.jl as a faster alternative.
The simplest way to measure how fast a piece of code runs is to use the @time
macro, which returns the result of the code and prints the measured runtime and allocations.
Because code needs to be compiled before it can be run, you should first run a function without timing it so it can be compiled, and then time it:
julia> sum_abs(vec) = sum(abs(x) for x in vec);
julia> v = rand(100);
julia> using BenchmarkTools
julia> @time sum_abs(v); # Inaccurate, note the >99% compilation time
0.028753 seconds (48.50 k allocations: 3.230 MiB, 99.96% compilation time)
julia> @time sum_abs(v); # Accurate
0.000005 seconds (1 allocation: 16 bytes)
Using @time
is quick but it has flaws, because your function is only measured once.
That measurement might have been influenced by other things going on in your computer at the same time.
In general, running the same block of code multiple times is a safer measurement method, because it diminishes the probability of only observing an outlier.
BenchmarkTools.jl is the most popular package for repeated measurements on function executions.
Similarly to @time
, BenchmarkTools offers @btime
which can be used in exactly the same way but will run the code multiple times and provide an average.
Additionally, by using $
to interpolate external values, you remove the overhead caused by global variables.
julia> using BenchmarkTools
julia> @btime sum_abs(v);
95.032 ns (1 allocation: 16 bytes)
julia> @btime sum_abs($v);
60.460 ns (0 allocations: 0 bytes)
In more complex settings, you might need to construct variables in a setup phase that is run before each sample. This can be useful to generate a new random input every time, instead of always using the same input.
julia> my_matmul(A, b) = A * b;
julia> @btime my_matmul(A, b) setup=(
A = rand(1000, 1000); # use semi-colons between setup lines
b = rand(1000)
);
141.156 μs (1 allocation: 7.94 KiB)
For better visualization, the @benchmark
macro shows performance histograms:
Advanced: Certain computations may be optimized away by the compiler before the benchmark takes place. If you observe suspiciously fast performance, especially below the nanosecond scale, this is very likely to have happened.
Chairmarks.jl offers an alternative to BenchmarkTools.jl, promising faster benchmarking while attempting to maintain high accuracy and using an alternative syntax based on pipelines.
While we previously discussed the importance of documenting breaking changes in packages using semantic versioning, regressions in performance can also be vital to track. Several packages exist for this purpose:
BenchmarkTools.jl works fine for relatively short and simple blocks of code (microbenchmarking). To find bottlenecks in a larger program, you should rather use a profiler or the package TimerOutputs.jl. It allows you to label different sections of your code, then time them and display a table of grouped by label.
Finally, if you know a loop is slow and you'll need to wait for it to be done, you can use ProgressMeter.jl or ProgressLogging.jl to track its progress.
TLDR: Profiling can identify performance bottlenecks at function level, and graphical tools such as ProfileView.jl are the best way to use it.
Whereas a benchmark measures the overall performance of some code, a profiler breaks it down function by function to identify bottlenecks.
Sampling-based profilers periodically ask the program which line it is currently executing, and aggregate results by line or by function.
Julia offers two kinds: one for runtime (in the module Profile
) and one for memory (in the submodule Profile.Allocs
).
These built-in profilers print textual outputs, but the result of profiling is best visualized as a flame graph. In a flame graph, each horizontal layer corresponds to a specific level in the call stack, and the width of a tile shows how much time was spent in the corresponding function. Here's an example:
The packages ProfileView.jl and PProf.jl both allow users to record and interact with flame graphs. ProfileView.jl is simpler to use, but PProf is more featureful and is based on pprof, an external tool maintained by Google which applies to more than just Julia code. Here we only demonstrate the former:
using ProfileView
@profview do_work(some_input)
VSCode: Calling @profview do_work(some_input)
in the integrated Julia REPL will open an interactive flame graph, similar to ProfileView.jl but without requiring a separate package.
To integrate profile visualisations into environments like Jupyter and Pluto, use ProfileSVG.jl or ProfileCanvas.jl, whose outputs can be embedded into a notebook.
No matter which tool you use, if your code is too fast to collect samples, you may need to run it multiple times in a loop.
Advanced: To visualize memory allocation profiles, use PProf.jl or VSCode's @profview_allocs
.
A known issue with the allocation profiler is that it is not able to determine the type of every object allocated, instead Profile.Allocs.UnknownType
is shown instead.
Inspecting the call graph can help identify which types are responsible for the allocations.
Apart from the built-in Profile
standard library, there are a few external profilers that you can use including Intel VTune (in combination with IntelITT.jl), NVIDIA Nsight Systems (in combination with NVTX.jl), and Tracy.
TLDR: Use JET.jl to automatically detect type instabilities in your code, and @code_warntype
or Cthulhu.jl to do so manually. DispatchDoctor.jl can help prevent them altogether.
For a section of code to be considered type stable, the type inferred by the compiler must be "concrete", which means that the size of memory that needs to be allocated to store its value is known at compile time.
Types declared abstract with abstract type
are not concrete and neither are parametric types whose parameters are not specified:
julia> isconcretetype(Any)
false
julia> isconcretetype(AbstractVector)
false
julia> isconcretetype(Vector) # Shorthand for `Vector{T} where T`
false
julia> isconcretetype(Vector{Real})
true
julia> isconcretetype(eltype(Vector{Real}))
false
julia> isconcretetype(Vector{Int64})
true
Advanced: Vector{Real}
is concrete despite Real
being abstract for subtle typing reasons but it will still be slow in practice because the type of its elements is abstract.
While type-stable function calls compile down to fast GOTO
statements, type-unstable function calls generate code that must read the list of all methods for a given operation and find the one that matches.
This phenomenon called "dynamic dispatch" prevents further optimizations.
Type-stability is a fragile thing: if a variable's type cannot be inferred, then the types of variables that depend on it may not be inferrable either. As a first approximation, most code should be type-stable unless it has a good reason not to be.
The simplest way to detect an instability is with the builtin macro @code_warntype
:
The output of @code_warntype
is difficult to parse, but the key takeaway is the return type of the function's Body
: if it is an abstract type, like Any
, something is wrong.
In a normal Julia REPL, such cases would show up colored in red as a warning.
julia> function put_in_vec_and_sum(x)
v = []
push!(v, x)
return sum(v)
end;
julia> @code_warntype put_in_vec_and_sum(1)
MethodInstance for Main.__FRANKLIN_1579807.put_in_vec_and_sum(::Int64)
from put_in_vec_and_sum(x) @ Main.__FRANKLIN_1579807 string:1
Arguments
#self#::Core.Const(Main.__FRANKLIN_1579807.put_in_vec_and_sum)
x::Int64
Locals
v::Vector{Any}
Body::Any
1 ─ (v = Base.vect())
│ Main.__FRANKLIN_1579807.push!(v, x)
│ %3 = Main.__FRANKLIN_1579807.sum(v)::Any
└── return %3
Unfortunately, @code_warntype
is limited to one function body: calls to other functions are not expanded, which makes deeper type instabilities easy to miss.
That is where JET.jl can help: it provides optimization analysis aimed primarily at finding type instabilities.
While test integrations are also provided, the interactive entry point of JET is the @report_opt
macro.
julia> using JET
julia> @report_opt put_in_vec_and_sum(1)
═════ 6 possible errors found ═════
┌ put_in_vec_and_sum(x::Int64) @ Main.__FRANKLIN_1579807 ./string:4
│┌ sum(a::Vector{Any}) @ Base ./reducedim.jl:1010
││┌ sum(a::Vector{Any}; dims::Colon, kw::@Kwargs{}) @ Base ./reducedim.jl:1010
│││┌ _sum(a::Vector{Any}, ::Colon) @ Base ./reducedim.jl:1014
││││┌ _sum(a::Vector{Any}, ::Colon; kw::@Kwargs{}) @ Base ./reducedim.jl:1014
│││││┌ _sum(f::typeof(identity), a::Vector{Any}, ::Colon) @ Base ./reducedim.jl:1015
││││││┌ _sum(f::typeof(identity), a::Vector{Any}, ::Colon; kw::@Kwargs{}) @ Base ./reducedim.jl:1015
│││││││┌ mapreduce(f::typeof(identity), op::typeof(Base.add_sum), A::Vector{Any}) @ Base ./reducedim.jl:357
││││││││┌ mapreduce(f::typeof(identity), op::typeof(Base.add_sum), A::Vector{Any}; dims::Colon, init::Base._InitialValue) @ Base ./reducedim.jl:357
│││││││││┌ _mapreduce_dim(f::typeof(identity), op::typeof(Base.add_sum), ::Base._InitialValue, A::Vector{Any}, ::Colon) @ Base ./reducedim.jl:365
││││││││││┌ _mapreduce(f::typeof(identity), op::typeof(Base.add_sum), ::IndexLinear, A::Vector{Any}) @ Base ./reduce.jl:447
│││││││││││┌ mapreduce_impl(f::typeof(identity), op::typeof(Base.add_sum), A::Vector{Any}, ifirst::Int64, ilast::Int64) @ Base ./reduce.jl:277
││││││││││││┌ mapreduce_impl(f::typeof(identity), op::typeof(Base.add_sum), A::Vector{…}, ifirst::Int64, ilast::Int64, blksize::Int64) @ Base ./reduce.jl:257
│││││││││││││ runtime dispatch detected: Base.mapreduce_first(f::typeof(identity), op::typeof(Base.add_sum), %3::Any)::Any
││││││││││││└────────────────────
││││││││││││┌ mapreduce_impl(f::typeof(identity), op::typeof(Base.add_sum), A::Vector{…}, ifirst::Int64, ilast::Int64, blksize::Int64) @ Base ./reduce.jl:262
│││││││││││││ runtime dispatch detected: op::typeof(Base.add_sum)(%9::Any, %11::Any)::Any
││││││││││││└────────────────────
││││││││││││┌ mapreduce_impl(f::typeof(identity), op::typeof(Base.add_sum), A::Vector{…}, ifirst::Int64, ilast::Int64, blksize::Int64) @ Base ./reduce.jl:273
│││││││││││││ runtime dispatch detected: op::typeof(Base.add_sum)(%69::Any, %71::Any)::Any
││││││││││││└────────────────────
││││││││││┌ _mapreduce(f::typeof(identity), op::typeof(Base.add_sum), ::IndexLinear, A::Vector{Any}) @ Base ./reduce.jl:435
│││││││││││ runtime dispatch detected: Base.mapreduce_first(f::typeof(identity), op::typeof(Base.add_sum), %10::Any)::Any
││││││││││└────────────────────
││││││││││┌ _mapreduce(f::typeof(identity), op::typeof(Base.add_sum), ::IndexLinear, A::Vector{Any}) @ Base ./reduce.jl:440
│││││││││││ runtime dispatch detected: op::typeof(Base.add_sum)(%15::Any, %16::Any)::Any
││││││││││└────────────────────
││││││││││┌ _mapreduce(f::typeof(identity), op::typeof(Base.add_sum), ::IndexLinear, A::Vector{Any}) @ Base ./reduce.jl:443
│││││││││││ runtime dispatch detected: op::typeof(Base.add_sum)(%18::Any, %25::Any)::Any
││││││││││└────────────────────
VSCode: The Julia extension features a static linter, and runtime diagnostics with JET can be automated to run periodically on your codebase and show any problems detected.
Cthulhu.jl exposes the @descend
macro which can be used to interactively "step through" lines of the corresponding typed code, and "descend" into a particular line if needed.
This is akin to repeatedly calling @code_warntype
deeper and deeper into your functions, slowly succumbing to the madness...
We cannot demonstrate it on a static website, but the video example is a good starting point.
The Julia manual has a collection of tips to improve type inference.
A more direct approach is to error whenever a type instability occurs: the macro @stable
from DispatchDoctor.jl allows exactly that.
TLDR: You can reduce allocations with careful array management.
After ensuring type stability, one should try to reduce the number of heap allocations a program makes. Again, the Julia manual has a series of tricks related to arrays and allocations which you should take a look at. In particular, try to modify existing arrays instead of allocating new objects (caution with array slices) and try to access arrays in the right order (column major order).
And again, you can also choose to error whenever an allocation occurs, with the help of AllocCheck.jl.
By annotating a function with @check_allocs
, if the function is run and the compiler detects that it might allocate, it will throw an error.
Alternatively, to ensure that non-allocating functions never regress in future versions of your code, you can write a test set to check allocations by providing the function and a concrete type-signature.
@testset "non-allocating" begin
@test isempty(AllocCheck.check_allocs(my_func, (Float64, Float64)))
end
TLDR: If you can anticipate which functions or packages you will need, loading time can be greatly reduced with PrecompileTools.jl or PackageCompiler.jl.
A number of tools allow you to reduce Julia's latency, also referred to as TTFX (time to first X, where X was historically plotting a graph).
PrecompileTools.jl reduces the amount of time taken to run functions loaded from a package or local module that you wrote. It allows module authors to give a "list" of methods to precompile when a module is loaded for the first time. These methods then have the same latency as if they had already been run by the end user.
Here's an example of precompilation, adapted from the package's documentation:
module MyPackage
using PrecompileTools: @compile_workload
struct MyType
x::Int
end
myfunction(a::Vector) = a[1].x
@compile_workload begin
a = [MyType(1)]
myfunction(a)
end
end
Note that every method that is called will be compiled, no matter how far down the call stack or which module it comes from. To see if the intended calls were compiled correctly or diagnose other problems related to precompilation, use SnoopCompile.jl. This is especially important for writers of registered Julia packages, as it allows you to diagnose recompilation that happens due to invalidation.
To reduce the time that packages take to load, you can use PackageCompiler.jl to generate a custom version of Julia, called a sysimage, with its own standard library.
As packages in the standard library are already compiled, any using
or import
statement involving them is almost instant.
Once PackageCompiler.jl is added to your global environment, activate a local environment for which you want to generate a sysimage, ensure all of the packages you want to compile are in its Project.toml
, and run create_sysimage
as in the example below.
The filetype of sysimage_path
differs by operating system: Linux has .so
, MacOS has .dylib
, and Windows has .dll
.
packages_to_compile = ["Makie", "DifferentialEquations"]
create_sysimage(packages_to_compile; sysimage_path="MySysimage.so")
Once a sysimage is generated, it can be used with the command line flag: julia --sysimage=path/to/sysimage
.
VSCode: The generation and loading of sysimages can be streamlined with VSCode.
By default, the command sequence Task: Run Build Task
followed by Julia: Build custom sysimage for current environment
will compile a sysimage containing all packages in the current environment, but additional details can be specified in a /.vscode/JuliaSysimage.toml
file.
To automatically detect and use a custom sysimage, set useCustomSysimage
to true
in the application settings.
PackageCompiler.jl also facilitates the creation of apps and libraries that can be shared to and run on machines that don't have Julia installed.
At a basic level, all that's required to turn a Julia module MyModule
into an app is a function julia_main()::Cint
that returns 0
upon successful completion.
Then, with PackageCompiler.jl loaded, run create_app("MyModule", "MyAppCompiled")
.
Command line arguments to the resulting app are assigned to the global variable ARGS::Array{ASCIIString}
, the handling of which can be made easier by ArgParse.jl.
In Julia, a library is just a sysimage with some extras that enable external programs to interact with it.
Any functions in a module marked with Base.@ccallable
, and whose type signature involves C-conforming types e.g. Cint
, Cstring
, and Cvoid
, can be compiled into an externally callable library with create_library
, similarly to create_app
.
Unfortunately, the process of compiling and sharing a standalone executable or callable library must take relocability into account, which is beyond the scope of this blog.
Advanced: An alternative way to compile a shareable app or library that doesn't need to compile a sysimage, and therefore results in smaller binaries, is to use StaticCompiler.jl and its sister package StaticTools.jl.
The biggest tradeoff of not compiling a sysimage, is that Julia's garbage collector is no longer included, so all heap allocations must be managed manually, and all code compiled must be type-stable.
To get around this limitation, you can use static equivalents of dynamic types, such as a StaticArray
(StaticArrays.jl) instead of an Array
or a StaticString
(StaticTools.jl), use malloc
and free
from StaticTools.jl directly, or use arena allocators with Bumper.jl.
The README of StaticCompiler.jl contains a more detailed guide on how to prepare code to be compiled.
TLDR: Use Threads
or OhMyThreads.jl on a single machine, Distributed
or MPI.jl on a computing cluster. GPU-compatible code is easy to write and run.
Code can be made to run faster through parallel execution with multithreading (shared-memory parallelism) or multiprocessing / distributed computing.
Many common operations such as maps and reductions can be trivially parallelised through either method by using their respective Julia packages (e.g pmap
from Distributed.jl and tmap
from OhMyThreads.jl).
Multithreading is available on almost all modern hardware, whereas distributed computing is most useful to users of high-performance computing clusters.
To enable multithreading with the built-in Threads
library, use one of the following equivalent command line flags, and give either an integer or auto
:
julia --threads 4
julia -t auto
Once Julia is running, you can check if this was successful by calling Threads.nthreads()
.
VSCode: The default number of threads can be edited by adding "julia.NumThreads": 4,
to your settings. This will be applied to the integrated terminal.
Advanced: Linear algebra code calls the low-level libraries BLAS and LAPACK.
These libraries manage their own pool of threads, so single-threaded Julia processes can still make use of multiple threads, and multi-threaded Julia processes that call these libraries may run into performance issues due to the limited number of threads available in a single core.
In this case, once LinearAlgebra
is loaded, BLAS can be set to use only one thread by calling BLAS.set_num_threads(1)
.
For more information see the docs on multithreading and linear algebra.
Regardless of the number of threads, you can parallelise a for loop with the macro Threads.@threads
.
The macros @spawn
and @async
function similarly, but require more manual management of tasks and their results. For this reason @threads
is recommended for those who do not wish to use third-party packages.
When designing multithreaded code, you should generally try to write to shared memory as rarely as possible. Where it cannot be avoided, you need to be careful to avoid "race conditions", i.e. situations when competing threads try to write different things to the same memory location. It is usually a good idea to separate memory accesses with loop indices, as in the example below:
results = zeros(Int, 4)
Threads.@threads for i in 1:4
results[i] = i^2
end
Almost always, it is not a good idea to use threadid()
.
Even if you manage to avoid any race conditions in your multithreaded code, it is very easy to run into subtle performance issues (like false sharing). For these reasons, you might want to consider using a high-level package like OhMyThreads.jl, which provides a user-friendly alternative to Threads
and makes managing threads and their memory use much easier.
The helpful translation guide will get you started in a jiffy.
If the latency of spinning up new threads becomes a bottleneck, check out Polyester.jl for very lightweight threads that are quicker to start.
If you're on Linux, you should consider using ThreadPinning.jl to pin your Julia threads to CPU cores to obtain stable and optimal performance. The package can also be used to visualize where the Julia threads are running on your system (see threadinfo()
).
Advanced: Some widely used parallel programming packages like LoopVectorization.jl (which also powers Octavian.jl) or ThreadsX.jl are no longer maintained.
Julia's multiprocessing and distributed relies on the standard library Distributed
.
The main difference with multi-threading is that data isn't shared between worker processes.
Once Julia is started, processes can be added with addprocs
, and their can be queried with nworkers
.
The macro Distributed.@distributed
is a syntactic equivalent for Threads.@threads
.
Hence, we can use @distributed
to parallelise a for loop as before, but we have to additionally deal with sharing and recombining the results
array.
We can delegate this responsibility to the standard library SharedArrays
.
However, in order for all workers to know about a function or module, we have to load it @everywhere
:
using Distributed
# Add additional workers then load code on the workers
addprocs(3)
@everywhere using SharedArrays
@everywhere f(x) = 3x^2
results = SharedArray{Int}(4)
@sync @distributed for i in 1:4
results[i] = f(i)
end
Note that @distributed
does not force the main process to wait for other workers, so we must use @sync
to block execution until all computations are done.
One feature @distributed
has over @threads
is the possibility to specify a reduction function (an associative binary operator) which combines the results of each worker.
In this case @sync
is implied, as the reduction cannot happen unless all of the workers have finished.
@distributed (+) for i in 1:4
i^2
end
Alternately, the convenience function pmap
can be used to easily parallelise a map
, both in a distributed and multi-threaded way.
results = pmap(f, 1:100; distributed=true, batch_size=25, on_error=ex->0)
For more functionalities related to higher-order functions, Transducers.jl and Folds.jl are the way to go.
Advanced: MPI.jl implements the Message Passing Interface standard, which is heavily used in high-performance computing beyond Julia.
The C library that MPI.jl wraps is highly optimized, so Julia code that needs to be scaled up to a large number of cores, such as an HPC cluster, will typically run faster with MPI than with plain Distributed
.
Elemental.jl is a package for distributed dense and sparse linear algebra which wraps the Elemental library written in C++, itself using MPI under the hood.
GPUs are specialised in executing instructions in parallel over a large number of threads. While they were originally designed for accelerating graphics rendering, more recently they have been used to train and evaluate machine learning models.
Julia's GPU ecosystem is managed by the JuliaGPU organisation, which provides individual packages for directly working with each GPU vendor's instruction set. The most popular one is CUDA.jl, which also simplifies installation of CUDA drivers for NVIDIA GPUs. Through KernelAbstractions.jl, you can easily write code that is agnostic to the type of GPU where it will run.
In the Single Instruction, Multiple Data paradigm, several processing units perform the same instruction at the same time, differing only in their inputs. The range of operations that can be parallelised (or "vectorized") like this is more limited than in the previous sections, and slightly harder to control. Julia can automatically vectorize repeated numerical operations (such as those found in loops) provided a few conditions are met:
While this may seem straightforward, there are a number of important caveats which prevent code from being vectorized.
Performance annotations like @simd
or @inbounds
help enable vectorization in some cases, as does replacing control flow with ifelse
.
If this isn't enough, SIMD.jl allows users to force the use of SIMD instructions and bypass the check for whether this is possible.
One particular use-case for this is for vectorising non-contiguous memory reads and writes through SIMD.vgather
and SIMD.vscatter
respectively.
Advanced: You can detect whether the optimizations have occurred by inspecting the output of @code_llvm
or @code_native
and looking for vectorised registers, types, instructions.
Note that the exact things you're looking for will vary between code and CPU instruction set, an example of what to look for can be seen in this blog post by Kristoffer Carlsson.
TLDR: Be aware that StaticArrays.jl exist and learn how they work.
Using an efficient data structure is a tried and true way of improving the performance. While users can write their own efficient implementations through officially documented interfaces, a number of packages containing common use cases are more tightly integrated into the Julia ecosystem.
Using StaticArrays.jl, you can construct arrays which contain size information in their type.
Through multiple dispatch, statically sized arrays give rise to specialised, efficient methods for certain algorithms like linear algebra.
In addition, the SArray
, SMatrix
and SVector
types are immutable, so the array does not need to be garbage collected as it can be stack-allocated.
Creating a new SArray
s comes at almost no extra cost, compared to directly editing the data of a mutable object.
With MArray
, MMatrix
, and MVector
, data remains mutable as in normal arrays.
To handle mutable and immutable data structures with the same syntax, you can use Accessors.jl:
using StaticArrays, Accessors
sx = SA[1, 2, 3] # SA constructs an SArray
@set sx[1] = 3 # Returns a copy, does not update the variable
@reset sx[1] = 4 # Replaces the original
All but the most obscure data structures can be found in the packages from the JuliaCollections organization, especially DataStructures.jl which has all the standards from the computer science courses (stacks, queues, heaps, trees and the like). Iteration and memoization utilities are also provided by packages like IterTools.jl and Memoize.jl.