Code, Community, Future — A year in review for Numba (2021)

esc
10 min readFeb 18, 2022

Looking back at 2021, the Numba project continues to deliver regardless! We published a significant number of releases again delivering numerous features, performance improvements and bug-fixes. In this post I would like to highlight some of the most important achievements both code- and community-wise and also take a brief look at what 2022 may bring.

Code

First let’s look at the Numba releases of 2021:

Let’s take a quick look at the same code metrics that were also calculated as part of the 2020 review. The total count of commits, the total count of merge-commits and the total number of unique contributors:

# Total commits in 2021
zsh» git log — all — after=”01/01/2021 00:00" — before=”31/12/2021 23:59" — oneline | wc -l
2671
# Total merge commits in 2021
zsh» git log — merges — after=”01/01/2021 00:00" — before=”31/12/2021 23:59" — oneline | wc -l
482
# Total number of unique contributors in 2021
zsh» git shortlog — after=”01/01/2021 00:00" — before=”31/12/2021 23:59" -sn | wc -l
66

Comparing this to the numbers of 2020: 4905 total commits, 678 merge commits and 105 unique contributors, make it obvious that 2021 was, at least according to these metrics, less busy than 2020. Since these are only one aspect of Numba’s health and everything depends on how things are measured we resist the temptation to claim that 2021 was less successful for Numba than 2020, it was simply different!

2021 began with Version 0.53.0 on the 11th of March. This included support for Python 3.9 which itself had been released on 05 October 2020. This means, that it took around six months to land this support. Other notable features of this release include an implementation of function sub-typing, initial support for dynamic gufuncs and the parallel accelerator now supports Fortran arrays. Additionally, two new features were sponsored by Intel: a method to expose LLVM compilation pass timings and an event system for broadcasting compiler events. Furthermore, the CUDA target saw a significant set of changes, including CUDA 11.2 support, support for the CUDA array interface version 3 and initial support for Cooperative Groups. Additionally, implementations of a fast cube root and the functions xor,math.log2 andmath.remainder. Lastly, tuples can now be passed to GPU kernels. From a structural point of view, progress has been made on unifying the dispatcher architecture across targets, the CUDA dispatcher now shares infrastructure with the CPU dispatcher improving launch times for lazily compiled kernels. If you would like to discover more details, we encourage you to examine the multiple demo notebooks that accompany this release: one for the general changes, one for the CUDA target improvements and one for the new profiling features. Links to the MyBinder service for all release demo notebooks are included above. Lastly, version 0.53.0 was quickly followed by 0.53.1 which mainly fixed two critical regressions.

The second big release of the year was 0.54.0 which landed on the 19th August. It included a whole host of new features and improvements across the various parts of Numba. For the Python language features, Numba now supports f-strings, dict comprehensions and — finally — the sum built-in function! On the NumPy front, we now support a number of new functions such as np.clip,np.realand np.swapaxes. Additionally, np.argmaxhas gained support for the axis keyword. Internally, debugging support through DWARF has been fixed and enhanced and Numba now optimises the way in which locals are emitted to help reduce time spent in the LLVM SROA pass. Intel kindly sponsored the following new features to enable additional targets: dispatchers re-targeting via a user defined context manager, support for custom NumPy array subclasses and an inheritance based model for targets that permits sharing of @overload implementations. Additionally, per function compiler flags were implemented and the extension API now has support for overloading class methods via the @overload_classmethod decorator. The AMD ROC GPU target (ROCm) has been set to “unmaintained” and a repository stub was created to hold the code until it is resurrected and becomes maintained again. Furthermore there were some significant version changes: we now support LLVM 11 on all platforms via llvmlite, the minimum supported Python version was raised to 3.7, the minimum supported NumPy version is now 1.17 , we vendored cloudpickle at version 1.6.0 and the TBB requirement was changed to >=2021. Lastly, this version is the first to support a richer changelog: all listed pull-requests and authors are now hyperlinks pointing to respective URIs on Github. Again, for more the details and examples, please do consult the demo release notebooks. For 0.54.0 Intel kindly contributed a notebook to illustrate their DPPY library, a Numba extension to enable writing data-parallel programs that can be offloaded to various Intel architectures. Finally, we released 0.54.1 on the 7th of October, which fixed some regressions introduced with 0.54.0.

Before we continue, I would like to mention 0.55.0 RC1 which was released on the 21st of December 2021. It was the first release candidate to support Python 3.10, which itself was released on the 4th of October 2021. As with Python 3.9 there was also a delay in supporting 3.10. In fact ,the initial plan was to release 0.55 by the end of the year 2021 — but there were so many unexpected issues that this original deadline was no longer feasible. The Numba team hopes to get 0.55.0 released in early 2022 (which did indeed happen). In any case, Python 3.9 had delayed the 0.53 release and 3.10 had delayed the 0.55 release, so it is safe to assume that these two new Python versions — more precisely; the work required to support them — did indeed limit the amount of Numba releases for 2021.

Why do new Python versions cause trouble for Numba?

As outlined above, both Python 3.9 and 3.10 caused significant friction for the Numba project and did require a substantial effort and non-trivial changes to support. Why is that?

In order to answer this question a basic understanding of how Numba interfaces with Python and the interpreter will be needed. Numba, as a compiler, must begin with some representation of the original, interpreted, Python program and then transform this through a series of steps into a binary program. The are a few potential representations that a compiler could interface with: the source code itself, the abstract syntax tree (AST) representation or the resulting Python bytecode are all valid options. For historical and convenience reasons Numba interfaces through the bytecode (for example, the source code may not even be available anymore in certain environments). The astute reader will immediately spot the issue with this: the Python bytecode is not considered a stable interface by the CPython community and is instead treated as an implementation detail! New bytecodes are introduced with every minor version upgrade and also the way bytecode is handled internally by the interpreter may change. So, for example, a function may generate very different bytecode sequence on Python 3.9 and 3.10 — but of course that functions correctness remains intact with a hopefully improved runtime. So, this means that implementing support for a new minor Python version can incur a difficult to predict number of required changes each with a difficult to predict complexity. The only thing to rely on here is expert knowledge. In the case of 3.9 the Numba support landed about 6 months post Python release. For 3.10 we need to look ahead into 2022 and in fact 0.55.0 was released Mid January, which takes the total delay to less than 3.5 months. We do have some ideas on how to better tackle this which are outlined in the section “Future”.

Community

In 2021 the Numba Twitter account reached the significant milestone of 5000 followers. We celebrated this success on Twitter by asking the community to reach out and present their Numba “infused” projects. The Tweet was quite successful and a number of projects and maintainers answered. There were very many great libraries and projects mentioned, some old and rather well-known and some bleeding edge and brand-new. The Tweet itself is at:

In the following section we will highlight some of the team favorites and their responses and showcase some of the amazing work that is being done with Numba.

Someone from project pytreegrav responded at:

It is a package for computing gravitational potential. It’s READMEhas the tagline: “For the Barnes-Hut method we implement an oct-tree as a Numba jitclass to achieve much higher performance than the equivalent pure Python implementation, without writing a single line of C or Cython”. Effectively this is an approximation algorithm for the n-body problem. Even the wikipedia article at states that “Some of the most demanding high-performance computing projects do computational astrophysics using the Barnes–Hut treecode algorithm”. So, we are excited to see fast Numba implementation of this important computational approach. They also shared a stunning visualization of what may very well be a globular cluster, which was rendered with a Numba-accelerated back-end (presumably datashader?):

A member of the Stumpy-project also responded:

Stumpy is a time series analysis that makes extensive use of the numba.cudapackage implementing novel time-series analysis algorithms (matrix profiles). Additionally, they are happy how Numba keeps dependencies to a minimum. One thing that excites us about Stumpy is their extensive benchmarking suite and the comprehensiveness of the reported results.

Lastly, thanks to the following tweet:

… we became aware of the nodevectorspackage. There is a section in the READMEthe that answers the question: “Why is it so fast?” The answer: “We leverage CSRGraphs for most algorithms. This uses CSR graph representations and a lot of Numba JIT’ed procedures.” The package implements a variety of fast/scalable node embedding algorithms. Furthermore, their READMEcites a nice example of embedding a really large graph, in this case the full English Wikipedia link graph which has approximately 6 Million nodes.

NVIDIA continues to make greater use of Numba in cuDF (a GPU-accelerated pandas equivalent, part of the RAPIDS AI and data science stack) to support user-defined functions applied to its GPU dataframes, with the addition of support for masked / nullable data being added in 2021 on top of the additions to Numba’s CUDA extension APIs.

Also on the commercial front, bodo.ai are pushing the limits of parallel compute at scale with their HPC-style framework. They use Numba under the hood to generate something a little like a Pandas compiler. I.e. a unique and modern approach that uses compiler technology to accelerate existing ETL, data analytics and business intelligence workloads. The recently provided a comprehensive performance benchmark comparing the bodo platform to well established players such as Spark and Dask. And even more excitingly, they recently announce a partnership with the venerable Snowflake, bringing ultra-fast pandas-style querying to the Snowflake platform and thus enabling luxuriously comfortable analysis of data stored in Snowflake at unprecedented speed and scale.

Future

One of the most pressing issues for Numba in 2022 will be the support of Python 3.11. Yes, PEP 602 introduces the annual release cycle for Python. Also, from OOB conversations with some of CPython developers, we do have anecdotal information that the bytecode will be overhauled and changed even more significantly. Luckily, many alpha and beta releases will become available before the final release on the 3rd of October 2022, for example: alpha 4 was released on the 14th of January 2021. So, the motto for this will be: “start early”. With around 7months left, there is still a good chance to decrease the delay even more. An artifact from the work on 3.10 is a small framework known internally as “numba-hatchery” which enables a full-stack Numba build (LLVM, llvmlite, Numba) in a PyPa provided Docker container. This will serve as the basis for developing the 3.11 support. Hopefully, we can also integrate this with some sort of CI system so as to track the alphas and betas throughout the year so we can at least gauge how many tests are broken. Hopefully, this will ease some of the pain.

In addition to that, there are plans to release an llvmlite and Numba with support for the Apple M1 architecture. Many of our users have been asking about this for some time now. Since now at least some of our developers have the appropriate hardware, this is now in scope. Also since LLVM 12 and 13 have now been released, this year will probably see an upgrade there, potentially adding code to enable multiple LLVMs to be used with less trouble. Additionally, the long awaited switch of the JIT-engine from McJIT to OrcJit-V2 may very materialize this year. McJIT has been deprecated for some time, and even OrcJIT-V1 has been deprecated so the upgrade is much needed.

Plans for the CUDA target are to make it much easier to extend, with support for the high-level extension API — this will enable a lot more core Numba functionality to be used with the CUDA target, including some NumPy functions. Greater support for the float16 data type is under way, and on-disk caching of @cuda.jit functions is also targeted for this year.

A big thank you to my esteemed colleagues Siu Kwan Lam for critical feedback on this article and also Graham Markall for providing both feedback and for contributing the CUDA target plans!

--

--

esc

Senior Software engineer @ Anaconda Inc. Working on Numba. Views expressed here are my own and not necessarily those of my employer.