Introducing nanoarrow: the world's tiniest Arrow Implementation SciPy 2024

Introducing nanoarrow: the world's tiniest Arrow Implementation
.ical

2024-07-11 15:50–16:20, Ballroom

nanoarrow, a newly developed subproject of Apache Arrow, is squarely focused on unlocking connectivity among Python packages and the libraries they wrap using the features and rich type support of the Arrow columnar format. The vision of nanoarrow is that it should be trivial for a library to implement an Arrow-based interface: nanoarrow and its bindings provide tools to produce, consume, and transport tabular data between processes using the Arrow IPC format or between libraries using the Arrow C ABI. For Python maintainers this means less glue code that runs faster so that developers can focus on feature development.

Since its creation in 2016, Apache Arrow https://arrow.apache.org/ has grown to include implementations in many languages and has become an important enabler for low-overhead connectivity among Python packages and the tools they bind in other languages like Rust, Go, C, and C++. Whereas many Arrow implementations expose a broad user-facing toolkit, nanoarrow https://arrow.apache.org/nanoarrow is focused on providing tools for developers such that libraries have little to no barrier to implement an Arrow-based interface that integrates seamlessly across the wider Python data ecosystem (e.g., the Arrow PyCapsule Interface https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html).

nanoarrow was built to be:

Small: nanoarrow’s C runtime compiles into a few hundred kilobytes and its R and Python bindings both have zero dependencies and an installed size of ~1 MB.
Easy to depend on: nanoarrow’s C library is distributed as two files (nanoarrow.c and nanoarrow.h) and its R and Python bindings have zero dependencies.
Useful: The Arrow Columnar Format includes a wide range of data type and data encoding options. To the greatest extent practicable, nanoarrow strives to support the entire specification to help producers and consumers take advantage of recent memory layout additions like run-end encoding (REE), StringView, and ListView.

The C library component of nanoarrow is already helping integrate multi-language tools in the wider SciPy ecosystem:

Arrow Database Connectivity (ADBC) https://arrow.apache.org/adbc drivers use nanoarrow to expose C-based driver libraries provided by vendors like SQLite and PostgreSQL as streams of Arrow data with minimal packaging overhead.
Snowflake’s Python connector https://docs.snowflake.com/en/developer-guide/python-connector/python-connector uses nanoarrow to provide accelerated access to database results with a package size amenable to deployment on AWS Lambda.
The nascent GeoArrow https://geoarrow.org/ ecosystem includes a C implementation of several high-performance geospatial array layouts that helps integrate tools written in C++, Rust, and Python.

Whereas the nanoarrow C library is relatively straightforward to integrate into a Python package that already contains compiled code, the recently released Python bindings to nanoarrow target Python developers who wish to avoid compiled code entirely by providing integration with numpy, the Python buffer protocol, the Python DataFrame protocol, and the Arrow PyCapsule interface. Some opportunities for future use include:

Implementing high-performance data API servers or clients using the Arrow IPC streaming format
Facilitating connectivity among Python packages where nested data or data with null values are important concepts
Building file readers or writers with the ability to re-use glue code for bindings in multiple languages.
Testing Arrow-based interfaces as the number of packages that provide them increase.

Dewey Dunnington

Dewey Dunnington (Ph.D., P.Geo.) is a software engineer and geoscientist based in Nova Scotia, Canada. As a software engineer he works on all things Apache Arrow at Voltron Data, Inc., including standards for geospatial data connectivity, R bindings for Apache Arrow, and Arrow Database Connectivity (ADBC). As a geoscientist, he has worked in contaminated site remediation, taught Applied Geomorphology at Acadia University, and has authored more than a dozen articles on lake water and sediment geochemistry. Dewey is an Apache Arrow Project Management Committee member, an RStudio-certified tidyverse instructor, an NSERC Postgraduate Scholarship (Doctoral) recipient, and maintainer of dozens of R, Python, C, and C++ libraries at the intersection of geoscience, geospatial data, and enterprise data connectivity.

Introducing nanoarrow: the world's tiniest Arrow Implementation .ical 2024-07-11 15:50–16:20, Ballroom

Introducing nanoarrow: the world's tiniest Arrow Implementation
.ical

2024-07-11 15:50–16:20, Ballroom