8 minute read

The Arrow C Data Interface is an amazing tool, and while it documents its own potential use cases I wanted to dedicate a blog post to my personal experience using it.

Problem Statement

Transferring data across systems and libraries is difficult and time-consuming. This statement applies not only to compute time but perhaps more importantly to developer time as well.

I first ran into this issue over 5 years ago when I started a library called pantab. At the time, I had just become a core developer of pandas, and through consulting work had been dealing a lot with Tableau. Tableau had just released their Hyper API, which is a way to exchange data to/from their proprietary Hyper database.

Great…, I said to myself, I know a lot of pandas internals and I think writing a DataFrame to a Hyper database will be easier than any other option. Hence, pantab was created.

As you may or may not already be aware, most high-performance Python libraries in the analytics space get their performance from implementing parts of their code base in lower-level languages like C/C++/Rust. So with pantab I set out to do the same thing.

The problem, however, is that pandas did NOT expose any of its internal data structures to other libraries. pantab was forced to hack a lot of things to make this integration “work”, but in a way that was very fragile across pandas releases.

Late in 2023 I decided that pantab was due for a rewrite. Hacking into the pandas internals was not going to work any more, especially as the number of data types that pandas supported started to grow. What pantab needed was an agreement with a library like pandas as to how to exchange low-level data at an extremely high level of performance.

Fortunately, I wasn’t the only person with that idea. Data interchange libraries that weren’t even a thought when pantab started were now a reality, so it was time to test those out.

Status Quo

pantab initially used pandas.DataFrame.itertuples to loop over every row and every element within a DataFrame before writing it out to a Hyper file. While this worked and was faster than what most users would write by hand, it still really wasn’t that fast.

Here is a high level overview of that process, with heavy Python runtime interactions highlighted in red:

G rawdata df.itertuples() forloop Python for loop rawdata->forloop df df df->rawdata convert PyObject -> primitive forloop->convert write Database write convert->write

A later version of pantab which required a minimum of pandas 1.3 ended up hacking into the internals of pandas, calling something like df._mgr.column_arrays to get a NumPy array for each column in the DataFrame. Combined with the NumPy Array Iterator API, pantab could iterate over raw NumPy arrays instead of doing a loop in Python.

G rawdata df._mgr.column_arrays forloop NumPy Array Iterator API rawdata->forloop df df df->rawdata string Is string? forloop->string convert PyObject -> primitive string->convert yes write Database write string->write no convert->write

This helped a lot with performance, and while the NumPy Array Iterator API was solid, the pandas internals would change across releases, so it took a lot of developer time to maintain.

The images and comments above assume we are writing a DataFrame to a Hyper file. Going the other way around, pantab would create a Python list of PyObjects and convert to more appropriate data types after everything was read. If we were to graph that process, it would be even more red - not good!

Initial Redesign Attempt - Python DataFrame Interchange Protocol

Before I ever considered the Arrow C Data Interface, my first try at getting high performance and easy data exchange from pandas to Hyper was through the Python DataFrame interchange protocol. While initially promising, this soon became problematic.

For starters, Memory ownership and lifetime is listed as something in scope of the protocol, but is not actually defined. Implementers are free to choose how long a particular buffer should last, and it is up the client to just know this. After many unexpected segfaults, I started to grow weary of this solution.

Another major issue for the interchange protocol is that Non-Python API standardization (e.g., C/C++ APIs) is explicitly a non-goal. With pantab being a consumer of raw data, this meant I had to know how to manage those raw buffers for every type I wished to consume. While that may not be a huge deal for simple primitive types like sized integers, it leaves much to be desired when you try to work with more complex types like decimals.

Next topic - nullability! Here is the enumeration the protocol specified:

class ColumnNullType(enum.IntEnum):
    Integer enum for null type representation.

    NON_NULLABLE : int
        Non-nullable column.
    USE_NAN : int
        Use explicit float NaN value.
    USE_SENTINEL : int
        Sentinel value besides NaN.
    USE_BITMASK : int
        The bit is set/unset representing a null on a certain position.
    USE_BYTEMASK : int
        The byte is set/unset representing a null on a certain position.

    USE_NAN = 1

The way the DataFrame Interchange Protocol decided to handle nullability is an area where trying to be inclusive of many different strategies ended up as a detriment to all. Requiring developers to integrate all of these methods across any type they may consume is a lot of effort (particularly for USE_SENTINEL).

Another limitation with the DataFrame Interchange Protocol is the fact that it only talks about how to consume data, but offers no guidance on how to produce it. If starting from your extension, you have no tools or library to manually build buffers. Much like the status quo, this meant reading from a Hyper database to a pandas DataFrame would likely be going through Python objects.

Finally, and related to all of the issues above, the pandas implementation of the DataFrame Interchange Protocol left a lot to be desired. While started with good intentions, it never got the attention needed to make it really effective. I already mentioned the lifetime issues across various data types, but nullability handling was all over the place across types. Metadata was often passed along incorrectly from pandas down through the interface…essentially making it a very high effort for consumers to try and use it.

Arrow C Data Interface to the Rescue

After stumbling around the DataFrame Protocol Interface for a few weeks, Joris Van den Bossche asked me why I didn’t look at the Arrow C Data Interface. The answer of course was that I was just not very familiar with it. Joris knows a ton about pandas and Arrow, so I figured it best to take his word for it and try it out.

Almost immediately my issues went away. To wit:

  1. Memory ownership and lifetime - well defined at low levels
  2. Non-Python API - for this there is nanoarrow
  3. Nullability handling - uses Arrow bitmasks
  4. Producing buffers - can create (not just read) data
  5. pandas implementation - it just works via PyCapsules

With well defined memory semantics, a low-level API and clean nullability handling, the amount of extension code I had to write was drastically reduced. I felt more confident in the implementation and had to deal with less memory corruption / crashes than before. And, perhaps most importantly, I saved a lot of time.

See the image below for a high level overview of the process. Note the lack of any red compared to the status quo - this has a very limited interaction with the Python runtime:

G rawdata df.__arrow_c_stream__() forloop Arrow C API / nanoarrow rawdata->forloop df df df->rawdata write Database write forloop->write

Without going too deep in the benchmarks game, the Arrow C Data Interface implementation yielded a 25% performance improvement for me when writing strings. When reading data, it was more like a 500% improvement than what had been previously implemented. Not bad…

My code is no longer tied to the potentially fragile internals of pandas, and with the stability of the Arrow C Data Interface things are far less likely to break when new versions are released.

Bonus Feature - Bring Your Own Library

While it wasn’t my goal at the outset, implementing the Arrow C Data Interface had the benefit of decoupling a dependency on pandas. pandas was the de facto library when pantab was first written, but since then many high quality Arrow-based libraries have popped up.

With the Arrow C Data Interface, pantab now has a bring your own DataFrame library mentality.

>>> import pantab as pt
>>> import pantab as pd
>>> df = pd.DataFrame({"col": [1, 2, 3]})
>>> pt.frame_to_hyper(df, "example.hyper", table="test")

>>> import polars as pl
>>> df = pl.DataFrame({"col": [1, 2, 3]})
>>> pt.frame_to_hyper(df, "example.hyper", table="test")

>>> import pyarrow as pa
>>> tbl = pa.Table.from_pydict({"col": [1, 2, 3]})
>>> pt.frame_to_hyper(tbl, "example.hyper", table="test")

These all produce the same results, and as the author of pantab I did not have to do anything extra to accommodate the various libraries - everything just works.

Closing Thoughts

The Arrow specification is simply put…awesome. While initiatives like the Python DataFrame Protocol have tried to solve the issue of interchange, I don’t believe that goal was ever achieved…until now. The Arrow C Data Interface is the tool developers have always needed to make analytics integrations easy.

pantab is not the first library to take advantage of these features. The Arrow ADBC drivers I previously blogged about are also huge users of nanoarrow / the Arrow C Data Interface, and heavily influenced the design of pantab. The Powered By Apache Arrow project page is the best resource to find others as they get developed in the future.

I, for one, am excited to see Arrow-based tooling grow and make open-source data integrations more powerful than ever before.