Pandas, HD5, and large data sets

Posted on 19th May 2017

I have finally gotten around to playing with the HD5 driver for pandas (which uses, I believe, pytables under the hood). I'm only scratching the surface, but it's easy to do what I want:

Create a huge data frame storing an entire data set
Efficiently query subsections of the frame

Create the dataframe

We obviously cannot do this in memory. But if we have some way of generating one row at a time, or a small "chunk" of rows at a time, then we can "append" these iteratively to a HD5 store:

 store = pd.HDFStore("test.hd5", "w", complevel=9, complib="bzip2", fletcher32=True)
 # Generate a data frame as `frame`
 store.append("main", frame, data_columns=True)
 # Repeat as necessary
 store.close()

This creates a new HD5 file, and then creates a table in it named "main". We can call store.append() repeatedly to add lots of rows. The data_columns=True is necessary if we wish to query by column (which we do).

Read back the data

We can then iterate over the whole dataframe in "chunks" of rows:

store = pd.HDFStore("test.hd5", "r")
for df in store.select("main", chunksize = 1000):
    # Do something with `df` which contains the next 1000 rows

Alternatively, we can use the power querying ability. Suppose we have a column named "one" in the large dataframe, and we just want the rows where the value of "one" is less then 100. Then we can use:

store = pd.HDFStore("test.hd5", "r")
df = store.select("main", where="one < 100")

This seems to be wonderfully fast.

Downsides

You cannot store "objects" in a table, so e.g. storing a GeoPandas data frame is impossible (or extremely hard).

Pandas, HD5, and large data sets

Create the dataframe

Read back the data

Downsides

Some sources

Categories

Recent posts