# Pandas, HD5, and large data sets

I have finally gotten around to playing with the HD5 driver for pandas (which uses, I believe, pytables under the hood). I’m only scratching the surface, but it’s easy to do what I want:

• Create a huge data frame storing an entire data set
• Efficiently query subsections of the frame

### Create the dataframe

We obviously cannot do this in memory. But if we have some way of generating one row at a time, or a small “chunk” of rows at a time, then we can “append” these iteratively to a HD5 store:

 store = pd.HDFStore("test.hd5", "w", complevel=9, complib="bzip2", fletcher32=True)
# Generate a data frame as frame
store.append("main", frame, data_columns=True)
# Repeat as necessary
store.close()


This creates a new HD5 file, and then creates a table in it named “main”. We can call store.append() repeatedly to add lots of rows. The data_columns=True is necessary if we wish to query by column (which we do).

We can then iterate over the whole dataframe in “chunks” of rows:

store = pd.HDFStore("test.hd5", "r")
for df in store.select("main", chunksize = 1000):
# Do something with df which contains the next 1000 rows


Alternatively, we can use the power querying ability. Suppose we have a column named “one” in the large dataframe, and we just want the rows where the value of “one” is less then 100. Then we can use:

store = pd.HDFStore("test.hd5", "r")
df = store.select("main", where="one < 100")


This seems to be wonderfully fast.

### Downsides

You cannot store “objects” in a table, so e.g. storing a GeoPandas data frame is impossible (or extremely hard).