Posted on 19th May 2017
I have finally gotten around to playing with the HD5 driver for pandas
(which uses, I believe, pytables
under
the hood). I'm only scratching the surface, but it's easy to do what I want:
We obviously cannot do this in memory. But if we have some way of generating one row at a time, or a small "chunk" of rows at a time, then we can "append" these iteratively to a HD5 store:
store = pd.HDFStore("test.hd5", "w", complevel=9, complib="bzip2", fletcher32=True)
# Generate a data frame as `frame`
store.append("main", frame, data_columns=True)
# Repeat as necessary
store.close()
This creates a new HD5 file, and then creates a table in it named "main". We can call store.append()
repeatedly to add lots of rows. The data_columns=True
is necessary if we wish to query by column (which we do).
We can then iterate over the whole dataframe in "chunks" of rows:
store = pd.HDFStore("test.hd5", "r")
for df in store.select("main", chunksize = 1000):
# Do something with `df` which contains the next 1000 rows
Alternatively, we can use the power querying ability. Suppose we have a column named "one" in the large dataframe, and we just want the rows where the value of "one" is less then 100. Then we can use:
store = pd.HDFStore("test.hd5", "r")
df = store.select("main", where="one < 100")
This seems to be wonderfully fast.
You cannot store "objects" in a table, so e.g. storing a GeoPandas
data frame is impossible (or extremely hard).