I am working with 4-50 GB HDF5 files. One file has an HDF5 Group ("outer_group") with several subgroups. The last subgroup ("sub_sub_group") has a dataset ("dataset1"), which I want to read and convert to a Pandas data frame. "dataset1" contains four columns "A", "B", "C", and "D" and mixed data types (string, numeric, byte, etc).
- test_file.h5
- outer_group
- sub_group
- sub_sub_group
- dataset1
- sub_sub_group
- sub_group
- outer_group
This is how I am currently loading the data into a Pandas dataframe:
import h5py
import pandas as pd
f = h5py.File("test_file.h5", "r")
values_df = pd.DataFrame.from_records(
f["outer_group"]["sub_group"]["sub_sub_group"]["dataset1"][()],
columns=["A", "B", "C", "D"]
)
However, this takes several minutes to load. Does anyone know of a faster solution for loading the "dataset1" hdf5 attribute to a Pandas data frames when working with large files? I have looked into read_hdf and HDF_Store, but I am not sure how to access the specific attribute. I also read here that my current approach (combining h5py with Pandas) might lead to issues.
Any insight is greatly appreciated.