H5py - A Pythonic interface to access HDF5 files

Recently, I updated our meshfree partitioner to support writing files in the Hierarchical Data Format 5 (HDF5) file format. At the time of writing this post, we were handling more than six different file formats for our meshfree partitioner.

With an increase in complexity, we needed a standard file format that many solvers can use.

The solution? Standardize the output format using HDF5.

We chose HDF5 since a significant portion of people in the high-performance computing (HPC) and computational fluid dynamics (CFD) community prefer using it.

While Kumar has already implemented it in the Fortran 90 Solver, I’ll be talking about the Pythonic version of the same.

Requirements

  1. Python 3.x (I’m using 3.8.5)
  2. H5py (The HDF5 Python Wrapper Library)

The File Structure

The below example is from a grid file containing 6415 points. You may download the h5 file from here.

To view the file you can use the h5dump command to print to the console.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
HDF5 "point/point.h5" {
GROUP "/" {
   GROUP "1" {
      ATTRIBUTE "ghost" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): 69
         }
      }
      ATTRIBUTE "local" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): 3212
         }
      }
      ATTRIBUTE "total" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): 3281
         }
      }
      DATASET "ghost" {
         DATATYPE  H5T_IEEE_F64LE
         DATASPACE  SIMPLE { ( 69, 4 ) / ( 69, 4 ) }
         DATA {
         (0,0): 3271, 0.558617, -0.058334, 0.0308573,
         (1,0): 3272, 0.676293, 0.0569743, 0.020337
         ...
         (67,0): 6412, 1.81739, -0.694766, 0.0865301,
         (68,0): 6415, 1.76524, -0.77766, 0.0885843
        }
      }
      DATASET "local" {
         DATATYPE  H5T_IEEE_F64LE
         DATASPACE  SIMPLE { ( 3212, 30 ) / ( 3212, 30 ) }
         DATA {
         (0,0): 60, 0.524524, -0.0621129, 0, 1, 0.0343024, 3213, 2, 0, 0, 1,
         (0,11): 4, 3215, 196, 3213, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1,
         (0,25): -1, -1, -1, -1, -1,
         (1,0): 61, 0.486533, -0.0657348, 0, 1, 0.0258145, 1, 3, 0, 0, 1, 4,
         (1,12): 190, 196, 1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
         (1,27): -1, -1, -1,
         ...
         (3210,0): 6414, 14.1302, 0.00245159, 0, 1, 1.36979, 0, 0, 0, 1, 0,
         (3210,11): 6, 123, 3203, 3204, 3205, 168, 169, -1, -1, -1, -1, -1,
         (3210,23): -1, -1, -1, -1, -1, -1, -1,
         (3211,0): 6415, 13.1104, -5.90557, 0, 1, 1.17818, 0, 0, 0, 1, 0, 5,
         (3211,12): 3206, 165, 166, 3207, 3210, -1, -1, -1, -1, -1, -1, -1,
         (3211,24): -1, -1, -1, -1, -1, -1
         }
      }
   }
   GROUP "2" {
       ...
   }
}
}

How is the file structured?

The HDF5 file consists of groups, attributes, and datasets. Groups and Datasets are similar to how files are stored in folders. Similarly, attributes can be compared to how file and folder in the filesystem have their respective attributes.

In our case since we use METIS to decompose our point distribution, we have several partitioned grids. Each partitioned grid contains a set of points. Each point has a set of neighbor points known as the connectivity set. They are identified by their index in the partitioned grid.

These points are further classified into local and ghost. local are points whose neighbor points all exist in the same partition. ghost points are points that have at least one neighbor point which exists in another partition.

For any HDF5 file, there exists a / group also known as the root group. This is similar to how the file system structure exists in Linux based systems. For example, if we have two groups group1 and group2 they will be accessed as /group1 and /group2.

Alternatively, if group2 is a subgroup of group1 we can access it as /group1/group2.

Now, for every partition, we create a group whose name corresponds to the index (one-based indexing) in the partitioned grid. We assign attributes such as local, ghost, and total points which refer to the total number of local, ghost, and local+ghost points in that partition.

We then create a dataset called local a (number of local points, 30) array. This array contains our list of local points and corresponding data for each of those points.

Likewise, we create another dataset called ghost a (number of ghost points, 4) array.

The Code

We first import the h5py module.

1
import h5py

Now, we read the hdf5 file using the File method. We open the file in read mode, r.

1
h5file = h5py.File("point.hdf5", "r")

We, read the number of partitions in the file using keys() method.

1
partitions = len(h5file.keys())

We, now use the range() and loop over from 1 -> partitions.

1
for i in range(1, partitions + 1):

Using the get() method we can access the dataset provided we give the path.

Our path for accessing local points would be as follows.

1
localPath = "{}/{}".format(str(i), "local")
and we access the data using
1
localData = h5file.get(localPath)

Use the shape method to get the shape of the dataset.

1
2
localData.shape
(3212, 30)

Similarly, we can replace the text local with ghost to access the ghost dataset.

Finally, we close the hdf5 file handler with close() method.

1
h5file.close()

The entire code can be summed up as.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import h5py

h5file = h5py.File("point.hdf5", "r")
partitions = len(h5file.keys())
for i in range(1, partitions + 1):

    localPath = "{}/{}".format(str(i), "local")
    localData = h5file.get(localPath)
    for point in localData:
        ... Do something
    
    ghostPath = "{}/{}".format(str(i), "ghost")
    ghostData = h5file.get(ghostPath)
    for point in ghostData:
        ... Do something

h5file.close()

Summary

It’s quite easy and simple to read HDF5 files in Python. Furthermore, HDF5 offer some nifty features like chunking, data compression and a simple to understand hierarchical structure.

Additionally, a HDF5 file created in Python can be easily read by a program written in another language such as C++ or Fortran. This allows HDF5 to be extremely flexible with it comes to playing around with different languages.

  1. HDF5 Wikipedia Article
  2. Official website for HDF5
  3. H5py library for providing a Pythonic Interface to access HDF5 files