Note
Go to the end to download the full example code.
Reading Subsets of Datasets#
Read in subsets of datasets and scales from standard PSI data files.
- This example demonstrates how to use
These functions allow users to read in subsets of datasets based on either index positions or scale values. This is particularly useful for large datasets where only a portion of the data is needed for analysis.
Note
To reiterate, these functions do not read in the entire dataset; instead, they
extract only the specified slices based on the provided indices or values. The only
caveat to this is that when using read_hdf_by_value(), the
function’s scales are read in to determine the appropriate indices corresponding
to the requested values.
from pathlib import Path
from psi_io import data, read_hdf_meta, read_hdf_by_index, read_hdf_by_value, read_hdf_by_ivalue
Read in the metadata for a 3D data file (the radial magnetic field data).
br_data_filepath = data.get_3d_data()
metadata = read_hdf_meta(br_data_filepath)
Datasets for file: br.h5
+-----------------+-----------------+-----------------+
| Name | Shape | Type |
+-----------------+-----------------+-----------------+
| Data | (181, 100, 151) | float32 |
+-----------------+-----------------+-----------------+
Scales for file: br.h5
+------------+------------+------------+------------+------------+------------+------------+
| Dataset | Dim | Name | Shape | Type | Min | Max |
+------------+------------+------------+------------+------------+------------+------------+
| Data | 0 | dim1 | (151,) | float32 | 0.99956024 | 30.511646 |
| Data | 1 | dim2 | (100,) | float32 | 0.0 | 3.1415927 |
| Data | 2 | dim3 | (181,) | float32 | 0.0 | 6.2831855 |
+------------+------------+------------+------------+------------+------------+------------+
Using this metadata, we can see that the dataset is a standard PSI 3D data file with scales:
radius (1st dimension), in units of solar radii (R☉),
co-latitude (2nd dimension), in radians,
longitude (3rd dimension), in radians.
To read in a subset of this data based on index positions, we can use
read_hdf_by_index(). For example, to read in only
the first position along the radius dimension (index 0) and all
positions along the co-latitude and longitude dimensions, we can do:
subset_at_r0 = read_hdf_by_index(br_data_filepath, 0, None, None)
for r in subset_at_r0:
print(r.shape)
(181, 100, 1)
(1,)
(100,)
(181,)
Or to read in the first five positions along the phi dimension (indices 0 to 4):
subset_phi0_4 = read_hdf_by_index(br_data_filepath, None, None, (0, 5))
for r in subset_phi0_4:
print(r.shape)
(5, 100, 151)
(151,)
(100,)
(5,)
Or to read a single position at the midpoint of each dimension:
mid_indices = (dim.shape[0]//2 for dim in metadata[0].scales)
subset_midpoint = read_hdf_by_index(br_data_filepath, *mid_indices)
for r in subset_midpoint:
print(r.shape)
(1, 1, 1)
(1,)
(1,)
(1,)
Alternatively, to read in subsets based on scale values, we can use
read_hdf_by_value(). For example, to read in data
at a radius of 2 R☉ and all co-latitude and longitude values, we can do:
subset_at_r2 = read_hdf_by_value(br_data_filepath, 2.0, None, None)
for r in subset_at_r2:
print(r.shape)
(181, 100, 2)
(2,)
(100,)
(181,)
Note
When using read_hdf_by_value(), the minimum number
of elements returned along each dimension is two, viz. the values that
bracket the requested value. Therefore, in the above example, the radius
scale returned will contain two values: one just below 2.0 R☉ and one just above.
This allows for interpolation between scale values if desired (using the
minimum amount of data necessary).
_, rscale, *_ = subset_at_r2
print(rscale)
[1.9383442 2.004063 ]
Suppose we haven’t read in the metadata and don’t know the exact scale values
(or, alternatively, the desired scale values don’t exist in the dataset). In that case,
we can use read_hdf_by_ivalue(), which reads in data based on
the nearest integer value of the requested scale values.
Note
While this function performs a seemingly trivial task (one that is replicable through
read_hdf_by_index()), it is provided for user convenience
and for interpolation routines.
subset_ivalue = read_hdf_by_ivalue(br_data_filepath, 0.5, None, None)
for r in subset_ivalue:
print(r.shape)
_, rscale_ivalue, *_ = subset_ivalue
(181, 100, 2)
(2,)
(100,)
(181,)
Total running time of the script: (0 minutes 0.023 seconds)