Data format¶
The HDF5 file format is used as a container for all data in INTEGRATE. Each HDF5 file can contain multiple data sets (typically arranged in matrix format) with associated attributes that describe the data. HDF View is useful to inspect the content of HDF5 files.
The following HDF files are used for any INTEGRATE project
DATA.h5 : Stores observed data, and its associated geometry.
PRIOR.h5: Stores realizations of the prior model, and corresponding forward response
FORWARD.h5: Stores information needed to solve the forward problem, and/or needed to describe the observed data in DATA.h5
POST.h5: stores index of posterior realizations, as well as posterior statistics
DATA.h5¶
DATA.h5 contains observed data, and its associated geometry. The observed data can be of many types, such as TEM data and well-log data
By default, observed data in /D1 is compared with prior data in /D1, /D2 with /D2, etc.
To compare observed data with a different prior dataset, set the /D1/id_prior field.
For example, setting /D1/id_prior=2 will compare observed /D1 with prior /D2.
Np: Number of data locations (typically one set data per unique X-Y location)
Ndi: Number of data pointsNdper data typeiper location.
Nclass: Number of classes
The datasets UTMX, UTMY, ELEVATION, and LINE are mandatory for most plotting routines in INTEGRATE,
but are not used in the inversion itself.
The attribute D1/noise_model is mandatory for all data types, and describes the noise model used for the data.
Dataset |
Format |
Attribute |
Mandatory |
Description |
|---|---|---|---|---|
/UTMX |
[Np,1] |
(*) |
X - location of data points |
|
/UTMY |
[Np,1] |
(*) |
Y - location of data points |
|
/ELEVATION |
[Np,1] |
(*) |
Elevation at data points |
|
/LINE |
[Np,1] |
(*) |
Linenumber at data points |
|
/D1/noise_model |
[string] |
A string describing the noise model used for the data. |
||
/D1/id_prior |
[integer] |
The prior dataset ID to compare against. Observed data in /D1 will be compared with prior data in /D{id_prior} during inversion. If not set, defaults to the same ID as the observed data (D1→D1, D2→D2, etc.) |
||
/D1/i_use |
[NP,1] int [0/1] |
Determines whether a data point should be used or not. All data are used by default |
The format of the observed data, and the associate uncertainty, depends on the type of data, and the choice of noise model.
See the function integrate.load_data() for an example on how read DATA.h5 files.
Gaussian noise - continuous data¶
For continuous data and the multivariate Gaussian noise model can be chosen by setting the attribute D1/noise_model=gaussian
Dataset |
Format |
Attribute |
Mandatory |
Description |
|---|---|---|---|---|
/D1/noise_model |
[string]=’gaussian’ |
A string describing the noise model used for the data. Here |
||
/D1/d_obs |
[Np,Nd1] |
Observed data (#1) |
||
/D1/d_std |
[Np,Nd1] |
Standard deviation of observed data (db/dT). Is the size is [1,Nd], the same |
||
/D1/Cd |
[Nd1,Nd1] |
Correlated noise matrix. |
||
/D1/Cd |
[Np,Nd1,Nd1] |
Correlated noise matrix; each data observation has its own correlated noise matrix |
||
/gatetimes |
[Ndata,1] |
Gate times (in seconds) for each data point |
||
/i_lm |
[Nlm,1] |
Index (rel to /gatetimes) of Nlm gates for the low moment. |
||
/i_hm |
[Nhm,1] |
Index (rel to /gatetimes) of Nhm gates for the high moment. |
||
– |
Multinomial noise - discrete data¶
For discrete data the multinomial distribution can be used as likelihood by setting the attribute D1/noise_model=multinomial
Dataset |
Format |
Attribute |
Mandatory |
Description |
|---|---|---|---|---|
/D2/noise_model |
[string]=’multinomial’ |
The multinomial distribution is used as likelihood model for the data. |
||
/D2/d_obs |
[Np,Nclass,Nm] |
Observed data (class probabilities) |
||
/D2/i_use |
[Np,1] |
Binary indicator of whether a data point should be used or not |
PRIOR.h5¶
PRIOR.h5 contains N realizations of a prior model (represented as potentially multiple types of model parameters, such as resistivity, lithology, grainsize,….), and corresponding data (consisting of potentially multiple types of data, such as tTEM, SkyTEM, WellData,..)
N: Number of realizations of the prior model
Nm1: Number of model parameters of type 1
Nm2: Number of model parameters of type 2
NmX: Number of model parameters of type X
Dataset |
Format |
attribute |
Mandatory |
Description |
|---|---|---|---|---|
/M1 |
[N,Nm1] |
N realizations of model parameter 1, each consisting of Nm1 model param1eters |
||
/M1/x |
[nm] |
Array of values describing each value in M1 (e.g. depth to top of layer) |
||
/M1/name |
[string] |
Name of model parameter /M1 |
||
/M1/is_discrete |
[nm] |
[0/1] described whether /M1 is a discrete or continuous parameter |
||
/M1/class_id |
[1,n_class] |
A list of |
||
/M1/class_name |
[1,n_class] |
A list of |
||
/M1/clim |
[1,2] |
Min and maximum value for colorbar |
||
/M1/cmap |
[3,nlev] |
Colormap with |
||
/M2 |
[N,Nm2] |
N realizations of model parameter 2, each consisting of Nm2 model parameters |
||
/Mx |
[N,NmX] |
N realizations of model parameter X, each consisting of NmX model parameters |
Dataset |
Format |
attribute |
Mandatory |
Description |
|---|---|---|---|---|
/D1 |
[N,Nd1] |
N realizations of data number 1,
each consisting of |
||
/D1/f5_forward |
[string] |
HDF file describing the forward model used to compute prior data. |
||
/D1/with_noise |
[1] |
Indicates whether noise was added to the data[1] or not[0]. |
||
/D2 |
[N,Nd2] |
N realizations of data number 2,
each consisting of |
/D1 is only mandatory when PRIOR.h5 is used for inversion
All the mandatory attributes specified for /M1 are also mandatory for other attributes, i.e. /M1, /M2, … .
f_forward_h5 [string]: Defines the name of the HDF5 file that contains information need to solved the forward problem…
FORWARD.h5¶
The FORWARD.h5 needs to hold as much information as needed to define the use of a specific forward model.
The attribute /method refer to a specific choice of forward method.
Dataset |
Format |
attribute |
Mandatory |
Description |
|---|---|---|---|---|
/method |
[string] |
Defines the type of forward model def:’TDEM’. |
||
/type |
[string] |
Define the algorithm used to solve the forward model. def:’GA-AEM’. |
/method can, for example, be TDEM for Time Domain EM (The default in INTEGRATE),
or can be identity for an identity mapping (useful to represent log data).
/method='TDEM' make use of time-domain EM forward modeling.
The following three types of forward models will (eventually) be available:
/type='GA-AEM' [DEFAULT].
[GA-AEM]. Available for both Linux and Windows, Matlab and Python.
/type='AarhusInv'.
[AarhusInv]. Windows only.
Not yet implemented
/type='SimPEG'.
[SimPEG]. Python only.
/method='identity' maps attributes of a specific model (realizations of the prior) directly into data.
POST.h5¶
At the very minimum POST.h5 needs to contain the index (in PRIOR.h5) of realizations from the posterior.
Statistics are written by integrate.integrate_posterior_stats().
Np: Number of data locations.
Nr: Number of posterior realizations per location.
Nm: Number of model parameters (e.g. depth layers) for a given model type.
Nclass: Number of discrete classes for a given model type.
Dataset |
Format |
Attribute |
Mandatory |
Description |
|---|---|---|---|---|
/i_use |
[Np, Nr] |
Indices (into PRIOR.h5) of posterior realizations for each data location. |
||
/T |
[Np, 1] |
Annealing temperature used during inversion. |
||
/EV |
[Np, 1] |
Log-evidence at each data location. |
||
/f5_data |
string |
Filename of the DATA HDF5 file. |
||
/f5_prior |
string |
Filename of the PRIOR HDF5 file. |
||
/CHI2 |
[Np, Nd] |
float |
Reduced chi-squared (χ²/ν) goodness-of-fit metric per data type. Values near 1 indicate good fit; >1 underfit; <1 overfit. |
|
/N_UNIQUE |
[Np] |
Number of unique prior realizations used at each data location. Written by |
||
/UTMX |
[Np] |
X-coordinate, copied from DATA.h5 (written when |
||
/UTMY |
[Np] |
Y-coordinate, copied from DATA.h5. |
||
/ELEVATION |
[Np] |
Elevation, copied from DATA.h5. |
||
/LINE |
[Np] |
Line number, copied from DATA.h5. |
Written for each continuous model parameter /Mx (is_discrete=0) by
integrate.integrate_posterior_stats().
Dataset |
Format |
Written |
Description |
|---|---|---|---|
/Mx/Mean |
[Np, Nm] |
always |
Arithmetic mean of the posterior realizations at each location and depth. |
/Mx/LogMean |
[Np, Nm] |
always |
Geometric mean: exp(mean(log(values))). Appropriate for log-normally distributed parameters such as resistivity. |
/Mx/Median |
[Np, Nm] |
always |
Median of the posterior realizations. |
/Mx/Std |
[Np, Nm] |
always |
Standard deviation of log10(posterior). Measures spread on the logarithmic scale. |
/Mx/KL |
[Np, Nm] |
|
KL divergence D_KL(posterior ∥ prior) in bits (log base 2), estimated from log10-space histograms (50 bins). No fixed upper bound for continuous parameters. |
Written for each discrete model parameter /Mx (is_discrete=1) by
integrate.integrate_posterior_stats().
Dataset |
Format |
Written |
Description |
|---|---|---|---|
/Mx/Mode |
[Np, Nm] |
always |
Most probable class (class_id value) at each location and depth. |
/Mx/Entropy |
[Np, Nm] |
always |
Shannon entropy of the posterior class probabilities, normalised by log(Nclass). Range [0, 1]: 0 = certain, 1 = maximally uncertain. |
/Mx/P |
[Np, Nclass, Nm] |
always |
Posterior probability of each class at each location and depth. |
/Mx/KL |
[Np, Nm] |
|
KL divergence D_KL(posterior ∥ prior) normalised to [0, 1] using log_base = Nclass (read from |
Compression in HDF5 files¶
All HDF5 files created by INTEGRATE use compression by default to reduce file sizes while maintaining reasonable I/O performance. The default compression settings are optimized based on extensive benchmarking:
Default:
gziplevel 1 (provides 3.5× file size reduction, 78% faster than level 9)Performance: Write overhead of ~3× compared to no compression, but results in significantly smaller files
Customizable: Compression can be configured per-function call or globally
Example usage:
import integrate as ig
# Use default compression (gzip level 1)
ig.prior_model_layered(N=50000)
# Custom compression level
ig.prior_model_layered(N=50000, compression='gzip', compression_opts=4)
# Fast LZF compression
ig.prior_model_layered(N=50000, compression='lzf')
# Disable compression for temporary files
ig.prior_model_layered(N=50000, compression=False)
# Same parameters work for data functions
ig.save_data_gaussian(D_obs, compression='gzip', compression_opts=9)
You can modify the module-wide compression defaults in integrate_io.py:
import integrate.integrate_io as io
# Change global defaults (affects all subsequent saves)
io.DEFAULT_COMPRESSION = 'lzf' # Options: 'gzip', 'lzf', or None
io.DEFAULT_COMPRESSION_OPTS = 4 # For gzip: 1-9 (1=fastest, 9=smallest)
# Now all functions use the new defaults
ig.prior_model_layered(N=50000) # Will use lzf compression
Based on benchmarks with N=50,000 models:
Setting |
Write Speed |
File Size |
Best For |
|---|---|---|---|
|
Fastest |
19.7 MB |
Temporary files |
|
Fast |
5.6 MB (3.5×) |
Default (best balance) |
|
Very Fast |
~6-7 MB (2-3×) |
Speed-critical workflows |
|
Medium |
5.5 MB (3.6×) |
Good alternative |
|
Slowest (1.51s) |
5.5 MB (3.6×) |
Long-term archival (diminishing returns) |
Note: The difference between gzip levels 1 and 9 is only ~2% in file size but 4.6× difference in write time. Level 1 is recommended for most use cases.
All HDF5 write functions accept compression and compression_opts parameters:
save_prior_model()- Model parameter arrayssave_prior_data()- Forward-modeled datasave_data_gaussian()- Observed data with Gaussian noise modelsave_data_multinomial()- Observed data with multinomial noise modelprior_model_layered()- Passes compression settings to internal saves
A typical workflow¶
Setup DATA.h5
Store the observed data and its associated uncertainty in DATA.h5
Setup FORWARD.h5
Define the forward problem for data type A in FORWARD_A.h5.
Define the forward problem for data type B in FORWARD_B.h5.
Setup PRIOR.h5
Generate prior model realizations of model parameter 1 in in /M1
Generate prior model realizations of model parameter 2 in in /M2
Use FORWARD_A.h5 to compute prior data of the prior realizations for data type A
Use FORWARD_A.h5 to compute prior data of the prior realizations for data type B
Sample the posterior and output POST.h5
Update POST.h5 with some statistics computed from the posterior.