integrate_io

Functions related to IO in the INTEGRATE module.

INTEGRATE I/O Module - Data Input/Output and File Management

This module provides comprehensive input/output functionality for the INTEGRATE geophysical data integration package. It handles reading and writing of HDF5 files, data format conversions, and management of prior/posterior data structures.

Key Features:
  • HDF5 file I/O for prior models, data, and posterior results

  • Support for multiple geophysical data formats (GEX, STM, USF)

  • Automatic data validation and format checking

  • File conversion utilities between different formats

  • Data merging and aggregation functions

  • Checksum verification and file integrity checks

Main Functions:
  • load_*(): Functions for loading prior models, data, and results

  • save_*(): Functions for saving prior models and data arrays

  • read_*(): File format readers (GEX, USF, etc.)

  • write_*(): File format writers and converters

  • merge_*(): Data and posterior merging utilities

File Format Support:
  • HDF5: Primary data storage format

  • GEX: Geometry and survey configuration files

  • STM: System transfer function files

  • USF: Field measurement files

  • CSV: Export format for GIS integration

Author: Thomas Mejer Hansen Email: tmeha@geo.au.dk

integrate.integrate_io.check_data(f_data_h5='data.h5', **kwargs)

Validate and complete INTEGRATE data file structure.

Ensures HDF5 data files contain required geometry datasets (UTMX, UTMY, LINE, ELEVATION) for electromagnetic surveys. Creates missing datasets using provided values or sensible defaults based on existing data dimensions.

Parameters:
  • f_data_h5 (str, optional) – Path to the HDF5 data file to validate and update (default is ‘data.h5’).

  • **kwargs (dict) – Dataset values and configuration options: - UTMX : array-like, UTM X coordinates - UTMY : array-like, UTM Y coordinates - LINE : array-like, survey line identifiers - ELEVATION : array-like, ground elevation values - showInfo : int, verbosity level (0=silent, >0=verbose)

Returns:

Function modifies the HDF5 file in place, adding missing datasets.

Return type:

None

Raises:
  • KeyError – If ‘D1/d_obs’ dataset is missing and geometry dimensions cannot be determined.

  • FileNotFoundError – If the specified HDF5 file does not exist.

Notes

The function ensures INTEGRATE data files have complete geometry information: - UTMX, UTMY: Spatial coordinates (required for mapping and modeling) - LINE: Survey line identifiers (required for data organization) - ELEVATION: Ground surface elevation (required for depth calculations)

Behavior:

  • If coordinate parameters are provided (UTMX, UTMY, LINE, ELEVATION): * Updates existing datasets with new values * Creates datasets if they don’t exist

  • If coordinate parameters are not provided: * Leaves existing datasets unchanged * Creates missing datasets with default values

Default value generation when datasets are missing and no values provided: - UTMX: Sequential values 0, 1, 2, … (placeholder coordinates) - UTMY: Zeros array with same length as UTMX - LINE: All values set to 1 (single survey line) - ELEVATION: All values set to 0 (sea level reference)

Dataset dimensions are inferred from existing ‘D1/d_obs’ observations when no coordinate data is provided.

integrate.integrate_io.copy_hdf5_file(input_filename, output_filename, N=None, loadToMemory=True, compress=True, **kwargs)

Copy the contents of an HDF5 file to another HDF5 file.

Parameters:
  • input_filename (str) – The path to the input HDF5 file.

  • output_filename (str) – The path to the output HDF5 file.

  • N (int, optional) – The number of elements to copy from each dataset. If not specified, all elements will be copied.

  • loadToMemory (bool, optional) – Whether to load the entire dataset to memory before slicing. Default is True.

  • compress (bool, optional) – Whether to compress the output dataset. Default is True.

Returns:

output_filename

integrate.integrate_io.copy_prior(input_filename, output_filename, idx=None, N_use=None, loadtomem=False, **kwargs)

Copy a PRIOR file, optionally subsetting the data.

This function copies an HDF5 PRIOR file, which may contain model parameters (M1, M2, …) and forward-modeled data (D1, D2, …). It allows for copying only a specific subset of samples using either an index array (idx) or a specified number of random samples (N_use).

Parameters:
  • input_filename (str) – Path to the input PRIOR HDF5 file.

  • output_filename (str) – Path to the output PRIOR HDF5 file.

  • idx (array-like, optional) – An array of indices to copy. If provided, only the data corresponding to these indices will be included in the new file. This takes precedence over N_use. Default is None (copy all data).

  • N_use (int, optional) – The number of random samples to select and copy. This is ignored if idx is provided. Default is None.

  • loadtomem (bool, optional) – If True, datasets are loaded entirely into memory before slicing. This can significantly speed up copying large subsets of data but increases memory consumption. Default is False.

  • **kwargs (dict) – Additional keyword arguments (e.g., showInfo, compress).

Returns:

The path to the output HDF5 file (output_filename).

Return type:

str

Raises:

ValueError – If N_use is greater than the total number of samples in the file, or if no datasets are found to determine the size for random sampling.

integrate.integrate_io.download_file(url, download_dir, use_checksum=False, **kwargs)

Download a file from a URL to a specified directory.

Parameters:
  • url (str) – The URL of the file to download.

  • download_dir (str) – The directory to save the downloaded file.

  • use_checksum (bool) – Whether to verify the file checksum after download.

  • kwargs – Additional keyword arguments.

Returns:

None

integrate.integrate_io.download_file_old(url, download_dir, **kwargs)

Download a file from a URL to a specified directory (old version).

Parameters:
  • url (str) – The URL of the file to download.

  • download_dir (str) – The directory to save the downloaded file.

  • kwargs – Additional keyword arguments.

Returns:

None

integrate.integrate_io.extract_feature_at_elevation(f_post_h5, elevation, im=1, key='', iz=None, ic=None, iclass=None)

Extract model parameter feature values at a specific elevation for all data points.

This function extracts values from a posterior model parameter at a specified elevation (e.g., 40m above sea level) across all data points. The function performs linear interpolation between model layers to obtain values at the exact requested elevation. For each data point, it uses the ELEVATION from the data file and the depth discretization from the prior model to compute the interpolated value.

Parameters:
  • f_post_h5 (str) – Path to the HDF5 file containing posterior sampling results.

  • elevation (float) – Elevation in meters at which to extract the feature values. This is an absolute elevation value (e.g., 40 means 40m above sea level).

  • im (int, optional) – Model index to extract from (e.g., 1 for M1, 2 for M2, default is 1).

  • key (str, optional) –

    Dataset key within the model group to extract. If empty string, automatically selects appropriate statistic based on parameter type:

    Continuous parameters: ‘Mean’, ‘Median’, ‘Std’ - Default: ‘Median’

    Discrete parameters: ‘Mode’, ‘Entropy’, ‘P’ (probability) - Default: ‘Mode’ - For ‘P’: requires ic/iclass parameter to specify which class

  • iz (int or None, optional) – Specific layer/feature index to extract. If None, attempts to find the appropriate depth layer automatically based on the elevation and model discretization (default is None). This parameter is primarily for advanced use when you want to extract a specific indexed feature rather than interpolating at an elevation.

  • ic (int or None, optional) – Class index for probability extraction when key=’P’. Specifies which class probability to extract. If None and key=’P’, defaults to 0 (first class). Alias for iclass parameter (default is None).

  • iclass (int or None, optional) – Alternative name for ic parameter. Class index for probability extraction when key=’P’ (default is None).

Returns:

Array of feature values at the specified elevation for all data points. Shape is (N_points,) where N_points is the number of data locations. Values are interpolated from the model layers surrounding the requested elevation. Returns NaN for data points where the requested elevation is outside the model domain (above surface or below maximum depth).

Return type:

numpy.ndarray

Raises:
  • FileNotFoundError – If the specified HDF5 file does not exist.

  • KeyError – If the requested model index (im) or key is not found in the file.

  • ValueError – If the elevation is invalid or cannot be interpolated from the model.

Notes

Elevation and Depth Calculation:

The function uses the following coordinate system: - ELEVATION: Ground surface elevation for each data point (from data file) - z: Depth below surface from the prior model (e.g., 0, 1, 2, … meters) - Absolute elevation = ELEVATION - z

For example, if a data point has ELEVATION=50m and the model has z=[0,10,20,30]: - At z=0: absolute elevation = 50m (surface) - At z=10: absolute elevation = 40m (10m below surface) - At z=20: absolute elevation = 30m (20m below surface)

To extract a value at elevation=40m, the function: 1. Computes depth below surface: depth = ELEVATION - elevation = 50 - 40 = 10m 2. Interpolates the feature value at depth=10m from the model

Interpolation:

Linear interpolation is used between model layers. If the requested elevation falls exactly on a model layer boundary, that layer’s value is returned. If the elevation is between two layers, values are linearly interpolated.

Automatic Key Selection:

When key=’’, the function automatically selects an appropriate statistic: - Discrete parameters: defaults to ‘Mode’ (most probable class) - Continuous parameters: defaults to ‘Median’ (robust central estimate)

Valid Keys by Parameter Type:

Continuous parameters: - ‘Mean’: Average value - ‘Median’: Median value (default) - ‘Std’: Standard deviation

Discrete parameters: - ‘Mode’: Most probable class (default) - ‘Entropy’: Uncertainty measure - ‘P’: Probability for a specific class (requires ic/iclass parameter)

Probability Extraction:

When extracting probabilities (key=’P’), the function requires a class index specified by ic or iclass. The P array has shape (nd, n_classes, nz) where: - nd = number of data points - n_classes = number of discrete classes - nz = number of depth layers

The ic/iclass parameter selects which class probability to extract.

Examples

Extract median resistivity at 40m elevation (continuous):

>>> values = extract_feature_at_elevation('post.h5', elevation=40, im=1, key='Median')
>>> print(values.shape)  # (N_points,)

Extract mean and standard deviation (continuous):

>>> mean_vals = extract_feature_at_elevation('post.h5', elevation=40, im=1, key='Mean')
>>> std_vals = extract_feature_at_elevation('post.h5', elevation=40, im=1, key='Std')

Extract mode (most probable class) at 25m elevation (discrete):

>>> classes = extract_feature_at_elevation('post.h5', elevation=25, im=2, key='Mode')

Extract entropy (uncertainty) for discrete parameter:

>>> entropy = extract_feature_at_elevation('post.h5', elevation=25, im=2, key='Entropy')

Extract probability for first class (discrete):

>>> prob_class0 = extract_feature_at_elevation('post.h5', elevation=25, im=2, key='P', ic=0)

Extract probability for second class using iclass parameter:

>>> prob_class1 = extract_feature_at_elevation('post.h5', elevation=25, im=2, key='P', iclass=1)

Use automatic key selection (Mode for discrete, Median for continuous):

>>> values = extract_feature_at_elevation('post.h5', elevation=30, im=1)

Extract mean values at sea level (elevation=0):

>>> values = extract_feature_at_elevation('post.h5', elevation=0, im=1, key='Mean')
integrate.integrate_io.file_checksum(file_path)

Calculate the MD5 checksum of a file.

Parameters:

file_path (str) – The path to the file.

Returns:

The MD5 checksum of the file.

Return type:

str

integrate.integrate_io.get_case_data(case='DAUGAARD', loadAll=False, loadType='', filelist=None, **kwargs)

Get case data for a specific case.

Parameters:
  • case (str) – The case name. Default is ‘DAUGAARD’. Options are ‘DAUGAARD’, ‘GRUSGRAV’, ‘FANGEL’, ‘HALD’, ‘ESBJERG’, and ‘OERUM.

  • loadAll (bool) – Whether to load all files for the case. Default is False.

  • loadType (str) – The type of files to load. Options are ‘’, ‘prior’, ‘prior_data’, ‘post’, and ‘inout’.

  • filelist (list or None) – A list of files to load. Default is None (creates new empty list).

  • kwargs – Additional keyword arguments.

Returns:

A list of file names for the case.

Return type:

list

integrate.integrate_io.get_discrete_classes(f_h5, im=1)

Get class IDs and class names for a discrete model parameter.

Retrieves the classification information (class IDs and class names) for a discrete model parameter from either a prior or posterior HDF5 file. This function is useful for understanding the categorical classes used in discrete parameter inversion (e.g., geological units, lithology types).

Parameters:
  • f_h5 (str) – Path to the HDF5 file. Can be a prior file (reads classes directly) or a posterior file (extracts the prior file reference first, then reads classes from the prior).

  • im (int, optional) – Model index to get classes for (e.g., 1 for M1, 2 for M2, default is 1).

Returns:

  • class_id (numpy.ndarray or list) – Array of class IDs. Empty list if the model parameter is not discrete or if class_id attribute is not set.

  • class_name (numpy.ndarray or list) – Array of class names corresponding to the class IDs. Empty list if the model parameter is not discrete or if class_name attribute is not set.

Examples

Get classes from a prior file:

>>> class_id, class_name = get_discrete_classes('PRIOR.h5', im=2)
>>> for cid, cname in zip(class_id, class_name):
...     print(f"Class {cid}: {cname}")

Get classes from a posterior file (automatically finds prior):

>>> class_id, class_name = get_discrete_classes('POST.h5', im=2)
>>> if len(class_id) > 0:
...     print(f"Found {len(class_id)} classes")

Check if parameter is discrete:

>>> class_id, class_name = get_discrete_classes('POST.h5', im=1)
>>> if len(class_id) == 0:
...     print("Model parameter M1 is continuous")
... else:
...     print(f"Model parameter M1 is discrete with {len(class_id)} classes")

Notes

The function automatically determines whether the input file is a prior or posterior file. For posterior files, it extracts the prior file reference from the file attributes and reads the class information from the prior.

Class information is stored in the prior file attributes: - ‘class_id’: Numeric identifiers for each class (e.g., [0, 1, 2, 3]) - ‘class_name’: Text labels for each class (e.g., [‘Clay’, ‘Sand’, ‘Gravel’, ‘Bedrock’]) - ‘is_discrete’: Boolean flag indicating if the parameter is discrete

If the model parameter is continuous (is_discrete=False) or if the class attributes are not set, the function returns empty lists.

integrate.integrate_io.get_geometry(f_data_h5)

Extract survey geometry data from HDF5 file.

Retrieves spatial coordinates, survey line identifiers, and elevation data from an INTEGRATE data file. Automatically handles both direct data files and posterior files that reference data files.

Parameters:

f_data_h5 (str) – Path to the HDF5 file containing geometry data. Can be either a data file or posterior file (function automatically detects and uses correct file).

Returns:

  • X (numpy.ndarray) – UTM X coordinates in meters, shape (N_points,).

  • Y (numpy.ndarray) – UTM Y coordinates in meters, shape (N_points,).

  • LINE (numpy.ndarray) – Survey line identifiers, shape (N_points,).

  • ELEVATION (numpy.ndarray) – Ground surface elevation in meters, shape (N_points,).

Raises:

IOError – If the HDF5 file cannot be opened or required datasets are missing.

Examples

>>> X, Y, LINE, ELEVATION = get_geometry('data.h5')
>>> print(f"Survey covers {X.max()-X.min():.0f}m x {Y.max()-Y.min():.0f}m")

Notes

The function expects geometry data to be stored in standard INTEGRATE format: - ‘/UTMX’: UTM X coordinates - ‘/UTMY’: UTM Y coordinates - ‘/LINE’: Survey line numbers - ‘/ELEVATION’: Ground elevation

When passed a posterior file, automatically extracts the reference to the original data file from the ‘f5_data’ attribute.

integrate.integrate_io.get_gex_file_from_data(f_data_h5, id=1)

Retrieves the ‘gex’ attribute from the specified HDF5 file.

Parameters:
  • f_data_h5 (str) – The path to the HDF5 file.

  • id (int) – The ID of the dataset within the HDF5 file. Defaults to 1.

Returns:

The value of the ‘gex’ attribute if found, otherwise an empty string.

Return type:

str

integrate.integrate_io.get_number_of_data(f_data_h5, id=None, count_nan=False)

Get the number of data per location for datasets in an INTEGRATE data HDF5 file.

Returns a 2D numpy array of size (Ndataset, Ndatapoints) containing the number of valid (non-NaN) or total data values at each measurement location for each dataset.

Parameters:
  • f_data_h5 (str) – Path to the HDF5 file containing INTEGRATE data with dataset groups.

  • id (int or list of int, optional) – Dataset identifier(s) to query (e.g., 1 for D1, [1,2] for D1 and D2). If None, returns data for all datasets found in the file.

  • count_nan (bool, optional) – If False (default), counts only non-NaN values at each location. If True, counts total number of data channels regardless of NaN values.

Returns:

2D array of shape (Ndataset, Ndatapoints) where: - Ndataset: number of datasets - Ndatapoints: maximum number of data locations across all datasets - Values: number of valid data channels per location (or total if count_nan=True)

Return type:

numpy.ndarray

Raises:
  • FileNotFoundError – If the specified HDF5 file does not exist.

  • IOError – If the HDF5 file cannot be opened or read.

  • KeyError – If the specified dataset ID does not exist in the file.

Examples

>>> # Get non-NaN data counts for all datasets
>>> data_counts = get_number_of_data('data.h5')
>>> print(f"Shape: {data_counts.shape}")  # (3, 4000) for 3 datasets, 4000 locations
Shape: (3, 4000)
>>> # Get total data counts (including NaN) for specific dataset
>>> counts_d1 = get_number_of_data('data.h5', id=1, count_nan=True)
>>> print(f"Shape: {counts_d1.shape}")  # (1, 4000) for 1 dataset, 4000 locations
Shape: (1, 4000)

Notes

This function analyzes d_obs arrays in each dataset: - d_obs shape: (N_locations, N_data_per_location) - By default, counts non-NaN values: np.sum(~np.isnan(d_obs), axis=1) - With count_nan=True, returns total data channels: d_obs.shape[1] for each location

The returned 2D array allows easy comparison across datasets and locations. Missing datasets are filled with zeros in the output array.

integrate.integrate_io.get_number_of_datasets(f_data_h5, return_ids=False)

Get the number of datasets (D1, D2, D3, etc.) in an INTEGRATE data HDF5 file.

Counts the number of dataset groups with names following the pattern ‘D1’, ‘D2’, ‘D3’, etc. in an INTEGRATE HDF5 data file. This function is useful for determining how many different data types or measurement systems are stored in a single file.

Parameters:
  • f_data_h5 (str) – Path to the HDF5 file containing INTEGRATE data with dataset groups.

  • return_ids (bool, optional) – If True, returns the list of dataset IDs instead of just the count (default is False).

Returns:

If return_ids=False: Number of datasets found in the file. Returns 0 if no datasets are found. If return_ids=True: List of dataset IDs (e.g., [1, 2, 3] for D1, D2, D3). Returns empty list if none found.

Return type:

int or list

Raises:
  • FileNotFoundError – If the specified HDF5 file does not exist.

  • IOError – If the HDF5 file cannot be opened or read.

Examples

>>> # Get number of datasets
>>> n_datasets = get_number_of_datasets('data.h5')
>>> print(f"File contains {n_datasets} datasets")
File contains 3 datasets
>>> # Get dataset IDs
>>> dataset_ids = get_number_of_datasets('data.h5', return_ids=True)
>>> print(f"Dataset IDs: {dataset_ids}")
Dataset IDs: [1, 2, 3]

Notes

This function looks for HDF5 groups with names starting with ‘D’ followed by digits. The typical INTEGRATE data file structure includes: - ‘/D1/’: First dataset (e.g., high moment data) - ‘/D2/’: Second dataset (e.g., low moment data) - ‘/D3/’: Third dataset (e.g., processed data) - And so on…

The function only counts top-level groups that match the ‘D{number}’ pattern, ignoring other groups like geometry data (UTMX, UTMY, etc.).

integrate.integrate_io.gex_to_stm(file_gex, **kwargs)

Convert GEX system configuration to STM files for electromagnetic modeling.

Convenience function that combines GEX file reading and STM file generation into a single operation. Handles both file paths and pre-loaded GEX dictionaries to create system transfer matrix files required for GA-AEM forward modeling.

Parameters:
  • file_gex (str or dict) – GEX system configuration. Pass a file path (str) to read and process a GEX file, or pass a pre-loaded GEX dictionary from a previous read_gex() call.

  • Nhank (int, optional) – Number of Hankel transform coefficients.

  • Nfreq (int, optional) – Number of frequencies for transform.

  • showInfo (int, optional) – Verbosity level.

Returns:

  • stm_files (list of str) – Paths to the generated STM files.

  • GEX (dict) – Processed GEX dictionary used for STM generation.

Raises:
  • TypeError – If file_gex is neither a string nor a dictionary.

  • FileNotFoundError – If file_gex is a string pointing to a non-existent file.

Notes

This function provides a streamlined workflow for electromagnetic system setup by automating the GEX→STM conversion process. The generated STM files contain system transfer functions needed for accurate forward modeling with GA-AEM.

When file_gex is a string, the function first tries read_gex() for legacy format compatibility. If that fails (e.g., Workbench format), it automatically falls back to read_gex_workbench().

When file_gex is a dictionary, it is assumed to be a valid GEX structure from a previous read_gex() or read_gex_workbench() call.

The write_stm_files() function handles the actual STM file generation with the provided or default parameters.

Examples

>>> # Direct file path (automatically detects format)
>>> stm_files, GEX = gex_to_stm('TX08_20201112.gex')
>>> # Pre-loaded GEX dictionary
>>> GEX = read_gex_workbench('TX08_20201112.gex')
>>> stm_files, _ = gex_to_stm(GEX)
integrate.integrate_io.hdf5_info(f_h5, verbose=True, load_data=False)

Get and print comprehensive information about an HDF5 file.

This function reads an HDF5 file (DATA, PRIOR, POST, or FORWARD) and prints detailed information about its contents, including datasets, dimensions, attributes, and file-type-specific metadata.

By default, only metadata (shapes, dtypes, attributes) is read for fast analysis. Set load_data=True to also compute data ranges and statistics.

Parameters:
  • f_h5 (str) – Path to the HDF5 file to analyze.

  • verbose (bool, optional) – If True, prints detailed information. If False, returns dictionary only (default is True).

  • load_data (bool, optional) – If True, loads actual data to compute ranges and statistics. If False, only reads metadata (much faster, default is False).

Returns:

info – Dictionary containing file information with keys: - ‘file_type’: Detected file type (‘DATA’, ‘PRIOR’, ‘POST’, ‘FORWARD’, or ‘UNKNOWN’) - ‘datasets’: List of dataset paths - ‘attributes’: Dictionary of root-level attributes - ‘structure’: Nested dictionary of file structure

Return type:

dict

Examples

>>> hdf5_info('PRIOR.h5')
>>> info = hdf5_info('DATA.h5', verbose=False)
>>> info = hdf5_info('POST.h5', load_data=True)  # Include data ranges

Notes

The function determines file type based on the presence of characteristic datasets: - DATA files: contain /UTMX, /UTMY, /ELEVATION, /LINE and /D1/, /D2/, etc. - PRIOR files: contain /M1, /M2, /D1, /D2 arrays - POST files: contain /i_use, /T, /EV attributes - FORWARD files: contain /method attribute

Performance: - With load_data=False (default): Very fast, only reads file metadata - With load_data=True: Slower, reads all data to compute ranges/statistics

See also

load_prior

Load prior model and data

load_data

Load observational data

load_posterior

Load posterior results

integrate.integrate_io.hdf5_scan(file_path)

Scans an HDF5 file and prints information about datasets (including their size) and attributes.

Parameters:

file_path (str) – The path to the HDF5 file.

integrate.integrate_io.load_data(f_data_h5, id_arr=[], ii=None, **kwargs)

Load observational electromagnetic data from HDF5 file.

Loads observed electromagnetic measurements, uncertainties, covariance matrices, and associated metadata from structured HDF5 files. Handles multiple data types and noise models with automatic fallback for missing data components.

Parameters:
  • f_data_h5 (str) – Path to the HDF5 file containing observational electromagnetic data.

  • id_arr (list of int, optional) – Dataset identifiers to load (e.g., [1, 2] for D1 and D2). Each ID corresponds to a different measurement system or processing stage (default is [1]).

  • ii (array-like, optional) – Array of indices specifying which data points to load from each dataset. If provided, only len(ii) data points will be loaded from each dataset using these indices (default is None).

  • **kwargs (dict) – Additional arguments: - showInfo : int, verbosity level (0=silent, 1=normal, >1=verbose)

Returns:

Dictionary containing loaded observational data with keys:

  • ’noise_model’list of str

    Noise model type for each dataset (‘gaussian’, ‘multinomial’, etc.)

  • ’d_obs’list of numpy.ndarray

    Observed data measurements, shape (N_stations, N_channels) per dataset

  • ’d_std’list of numpy.ndarray or None

    Standard deviations of observations, same shape as d_obs

  • ’Cd’list of numpy.ndarray or None

    Full covariance matrices for each dataset

  • ’id_arr’list of int

    Dataset identifiers that were successfully loaded. If set as empty, all data types will be loaded

  • ’i_use’list of numpy.ndarray

    Data point usage indicators (1=use, 0=ignore)

  • ’id_prior’list of int or numpy.ndarray

    Index of prior data type to compare against, used for cross-referencing. If ‘id_prior’ is not present in the file, it defaults to the dataset id_arr

Return type:

dict

Notes

The function gracefully handles missing data components: - Missing ‘id_prior’ defaults to sequential dataset IDs (1, 2, 3, …) - Missing ‘i_use’ defaults to ones array (use all data points) - Missing ‘d_std’ and ‘Cd’ remain as None (diagonal noise assumed)

Data structure follows INTEGRATE standard format: - ‘/D{id}/d_obs’: observed measurements - ‘/D{id}/d_std’: measurement uncertainties - ‘/D{id}/Cd’: full covariance matrix (optional) - ‘/D{id}/i_use’: data usage flags (optional) - ‘/D{id}/id_prior’: prior dataset cross-reference IDs (optional)

Each dataset can have a different noise model specified in the ‘noise_model’ attribute, enabling mixed data types in the same file.

integrate.integrate_io.load_prior(f_prior_h5, N_use=0, idx=[], Randomize=False, ii=None)

Load prior model parameters and data from HDF5 file.

Loads both model parameters and forward-modeled data from a prior HDF5 file, with options for sample selection, indexing, and randomization. This is a convenience function that combines model and data loading operations.

Parameters:
  • f_prior_h5 (str) – Path to the HDF5 file containing prior model realizations and data.

  • N_use (int, optional) – Number of samples to load. If 0, loads all available samples (default is 0).

  • idx (list, optional) – Specific indices to load. If empty, uses N_use or loads all samples (default is []).

  • Randomize (bool, optional) – Whether to randomize the order of loaded samples (default is False).

  • ii (array-like, optional) – Array of indices specifying which models and data to load. If provided, only len(ii) models and data will be loaded from ‘M1’, ‘M2’, … and ‘D1’, ‘D2’, … datasets using these indices (default is None).

Returns:

  • D (dict) – Dictionary containing forward-modeled data arrays, with keys corresponding to data types (e.g., ‘D1’, ‘D2’).

  • M (dict) – Dictionary containing model parameter arrays, with keys corresponding to model types (e.g., ‘M1’, ‘M2’).

  • idx (numpy.ndarray) – Array of indices corresponding to the loaded samples.

Notes

This function internally calls load_prior_data() and load_prior_model() with consistent indexing to ensure data and model correspondence. Sample selection priority: ii > explicit idx > N_use > all samples.

integrate.integrate_io.load_prior_data(f_prior_h5, id_use=[], idx=[], N_use=0, Randomize=False, **kwargs)

Load forward-modeled data arrays from prior HDF5 file.

Loads electromagnetic or other geophysical data predictions from forward modeling runs stored in the prior file. Supports selective loading by data type, sample indices, and randomization for sampling purposes.

Parameters:
  • f_prior_h5 (str) – Path to the HDF5 file containing forward-modeled data arrays.

  • id_use (list of int, optional) – Data type identifiers to load (e.g., [1, 2] for D1 and D2). If empty, loads all available data types (default is []).

  • idx (list or array-like, optional) – Specific sample indices to load. If empty, uses N_use and Randomize to determine samples (default is []).

  • N_use (int, optional) – Number of samples to load. If 0, loads all available samples. Automatically limited to available data size (default is 0).

  • Randomize (bool, optional) – Whether to randomly select samples when idx is empty. If False, uses sequential selection (default is False).

Returns:

  • D (list of numpy.ndarray) – List of forward-modeled data arrays, one for each requested data type. Each array has shape (N_samples, N_data_points).

  • idx (numpy.ndarray) – Array of sample indices that were loaded, useful for consistent indexing with corresponding model parameters.

Notes

Data arrays are stored as HDF5 datasets with keys ‘/D1’, ‘/D2’, etc., representing different data types (e.g., different measurement systems, frequencies, or processing stages). The function automatically detects available data types and loads the requested subset.

Sample selection follows the same priority as load_prior_model(): explicit idx > N_use random/sequential > all samples.

integrate.integrate_io.load_prior_model(f_prior_h5, im_use=[], idx=[], N_use=0, Randomize=False)

Load model parameter arrays from prior HDF5 file.

Loads model parameter arrays (e.g., resistivity, layer thickness, geological units) from a prior HDF5 file with flexible model selection and sample indexing options. Supports loading specific model types and sample subsets.

Parameters:
  • f_prior_h5 (str) – Path to the HDF5 file containing prior model parameter realizations.

  • im_use (list of int, optional) – Model parameter indices to load (e.g., [1, 2] for M1 and M2). If empty, loads all available model parameters (default is []).

  • idx (list or array-like, optional) – Specific sample indices to load. If empty, uses N_use and Randomize to determine samples (default is []).

  • N_use (int, optional) – Number of samples to load. If 0, loads all available samples. Ignored if idx is provided (default is 0).

  • Randomize (bool, optional) – Whether to randomly select samples when idx is empty. If False, uses sequential selection (default is False).

Returns:

  • M (list of numpy.ndarray) – List of model parameter arrays, one for each requested model type. Each array has shape (N_samples, N_model_parameters).

  • idx (numpy.ndarray) – Array of sample indices that were loaded, useful for consistent indexing across related datasets.

Notes

The function automatically detects available model parameters (M1, M2, …) and loads the requested subset. Sample selection priority follows: explicit idx > N_use random/sequential > all samples.

When idx length differs from N_use, the function uses len(idx) and issues a warning message.

integrate.integrate_io.merge_data(f_data, f_gex='', delta_line=0, f_data_merged_h5='', **kwargs)

Merge multiple data files into a single HDF5 file.

Parameters:
  • f_data (list) – List of input data files to merge.

  • f_gex (str, optional) – Path to geometry exchange file, by default ‘’.

  • delta_line (int, optional) – Line number increment for each merged file, by default 0.

  • f_data_merged_h5 (str, optional) – Output merged HDF5 file path, by default derived from f_gex.

  • kwargs – Additional keyword arguments.

Returns:

Filename of the merged HDF5 file.

Return type:

str

Raises:

ValueError – If f_data is not a list.

integrate.integrate_io.merge_posterior(f_post_h5_files, f_data_h5_files, f_post_merged_h5='', showInfo=0)

Merge multiple posterior sampling results into unified datasets.

Combines posterior results from separate electromagnetic survey areas or time periods into single merged files for comprehensive regional analysis. Handles both model parameter statistics and observational data consolidation.

Parameters:
  • f_post_h5_files (list of str) – List of paths to posterior HDF5 files containing sampling results from different survey areas or processing runs.

  • f_data_h5_files (list of str) – List of paths to corresponding observational data HDF5 files. Must have same length as f_post_h5_files with matching order.

  • f_post_merged_h5 (str, optional) – Output path for merged posterior file. If empty, generates default name based on input files (default is ‘’).

Returns:

Tuple containing (merged_posterior_path, merged_data_path) where: - merged_posterior_path : str, path to merged posterior HDF5 file - merged_data_path : str, path to merged observational data HDF5 file

Return type:

tuple

Raises:
  • ValueError – If f_data_h5_files and f_post_h5_files have different lengths.

  • FileNotFoundError – If any input files do not exist or cannot be accessed.

Notes

The merging process combines: - Model parameter statistics (Mean, Median, Mode, Std, Entropy) - Temperature and evidence fields from sampling - Geometry and observational data from all survey areas - Metadata and file references for traceability

Spatial coordinates are preserved to maintain geographic relationships between different survey areas. The merged files retain full compatibility with INTEGRATE analysis and visualization functions.

If f_post_merged_h5 is not provided, the output uses the format 'POST_merged_N{N}.h5' and the data file uses 'DATA_merged_N{N}.h5'. Posterior files must have compatible structure for merging.

integrate.integrate_io.merge_prior(f_prior_h5_files, f_prior_merged_h5='', shuffle=True, showInfo=0)

Merge multiple prior model files into a single combined HDF5 file.

Combines prior model parameters and forward-modeled data from multiple HDF5 files into a unified dataset. Creates a new model parameter (MX where X is the next available number) that tracks the source file index for each sample, enabling traceability of merged data origins.

Parameters:
  • f_prior_h5_files (list of str) – List of paths to prior HDF5 files to merge. Each file must contain compatible model parameters (M1, M2, M3, …) and data arrays (D1, D2, …).

  • f_prior_merged_h5 (str, optional) – Output path for the merged prior file. If empty, generates default name ‘PRIOR_merged_N{number_of_files}.h5’ (default is ‘’).

  • shuffle (bool, optional) – If True (default), randomly shuffle the order of realizations in the merged output. The same permutation is applied to all datasets (M1, M2, D1, D2, etc.) to maintain consistency. This is useful for ensuring realizations from different source files are well-mixed. If False, realizations are concatenated in order.

  • showInfo (int, optional) – Verbosity level for progress information. Higher values provide more detailed output (default is 0).

Returns:

Path to the merged prior HDF5 file.

Return type:

str

Raises:
  • ValueError – If f_prior_h5_files is not a list or is empty.

  • FileNotFoundError – If any input files do not exist or cannot be accessed.

Notes

The merging process: - Concatenates all model parameters (M1, M2, M3, …) across files - Concatenates all data arrays (D1, D2, D3, …) across files - Creates new MX parameter (where X is next available number) containing source file indices (1-based) - Optionally shuffles realizations using a consistent permutation across all arrays - Preserves HDF5 attributes that are identical across all input files - Updates metadata to reflect merged status

Shuffling Behavior: When shuffle=True (default), a random permutation is applied to all realizations: - A single permutation is generated and applied to ALL datasets (M1, M2, D1, D2, etc.) - This ensures realizations remain synchronized across all parameters - Uses fixed random seed (42) for reproducibility - Useful for mixing realizations from different source files - The source file tracking parameter (MX) is also shuffled to maintain traceability

Attribute Preservation: The function intelligently copies dataset attributes from input files to the merged file: - Only attributes that are identical across all input files are copied - This includes important attributes like class_name, class_id, is_discrete, clim, cmap, etc. - Attributes for data arrays (D1, D2, …) like method, type, Nfreq, etc. are preserved - Special handling for x and z attributes to match potentially padded dimensions

Source File Tracking: The new MX parameter is a DISCRETE integer array with shape (Ntotal, 1) where each value indicates which input file the corresponding sample originated from: - 1: samples from first file in f_prior_h5_files - 2: samples from second file in f_prior_h5_files - etc.

The MX parameter is properly marked with: - is_discrete = 1 (discrete parameter type) - shape = (Ntotal, 1) (consistent with other model parameters) - class_name = meaningful names derived from filenames - class_id = [1, 2, 3, …] (class identifiers)

File Compatibility: Input files can have different model parameter dimensions (e.g., different numbers of layers). Arrays with fewer parameters will be padded with NaN values to match the maximum dimensions. Data arrays should ideally have the same dimensions, but padding is applied if they differ.

Examples

>>> # Default: merge with shuffling
>>> f_files = ['prior1.h5', 'prior2.h5', 'prior3.h5']
>>> merged_file = merge_prior(f_files, 'combined_prior.h5')
>>> print(f"Merged {len(f_files)} files into {merged_file}")
>>> # Merge without shuffling (preserves original order)
>>> merged_file = merge_prior(f_files, 'combined_prior.h5', shuffle=False)
>>> # Merge with verbose output
>>> merged_file = merge_prior(f_files, 'combined_prior.h5', showInfo=1)
integrate.integrate_io.post_to_csv(f_post_h5='', Mstr='/M1')

Export posterior results to CSV format for GIS integration.

Converts posterior sampling results to CSV files containing spatial coordinates and model parameter statistics. Creates files suitable for import into GIS software or other analysis tools.

Parameters:
  • f_post_h5 (str, optional) – Path to the HDF5 file containing posterior results. If empty string, uses a default example file (default is ‘’).

  • Mstr (str, optional) – Model parameter dataset path within the HDF5 file (e.g., ‘/M1’, ‘/M2’). Specifies which model parameter to export (default is ‘/M1’).

Returns:

Path to the generated CSV file.

Return type:

str

Raises:
  • KeyError – If the specified model parameter dataset does not exist in the HDF5 file.

  • FileNotFoundError – If the specified HDF5 file does not exist or cannot be accessed.

Notes

The exported CSV file contains: - X, Y: UTM coordinates - ELEVATION: Ground surface elevation - Model statistics: Mean, Median, Mode, Standard deviation - For discrete models: probability distributions across classes - For continuous models: quantile values and uncertainty measures

The function automatically handles both discrete and continuous model types based on the ‘is_discrete’ attribute in the prior file. Output format is optimized for GIS applications with appropriate coordinate reference systems.

TODO: Future enhancements planned for LINE number export and separate functions for grid vs. point data export.

integrate.integrate_io.read_borehole(filename, **kwargs)

Read one or more borehole dictionaries from a JSON file.

The file must have been written by write_borehole(). If the file contains a single borehole (JSON object) a single dict is returned. If it contains many boreholes (JSON array) a list of dicts is returned.

The returned dict(s) preserve all fields that were written, including the optional elevation key (ground-surface elevation in m a.s.l.) used by plot_boreholes() for elevation-axis plotting.

Parameters:
  • filename (str) – JSON file path to read.

  • **kwargs

    showInfoint, optional

    Verbosity level. Default is 0.

Returns:

Single borehole dict, or list of borehole dicts.

Return type:

dict or list of dict

Examples

>>> W = ig.read_borehole('borehole.json')
>>> print(W['name'], W['depth_top'])
>>> WELLS = ig.read_borehole('all_boreholes.json')
>>> for W in WELLS:
...     print(W['name'])
integrate.integrate_io.read_gex(file_gex, **kwargs)

Parse GEX (Geometry Exchange) file into structured dictionary.

Reads and parses electromagnetic system configuration files in GEX format, which contain survey geometry, system parameters, waveforms, and timing information required for electromagnetic forward modeling.

Parameters:
  • file_gex (str) – Path to the GEX file containing electromagnetic system configuration.

  • Nhank (int, optional) – Number of Hankel transform abscissae for both frequency windows.

  • Nfreq (int, optional) – Number of frequencies per decade for both frequency windows.

  • Ndig (int, optional) – Number of digits for waveform digitizing frequency.

  • showInfo (int, optional) – Verbosity level (0=silent, >0=verbose, default 0).

Returns:

Dictionary containing parsed GEX file contents. Keys include 'filename' (str), 'General' (dict), section-specific parameter dicts, 'WaveformLM' (ndarray), 'WaveformHM' (ndarray), and 'GateArray' (ndarray).

Return type:

dict

Raises:

FileNotFoundError – If the specified GEX file does not exist or cannot be accessed.

Notes

GEX files use a section-based format with key=value pairs: - [Section] headers define parameter groups - Numeric values are automatically converted to numpy arrays - String values are preserved as text - Waveform and gate timing data are consolidated into arrays

The parser automatically handles: - Multi-point waveform definitions (WaveformLMPoint*, WaveformHMPoint*) - Gate timing arrays (GateTime*) - Numeric array conversion with space-separated values - Comments and formatting variations

Output dictionary structure matches INTEGRATE conventions for electromagnetic system configuration and GA-AEM compatibility.

integrate.integrate_io.read_gex_workbench(file_gex, **kwargs)

Parse Seequent Workbench GEX file into structured dictionary.

Reads and parses electromagnetic system configuration files in the newer Seequent Workbench GEX format, which supports both dual-moment (LM/HM) and single-moment configurations. This function handles: - Dual-moment systems with GateTimeLM## and GateTimeHM## entries - Single-moment systems with GateTime## entries (e.g., Diamond SkyTEM) - WaveformLMPoint## and WaveformHMPoint## entries

Parameters:
  • file_gex (str) – Path to the GEX file containing electromagnetic system configuration.

  • **kwargs (dict) – Additional parsing parameters: - showInfo : int, verbosity level (0=silent, >0=verbose, default 0)

Returns:

Dictionary containing parsed GEX file contents with structure: - ‘filename’ : str, original file path - ‘General’ : dict, system description and general parameters - ‘WaveformLM’ : numpy.ndarray, low-moment waveform points - ‘WaveformHM’ : numpy.ndarray, high-moment waveform points (if present) - ‘GateArrayLM’ : numpy.ndarray, low-moment gate timing (if dual-moment) - ‘GateArrayHM’ : numpy.ndarray, high-moment gate timing (if dual-moment) - ‘GateArray’ : numpy.ndarray, gate timing (if single-moment)

Return type:

dict

Raises:

FileNotFoundError – If the specified GEX file does not exist or cannot be accessed.

Notes

This function supersedes read_gex() for newer Workbench-exported files.

Format detection: - Dual-moment: Keys contain ‘GateTimeLM’ or ‘GateTimeHM’ suffixes - Single-moment: Keys are ‘GateTime##’ without moment identifier

Examples

>>> GEX = read_gex_workbench('TX08_20201112.gex')
>>> print(GEX['General']['GateArrayLM'].shape)  # Dual-moment
(30, 3)
>>> print(GEX['General']['GateArrayHM'].shape)
(30, 3)
>>> GEX = read_gex_workbench('diamond_system.gex')
>>> print(GEX['General']['GateArray'].shape)  # Single-moment
(25, 3)
integrate.integrate_io.read_usf(file_path: str) Dict[str, Any]

Parse Universal Sounding Format (USF) electromagnetic data file.

Reads and parses USF files containing electromagnetic survey data including measurement sweeps, timing information, and system parameters. USF is a standard format for time-domain electromagnetic data exchange.

Parameters:

file_path (str) – Path to the USF file to be parsed.

Returns:

Dictionary containing parsed USF file contents with keys: - ‘sweeps’ : list of dict, measurement sweep data with timing and values - ‘header’ : dict, file header information and metadata - ‘parameters’ : dict, system and acquisition parameters - ‘dummy_value’ : float, placeholder value for missing data - Additional keys for file-specific parameters and settings

Return type:

Dict[str, Any]

Notes

USF files contain structured electromagnetic data with sections for: - Header information (file version, date, system type) - Acquisition parameters (timing, frequencies, coordinates) - Measurement sweeps with data points and uncertainties - System configuration and processing parameters

The parser handles various USF format variations and automatically converts numeric data while preserving text metadata. Sweep data includes timing gates, measured values, and quality indicators.

This function is compatible with USF files from various electromagnetic systems and processing software, following standard format specifications for time-domain electromagnetic data exchange.

integrate.integrate_io.read_usf_mul(directory: str = '.', ext: str = '.usf') List[Dict[str, Any]]

Read all USF files in a specified directory and return a list of USF data structures.

Parameters:
  • directory – Path to the directory containing USF files (default: current directory)

  • ext – File extension to look for (default: “.usf”)

Returns:

  • np.ndarray: Array of observed data (d_obs) from all USF files

  • np.ndarray: Array of relative errors (d_rel_err) from all USF files

  • List[Dict[str, Any]]: List of USF data structures, each representing a single USF file

Return type:

tuple containing

integrate.integrate_io.save_data_gaussian(D_obs, D_std=[], d_std=[], Cd=[], id=1, id_prior=None, i_use=None, is_log=0, f_data_h5='data.h5', UTMX=None, UTMY=None, LINE=None, ELEVATION=None, delete_if_exist=False, name=None, compression=None, compression_opts=None, **kwargs)

Save observational data with Gaussian noise model to HDF5 file.

Creates HDF5 datasets for electromagnetic or other geophysical measurements assuming Gaussian-distributed uncertainties. Handles both diagonal and full covariance representations of measurement errors.

Parameters:
  • D_obs (numpy.ndarray) – Observed data measurements with shape (N_stations, N_channels). Each row represents a measurement location, each column a data channel.

  • D_std (list, optional) – Standard deviations of observed data, same shape as D_obs. If empty, computed from d_std parameter (default is []).

  • d_std (list, optional) – Default standard deviation values or multipliers for uncertainty calculation when D_std is not provided (default is []).

  • Cd (list, optional) – Full covariance matrices for measurement uncertainties. If provided, takes precedence over D_std (default is []).

  • id (int, optional) – Dataset identifier for HDF5 group naming (‘/D{id}’, default is 1).

  • id_prior (int, optional) – Prior dataset identifier to compare against during inversion. If specified, observed data in /D{id} will be compared with prior data in /D{id_prior}. If None, defaults to same ID (D1 compares with D1, D2 with D2, etc.) (default is None).

  • i_use (numpy.ndarray, optional) – Binary mask indicating which data points to use in inversion, shape (N_stations,) or (N_stations,1). Values of 1 indicate data should be used, 0 indicates data should be excluded. If None, creates array of ones (all data used by default, default is None).

  • is_log (int, optional) – Flag indicating logarithmic data scaling (0=linear, 1=log, default is 0).

  • f_data_h5 (str, optional) – Path to output HDF5 file (default is ‘data.h5’).

  • UTMX (numpy.ndarray, optional) – UTM X coordinates in meters, shape (N_stations,) or (N_stations,1). If None, creates sequential integers (default is None).

  • UTMY (numpy.ndarray, optional) – UTM Y coordinates in meters, shape (N_stations,) or (N_stations,1). If None, creates zeros array (default is None).

  • LINE (numpy.ndarray, optional) – Survey line identifiers, shape (N_stations,) or (N_stations,1). If None, creates array filled with 1s (default is None).

  • ELEVATION (numpy.ndarray, optional) – Ground surface elevation in meters, shape (N_stations,) or (N_stations,1). If None, creates zeros array (default is None).

  • delete_if_exist (bool, optional) – Whether to delete the entire HDF5 file if it exists before creating new data. Use with caution as this removes all existing data (default is False).

  • name (str, optional) – Optional name attribute to be written to the data group. If provided, this string will be stored as an attribute alongside ‘noise_model’ (default is None).

  • compression (str or None, optional) – Compression filter to use. Options: ‘gzip’, ‘lzf’, or None. If None (default), uses global DEFAULT_COMPRESSION setting. Set to False to explicitly disable compression.

  • compression_opts (int, optional) – Compression level (0-9 for gzip). If None (default), uses global DEFAULT_COMPRESSION_OPTS setting. Level 1 provides 78% faster writes than level 9 with only 2% larger files.

  • **kwargs (dict) – Additional metadata parameters: - showInfo : int, verbosity level - Other dataset attributes for electromagnetic processing

Returns:

Path to the HDF5 file where data was written.

Return type:

str

Notes

The function creates HDF5 structure following INTEGRATE conventions: - ‘/D{id}/d_obs’: observed measurements - ‘/D{id}/d_std’: measurement standard deviations (if available) - ‘/D{id}/Cd’: full covariance matrix (if provided) - Dataset attributes include ‘noise_model’=’gaussian’

Uncertainty handling priority: Cd > D_std > computed from d_std The Gaussian noise model assumes independent, normally distributed measurement errors with specified standard deviations or covariances.

Compression settings default to module-wide DEFAULT_COMPRESSION and DEFAULT_COMPRESSION_OPTS values (gzip level 1 by default), providing 3.5x file size reduction with good performance.

Note

Additional Parameters (kwargs):

  • showInfo (int): Level of verbosity for printing information. Default is 0.

  • f_gex (str): Name of the GEX file associated with the data. Default is empty string.

Behavior:

  • If D_std is not provided, it is calculated as d_std * D_obs

  • If coordinate parameters (UTMX, UTMY, LINE, ELEVATION) are provided, uses check_data() to create/update geometry datasets

  • If coordinate parameters are not provided, creates default geometry datasets if they don’t exist

  • If a group with name ‘D{id}’ exists, it is removed before adding new data

  • Writes attributes ‘noise_model’ and ‘is_log’ to the dataset group

integrate.integrate_io.save_data_multinomial(D_obs, i_use=None, id=[], id_prior=None, f_data_h5='data.h5', compression=None, compression_opts=None, **kwargs)

Save observed data to an HDF5 file in a specified group with a multinomial noise model.

Parameters:
  • D_obs (numpy.ndarray) – The observed data array to be written to the file.

  • id (list, optional) – The ID of the group to write the data to. If not provided, the function will find the next available ID.

  • id_prior (int, optional) – The ID of PRIOR data to compare against this data. If not set, id_prior=id

  • f_data_h5 (str, optional) – The path to the HDF5 file where the data will be written. Default is ‘data.h5’.

  • kwargs – Additional keyword arguments.

Returns:

The path to the HDF5 file where the data was written.

Return type:

str

integrate.integrate_io.save_prior_data(f_prior_h5, D_new, id=None, force_delete=False, compression='gzip', compression_opts=1, **kwargs)

Save forward-modeled data arrays to prior HDF5 file.

Saves electromagnetic or other geophysical data predictions from forward modeling to an HDF5 file with automatic data identifier assignment and data type optimization. Supports overwriting existing data arrays.

Parameters:
  • f_prior_h5 (str) – Path to the HDF5 file where forward-modeled data will be saved.

  • D_new (numpy.ndarray) – Forward-modeled data array to save. Should have shape (N_samples, N_data_points) for consistency.

  • id (int, optional) – Data identifier for the dataset key (creates ‘/D{id}’). If None, automatically assigns the next available ID (default is None).

  • force_delete (bool, optional) – Whether to delete existing data with the same identifier before saving. If False, raises error when key exists (default is False).

  • compression (str or None, optional) – Compression filter to use. Options: ‘gzip’, ‘lzf’, or None. Default is ‘gzip’ for good compression with reasonable speed. Set to None to disable compression (fastest I/O, largest files).

  • compression_opts (int, optional) – Compression level (0-9 for gzip). Default is 1 (optimal balance). Level 1 provides 78% faster writes than level 9 with only 2% larger files. Only used when compression=’gzip’. Ignored if compression is None.

  • **kwargs (dict) – Additional arguments: - showInfo : int, verbosity level (0=silent, >0=verbose)

Returns:

id – The data identifier used for saving the data.

Return type:

int

Notes

Forward-modeled data is stored as HDF5 datasets with keys ‘/D1’, ‘/D2’, etc., representing different data types (e.g., electromagnetic frequencies, measurement systems, or processing variants).

Data type optimization is performed automatically: - Floating-point arrays are converted to float32 for memory efficiency - Integer arrays are preserved as appropriate integer types

Compression settings (default: gzip level 1): - Provides 3.5x file size reduction vs no compression - 78% faster write than gzip level 9 (old default) - Only 2% larger files than maximum compression

The function ensures 2D array format with shape (N_samples, N_data_points).

integrate.integrate_io.save_prior_model(f_prior_h5, M_new, im=None, force_replace=False, delete_if_exist=False, compression='gzip', compression_opts=1, **kwargs)

Save model parameter arrays to prior HDF5 file.

Saves model parameter realizations (e.g., resistivity, layer thickness) to an HDF5 file with automatic model identifier assignment and data type optimization. Supports overwriting existing models and file management options.

Parameters:
  • f_prior_h5 (str) – Path to the HDF5 file where model data will be saved.

  • M_new (numpy.ndarray) – Model parameter array to save. Can be 1D or 2D; 1D arrays are automatically converted to column vectors.

  • im (int, optional) – Model identifier for the dataset key (creates ‘/M{im}’). If None, automatically assigns the next available ID (default is None).

  • force_replace (bool, optional) – Whether to overwrite existing model data with the same identifier. If False, raises error when key exists (default is False).

  • delete_if_exist (bool, optional) – Whether to delete the entire HDF5 file before saving. Use with caution as this removes all existing data (default is False).

  • compression (str or None, optional) – Compression filter to use. Options: ‘gzip’, ‘lzf’, or None. - ‘gzip’: Good compression ratio, moderate speed (default) - ‘lzf’: Faster but lower compression ratio - None: No compression, fastest read/write Default is ‘gzip’. Set to None for temporary files or fast iteration.

  • compression_opts (int, optional) – Compression level for gzip (1-9). Higher = better compression but slower. - 1: Fast compression, excellent balance (NEW DEFAULT, changed from 9) - 4: Good compression, moderate speed - 9: Maximum compression, very slow (OLD DEFAULT) Only used when compression=’gzip’. Default is 1.

  • **kwargs (dict) – Additional arguments: - showInfo : int, verbosity level (0=silent, >0=verbose)

Returns:

im – The model identifier used for saving the data.

Return type:

int

Notes

Model data is stored as HDF5 datasets with keys ‘/M1’, ‘/M2’, etc. Data type optimization is performed automatically: - Floating-point arrays are converted to float32 for memory efficiency - Integer arrays are preserved as appropriate integer types

Compression settings (based on performance tests with N=50000): - compression=None: Fastest (baseline), but 3.6x larger files - compression=’gzip’, compression_opts=1: OPTIMAL - 78% faster than level 9, only 2% larger (NEW DEFAULT) - compression=’gzip’, compression_opts=4: 71% faster than level 9, only 0.5% larger - compression=’gzip’, compression_opts=9: Maximum compression, very slow (diminishing returns)

Recommendation: The new default (gzip level 1) provides the best balance: - 3.5x file size reduction vs no compression - 78% faster write than the old default (level 9) - Only 2% larger files than maximum compression

For temporary files or rapid iteration, use compression=None. For maximum compression (archival), use compression_opts=9.

The function ensures 2D array format with shape (N_samples, N_parameters) where 1D arrays are converted to column vectors.

integrate.integrate_io.test_read_usf(file_path: str) None

Test function to read a USF file and print some key values.

Parameters:

file_path – Path to the USF file

integrate.integrate_io.write_borehole(W, filename, **kwargs)

Write one or more borehole dictionaries to a JSON file.

Parameters:
  • W (dict or list of dict) –

    A single borehole dict, or a list of borehole dicts. Each dict may contain any combination of the standard borehole fields:

    • name (str) – identifier

    • X, Y (float) – UTM coordinates

    • depth_top, depth_bottom (list of float) – interval boundaries (m)

    • class_obs (list of int) – observed lithology class per interval

    • class_prob (list of float) – confidence per interval (0–1)

    • method (str, optional) – likelihood method (default 'mode_probability')

    • elevation (float, optional) – ground-surface elevation (m a.s.l.). Used only by plot_boreholes() to place the well on a shared elevation axis. Has no effect on inversion.

    numpy arrays and scalars are automatically converted to plain Python lists/numbers so the file is human-readable JSON.

  • filename (str) – Output JSON file path (e.g. 'borehole.json').

  • **kwargs

    showInfoint, optional

    Verbosity level. Default is 0.

Returns:

The filename written.

Return type:

str

Examples

>>> W = {'name': 'BH1', 'X': 498832.5, 'Y': 6250843.1,
...      'depth_top': [0, 10], 'depth_bottom': [10, 20],
...      'class_obs': [1, 2], 'class_prob': [0.9, 0.9]}
>>> ig.write_borehole(W, 'borehole.json')
'borehole.json'
>>> WELLS = [W1, W2, W3]
>>> ig.write_borehole(WELLS, 'all_boreholes.json')
'all_boreholes.json'
integrate.integrate_io.write_data_gaussian(*args, **kwargs)

[DEPRECATED] Use save_data_gaussian() instead.

This function has been renamed to save_data_gaussian() to maintain consistency with the HDF5 I/O naming convention (load_* / save_* for HDF5 operations).

The write_data_gaussian() function will be removed in a future version. Please update your code to use save_data_gaussian() instead.

See also

save_data_gaussian

The new function name for this functionality

integrate.integrate_io.write_data_multinomial(*args, **kwargs)

[DEPRECATED] Use save_data_multinomial() instead.

This function has been renamed to save_data_multinomial() to maintain consistency with the HDF5 I/O naming convention (load_* / save_* for HDF5 operations).

The write_data_multinomial() function will be removed in a future version. Please update your code to use save_data_multinomial() instead.

See also

save_data_multinomial

The new function name for this functionality

integrate.integrate_io.write_stm_files(GEX, **kwargs)

Generate STM (System Transfer Matrix) files from GEX system configuration.

Creates system transfer matrix files required for electromagnetic forward modeling using GA-AEM. Processes both high-moment (HM) and low-moment (LM) configurations with customizable frequency content and Hankel transform parameters.

Parameters:
  • GEX (dict) – Dictionary containing GEX system configuration data with keys: - ‘General’: System description and waveform information - Waveform and timing parameters for electromagnetic modeling

  • **kwargs (dict) – Additional configuration parameters: - Nhank : int, number of Hankel transform coefficients (default 280) - Nfreq : int, number of frequencies for transform (default 12) - Ndig : int, number of digital filters (default 7) - showInfo : int, verbosity level (0=silent, >0=verbose) - WindowWeightingScheme : str, weighting scheme (‘AreaUnderCurve’, ‘BoxCar’) - NumAbsHM : int, number of abscissae for high moment (default Nhank) - NumAbsLM : int, number of abscissae for low moment (default Nhank) - NumFreqHM : int, number of frequencies for high moment (default Nfreq) - NumFreqLM : int, number of frequencies for low moment (default Nfreq)

Returns:

List of file paths for the generated STM files (typically HM and LM variants).

Return type:

list of str

Notes

STM files contain system transfer functions that describe the electromagnetic system response characteristics needed for accurate forward modeling. The function generates separate files for high-moment and low-moment configurations when applicable.

The generated STM files follow GA-AEM format specifications and include: - Frequency domain transfer functions - Hankel transform coefficients - Digital filter parameters - System timing and waveform information

File naming convention follows: {system_description}_{moment_type}.stm

integrate.integrate_io.xyz_to_h5(file_xyz, file_gex, f_data_h5=None, i_lm_skip=None, i_hm_skip=None, nan_value=None, showInfo=0, disregardFullNan=True)

Convert Aarhus Workbench XYZ export file(s) to an INTEGRATE HDF5 data file.

Reads one or more tTEM/SkyTEM XYZ files exported from Aarhus Workbench and writes a Gaussian-noise HDF5 data file suitable for use with integrate_rejection(). The GEX file is used to determine which initial gates to skip per channel (RemoveInitialGates) and the total gate count (NoGates).

Parameters:
  • file_xyz (str or list of str) – Path(s) to Aarhus Workbench XYZ export file(s). Multiple files are concatenated in order (e.g. several flight days sharing one GEX).

  • file_gex (str) – Path to the GEX file describing the EM system configuration. Gate selection and channel count are read from this file.

  • f_data_h5 (str, optional) – Output HDF5 file path. If None, derived by joining the XYZ basename(s) with '_' and appending '.h5'.

  • i_lm_skip (list of int, optional) – Workbench LM gate numbers to exclude from inversion (1-indexed, same numbering as in the XYZ file header, e.g. DBDT_Ch1GT3 = gate 3). Gates already removed by RemoveInitialGates in the GEX are silently ignored. Excluded gates have their d_obs set to NaN and d_std set to 100.

  • i_hm_skip (list of int, optional) – Same as i_lm_skip but for HM (channel 2) Workbench gate numbers.

  • nan_value (float or None, optional) – Value used as a missing-data sentinel in the XYZ file. If None (default), the value is read from the XYZ file header (/DUMMY field, via model_info['dummy']); falls back to 9999 if the header field is absent. Pass an explicit value to override the header (e.g. nan_value=9999).

  • showInfo (int, optional) – Verbosity level (default 0). -1: suppress all output. 0: minimal — print only the output file summary (“Adding group …”). >=1: verbose — also print each XYZ file as it is read.

  • disregardFullNan (bool, optional) – If True (default), soundings where all gates are NaN are excluded from the output HDF5 file.

Returns:

Path to the written HDF5 file.

Return type:

str

Notes

  • d_std in the XYZ file is a relative (fractional) uncertainty; absolute d_std is computed as relative_std * d_obs.

  • Geometry (UTMX, UTMY, LINE, ELEVATION) is taken from channel-1 rows.

  • The /D1/i_lm and /D1/i_hm datasets store the 0-indexed gate arrays that were used (mirrors the MATLAB output).

  • The GEX file path is stored as the gex attribute on /D1.

  • Requires libaarhusxyz (pip install libaarhusxyz).

Examples

Single file:

>>> f = xyz_to_h5('tTEM_20230727_AVG_export.xyz',
...                          'TX07_20230731_2x4_RC20-33.gex')

Multiple files merged into one HDF5:

>>> f = xyz_to_h5(
...     ['tTEM_20230727_AVG_export.xyz', 'tTEM_20230814_AVG_export.xyz'],
...     'TX07_20230731_2x4_RC20-33.gex'
... )

Skip Workbench LM gates 2, 3 and HM gates 27-30:

>>> f = xyz_to_h5('data.xyz', 'system.gex',
...                          i_lm_skip=[2, 3], i_hm_skip=[27, 28, 29, 30])