integrate_rejection¶

Functions for rejection sampling and Bayesian inversion in the INTEGRATE module.

Rejection Sampling Module for INTEGRATE

This module contains functions for Bayesian inversion using rejection sampling methodology. It includes the main rejection sampling algorithm, likelihood calculations, and parallel processing support for efficient posterior sampling.

Key Functions: - integrate_rejection(): Main rejection sampling function - integrate_rejection_range(): Core rejection sampling for data point ranges - likelihood_*(): Various likelihood calculation functions - Shared memory functions for parallel processing

integrate.integrate_rejection.cleanup_shared_memory(shm_objects)¶

Clean up shared memory segments.

This function properly closes and unlinks shared memory objects created during parallel processing to prevent memory leaks and system resource exhaustion. It handles cleanup gracefully by catching and ignoring errors for objects that may have already been cleaned up.

Parameters:: shm_objects (list) – List of shared memory objects to clean up.
Return type:: None

Notes

This function should always be called in a finally block or similar error-safe context to ensure cleanup occurs even if exceptions are raised.

Each shared memory object is both closed (to release the local reference) and unlinked (to remove it from the system). Errors during cleanup are silently ignored to prevent cascading failures.

Examples

>>> shared_memories, shm_objects = create_shared_memory(arrays)
>>> try:
...     # Use shared memory
...     pass
... finally:
...     cleanup_shared_memory(shm_objects)

integrate.integrate_rejection.compute_hypothesis_probability(f_post_h5_list, **kwargs)¶

Compute hypothesis probabilities from evidence values in posterior files.

This function reads evidence (EV) values from multiple posterior HDF5 files, each representing a different hypothesis/prior model, and computes the probability of each hypothesis at each data point using Bayesian model averaging.

The probability is computed using Bayes’ theorem for model selection with the assumption of equal prior probabilities for all hypotheses.

Parameters:

f_post_h5_list (list of str) – List of paths to posterior HDF5 files, one for each hypothesis. Each file must contain an ‘/EV’ dataset with log-evidence values (natural log).
showInfo (int, optional) – Level of verbosity for output. Default is 0.

Returns:

P (ndarray, shape (n_data_points, n_hypotheses)) – Probability of each hypothesis at each data point. P[i, j] is the probability of hypothesis j at data point i. Each row sums to 1.0 (within numerical precision).
mode (ndarray, shape (n_data_points,)) – Index of the most probable hypothesis for each data point. Values are 0-based indices in range [0, n_hypotheses-1].
entropy_values (ndarray, shape (n_data_points,)) – Entropy (uncertainty measure) for each data point. Values are in range [0, log_base(n_hypotheses)], where base=n_hypotheses. 0 = certain (one hypothesis has probability 1), higher values = more uncertain.

Notes

The probability is computed using Bayes’ theorem for model selection:

P(hypothesis_i | data) = P(data | hypothesis_i) * P(hypothesis_i) / P(data)

Where P(data | hypothesis_i) is the marginal likelihood (evidence), stored as log-evidence (EV, natural log) in the HDF5 files. Assuming equal prior probabilities for all hypotheses:

P(hypothesis_i | data) = exp(EV_i) / sum_j(exp(EV_j))

For numerical stability, the log-sum-exp trick is used:

P(hypothesis_i | data) = exp(EV_i - log_sum_exp(all EVs))

where log_sum_exp is computed using np.logaddexp.reduce() for arbitrary number of hypotheses.

The evidence (EV) values are stored as natural logarithms (ln, not log10) as computed by integrate_rejection_range().

Examples

>>> import integrate as ig
>>> # Create three posterior files from different prior models
>>> f_post_list = ['post_valley.h5', 'post_standard.h5', 'post_merged.h5']
>>> P, mode, entropy = ig.compute_hypothesis_probability(f_post_list)
>>> print(P.shape)  # (n_data_points, 3)
>>> print(P[0])  # Probabilities for first data point: [0.3, 0.5, 0.2]
>>> print(np.sum(P[0]))  # Should be 1.0
>>> print(mode[0])  # Most probable hypothesis index: 1 (0-based)
>>> print(entropy[0])  # Uncertainty measure: ~0.96

>>> # For two hypotheses (e.g., valley vs standard lithology)
>>> f_post_list = ['post_valley.h5', 'post_standard.h5']
>>> P, mode, entropy = ig.compute_hypothesis_probability(f_post_list, showInfo=1)
>>> P_valley = P[:, 0]  # Probability of valley hypothesis
>>> P_standard = P[:, 1]  # Probability of standard hypothesis
>>> most_probable_hypothesis = mode  # Index of most probable hypothesis per data point
>>> uncertainty = entropy  # Entropy values indicating uncertainty

See also

integrate_rejection: Main rejection sampling function that creates posterior files
integrate_rejection_range: Core rejection sampling that computes EV values

integrate.integrate_rejection.create_shared_memory(arrays)¶

Create shared memory segments for arrays.

This function creates shared memory segments for a list of numpy arrays, allowing them to be accessed efficiently across multiple processes without copying data. Returns both memory references and objects for cleanup.

Parameters:

arrays (list) – List of numpy arrays to place in shared memory.

Returns:

shared_memories (list) – List of (name, shape, dtype) tuples identifying shared memory segments.
shm_objects (list) – List of SharedMemory objects for cleanup.

Notes

The returned shm_objects must be cleaned up using cleanup_shared_memory() to prevent memory leaks. This should be done in a finally block.

If an error occurs during creation, any successfully created memory segments are automatically cleaned up before raising the exception.

integrate.integrate_rejection.integrate_posterior_chunk(args)¶

Process a single chunk of data points for parallel rejection sampling.

This function is called by each worker process in the parallel processing pool. It reconstructs shared data arrays, processes the assigned chunk of data points using rejection sampling, and returns results for aggregation by the main process.

Parameters:

args (tuple) – Packed arguments: (i_chunk, ip_chunks, DATA, idx, N_use, id_use, shared_memory_refs, autoT, T_base, nr, use_N_best, T_N_above, T_P_acc_level). See integrate_rejection_range for descriptions of individual fields.

Returns:

i_use (ndarray, shape (nump, nr)) – Indices of accepted posterior samples for the chunk.
T (ndarray, shape (nump,)) – Temperature values used for the chunk.
EV (ndarray, shape (nump,)) – Evidence values for the chunk.
EV_post (ndarray, shape (nump,)) – Posterior evidence values for the chunk.
N_UNIQUE (ndarray, shape (nump,)) – Number of unique samples for the chunk.
ip_range (ndarray) – Range of data point indices that were processed.

Notes

This function runs in a separate process and communicates with the main process through shared memory for data arrays and return values through the process pool.

The function first reconstructs the shared data arrays from memory references, then calls integrate_rejection_range to perform the actual rejection sampling on the assigned chunk of data points.

integrate.integrate_rejection.integrate_posterior_main(ip_chunks, D, DATA, idx, N_use, id_use, autoT, T_base, nr, Ncpu, use_N_best, T_N_above=10, T_P_acc_level=0.2)¶

Coordinate parallel processing of posterior sampling across multiple chunks.

This function manages the parallel execution of rejection sampling by distributing data point chunks across multiple CPU cores. It handles shared memory management for efficient data transfer between processes and aggregates results from all chunks.

Parameters:

ip_chunks (list) – List of data point index chunks for parallel processing.
D (list) – List of forward modeled data arrays shared across processes.
DATA (dict) – Dictionary containing observed data structures.
idx (list) – Indices of prior samples to use for inversion.
N_use (int) – Maximum number of prior samples per chunk.
id_use (list) – List of data identifiers for likelihood calculation.
autoT (int) – Automatic temperature estimation flag.
T_base (float) – Base temperature for rejection sampling.
nr (int) – Number of posterior samples to retain per data point.
Ncpu (int) – Number of CPU cores to use for parallel processing.
use_N_best (int) – Flag to use only the N best-fitting samples.
T_N_above (int, optional) – Passed through to integrate_rejection_range / logl_T_est. Default is 10.
T_P_acc_level (float, optional) – Passed through to integrate_rejection_range / logl_T_est. Default is 0.2.

Returns:

i_use_all (ndarray, shape (Ndp, nr)) – Indices of accepted posterior samples for all data points.
T_all (ndarray, shape (Ndp,)) – Temperature values used for all data points.
EV_all (ndarray, shape (Ndp,)) – Evidence values for all data points.
EV_post_all (ndarray, shape (Ndp,)) – Posterior evidence values for all data points.
N_UNIQUE_all (ndarray, shape (Ndp,)) – Number of unique samples for all data points.

Notes

This function uses shared memory to minimize data copying overhead during parallel processing. Shared memory is automatically cleaned up after processing completion.

The function creates a process pool with the specified number of CPUs and distributes the work chunks across them. Each worker process operates on a subset of data points and returns its results, which are then aggregated by the main process.

integrate.integrate_rejection.integrate_rejection(f_prior_h5='prior.h5', f_data_h5='DAUGAAD_AVG_inout.h5', f_post_h5='', N_use=100000000000, id_use=[], ip_range=[], nr=1000, autoT=1, T_base=1, Nchunks=0, Ncpu=0, parallel=True, use_N_best=0, T_N_above=None, T_P_acc_level=None, progress_callback=None, console_progress=None, **kwargs)¶

Perform probabilistic inversion using rejection sampling.

This is the main function for Bayesian inversion using rejection sampling methodology. It samples the posterior distribution by rejecting prior samples that are inconsistent with observed data within a temperature-controlled tolerance. Supports parallel processing and automatic temperature estimation for efficient sampling.

Parameters:

f_prior_h5 (str, optional) – Path to HDF5 file containing prior model and data samples. Default is ‘prior.h5’.
f_data_h5 (str, optional) – Path to HDF5 file containing observed data for inversion. Default is ‘DAUGAAD_AVG_inout.h5’.
f_post_h5 (str, optional) – Output path for posterior samples. If empty, auto-generated from prior filename. Default is empty string.
N_use (int, optional) – Maximum number of prior samples to use for inversion. Default is 100000000000.
id_use (list, optional) – List of data identifiers to use for inversion. If empty, uses all available data. Default is empty list.
ip_range (list, optional) – List of data point indices to invert. If empty, inverts all data points. Default is empty list.
nr (int, optional) – Number of posterior samples to retain per data point. Default is 1000.
autoT (int, optional) – Automatic temperature estimation method (1=enabled, 0=disabled). Default is 1.
T_base (float, optional) – Base temperature for rejection sampling when autoT=0. Default is 1.
Nchunks (int, optional) – Number of chunks for parallel processing. If 0, auto-determined. Default is 0.
Ncpu (int, optional) – Number of CPU cores to use. If 0, auto-determined from system. Default is 0.
parallel (bool, optional) – Enable parallel processing if environment supports it. Default is True.
use_N_best (int, optional) – Use only the N best-fitting samples (0=disabled). Default is 0.
T_N_above (int, optional) – Number of top samples used by logl_T_est to estimate the annealing temperature. Passed as N_above to logl_T_est(). Default is None (uses logl_T_est default of 10).
T_P_acc_level (float, optional) – Target acceptance probability level used by logl_T_est to estimate the annealing temperature. Passed as P_acc_lev to logl_T_est(). Default is None (uses logl_T_est default of 0.2).
progress_callback (callable, optional) – Callback function for progress updates. Called as progress_callback(current, total, info_dict). Default is None (no callback).
console_progress (bool, optional) – Whether to show console TQDM progress bar. If None, auto-detects based on progress_callback. Default is None.
**kwargs (dict) – Additional keyword arguments including showInfo, updatePostStat, post_dir.

Returns:

Path to the output HDF5 file containing posterior samples and statistics.

Return type:

str

Notes

The function automatically determines optimal processing parameters based on data size and system capabilities. Temperature annealing is used to improve sampling efficiency.

Large datasets may require significant memory and processing time. Monitor system resources during execution.

Examples

>>> import integrate as ig
>>> f_post = ig.integrate_rejection('prior.h5', 'data.h5', N_use=10000)
>>> print(f"Results saved to: {f_post}")

integrate.integrate_rejection.integrate_rejection_range(D, DATA, idx=[], N_use=None, id_use=[], ip_range=[], nr=1000, autoT=1, T_base=1, T_N_above=10, T_P_acc_level=0.2, progress_callback=None, **kwargs)¶

Perform rejection sampling for a specific range of data points.

This function implements the core rejection sampling algorithm for a subset of data points. It evaluates likelihood for each data point in the range and accepts/rejects prior samples based on temperature-controlled criteria. Used internally by integrate_rejection for both serial and parallel processing.

Parameters:

D (list) – List of forward modeled data arrays for each data type.
DATA (dict) – Dictionary containing observed data including ‘d_obs’, ‘d_std’, and other data arrays.
idx (list, optional) – Indices of prior samples to use. If empty, uses sequential indexing. Default is empty list.
N_use (int, optional) – Maximum number of prior samples to evaluate. Default is 1000.
id_use (list, optional) – List of data identifiers to use for likelihood calculation. Default is [] which use all data types available.
ip_range (list, optional) – Range of data point indices to process. If empty, processes all data points. Default is empty list.
nr (int, optional) – Number of posterior samples to retain per data point. Default is 1000.
autoT (int, optional) – Automatic temperature estimation method (1=enabled, 0=disabled). Default is 1.
T_base (float, optional) – Base temperature for rejection sampling when autoT=0. Default is 1.
T_N_above (int, optional) – Number of top samples used by logl_T_est to estimate the annealing temperature. Passed as N_above to logl_T_est(). Default is 10.
T_P_acc_level (float, optional) – Target acceptance probability level used by logl_T_est to estimate the annealing temperature. Passed as P_acc_lev to logl_T_est(). Default is 0.2.
progress_callback (callable, optional) – Optional callback function for progress updates. Called with (current, total). Default is None (no callbacks).
**kwargs (dict) – Additional arguments including useRandomData, showInfo, use_N_best.

Returns:

i_use_all (ndarray, shape (nump, nr)) – Indices of accepted posterior samples for each data point.
T_all (ndarray, shape (nump,)) – Temperature values used for each data point.
EV_all (ndarray, shape (nump,)) – Evidence values for each data point.
EV_post_all (ndarray, shape (nump,)) – Posterior evidence values for each data point.
N_UNIQUE_all (ndarray, shape (nump,)) – Number of unique samples for each data point.
ip_range (ndarray) – Range of data point indices that were processed.

Notes

This function is the computational core of the rejection sampling algorithm. It handles temperature annealing and likelihood evaluation for efficient sampling.

The algorithm evaluates the likelihood of observed data given forward modeled data for each prior sample, then uses temperature-controlled acceptance criteria to select posterior samples that are consistent with observations.

integrate.integrate_rejection.likelihood_gaussian_diagonal(D, d_obs, d_std, N_app=0)¶

Compute the Gaussian likelihood for a diagonal covariance matrix.

This function calculates the likelihood of observed data given a set of predicted data and standard deviations, assuming a Gaussian distribution with a diagonal covariance matrix.

Parameters:

D (ndarray, shape (n_samples, n_features)) – Predicted data array containing forward model predictions.
d_obs (ndarray, shape (n_features,)) – Observed data array containing measured values.
d_std (ndarray, shape (n_features)) – Standard deviation array containing measurement uncertainties.
N_app (int, optional) – Number of data points to use for approximation. If 0, uses all data. Default is 0.

Returns:

Log-likelihood values for each sample, computed as: L[i] = -0.5 * sum((D[i] - d_obs)**2 / d_std**2)

Return type:

ndarray, shape (n_samples,)

Notes

The function assumes independent Gaussian errors with diagonal covariance matrix. The log-likelihood is computed using vectorized operations for efficiency.

When N_app > 0, only the N_app samples with smallest residuals are evaluated, and the remaining samples are assigned a very low likelihood (-1e15).

This implementation is already well-optimized. Micro-optimizations like pre-computing inverse variance do not improve performance with modern NumPy.

integrate.integrate_rejection.likelihood_gaussian_diagonal_old(D, d_obs, d_std, N_app=0)¶

Compute the Gaussian likelihood for a diagonal covariance matrix (original version).

This is the original implementation kept for reference and backwards compatibility. For better performance, use likelihood_gaussian_diagonal() instead.

Parameters:

D (ndarray, shape (n_samples, n_features)) – Predicted data array containing forward model predictions.
d_obs (ndarray, shape (n_features,)) – Observed data array containing measured values.
d_std (ndarray, shape (n_features,)) – Standard deviation array containing measurement uncertainties.
N_app (int, optional) – Number of data points to use for approximation. If 0, uses all data. Default is 0.

Returns:

Log-likelihood values for each sample, computed as: L[i] = -0.5 * sum((D[i] - d_obs)**2 / d_std**2)

Return type:

ndarray, shape (n_samples,)

Notes

This is the original implementation. It has been replaced by an optimized version that is ~15-25% faster. This function is kept for reference and validation.

integrate.integrate_rejection.likelihood_gaussian_full(D, d_obs, Cd, N_app=0, checkNaN=True, useVectorized=True)¶

Calculate the Gaussian likelihood with full covariance matrix.

This function computes likelihood values for model predictions given observed data and a full covariance matrix, handling NaN values appropriately.

Parameters:

D (ndarray, shape (n_samples, n_features)) – Model predictions containing forward model results.
d_obs (ndarray, shape (n_features,)) – Observed data containing measured values.
Cd (ndarray, shape (n_features, n_features)) – Full covariance matrix of observed data uncertainties.
N_app (int, optional) – Number of data points to use for approximation. If 0, uses all data. Default is 0.
checkNaN (bool, optional) – If True, handles NaN values in d_obs by ignoring them in calculations. Default is True.
useVectorized (bool, optional) – If True, uses vectorized computation for better performance. Default is False.

Returns:

Log-likelihood values for each sample, computed as: L[i] = -0.5 * (D[i] - d_obs)^T * Cd^(-1) * (D[i] - d_obs)

Return type:

ndarray, shape (n_samples,)

Notes

The function handles full covariance matrices accounting for correlated errors. When checkNaN=True, only non-NaN data points are used in the likelihood calculation.

The vectorized implementation uses einsum for efficient matrix operations. When N_app > 0, only the N_app samples with smallest residuals are evaluated.

TODO: Check that this works when D has NaN values and determine why they occur.

integrate.integrate_rejection.likelihood_multinomial(D, P_obs, class_id=None, class_is_idx=False, entropyFilter=False, entropyThreshold=0.99)¶

Calculate log-likelihood of multinomial distribution for discrete data.

This function computes the log-likelihood of multinomial distribution for discrete data using fully vectorized array operations for efficient computation.

Parameters:

D (ndarray, shape (N, n_features)) – Matrix of observed discrete data, where each element represents a class ID.
P_obs (ndarray, shape (n_classes, n_features)) – Matrix of probabilities, where each column represents probability distribution over classes.
class_id (ndarray, optional) – Array of unique class IDs corresponding to rows in P_obs. If None, extracted from unique values in D. Default is None.
class_is_idx (bool, optional) – If True, class_id is already an index. If False, computes index from class_id array. Default is False.
entropyFilter (bool, optional) – If True, applies entropy filtering to select features. Default is False.
entropyThreshold (float, optional) – Threshold for entropy filtering. Features with entropy below this value are selected. Default is 0.99.

Returns:

Log-likelihood values for each sample, computed using natural logarithm. For each sample i: logL[i] = sum(log(p[i,j])) over all features j.

Return type:

ndarray, shape (N,)

Notes

This vectorized implementation eliminates Python loops for significant performance gains. The log-likelihood is calculated as the sum of natural logarithms of probabilities: logL[i] = sum(log(p[i,j])) for all features j

This means exp(logL[i]) equals the product of probabilities across features. For single-feature cases, exp(logL) directly equals the observed probability.

When entropyFilter is True, only features with entropy below the threshold are used in the likelihood calculation, which can improve computational efficiency for datasets with many uninformative features.

Performance: This vectorized version is approximately 5-10x faster than the loop-based implementation for large datasets (N > 10,000).

Examples

>>> D = np.array([[1, 2], [2, 1]])  # Sample data with class IDs
>>> P_obs = np.array([[0.3, 0.7], [0.7, 0.3]])  # Class probabilities
>>> logL = likelihood_multinomial(D, P_obs)

integrate.integrate_rejection.likelihood_multinomial_old(D, P_obs, class_id=None, class_is_idx=False, entropyFilter=False, entropyThreshold=0.99)¶

Calculate log-likelihood of multinomial distribution for discrete data (old loop-based version).

This is the original loop-based implementation kept for reference and backwards compatibility. For better performance, use likelihood_multinomial() instead.

Parameters:

D (ndarray, shape (N, n_features)) – Matrix of observed discrete data, where each element represents a class ID.
P_obs (ndarray, shape (n_classes, n_features)) – Matrix of probabilities, where each column represents probability distribution over classes.
class_id (ndarray, optional) – Array of unique class IDs corresponding to rows in P_obs. If None, extracted from unique values in D. Default is None.
class_is_idx (bool, optional) – If True, class_id is already an index. If False, computes index from class_id array. Default is False.
entropyFilter (bool, optional) – If True, applies entropy filtering to select features. Default is False.
entropyThreshold (float, optional) – Threshold for entropy filtering. Features with entropy below this value are selected. Default is 0.99.

Returns:

Log-likelihood values for each sample, computed using natural logarithm. For each sample i: logL[i] = sum(log(p[i,j])) over all features j.

Return type:

ndarray, shape (N,)

Notes

This is the original loop-based implementation. It has been replaced by a vectorized version that is 5-10x faster. This function is kept for reference and validation.

Examples

>>> D = np.array([[1, 2], [2, 1]])  # Sample data with class IDs
>>> P_obs = np.array([[0.3, 0.7], [0.7, 0.3]])  # Class probabilities
>>> logL = likelihood_multinomial_old(D, P_obs)

integrate.integrate_rejection.reconstruct_shared_arrays(shared_memory_refs)¶

Reconstruct arrays from shared memory references.

This function takes shared memory references (created by create_shared_memory) and reconstructs the original numpy arrays by accessing the shared memory segments. Used by worker processes to access shared data.

Parameters:

shared_memory_refs (list) – List of (name, shape, dtype) tuples identifying shared memory segments.

Returns:

reconstructed_arrays (list) – List of numpy arrays reconstructed from shared memory.
shm_objects (list) – List of shared memory objects that must be closed after use.

Warning

The reconstructed arrays are views into shared memory. Modifications will affect the shared data across all processes. Do NOT modify these arrays. The shared memory objects must be closed after use to prevent leaks.

Notes

If an error occurs during reconstruction, any successfully opened shared memory objects are automatically closed before raising the exception.

integrate.integrate_rejection.select_subset_for_inversion(dd, N_app)¶

Select a subset of indices for inversion based on the sum of squared residuals.

This function calculates the sum of squared values along the specified axis for each row in the input array dd. It then selects the indices of the N_app smallest sums for fastest performance.

Parameters:

dd (numpy.ndarray) – A 2D array of data from which to select the subset.
N_app (int) – The number of indices to select based on the smallest sums.

Returns:

idx – An array of indices corresponding to the N_app smallest L2 norms.

Return type:

numpy.ndarray

Notes

This function uses squared residuals (L2 norm) for optimal performance, avoiding expensive absolute value operations. Uses np.argpartition for efficient selection of the smallest sums.