Part I: Known Number of Mixture Components

Here, we assume here that the number of Gaussians in the mixture model is known. For a trans-dimensional example, where both the Gaussians characteristics and their number are treated as unknown and inferred from the data, please refer to the Part II: Trans-dimensional GMM.

Import libraries and define constants

import bayesbay as bb
from math import sqrt, pi
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(30)

MEANS = [140, 162, 177]  # Means of the Gaussians
STDS = [12, 5, 6]  # Standard deviations of the Gaussians
WEIGHTS = [0.4, 0.3, 0.3]  # Weights of each Gaussian in the mixture
N_SAMPLES = 10_000 # Number of samples to generate

def gaussian(x, mu, sigma):
    return 1 / (sigma * sqrt(2*pi)) * np.exp(-0.5 * ((x - mu) / sigma)**2)

Observed data

In the following block, we first generate a set of random samples using the probability density function (PDF) defined by the mixture components. These samples represent height measurements from the three considered subgroups (i.e., Children, Adult Women, and Adult Men). We then use the collected samples to create a histogram — representing the observed data in this tutorial — which will be used later to infer the characteristics of the Gaussian mixture.

assert np.isclose(sum(WEIGHTS), 1), "The weights must sum to 1."

def generate_random_samples(means, stds, weights):
    """Generate random samples for each component of the Gaussian mixture"""
    samples = []
    for mean, std, weight in zip(means, stds, weights):
        n_samples = int(N_SAMPLES * weight) # number of samples for each component
        samples.append(np.random.normal(mean, std, n_samples))
    samples = np.concatenate(samples)
    np.random.shuffle(samples)
    return samples

samples = generate_random_samples(MEANS, STDS, WEIGHTS)
data_obs, bins = np.histogram(samples, bins=50, density=True)
data_x = (bins[:-1] + bins[1:]) / 2 # Height associated with each data point

For illustration purposes, we compute the true mixture PDF in the block below and plot it along the histogram of samples.

# Define the range of x-values (i.e., heights) over which to evaluate the PDF
x_min, x_max = min(MEANS) - 3 * max(STDS), max(MEANS) + 3 * max(STDS)
xs = np.linspace(x_min, x_max, 1000)

pdf_true = np.zeros_like(xs)
for mean, std, weight in zip(MEANS, STDS, WEIGHTS):
    pdf_true += weight * gaussian(xs, mean, std)

fig, ax = plt.subplots()
ax.set_title('Histogram of measured heights')
ax.hist(samples, bins, density=True, ec='w')
ax.plot(xs, pdf_true, label='True mixture PDF', color='r')
ax.plot(data_x, data_obs, 'ko', label='Observed data', markerfacecolor='None')

ax.set_xlabel('Height')
ax.set_ylabel('Density')
ax.grid()
ax.legend()
plt.show()

../_images/3bf2e7066808b970191d9f4b3a5e8e5cff97e03174eb3f5093a9b64fb3624667.png

Setting up the Bayesian sampling

To solve the inverse problem inherent in inferring the mean \(\boldsymbol{\mu}\), standard deviation \(\boldsymbol{\sigma}\), and weight \(\boldsymbol{\omega}\) of each Gaussian in the mixture, we need to define:

A Parameterization, encapsulating the unknown model parameters \(\mathbf{m} = [ \boldsymbol{\mu}, \boldsymbol{\sigma}, \boldsymbol{\omega}]\) and their prior probability
A forward function, enabling the prediction of data points \(\mathbf{d}_{pred}\) from the model \(\mathbf{m}\)
What we refer to as a Target, that is, a Python object designed to store all information about a given data set

Prior probability

Here, we assume that the number of Gaussians in the mixture is known and

\(100 \leq \mu_i \leq 200\) cm
\(1 \leq \sigma_i \leq 20\) cm
\(0 \leq \omega_i \leq 1\).

In BayesBay, the above prior information can be implemented through UniformPrior, that is, a free parameter characterized by a uniform prior probability. (Different options for the prior of a free parameter can be implemented via GaussianPrior or CustomPrior.)

Finally, note the argument perturb_std in the below block, taken by UniformPrior. The numerical value assigned to this argument represents the standard deviation of the Gaussian used to propose a new model by perturbing the considered parameter at a given Markov-chain iteration.

mean = bb.prior.UniformPrior(name="mean", vmin=100, vmax=200, perturb_std=2)
std = bb.prior.UniformPrior(name="std", vmin=1, vmax=20, perturb_std=0.4)
weight = bb.prior.UniformPrior(name="weight", vmin=0, vmax=1, perturb_std=0.02)

Parameter space and parameterization

The above free parameters should be used in BayesBay to create what we call a ParameterSpace. Mathematically, ParameterSpace can be thought of as a \(n\)-dimensional vector space. Each point (vector) within this space corresponds to a free parameter in the inference problem, and the position of each point defines the model parameters \(\mathbf{m}_i \in \mathbb{R}^n\), with \(\mathbf{m}_i \subseteq \mathbf{m}\). Importantly, in our implementation \(n\) can be treated as a variable to enable trans-dimensional Bayesian sampling.

From a computer programming perspective, ParameterSpace could be seen as a specialized container that not only groups a number of free parameters but also determines their dimensionality. For instance, in this tutorial’s case, ParameterSpace will contain three distinct free parameters (namely, \(\boldsymbol{\omega}\), \(\boldsymbol{\mu}\), and \(\boldsymbol{\sigma}\)); since each of these parameters is a vector of three dimensions (because there are three Gaussians in the mixture), ParameterSpace’s dimensionality will remain fixed and equal to three throughout the inversion. Finally, under the hood ParameterSpace also defines the perturbation functions used to propose new model parameters from the current ones (for more information, see the module bayesbridge.perturbations).

Having defined one or more instances of ParameterSpace, these should be used to define a Parameterization. Parameterization is a relatively simple object compared to ParameterSpace, and its main purpose is to aggregate all model parameters (from all specified instances of ParameterSpace) to define \(\mathbf{m} = \lbrace \mathbf{m}_1, \mathbf{m}_2, \mathbf{m}_3, \dots \rbrace\).

param_space = bb.parameterization.ParameterSpace(
    name="my_param_space", 
    n_dimensions=3, 
    parameters=[mean, std, weight], 
)
parameterization = bb.parameterization.Parameterization(param_space)

Forward problem

Another fundamental ingredient in BayesBay is the definition of the forward function. This function is required to return the predicted data in the form of a numpy array, which is then used to calculate the likelihood \(p({\bf d}_{obs} \mid {\bf m})\). The forward function should be programmed to take in a bayesbay.State, from which all model parameters can be accessed. Within the forward function passed to BayesBay, function calls to external functions, as illustrated in the code block below, are of course allowed.

def _forward(means, stds, weights):
    weights /= np.sum(weights)
    data_pred = np.zeros_like(data_x)
    for i in range(len(means)):
        data_pred += weights[i] * gaussian(data_x, means[i], stds[i])
    return data_pred

def fwd_function(state: bb.State) -> np.ndarray:
    means = state['my_param_space']['mean']
    stds = state['my_param_space']['std']
    weights = state['my_param_space']['weight']
    return _forward(means, stds, weights)

Observed data: the Target

In BayesBay, the observed data should be used to define what we call a data Target, which allows for treating the data noise as an unknown. This is accomplished by specifying minimum and maximum bounds for the standard deviation of the data and, optionally, for the data correlation. In this instance, we assume that there is no correlation between adjacent data points.

target = bb.Target('my_data', 
                   data_obs, 
                   std_min=0, 
                   std_max=0.01, 
                   std_perturb_std=0.001,
                   noise_is_correlated=False)

5. Log Likelihood

Having defined the forward function and the target, these can be used to initialize an instance of the LogLikelihood class. (Alternatively, such an instance can be initialized through a function that either calculates \(\log(p({\bf d}_{obs} \mid {\bf m}))\) or \(\log \left( \frac{p(\mathbf{d}_{obs} \mid \mathbf{m}')}{p(\mathbf{d}_{obs} \mid \mathbf{m})}\right)\), where \(\mathbf{m}'\) denotes the proposed model.) The LogLikelihood instance represents the last ingredient needed to run the Bayesian sampling.

log_likelihood = bb.LogLikelihood(targets=target, fwd_functions=fwd_function)

Run the Bayesian sampling

We can now use all the defined variables to solve the inference problem. The code provided below is self-explanatory. We sample the posterior using 20 Markov chains, each run on a separate CPU for 250,000 iterations. (To modify the number of CPUs employed in the sampling, see the argument parallel_config in bayesbay.BayesianInversion.run().) The first 50,000 iterations are discarded as the burn-in phase. After the burn-in, we save one model every 100 iterations.

Notice the arguments targets and fwd_functions of bayesbay.BayesianInversion: rather than accepting only one data target and forward function, as defined earlier, these can accept a list of targets and a list of functions. This flexibility makes it straightforward to carry out joint inversions of multiple datasets, each potentially associated with different forward functions.

inversion = bb.BayesianInversion(
    log_likelihood=log_likelihood, 
    parameterization=parameterization, 
    n_chains=20
)
inversion.run(
    n_iterations=250_000, 
    burnin_iterations=50_000, 
    save_every=100, 
    verbose=False, 
)

for chain in inversion.chains:
    chain.print_statistics()

Chain ID: 0
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 18304/250000 (7.32 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8458/62664 (13.50%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 9846/187336 (5.26%)
Chain ID: 1
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 18134/250000 (7.25 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8563/62691 (13.66%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 9571/187309 (5.11%)
Chain ID: 2
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 19670/250000 (7.87 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8519/62566 (13.62%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 11151/187434 (5.95%)
Chain ID: 3
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 17840/250000 (7.14 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8356/62211 (13.43%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 9484/187789 (5.05%)
Chain ID: 4
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 26442/250000 (10.58 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 9283/61974 (14.98%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 17159/188026 (9.13%)
Chain ID: 5
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 18332/250000 (7.33 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8382/62468 (13.42%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 9950/187532 (5.31%)
Chain ID: 6
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 22618/250000 (9.05 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 9000/62595 (14.38%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 13618/187405 (7.27%)
Chain ID: 7
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 18197/250000 (7.28 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8449/62452 (13.53%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 9748/187548 (5.20%)
Chain ID: 8
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 27298/250000 (10.92 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 9432/62418 (15.11%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 17866/187582 (9.52%)
Chain ID: 9
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 17873/250000 (7.15 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8514/63057 (13.50%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 9359/186943 (5.01%)
Chain ID: 10
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 25869/250000 (10.35 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 9438/62554 (15.09%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 16431/187446 (8.77%)
Chain ID: 11
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 18617/250000 (7.45 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8559/62502 (13.69%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 10058/187498 (5.36%)
Chain ID: 12
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 20913/250000 (8.37 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8776/62495 (14.04%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 12137/187505 (6.47%)
Chain ID: 13
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 22503/250000 (9.00 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 9112/62432 (14.60%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 13391/187568 (7.14%)
Chain ID: 14
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 18038/250000 (7.22 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8529/62509 (13.64%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 9509/187491 (5.07%)
Chain ID: 15
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 18544/250000 (7.42 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8670/62479 (13.88%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 9874/187521 (5.27%)
Chain ID: 16
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 18536/250000 (7.41 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8440/62301 (13.55%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 10096/187699 (5.38%)
Chain ID: 17
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 22007/250000 (8.80 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8815/62157 (14.18%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 13192/187843 (7.02%)
Chain ID: 18
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 18898/250000 (7.56 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8462/62486 (13.54%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 10436/187514 (5.57%)
Chain ID: 19
TEMPERATURE: 1
EXPLORED MODELS: 250000
ACCEPTANCE RATE: 21420/250000 (8.57 %)
PARTIAL ACCEPTANCE RATES:
	NoisePerturbation(my_data): 8839/62255 (14.20%)
	ParamPerturbation(['my_param_space.mean', 'my_param_space.std', 'my_param_space.weight']): 12581/187745 (6.70%)

Retrieve the results and plot

In the following block, we first retrieve the samples collected during the inference by calling inversion.get_results(). We then sort each retrieved model (i.e. each saved \(\mathbf{m}_i = [ \boldsymbol{\mu}_i, \boldsymbol{\sigma}_i, \boldsymbol{\omega}_i]\), where the subscript \(i\) denotes the Markov chain iteration) so as to arrange the mean values \(\boldsymbol{\mu}_i\) in increasing order. For example, a model \(\mathbf{m} = [(\mu_2, \mu_1, \mu_3), (\sigma_2, \sigma_1, \sigma_3), (\omega_2, \omega_1, \omega_3)]\), with \(\mu_1 < \mu_2 < \mu_3\), would be sorted into \(\mathbf{m}' = [(\mu_1, \mu_2, \mu_3), (\sigma_1, \sigma_2, \sigma_3), (\omega_1, \omega_2, \omega_3)]\). This allows us to identify the first Gaussian in the mixture as the one with the lowest mean, the second Gaussian as the one with the next lowest mean, and so on.

Note

Without sorting the saved models as explained, the retrieved statistics for each Gaussian (e.g., the median height) would be meaningless. This is because two models that are equivalent in their entries except for the order of the entries, such as \(\mathbf{m}\) and \(\mathbf{m}'\) above, when passed to the forward function yield the same predicted PDF (i.e., \(\mathbf{G}(\mathbf{m}) = \mathbf{G}(\mathbf{m}')\), where \(\mathbf{G}(\mathbf{m}) = \sum_{i=1}^3 \omega_i \cdot \mathcal{N}(x; \mu_i, \sigma_i)\)).

Having sorted all saved models, we use their median to retrieve a PDF. This PDF will then be compared with the true PDF of the mixture in a subsequent block.

def sort_mixture(means, stds, weights):
    indexes = [np.argsort(row) for row in means]
    for i, idx in enumerate(indexes):
        means[i] = means[i][idx]
        stds[i] = stds[i][idx]
        weights[i] = weights[i][idx]
    return means, stds, weights

results = inversion.get_results()
means, stds, weights = sort_mixture(np.array(results['my_param_space.mean']), 
                                    np.array(results['my_param_space.std']), 
                                    np.array(results['my_param_space.weight']))
pdf_pred = _forward(np.median(means, axis=0), 
                    np.median(stds, axis=0), 
                    np.median(weights, axis=0)
                    )

Before we plot the results, we conduct a quick simulation to estimate the uncertainty in the observed data. We achieve this by first drawing samples from the true PDF to build 10,000 sets of observed data. We use these datasets to calculate 10,000 histograms, from which we then compute each bin height’s standard deviation. Finally, we take the median of the thus derived standard deviations. This median will later be compared with the noise standard deviation as inferred through our hierarchical Bayesian sampling.

datasets = [generate_random_samples(MEANS, STDS, WEIGHTS) for _ in range(10000)]
histograms = np.array([np.histogram(dataset, bins=50, density=True)[0] for dataset in datasets])
true_std = np.median(np.std(histograms, axis=0))

We now plot the inferred PDF and data noise standard deviation along with the true ones.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
ax1.hist(samples, bins=50, density=True, ec='w', fc='gray', alpha=0.8)
ax1.plot(xs, pdf_true, label='True mixture PDF', color='r')
ax1.plot(data_x, pdf_pred, label='Inferred', color='b')
ax1.set_xlabel('Height [cm]')
ax1.set_ylabel('Density')
ax1.legend()

ax2.axvline(x=true_std, color='r', lw=3, alpha=0.3, label='"True" (i.e., estimated) noise')
pdf, bins, _ = ax2.hist(results['my_data.std'], density=True, bins=4, ec='w', zorder=100, label='Posterior')
ax2.fill_between([target.std_min, target.std_max], 1 / (target.std_max - target.std_min), alpha=0.2, label='Prior')
ax2.set_xlabel('Noise standard deviation')
ax2.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
ax2.legend(framealpha=0.9)
plt.show()

../_images/510e0790e1bf8a917c2655248ead5491bf2eb8f0ef4fa57c6738a54c6c064401.png

Finally, we create a figure that displays the inferred mean, standard deviation, and weight for each component of the mixture, with yellow dots indicating the true values. This is achieved using the ArviZ library, which provides useful functionalities for plotting inference results.

import arviz as az

for key, inferred_value, true_value in zip(
    ['mean', 'std', 'weight'], 
    [means, stds, weights],
    [MEANS, STDS, WEIGHTS]
                     ):
    fig, axes = plt.subplots(3, 3, figsize=(10, 8))
    _ = az.plot_pair(
        {f'{key}_1': inferred_value[:,0], 
         f'{key}_2': inferred_value[:,1], 
         f'{key}_3': inferred_value[:,2]},
        marginals=True,
        reference_values={f'{key}_1': true_value[0], 
                          f'{key}_2': true_value[1], 
                          f'{key}_3': true_value[2]},
        reference_values_kwargs={'color': 'yellow',
                                 'ms': 10},
        kind='kde',
        kde_kwargs={
            'hdi_probs': [0.3, 0.6, 0.9],  # Plot 30%, 60% and 90% HDI contours
            'contourf_kwargs': {'cmap': 'Blues'},
            },
        ax=axes,
        textsize=10
        )
    fig.suptitle(key.upper())
    plt.tight_layout()
    plt.show()

/home/fabrizio/mambaforge/envs/seislib/lib/python3.9/site-packages/arviz/plots/pairplot.py:232: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  gridsize = int(dataset.dims["draw"] ** 0.35)

../_images/d32ebac50ec066327801c7ee4d02636fd0e76247dfcd037d5e2815dd72989455.png

../_images/6de3e6e5de6bb6885a5a8b9e67c4be8481270aa60ddd29de1eaf31f3d730f329.png

../_images/1ea7e6942a9dc1416b97713be285595adf8410496184822ee6f8b2e71b0d6798.png