Inference and model evaluation
In this section, we go through the process of estimating the parameters of the model and evaluating the fitted model.
Simulating synthetic datasets
The method proposed by Ringbauer et al. (2017) makes some assumptions to approximate the evolution of a population in a continuous space. We refer to the Theory overview
for an introduction to the underlying theory. In any case, we advise researchers to simulate synthetic data that resembles the empirical data according to their expertise. This is a crucial step to determine the expected performance of the method and to guide different steps of the pre-processing strategy (such as which software to use to detect identity-by-descent blocks or how to bin the observed data).
For this purpose, we provide a SLiM
model to simulate an individual-based population in a continuous space as well as two scripts that fit the model with either error-free identity-by-descent blocks or blocks detected using state-of-the-art software and compare it with the ground-truth parameters. Such models can be found at simulations directory and might be adapted to your specific needs.
Processing pipeline
Inferring the effective density and dispersal rate from empirical datasets requires running an identity-by-descent detection software across different phased VCF files, post-processing them, and aggregating them appropriately. This procedure is error-prone and time-consuming.
To facilitate the analysis, we provide a Snakemake pipeline that can be used to perform a complete analysis, from detecting IBD blocks using HapIBD, post-processing them with Refined IBD, producing a CSV directly compatible with this package, and, optionally, finding a preliminary maximum likelihood estimate of the effective density. Effective dispersal rate. Snakemake is a popular tool in bioinformatics and it allows easy parallelization across multiple cores or jobs when submitting to a cluster via SLURM, as well as automatically managing dependencies via Conda or Apptainer containers. Most of the time, running such a pipeline would require a single-command such as:
snakemake --configfile config.yaml --cores 8 --sdm conda
The pipeline is configured via a configuration YAML file. An example of such a configuration file can be found here. We refer to the extensive documentation of Snakemake for more information.
Model inference
Next, we exemplify the common steps of an analysis using IdentityByDescentDispersal
. First, we load a synthetic dataset for which we know the ground truth. This is the dataset we simulated when showcasing how to simulate synthetic datasets using SLiM.
using JLD2, DataFrames, IdentityByDescentDispersal, Turing
data = load("../data/constant_density.jld2")
ground_truth = (data["local_density"], data["dispersal_rate"])
(253.30778276266545, 0.07088723439378913)
The IdentityByDescentDispersal
is designed to be compatible with existing statistical software. Here, we decide to Use the Turing
package, which is the most popular Bayesian modelling framework in Julia. Let's consider the following model:
@model function constant_density(df, contig_lengths)
De ~ Truncated(Normal(1000, 100), 0, Inf)
σ ~ Truncated(Normal(1, 0.1), 0, Inf)
Turing.@addlogprob! composite_loglikelihood_constant_density(De, σ, df, contig_lengths)
end
m = constant_density(data["df"], data["contig_lengths"])
DynamicPPL.Model{typeof(Main.constant_density), (:df, :contig_lengths), (), (), Tuple{DataFrames.DataFrame, Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(Main.constant_density, (df = 9440×6 DataFrame
Row │ DISTANCE_INDEX IBD_LEFT IBD_RIGHT NR_PAIRS COUNT DISTANCE
│ Float64 Float64 Float64 Int64 Int64 Float64
──────┼────────────────────────────────────────────────────────────────
1 │ 0.21 0.04 0.041 176 20 0.214839
2 │ 0.21 0.041 0.042 176 16 0.214839
3 │ 0.21 0.042 0.043 176 11 0.214839
4 │ 0.21 0.043 0.044 176 16 0.214839
5 │ 0.21 0.044 0.045 176 10 0.214839
6 │ 0.21 0.045 0.046 176 11 0.214839
7 │ 0.21 0.046 0.047 176 16 0.214839
8 │ 0.21 0.047 0.048 176 5 0.214839
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
9434 │ 0.55 0.193 0.194 1 0 0.556462
9435 │ 0.55 0.194 0.195 1 0 0.556462
9436 │ 0.55 0.195 0.196 1 0 0.556462
9437 │ 0.55 0.196 0.197 1 0 0.556462
9438 │ 0.55 0.197 0.198 1 0 0.556462
9439 │ 0.55 0.198 0.199 1 0 0.556462
9440 │ 0.55 0.199 0.2 1 0 0.556462
9425 rows omitted, contig_lengths = [1.0]), NamedTuple(), DynamicPPL.DefaultContext())
The IdentityByDescentDispersal
is compatible with automatic differentiation. Therefore, we can use standard Hamiltonian Monte Carlo algorithms such as NUTS()
to estimate the posterior distribution.
chains = sample(m, NUTS(), MCMCThreads(), 1000, 4);
┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC ~/.julia/packages/AbstractMCMC/F50rb/src/sample.jl:384
Sampling (1 thread) 0%| | ETA: N/A
┌ Info: Found initial step size
└ ϵ = 0.003125
┌ Info: Found initial step size
└ ϵ = 0.00625
Sampling (1 thread) 25%|███████▊ | ETA: 0:13:31
Sampling (1 thread) 50%|███████████████▌ | ETA: 0:08:32
┌ Info: Found initial step size
└ ϵ = 0.00625
┌ Info: Found initial step size
└ ϵ = 0.0125
Sampling (1 thread) 75%|███████████████████████▎ | ETA: 0:04:16
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:16:57
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:16:57
When fitting models with power-densities (i.e. using composite_loglikelihood_power_density
), be aware that the $\beta$ parameter controls the mean-square differentiability of the composite likelihood (through modified second-kind Bessel function). If you run into issues, we suggest that:
- reparametrize the model (perhaps using a custom function via the
composite_loglikelihood_custom
interface). - tighten the priors to constrain the space of parameters to a region where gradients are defined.
- use gradient-free inference methods such as
Nelder-Mead
when computing maximum likelihood estimates orMetropolis-Hastings
when doing Bayesian inference.
Model diagnosis
Diagnosing the posterior samples obtained via Markov Chain Monte Carlo is a crucial step in any Bayesian inference. We refer to existing resources such as this lecture on Bayesian modelling in Biostatistics Most popular approaches involve calculating quantities such as the effective number of samples (ESS) and $\hat {R}$, which can be computed directly from Turing
output.
As a rule of thumb, we aim to run the chain long enough to obtain an ESS greater than 100.
ess(chains) |> DataFrame
Row | parameters | ess | ess_per_sec |
---|---|---|---|
Symbol | Float64 | Float64 | |
1 | De | 820.772 | 0.811968 |
2 | σ | 794.417 | 0.785896 |
A $\hat {R}$ greater than 1.05 indicates the chains have not mixed well.
rhat(chains) |> DataFrame
Row | parameters | rhat |
---|---|---|
Symbol | Float64 | |
1 | De | 1.00531 |
2 | σ | 1.00401 |
Convergence issues can also be inspected through a traceplot
:
using Plots, StatsPlots
traceplot(chains)
When fitting models with power-densities (i.e. using composite_loglikelihood_power_density
), be aware that the $\beta$ parameter controls the mean-square differentiability of the composite likelihood (through modified second-kind Bessel function). If you run into issues, we suggest that:
- reparametrize the model (perhaps using a custom function via the
composite_loglikelihood_custom
interface). - tighten the priors to constrain the space of parameters to a region where gradients are defined.
- use gradient-free inference methods such as
Nelder-Mead
when computing maximum likelihood estimates orMetropolis-Hastings
when doing Bayesian inference.
Visualizing the posterior
After finishing the sampling process and assessing convergence, we can visualize the estimated posterior distribution of the parameters and compare it with the prior distribution and the ground-truth:
using Distributions
p1 = plot(
Truncated(Normal(1000, 100), 0, Inf),
label = "Prior",
xlab = "Effective population density",
ylab = "Density",
)
density!(p1, chains[:De][:], label = "Posterior")
vline!(p1, [ground_truth[1]], label = "Ground-truth")
p2 = plot(
Truncated(Normal(1, 0.1), 0, Inf),
label = "Prior",
xlab = "Effective dispersal rate",
ylab = "Density",
)
density!(p2, chains[:σ][:], label = "Posterior")
vline!(p2, [ground_truth[2]], label = "Ground-truth")
plot(p1, p2, layout = (2, 1), size = (600, 800))
Notice that, although the inference is accurate, we do not expect the posterior estimates to have nominal coverage (e.g., that a 90% credibility interval contains the true parameter 90% of the time). This is because we are assuming every pairwise observation is independent when constructing the composite likelihood and therefore we are overconfident in our estimations. In their publication, Ringbuaer et al. (2017) computed a bootstrapping confidence interval of the maximum-likelihood estimate but still do not achieve nominal coverage.
A bootstrapping confidence interval can be easily computed with the maximum_likelihood
function and taking samples with replacement of the rows of the DataFrame
. Notice, however, that other resampling schemes are possible. We advise you to experiment with different approaches and evaluate the performance with simulated datasets.
nboots = 100
boots = zeros(2, nboots)
df = data["df"]
for i = 1:nboots
df_resampled = df[sample(1:nrow(df), nrow(df), replace = true), :]
mle = maximum_likelihood(constant_density(df_resampled, data["contig_lengths"]))
boots[:, i] = mle.values
end
conf_De = quantile(boots[1, :], [0.025, 0.975])
conf_σ = quantile(boots[2, :], [0.025, 0.975])
DataFrame(
parameter = ["De", "σ"],
confidence_interval95 = [conf_De, conf_σ],
ground_truth = [ground_truth[1], ground_truth[2]],
)
Row | parameter | confidence_interval95 | ground_truth |
---|---|---|---|
String | Array… | Float64 | |
1 | De | [259.639, 302.688] | 253.308 |
2 | σ | [0.0649086, 0.0720443] | 0.0708872 |
This page was generated using Literate.jl.