Inference and model evaluation
In this section, we go through the process of estimating the parameters of the model and evaluating the fitted model.
Simulating synthetic datasets
The method proposed by Ringbauer et al. (2017) makes some assumptions to approximate the evolution of a population in a continuous space. We refer to the Theory overview
for an introduction to the underlying theory. In any case, we advise researchers to simulate synthetic data that resembles the empirical data according to their expertise. This is a crucial step to determine the expected performance of the method and to guide different steps of the pre-processing strategy (such as which software to use to detect identity-by-descent blocks or how to bin the observed data).
For this purpose, we provide a SLiM
model to simulate an individual-based population in a continuous space as well as two scripts that fit the model with either error-free identity-by-descent blocks or blocks detected using state-of-the-art software and compare it with the ground-truth parameters. Such models can be found at simulations directory and might be adapted to your specific needs.
Processing pipeline
Inferring the effective density and dispersal rate from empirical datasets requires running an identity-by-descent detection software across different phased VCF files, post-processing them, and aggregating them appropriately. This procedure is error-prone and time-consuming.
To facilitate the analysis, we provide a Snakemake pipeline that can be used to perform a complete analysis, from detecting IBD blocks using HapIBD, post-processing them with Refined IBD, producing a CSV directly compatible with this package, and, optionally, finding a preliminary maximum likelihood estimate of the effective density. Effective dispersal rate. Snakemake is a popular tool in bioinformatics and it allows easy parallelization across multiple cores or jobs when submitting to a cluster via SLURM, as well as automatically managing dependencies via Conda or Apptainer containers. Most of the time, running such a pipeline would require a single-command such as:
snakemake --configfile config.yaml --cores 8 --sdm conda
The pipeline is configured via a configuration YAML file. An example of such a configuration file can be found here. We refer to the extensive documentation of Snakemake for more information.
Model inference
Next, we exemplify the common steps of an analysis using IdentityByDescentDispersal
. First, we load a synthetic dataset for which we know the ground truth. This is the dataset we simulated when showcasing how to simulate synthetic datasets using SLiM.
using JLD2, DataFrames, IdentityByDescentDispersal, Turing
data = load("../data/constant_density.jld2")
ground_truth = (data["local_density"], data["dispersal_rate"])
(253.30778276266545, 0.07088723439378913)
The IdentityByDescentDispersal
is designed to be compatible with existing statistical software. Here, we decide to Use the Turing
package, which is the most popular Bayesian modelling framework in Julia. Let's consider the following model:
@model function constant_density(df, contig_lengths)
De ~ Truncated(Normal(1000, 100), 0, Inf)
σ ~ Truncated(Normal(1, 0.1), 0, Inf)
Turing.@addlogprob! composite_loglikelihood_constant_density(De, σ, df, contig_lengths)
end
m = constant_density(data["df"], data["contig_lengths"])
DynamicPPL.Model{typeof(Main.constant_density), (:df, :contig_lengths), (), (), Tuple{DataFrames.DataFrame, Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(Main.constant_density, (df = 9440×6 DataFrame
Row │ DISTANCE_INDEX IBD_LEFT IBD_RIGHT NR_PAIRS COUNT DISTANCE
│ Float64 Float64 Float64 Int64 Int64 Float64
──────┼────────────────────────────────────────────────────────────────
1 │ 0.21 0.04 0.041 176 20 0.214839
2 │ 0.21 0.041 0.042 176 16 0.214839
3 │ 0.21 0.042 0.043 176 11 0.214839
4 │ 0.21 0.043 0.044 176 16 0.214839
5 │ 0.21 0.044 0.045 176 10 0.214839
6 │ 0.21 0.045 0.046 176 11 0.214839
7 │ 0.21 0.046 0.047 176 16 0.214839
8 │ 0.21 0.047 0.048 176 5 0.214839
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
9434 │ 0.55 0.193 0.194 1 0 0.556462
9435 │ 0.55 0.194 0.195 1 0 0.556462
9436 │ 0.55 0.195 0.196 1 0 0.556462
9437 │ 0.55 0.196 0.197 1 0 0.556462
9438 │ 0.55 0.197 0.198 1 0 0.556462
9439 │ 0.55 0.198 0.199 1 0 0.556462
9440 │ 0.55 0.199 0.2 1 0 0.556462
9425 rows omitted, contig_lengths = [1.0]), NamedTuple(), DynamicPPL.DefaultContext())
The IdentityByDescentDispersal
is compatible with automatic differentiation. Therefore, we can use standard Hamiltonian Monte Carlo algorithms such as NUTS()
to estimate the posterior distribution.
chains = sample(m, NUTS(), MCMCThreads(), 1000, 4);
┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC ~/.julia/packages/AbstractMCMC/F50rb/src/sample.jl:384
Sampling (1 thread) 0%| | ETA: N/A
┌ Info: Found initial step size
└ ϵ = 0.00078125
┌ Info: Found initial step size
└ ϵ = 0.00625
Sampling (1 thread) 25%|███████▊ | ETA: 0:13:57
Sampling (1 thread) 50%|███████████████▌ | ETA: 0:08:59
┌ Info: Found initial step size
└ ϵ = 0.0015625
┌ Info: Found initial step size
└ ϵ = 0.00625
Sampling (1 thread) 75%|███████████████████████▎ | ETA: 0:04:24
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:17:39
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:17:40
When fitting models with power-densities (i.e. using composite_loglikelihood_power_density
), be aware that the $\beta$ parameter controls the mean-square differentiability of the composite likelihood (through modified second-kind Bessel function). If you run into issues, we suggest that:
- reparametrize the model (perhaps using a custom function via the
composite_loglikelihood_custom
interface). - tighten the priors to constrain the space of parameters to a region where gradients are defined.
- use gradient-free inference methods such as
Nelder-Mead
when computing maximum likelihood estimates orMetropolis-Hastings
when doing Bayesian inference.
Model diagnosis
Diagnosing the posterior samples obtained via Markov Chain Monte Carlo is a crucial step in any Bayesian inference. We refer to existing resources such as this lecture on Bayesian modelling in Biostatistics Most popular approaches involve calculating quantities such as the effective number of samples (ESS) and $\hat {R}$, which can be computed directly from Turing
output.
As a rule of thumb, we aim to run the chain long enough to obtain an ESS greater than 100.
ess(chains) |> DataFrame
Row | parameters | ess | ess_per_sec |
---|---|---|---|
Symbol | Float64 | Float64 | |
1 | De | 647.027 | 0.614003 |
2 | σ | 665.676 | 0.6317 |
A $\hat {R}$ greater than 1.05 indicates the chains have not mixed well.
rhat(chains) |> DataFrame
Row | parameters | rhat |
---|---|---|
Symbol | Float64 | |
1 | De | 1.00634 |
2 | σ | 1.00589 |
Convergence issues can also be inspected through a traceplot
:
using Plots, StatsPlots
traceplot(chains)
Visualizing the posterior
After finishing the sampling process and assessing convergence, we can visualize the estimated posterior distribution of the parameters and compare it with the prior distribution and the ground-truth:
using Distributions
p1 = plot(
Truncated(Normal(1000, 100), 0, Inf),
label = "Prior",
xlab = "Effective population density",
ylab = "Density",
)
density!(p1, chains[:De][:], label = "Posterior")
vline!(p1, [ground_truth[1]], label = "Ground-truth")
p2 = plot(
Truncated(Normal(1, 0.1), 0, Inf),
label = "Prior",
xlab = "Effective dispersal rate",
ylab = "Density",
)
density!(p2, chains[:σ][:], label = "Posterior")
vline!(p2, [ground_truth[2]], label = "Ground-truth")
plot(p1, p2, layout = (2, 1), size = (600, 800))
Notice that, although the inference is accurate, we do not expect the posterior estimates to have nominal coverage (e.g., that a 90% credibility interval contains the true parameter 90% of the time). This is because we are assuming every pairwise observation is independent when constructing the composite likelihood and therefore we are overconfident in our estimations. In their publication, Ringbuaer et al. (2017) computed a bootstrapping confidence interval of the maximum-likelihood estimate but still do not achieve nominal coverage.
A bootstrapping confidence interval can be easily computed with the maximum_likelihood
function and taking samples with replacement of the rows of the DataFrame
. Notice, however, that other resampling schemes are possible. We advise you to experiment with different approaches and evaluate the performance with simulated datasets.
nboots = 100
boots = zeros(2, nboots)
df = data["df"]
for i = 1:nboots
df_resampled = df[sample(1:nrow(df), nrow(df), replace = true), :]
mle = maximum_likelihood(constant_density(df_resampled, data["contig_lengths"]))
boots[:, i] = mle.values
end
conf_De = quantile(boots[1, :], [0.025, 0.975])
conf_σ = quantile(boots[2, :], [0.025, 0.975])
DataFrame(
parameter = ["De", "σ"],
confidence_interval95 = [conf_De, conf_σ],
ground_truth = [ground_truth[1], ground_truth[2]],
)
Row | parameter | confidence_interval95 | ground_truth |
---|---|---|---|
String | Array… | Float64 | |
1 | De | [260.678, 299.6] | 253.308 |
2 | σ | [0.0648972, 0.0716243] | 0.0708872 |
Interpreting the output
Finally, we provide some references on how to interpret the results and relate them to the biological context.
Effective population density
The effective population density (De) is simply the inverse of the coalescent rate of lineages that become very close to each other. It is equivalent to the effective population size of every deme in a stepping stone model where demes are separated by one distance unit. It is most informative of recent demographic events as it is calculated from identity-by-descent blocks.
Effective dispersal rate
Perhaps unintuitively, the mean per-generation dispersal distance when we average with respect to both parents is not necessarily equivalent to the mean per-generation dispersal distance when we average across single generations. From population genetic data, we can estimate the latter ( refer to the Theory overview
section for more details). Therefore, interpreting the estimated effective dispersal rate in the context of the ecology of one population has to be done carefully.
For example, consider a population with two separate sexes for which there are differences in the process of dispersal. Such differences arise naturally when considering the effect of mating systems on dispersal patterns. For simplicity, let's consider a 1-dimensional space where individuals can only move left or right as shown in the figure below.
Let's assume the displacement between the mother and the offspring, $d_{\text{mother-child}}$, is distributed according to a normal distribution with a mean of zero and a variance of $\sigma_D^2$ and the displacement between the father and the mother, $d_{\text{mother-father}}$, is also distributed according to a normal distribution with a mean of zero and a variance of $\sigma_M^2$. Because the mating process (that determines the displacement from the mother to the father) is independent from the specific dispersal process of the offspring (that determines the displacement from the mother to the offspring), then the displacement between the father and the offspring is distributed according to a normal distribution with a mean of zero and a variance of $\sigma_D^2 + \sigma_M^2$.
As mentioned earlier, what we can estimate is the mean per-generation dispersal of the "lineages". Because the lineage is either inherited from the mother or the father, the displacement is distributed as a mixture of two normal distributions. If sex ratio is 0.5, then the displacement is distributed as a mixture of two normal distributions with weights $w = [0.5, 0.5]$. This distribution has a mean of zero and a variance of $\sigma_D^2 + 0.5\sigma_M^2$ but it is not normally distributed. Therefore, and under these assumptions, what we can estimate from genetic data is actually $\sigma = \sqrt{0.5\sigma_D^2 + 0.5(\sigma_D^2+\sigma_M^2)} = \sqrt{\sigma_D^2 + 0.5\sigma_M^2)}$ and it is not possible to separate the effects of $\sigma_D$ and $\sigma_M$ without additional information. If a priori information is available, it is possible to obtain estimates of both $\sigma_D$ and $\sigma_M$ that are consistent with the observed data. However, this requires additional consideration as the posterior distribution will be multimodal.
@model function constant_density2(df, contig_lengths)
De ~ Truncated(Normal(1000, 100), 0, Inf)
σ_D ~ Truncated(Normal(1, 0.1), 0, Inf)
σ_M ~ Truncated(Normal(0.1, 0.01), 0, Inf)
σ := sqrt(σ_D^2 + 0.5 * σ_M^2)
Turing.@addlogprob! composite_loglikelihood_constant_density(De, σ, df, contig_lengths)
end
constant_density2 (generic function with 2 methods)
As mentioned earlier, the per-generation distance averaging across lineages is not necessarily equal to the mean per-generation distance when we average with respect to both mother and father. Notice that the average displacement is defined as $\hat d = 0.5 d_{\text{mother-child}} + 0.5 (d_{\text{mother-child} - d_{\text{mother-father}}})$ which is distributed as a normal distribution with mean zero and variance $\sqrt{\sigma_{D}^{2} + 0.25 \sigma_{M}^{2}}$.
This page was generated using Literate.jl.