Inference and model evaluation

In this section, we go through the process of estimating the parameters of the model and evaluating the fitted model.

Simulating synthetic datasets

The method proposed by Ringbauer et al. (2017) makes some assumptions to approximate the evolution of a population in a continuous space. We refer to the Theory overview for an introduction to the underlying theory. In any case, we advise researchers to simulate synthetic data that resembles the empirical data according to their expertise. This is a crucial step to determine the expected performance of the method and to guide different steps of the pre-processing strategy (such as which software to use to detect identity-by-descent blocks or how to bin the observed data).

For this purpose, we provide a SLiM model to simulate an individual-based population in a continuous space as well as two scripts that fit the model with either error-free identity-by-descent blocks or blocks detected using state-of-the-art software and compare it with the ground-truth parameters. Such models can be found at simulations directory and might be adapted to your specific needs.

Processing pipeline

Inferring the effective density and dispersal rate from empirical datasets requires running an identity-by-descent detection software across different phased VCF files, post-processing them, and aggregating them appropriately. This procedure is error-prone and time-consuming.

To facilitate the analysis, we provide a Snakemake pipeline that can be used to perform a complete analysis, from detecting IBD blocks using HapIBD, post-processing them with Refined IBD, producing a CSV directly compatible with this package, and, optionally, finding a preliminary maximum likelihood estimate of the effective density. Effective dispersal rate. Snakemake is a popular tool in bioinformatics and it allows easy parallelization across multiple cores or jobs when submitting to a cluster via SLURM, as well as automatically managing dependencies via Conda or Apptainer containers. Most of the time, running such a pipeline would require a single-command such as:

snakemake --configfile config.yaml --cores 8 --sdm conda

The pipeline is configured via a configuration YAML file. An example of such a configuration file can be found here. We refer to the extensive documentation of Snakemake for more information.

Model inference

Next, we exemplify the common steps of an analysis using IdentityByDescentDispersal. First, we load a synthetic dataset for which we know the ground truth. This is the dataset we simulated when showcasing how to simulate synthetic datasets using SLiM.

using JLD2, DataFrames, IdentityByDescentDispersal, Turing
data = load("../data/constant_density.jld2")
ground_truth = (data["local_density"], data["dispersal_rate"])
(253.30778276266545, 0.07088723439378913)

The IdentityByDescentDispersal is designed to be compatible with existing statistical software. Here, we decide to Use the Turing package, which is the most popular Bayesian modelling framework in Julia. Let's consider the following model:

@model function constant_density(df, contig_lengths)
    De ~ Truncated(Normal(1000, 100), 0, Inf)
    σ ~ Truncated(Normal(1, 0.1), 0, Inf)
    Turing.@addlogprob! composite_loglikelihood_constant_density(De, σ, df, contig_lengths)
end
m = constant_density(data["df"], data["contig_lengths"])
DynamicPPL.Model{typeof(Main.constant_density), (:df, :contig_lengths), (), (), Tuple{DataFrames.DataFrame, Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(Main.constant_density, (df = 9440×6 DataFrame
  Row │ DISTANCE_INDEX  IBD_LEFT  IBD_RIGHT  NR_PAIRS  COUNT  DISTANCE
      │ Float64         Float64   Float64    Int64     Int64  Float64
──────┼────────────────────────────────────────────────────────────────
    1 │           0.21     0.04       0.041       176     20  0.214839
    2 │           0.21     0.041      0.042       176     16  0.214839
    3 │           0.21     0.042      0.043       176     11  0.214839
    4 │           0.21     0.043      0.044       176     16  0.214839
    5 │           0.21     0.044      0.045       176     10  0.214839
    6 │           0.21     0.045      0.046       176     11  0.214839
    7 │           0.21     0.046      0.047       176     16  0.214839
    8 │           0.21     0.047      0.048       176      5  0.214839
  ⋮   │       ⋮            ⋮          ⋮         ⋮        ⋮       ⋮
 9434 │           0.55     0.193      0.194         1      0  0.556462
 9435 │           0.55     0.194      0.195         1      0  0.556462
 9436 │           0.55     0.195      0.196         1      0  0.556462
 9437 │           0.55     0.196      0.197         1      0  0.556462
 9438 │           0.55     0.197      0.198         1      0  0.556462
 9439 │           0.55     0.198      0.199         1      0  0.556462
 9440 │           0.55     0.199      0.2           1      0  0.556462
                                                      9425 rows omitted, contig_lengths = [1.0]), NamedTuple(), DynamicPPL.DefaultContext())

The IdentityByDescentDispersal is compatible with automatic differentiation. Therefore, we can use standard Hamiltonian Monte Carlo algorithms such as NUTS() to estimate the posterior distribution.

chains = sample(m, NUTS(), MCMCThreads(), 1000, 4);
┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC ~/.julia/packages/AbstractMCMC/F50rb/src/sample.jl:384
Sampling (1 thread)   0%|                               |  ETA: N/A
┌ Info: Found initial step size
└   ϵ = 0.00078125
┌ Info: Found initial step size
└   ϵ = 0.00625
Sampling (1 thread)  25%|███████▊                       |  ETA: 0:13:57
Sampling (1 thread)  50%|███████████████▌               |  ETA: 0:08:59
┌ Info: Found initial step size
└   ϵ = 0.0015625
┌ Info: Found initial step size
└   ϵ = 0.00625
Sampling (1 thread)  75%|███████████████████████▎       |  ETA: 0:04:24
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:17:39
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:17:40

When fitting models with power-densities (i.e. using composite_loglikelihood_power_density), be aware that the $\beta$ parameter controls the mean-square differentiability of the composite likelihood (through modified second-kind Bessel function). If you run into issues, we suggest that:

  • reparametrize the model (perhaps using a custom function via the composite_loglikelihood_custom interface).
  • tighten the priors to constrain the space of parameters to a region where gradients are defined.
  • use gradient-free inference methods such as Nelder-Mead when computing maximum likelihood estimates or Metropolis-Hastings when doing Bayesian inference.

Model diagnosis

Diagnosing the posterior samples obtained via Markov Chain Monte Carlo is a crucial step in any Bayesian inference. We refer to existing resources such as this lecture on Bayesian modelling in Biostatistics Most popular approaches involve calculating quantities such as the effective number of samples (ESS) and $\hat {R}$, which can be computed directly from Turing output.

As a rule of thumb, we aim to run the chain long enough to obtain an ESS greater than 100.

ess(chains) |> DataFrame
2×3 DataFrame
Rowparametersessess_per_sec
SymbolFloat64Float64
1De647.0270.614003
2σ665.6760.6317

A $\hat {R}$ greater than 1.05 indicates the chains have not mixed well.

rhat(chains) |> DataFrame
2×2 DataFrame
Rowparametersrhat
SymbolFloat64
1De1.00634
2σ1.00589

Convergence issues can also be inspected through a traceplot:

using Plots, StatsPlots
traceplot(chains)
Example block output

Visualizing the posterior

After finishing the sampling process and assessing convergence, we can visualize the estimated posterior distribution of the parameters and compare it with the prior distribution and the ground-truth:

using Distributions
p1 = plot(
    Truncated(Normal(1000, 100), 0, Inf),
    label = "Prior",
    xlab = "Effective population density",
    ylab = "Density",
)
density!(p1, chains[:De][:], label = "Posterior")
vline!(p1, [ground_truth[1]], label = "Ground-truth")

p2 = plot(
    Truncated(Normal(1, 0.1), 0, Inf),
    label = "Prior",
    xlab = "Effective dispersal rate",
    ylab = "Density",
)
density!(p2, chains[:σ][:], label = "Posterior")
vline!(p2, [ground_truth[2]], label = "Ground-truth")
plot(p1, p2, layout = (2, 1), size = (600, 800))
Example block output

Notice that, although the inference is accurate, we do not expect the posterior estimates to have nominal coverage (e.g., that a 90% credibility interval contains the true parameter 90% of the time). This is because we are assuming every pairwise observation is independent when constructing the composite likelihood and therefore we are overconfident in our estimations. In their publication, Ringbuaer et al. (2017) computed a bootstrapping confidence interval of the maximum-likelihood estimate but still do not achieve nominal coverage.

A bootstrapping confidence interval can be easily computed with the maximum_likelihood function and taking samples with replacement of the rows of the DataFrame. Notice, however, that other resampling schemes are possible. We advise you to experiment with different approaches and evaluate the performance with simulated datasets.

nboots = 100
boots = zeros(2, nboots)
df = data["df"]
for i = 1:nboots
    df_resampled = df[sample(1:nrow(df), nrow(df), replace = true), :]
    mle = maximum_likelihood(constant_density(df_resampled, data["contig_lengths"]))
    boots[:, i] = mle.values
end
conf_De = quantile(boots[1, :], [0.025, 0.975])
conf_σ = quantile(boots[2, :], [0.025, 0.975])
DataFrame(
    parameter = ["De", "σ"],
    confidence_interval95 = [conf_De, conf_σ],
    ground_truth = [ground_truth[1], ground_truth[2]],
)
2×3 DataFrame
Rowparameterconfidence_interval95ground_truth
StringArray…Float64
1De[260.678, 299.6]253.308
2σ[0.0648972, 0.0716243]0.0708872

Interpreting the output

Finally, we provide some references on how to interpret the results and relate them to the biological context.

Effective population density

The effective population density (De) is simply the inverse of the coalescent rate of lineages that become very close to each other. It is equivalent to the effective population size of every deme in a stepping stone model where demes are separated by one distance unit. It is most informative of recent demographic events as it is calculated from identity-by-descent blocks.

Effective dispersal rate

Perhaps unintuitively, the mean per-generation dispersal distance when we average with respect to both parents is not necessarily equivalent to the mean per-generation dispersal distance when we average across single generations. From population genetic data, we can estimate the latter ( refer to the Theory overview section for more details). Therefore, interpreting the estimated effective dispersal rate in the context of the ecology of one population has to be done carefully.

For example, consider a population with two separate sexes for which there are differences in the process of dispersal. Such differences arise naturally when considering the effect of mating systems on dispersal patterns. For simplicity, let's consider a 1-dimensional space where individuals can only move left or right as shown in the figure below.

Illustration of dispersal in a 1-dimensional space with separate sexes

Let's assume the displacement between the mother and the offspring, $d_{\text{mother-child}}$, is distributed according to a normal distribution with a mean of zero and a variance of $\sigma_D^2$ and the displacement between the father and the mother, $d_{\text{mother-father}}$, is also distributed according to a normal distribution with a mean of zero and a variance of $\sigma_M^2$. Because the mating process (that determines the displacement from the mother to the father) is independent from the specific dispersal process of the offspring (that determines the displacement from the mother to the offspring), then the displacement between the father and the offspring is distributed according to a normal distribution with a mean of zero and a variance of $\sigma_D^2 + \sigma_M^2$.

As mentioned earlier, what we can estimate is the mean per-generation dispersal of the "lineages". Because the lineage is either inherited from the mother or the father, the displacement is distributed as a mixture of two normal distributions. If sex ratio is 0.5, then the displacement is distributed as a mixture of two normal distributions with weights $w = [0.5, 0.5]$. This distribution has a mean of zero and a variance of $\sigma_D^2 + 0.5\sigma_M^2$ but it is not normally distributed. Therefore, and under these assumptions, what we can estimate from genetic data is actually $\sigma = \sqrt{0.5\sigma_D^2 + 0.5(\sigma_D^2+\sigma_M^2)} = \sqrt{\sigma_D^2 + 0.5\sigma_M^2)}$ and it is not possible to separate the effects of $\sigma_D$ and $\sigma_M$ without additional information. If a priori information is available, it is possible to obtain estimates of both $\sigma_D$ and $\sigma_M$ that are consistent with the observed data. However, this requires additional consideration as the posterior distribution will be multimodal.

@model function constant_density2(df, contig_lengths)
    De ~ Truncated(Normal(1000, 100), 0, Inf)
    σ_D ~ Truncated(Normal(1, 0.1), 0, Inf)
    σ_M ~ Truncated(Normal(0.1, 0.01), 0, Inf)
    σ := sqrt(σ_D^2 + 0.5 * σ_M^2)
    Turing.@addlogprob! composite_loglikelihood_constant_density(De, σ, df, contig_lengths)
end
constant_density2 (generic function with 2 methods)

As mentioned earlier, the per-generation distance averaging across lineages is not necessarily equal to the mean per-generation distance when we average with respect to both mother and father. Notice that the average displacement is defined as $\hat d = 0.5 d_{\text{mother-child}} + 0.5 (d_{\text{mother-child} - d_{\text{mother-father}}})$ which is distributed as a normal distribution with mean zero and variance $\sqrt{\sigma_{D}^{2} + 0.25 \sigma_{M}^{2}}$.


This page was generated using Literate.jl.