Assessing reliability of learning Gaussian graphical models from microbial abundances

Project in Bioinformatics (10 ECTS)

Curro Campuzano Jiménez

Bioinformatics Research Centre, Aarhus University

January 18, 2024

Microbial networks

From “who is there” to “who co-occurs with whom, and why?”

  • Amplicon and metagenomic sequencing has generated lot’s of co-occurrence data in the past few years.
  • A microbial co-occurrence networks is a representation of the conditional dependence structure between microbial abundances1
    • Nodes are taxonomic units
    • Edges are significant associations

Gaussian graphical models in a nutshell

A cartoon version

\[X \sim \mathcal N_p(0, \Omega^{-1})\]

Figure 1: Explanatory diagram

Graphical lasso

Estimating an sparse precision matrix in the likelihood framework (graphical lasso)

For a given \(\lambda\), just solve

\[ \hat \Omega(\lambda) = \arg \min_{\Omega\in M^+} (-L(\Omega) + \lambda ||\Omega||_1) \]

But, how do you choose \(\lambda\)?

SpiecEASI (CLR + StARS)

SpiecEASI selects \(\lambda\) using the StARS after applying CLR

  1. First, transform data using the centered log-ratio

  2. Compute the stability of each edge across subsamples without replacement using Meinshausen and Bühlmann method

  3. Choose \(\lambda\) so there is a small but acceptable amount of variability (\(\beta=0.05\))

Bayesian Gaussian graphical model

Estimating an sparse precision matrix using BDgraph

  • We do not use a lasso prior, but a discrete and continuous mixture prior distribution called G-Wishart

  • Estimate the joint posterior distribution of the graph and the precision matrix using a trans dimensional MCMC algorithm

\[ P(G, \Omega|Z) \propto P(Z|G, \Omega) P(\Omega|G)P(G) \]

  • An average edge inclusion probability can be computed as the frequency of that edge when sampling from the chain

Motivation

How reliable their results are in the context of typical microbial ecology studies?

Gaussian graphical models have received lots of attention since the publication of SpiecEASI (\(>1000\) citations), which popularized using StARS

I compared SpiecEASI with a novel Bayesian approach with a main focus in measuring the effect of the sample size

Results

Simulation of synthetic datasets

No standard simulation protocol or curated benchmark dataset exists

We simulated multivariate normal data and microbial counts for 53 graphs of different characteristics.

Figure 2: Simulated Random network

Recovery of true graph (F1 score)

Recovering the whole microbial network is an unrealistic goal

Figure 3: F1 distribution across 1590 models trained with microbial counts

What do you prefer, spurious or missing edges?

SpiecEASI seeks to minimize the number of missing edges

Figure 4: Recall and precision distribution across 1590 models trained with microbial counts

Predicting network properties (modularity)

Network properties have large associated errors

Figure 5: Modularity error across 1590 models trained with microbial counts

Module identification clusters (niche preference)

SpiecEASI outperforms the Bayesian method when identifying clusters

Figure 6: Comparison of the true and inferred clusters for 60 models trained with microbial counts

Credibility and confidence intervals (hub-score)

Credibility and confidence intervals can hardly improve the analysis robustness

(a)

(b)

Figure 7: Credibility and confidence intervals for 60 models trained with microbial counts

Future work

  1. Write agnostic simulation software to assess uncertainty
  2. What happens when we aggregate nodes?
  3. And what happens with known confounding variables (environmental factors)?

Conclussion

  1. Researchers may have overestimated the reliability of network inference methods

  2. They should justify their choices and link them to biology

  3. They should consider the associated error of their estimands via extensive simulation of synthetic datasets

Thanks!

Extra

Methods

Our code is available in GitHub.

Gaussian graphical models in a nutshell

Math definition

Let us consider a multivariate Gaussian random variable \(X \sim \mathcal N_p(0, \Omega^{-1})\).1 Let us consider an undirected graph \(G(V, E)\).

  1. If \(\Omega_{ij} = 0\), then \(X_i\) and \(X_j\) are independent when conditioning on the rest of the dimensions.
  2. If edge \((i, j) \in E\), then node \(i\) and node \(j\) are dependent when conditioning on the rest of the dimensions.
  3. Then, after estimating \(\hat \Omega^{-1}\) from data, we can determine the graph from the pattern of zero entries in \(\hat \Omega^{-1}\).

Data transformation

The centered log-ratio transformation

  • Microbial abundance data is not normal and it is highly compositional
    • Apply \(\text{clr}(\mathbf x)_i= \log\frac{x_i}{g(\mathbf x)}\) with \(g(\mathbf x) = \left(\prod_{i=1}^Dx_i\right)^{1/D}\)

(a)

(b)

Figure 8: Simulated microbial abundances before and after transformation

StARS

Stability Approach to Regularization Selection (StARS)

  1. Draw \(m\) subsamples without replacement
  2. Infer \(m\) graphs for each \(\lambda_i\) using the Meinshausen and Bühlmann method (slightly different approach)
  3. Measure the edge instability through the \(m\) samples for a fixed \(\lambda\)
  4. Choose sparsest graph such that the average total instability is equal to or more than \(\beta=0.05\)

\(k\) top-ranked edges

We can increase the accuracy, at the cost of having a new parameter \(k\)

Figure 9: Precision of the top \(k\) edges ranked according to probability of inclussion or stability score

Hu et al. 

Hu et al. used 328 taxa and 9 samples, and 152 taxa and 9 samples