Assessing reliability of learning Gaussian graphical models from microbial abundances
Project in Bioinformatics (10 ECTS)
Curro Campuzano Jiménez
Bioinformatics Research Centre, Aarhus University
January 18, 2024
Microbial networks
From “who is there” to “who co-occurs with whom, and why?”
- Amplicon and metagenomic sequencing has generated lot’s of co-occurrence data in the past few years.
- A microbial co-occurrence networks is a representation of the conditional dependence structure between microbial abundances1
- Nodes are taxonomic units
- Edges are significant associations
Gaussian graphical models in a nutshell
A cartoon version
\[X \sim \mathcal N_p(0, \Omega^{-1})\]
Figure 1: Explanatory diagram
Graphical lasso
Estimating an sparse precision matrix in the likelihood framework (graphical lasso)
For a given \(\lambda\), just solve
\[
\hat \Omega(\lambda) = \arg \min_{\Omega\in M^+} (-L(\Omega) + \lambda ||\Omega||_1)
\]
But, how do you choose \(\lambda\)?
SpiecEASI (CLR + StARS)
SpiecEASI selects \(\lambda\) using the StARS after applying CLR
First, transform data using the centered log-ratio
Compute the stability of each edge across subsamples without replacement using Meinshausen and Bühlmann method
Choose \(\lambda\) so there is a small but acceptable amount of variability (\(\beta=0.05\))
Bayesian Gaussian graphical model
Estimating an sparse precision matrix using BDgraph
We do not use a lasso prior, but a discrete and continuous mixture prior distribution called G-Wishart
Estimate the joint posterior distribution of the graph and the precision matrix using a trans dimensional MCMC algorithm
\[
P(G, \Omega|Z) \propto P(Z|G, \Omega) P(\Omega|G)P(G)
\]
- An average edge inclusion probability can be computed as the frequency of that edge when sampling from the chain
Motivation
How reliable their results are in the context of typical microbial ecology studies?
Gaussian graphical models have received lots of attention since the publication of SpiecEASI (\(>1000\) citations), which popularized using StARS
I compared SpiecEASI with a novel Bayesian approach with a main focus in measuring the effect of the sample size
Simulation of synthetic datasets
No standard simulation protocol or curated benchmark dataset exists
We simulated multivariate normal data and microbial counts for 53 graphs of different characteristics.
Figure 2: Simulated Random network
Recovery of true graph (F1 score)
Recovering the whole microbial network is an unrealistic goal
Figure 3: F1 distribution across 1590 models trained with microbial counts
What do you prefer, spurious or missing edges?
SpiecEASI seeks to minimize the number of missing edges
Figure 4: Recall and precision distribution across 1590 models trained with microbial counts
Predicting network properties (modularity)
Network properties have large associated errors
Figure 5: Modularity error across 1590 models trained with microbial counts
Module identification clusters (niche preference)
Figure 6: Comparison of the true and inferred clusters for 60 models trained with microbial counts
Credibility and confidence intervals (hub-score)
Credibility and confidence intervals can hardly improve the analysis robustness
Future work
- Write agnostic simulation software to assess uncertainty
- What happens when we aggregate nodes?
- And what happens with known confounding variables (environmental factors)?
Conclussion
Researchers may have overestimated the reliability of network inference methods
They should justify their choices and link them to biology
They should consider the associated error of their estimands via extensive simulation of synthetic datasets
Methods
Our code is available in GitHub.
Gaussian graphical models in a nutshell
Math definition
Let us consider a multivariate Gaussian random variable \(X \sim \mathcal N_p(0, \Omega^{-1})\).1 Let us consider an undirected graph \(G(V, E)\).
- If \(\Omega_{ij} = 0\), then \(X_i\) and \(X_j\) are independent when conditioning on the rest of the dimensions.
- If edge \((i, j) \in E\), then node \(i\) and node \(j\) are dependent when conditioning on the rest of the dimensions.
- Then, after estimating \(\hat \Omega^{-1}\) from data, we can determine the graph from the pattern of zero entries in \(\hat \Omega^{-1}\).
StARS
Stability Approach to Regularization Selection (StARS)
- Draw \(m\) subsamples without replacement
- Infer \(m\) graphs for each \(\lambda_i\) using the Meinshausen and Bühlmann method (slightly different approach)
- Measure the edge instability through the \(m\) samples for a fixed \(\lambda\)
- Choose sparsest graph such that the average total instability is equal to or more than \(\beta=0.05\)
\(k\) top-ranked edges
We can increase the accuracy, at the cost of having a new parameter \(k\)
Figure 9: Precision of the top \(k\) edges ranked according to probability of inclussion or stability score
Hu et al.
Hu et al. used 328 taxa and 9 samples, and 152 taxa and 9 samples