Neighbor-joining in Rust

Published

December 11, 2023

As part of my Master’s program at BiRC, I made a Rust crate (jargon for package/library) for building phylogenetic trees from large Phylip distance matrices using the Neighbor-Joining algorithm.

Our goal was to implement an efficient command-line application that implemented the canonical algorithm (with similar optimization tricks as in QuickTree) and the RapidNJ heuristic with a twist: using B-Trees rather than vecs. B-trees are (unsurprisingly) tree data structures that allow maintaining sorted data and perform insertions and deletions in \(\mathcal O(\log n)\), with a larger constant factor than binary search trees, but minimizing cache-misses which makes them faster in modern computers.

I was please to find that my canonical implementation was significantly faster than QuickTree (kudos to Rust for that!) and that I was able to obtain some performance gains for very large matrices (\(n>20000\)) with my heuristic. Still, it’s better to use the original implementation, although some work could be done to further improve it! To keep the tradition, I named my own program speedytree.

To be sure my implementation was correct, I elaborated an extensive suite of tests and explored the idea of property testing. In this testing technique, you simulate some random input to your function and assert that some property of the output is true. In my case, I simulated distance matrices from an additive binary tree and asserted that the topology (and the branch length) of the inferred tree was maintained using appropriate distance measures.

A few months later, I was asked to publish it as a separate package so the Rust ecosystem could benefit from it (so I have at least one user!). As a result, I published it as speedytree on crates.io.

use speedytree::DistanceMatrix;
use speedytree::{NeighborJoiningSolver, Canonical};
// Raw Phylip format
let input = "5
   a    0    5    9    9    8
   b    5    0    10    10    9
   c    9    10    0    8    7
   d    9    10    8    0    3
   e    8    9    7    3    0
".as_bytes();
let d = DistanceMatrix::read_from_phylip(input).expect("Invalid Phylip format");
// Canonical Neighbor Joining. Optimal for small problems.
let tree = NeighborJoiningSolver::<Canonical>::default(d)
  .solve()
  .expect("Failed to solve");

You can take a look at the documentation of the crate or my slides, which are publicly available on GitHub if you want to know more!