In our previous post, we overviewed how foundation models generate embeddings that can then be used for specific tasks such as image classification or segmentation. Here, we'll dive deeper into embedding concepts and explore how embeddings can be used not just as backbones for downstream tasks, but also can be game-changing for fine-tuning models.
A foundation model’s intelligence lies in how it processes and represents visual information. Our foundation model, PLUTO, generates embeddings on image tiles. These tiles can either be formed by exhaustively gridding up entire whole-slide images or selectively placed to focus on particular areas within slides.
PLUTO, being based on the DINOv2 architecture, uses a Vision Transformer (ViT) model, which fundamentally changes how an image is processed compared to older convolutional networks.
Instead of processing an image all at once, the ViT architecture breaks the input image tile into a sequence of smaller, non-overlapping sub-patches called tokens. Two types of embeddings are then produced for each tile:
(A) whole-slide image, (B) single tile from the WSI, (C) tile sub-patch tokens, along with the learned CLS token embedding and patch-token embeddings.
The goal during foundation model training is to yield embeddings that preserve and compress biological information while distilling away noise. Image tiles that capture similar histology should have similar embeddings. Achieving this similarity is what allows foundation models to serve as backbones for many types of downstream tasks.
We've successfully built various downstream models on top of PLUTO's representations, achieving impressive results on tasks like instance segmentation (identifying and delineating individual objects) and semantic segmentation (classifying every pixel into a category). These types of models are trained using annotations from expert pathologists drawing tissue regions or individually labeling cells, which produces high-quality training data but is tough to scale for data-hungry models.
Strategically curating training data to increase diversity, especially representation of tough-to-classify substances, can help models perform better - but finding examples of these tricky histologies among thousands of gigapixel images can be like searching for a needle in a haystack.
Our solution is to leverage PLUTO embeddings to mine for rare failure modes. Using an unsupervised approach, model developers can search for histologically similar areas within and across slides, based on the similarity of their PLUTO embeddings. Annotations can then be collected on these targeted areas to augment training data. Repeating this process iteratively improves model performance.
Figure 3. Similarity search enables iteratively improving models based on failure mode mining.
This strategy stemmed from a recent company hackathon project that brought together PathAI’s ML scientists and software engineers. As proof-of-concept, the hackathon team built a custom database to store and rapidly query millions of tile-level embeddings across thousands of slides, as well as a web interface allowing users to click on whole-slide images and immediately view the top most similar tiles across the database. This interface enables users to quickly assess the search results and decide if the tiles should be sent to expert pathologists for annotations, or if the search should be further refined. The developers added options to only return tiles from unique slides or cases, to increase diversity of the outputs.
Figure 4: Similarity search user interface
In Figure 4 above, the query tile contains a particular histological morphology - densely inflamed cancer stroma. In these areas, immune cells may be so tightly packed that they resemble cancer tissue to a model. The search retrieved additional tiles with densely inflamed stroma, and these tiles can be annotated to fuel further model fine-tuning efforts. Additional examples are shown in Figure 5 below.
As pathology becomes increasingly digital, we foresee potential extensions of how similarity search can be utilized outside of model development:
We’ve already used the approach described above to improve performance in our pan-indication tissue model as well as more specialized cell models. Building on this success, we are actively exploring several extensions to our similarity search capabilities. Our primary focus is on improving search results by further increasing the diversity of the top hits, which leads to better training data quality. We are also developing ways to steer results more precisely by using multiple queries with both positive and negative examples. A major leap forward will be the integration of our latest foundation model, PLUTO-4, to generate significantly more robust embeddings.
We are further refining the user interface and integrating similarity search results into core workflows, automating the collection of targeted training data from our network of expert pathologists. We are also exploring how to integrate language and vision models to allow users to search for histologies of interest from text descriptions.
In future posts, we’ll showcase different use-cases for integrating PLUTO embeddings directly into models.