Beyond the Foundation: Using PLUTO Embeddings for Model Fine-tuning with Similarity Search

Written by Admin | Nov 20, 2025 9:53:45 PM

In our previous post, we overviewed how foundation models generate embeddings that can then be used for specific tasks such as image classification or segmentation. Here, we'll dive deeper into embedding concepts and explore how embeddings can be used not just as backbones for downstream tasks, but also can be game-changing for fine-tuning models.

Diving Deeper into Embeddings

A foundation model’s intelligence lies in how it processes and represents visual information. Our foundation model, PLUTO, generates embeddings on image tiles. These tiles can either be formed by exhaustively gridding up entire whole-slide images or selectively placed to focus on particular areas within slides.

PLUTO, being based on the DINOv2 architecture, uses a Vision Transformer (ViT) model, which fundamentally changes how an image is processed compared to older convolutional networks.

Tokenization and the CLS Token

Instead of processing an image all at once, the ViT architecture breaks the input image tile into a sequence of smaller, non-overlapping sub-patches called tokens. Two types of embeddings are then produced for each tile:

Patch token embeddings: Each patch-token is flattened into an initial vector, which are then processed through the transformer blocks of PLUTO. This step allows the individual patch-token embeddings to capture both local and contextual visual information for their specific area within the tile, and these fine-scale embeddings are typically used for dense prediction.
CLS token embedding: In addition to the patch tokens, a special, learnable vector called the classification (CLS) token is prepended to the sequence. As the model processes the image through its transformer layers, the CLS token is trained to aggregate global information from all the tokens. The final embedding of the CLS token serves as a powerful, single, fixed-length representation for the entire tile. This makes it ideal for tasks that require a global understanding, such as tile-level classification or the similarity search we’ll describe below.

Figure 1: How PLUTO encodes information at different levels:

(A) whole-slide image, (B) single tile from the WSI, (C) tile sub-patch tokens, along with the learned CLS token embedding and patch-token embeddings.

The goal during foundation model training is to yield embeddings that preserve and compress biological information while distilling away noise. Image tiles that capture similar histology should have similar embeddings. Achieving this similarity is what allows foundation models to serve as backbones for many types of downstream tasks.

Figure 2. Ordination of PLUTO CLS token embeddings for 5,000 image tiles taken from histopathology whole-slide images: (A) colored by stain, (B) colored by disease.

The Challenge of Training Data

We've successfully built various downstream models on top of PLUTO's representations, achieving impressive results on tasks like instance segmentation (identifying and delineating individual objects) and semantic segmentation (classifying every pixel into a category). These types of models are trained using annotations from expert pathologists drawing tissue regions or individually labeling cells, which produces high-quality training data but is tough to scale for data-hungry models.

Strategically curating training data to increase diversity, especially representation of tough-to-classify substances, can help models perform better - but finding examples of these tricky histologies among thousands of gigapixel images can be like searching for a needle in a haystack.

Similarity Search for Model Fine-tuning

Our solution is to leverage PLUTO embeddings to mine for rare failure modes. Using an unsupervised approach, model developers can search for histologically similar areas within and across slides, based on the similarity of their PLUTO embeddings. Annotations can then be collected on these targeted areas to augment training data. Repeating this process iteratively improves model performance.

Figure 3. Similarity search enables iteratively improving models based on failure mode mining.

Proof-of-concept

This strategy stemmed from a recent company hackathon project that brought together PathAI’s ML scientists and software engineers. As proof-of-concept, the hackathon team built a custom database to store and rapidly query millions of tile-level embeddings across thousands of slides, as well as a web interface allowing users to click on whole-slide images and immediately view the top most similar tiles across the database. This interface enables users to quickly assess the search results and decide if the tiles should be sent to expert pathologists for annotations, or if the search should be further refined. The developers added options to only return tiles from unique slides or cases, to increase diversity of the outputs.

Figure 4: Similarity search user interface

In Figure 4 above, the query tile contains a particular histological morphology - densely inflamed cancer stroma. In these areas, immune cells may be so tightly packed that they resemble cancer tissue to a model. The search retrieved additional tiles with densely inflamed stroma, and these tiles can be annotated to fuel further model fine-tuning efforts. Additional examples are shown in Figure 5 below.

Figure 5: Example query tiles and their top 10 most similar tiles. Queried substances include (A) benign intestinal crypts, (B) steatotic liver tissue, (C) IHC+ gland-forming cancer, and (D) IHC+ tumor cell nests.

Translating Smart Similarity Search to Clinical Practice and Beyond

As pathology becomes increasingly digital, we foresee potential extensions of how similarity search can be utilized outside of model development:

Clinical Case Review: Pathologists could use similarity search to find cases similar to ones in their diagnostic queue, allowing them to take advantage of digitally archived slides with rare or challenging histologies for comparison. This functionality could also benefit medical education in pathology.
Research: Through incorporating similarity search into their study designs, researchers could quickly aggregate cohorts of cases with similar histologic findings for specific research projects and cohort analysis.

Improvements and extensions of embedding similarity search

We’ve already used the approach described above to improve performance in our pan-indication tissue model as well as more specialized cell models. Building on this success, we are actively exploring several extensions to our similarity search capabilities. Our primary focus is on improving search results by further increasing the diversity of the top hits, which leads to better training data quality. We are also developing ways to steer results more precisely by using multiple queries with both positive and negative examples. A major leap forward will be the integration of our latest foundation model, PLUTO-4, to generate significantly more robust embeddings.

We are further refining the user interface and integrating similarity search results into core workflows, automating the collection of targeted training data from our network of expert pathologists. We are also exploring how to integrate language and vision models to allow users to search for histologies of interest from text descriptions.

In future posts, we’ll showcase different use-cases for integrating PLUTO embeddings directly into models.

View full post