In our previous post introducing PLUTO-41, we described how foundation models transform complex input data like pathology images into embeddings—compact numerical representations that capture the key features of the data.
These embeddings are powerful because they:
One important application of this approach is slide-level prediction, where embeddings from image tiles are aggregated and passed to a classifier. Our additive multiple instance learning (aMIL) method2 has shown strong performance across many use cases.3,4
Building on this strong visual foundation, we are now exploring how additional context from language can further enrich these representations
Pathologists interpret histologic specimens using rich, structured morphological descriptions of the tissue, such as:
“Irregular nests of atypical melanocytes with pagetoid scatter and dermal invasion…”
These descriptions capture nuanced patterns and relationships that are not captured by discrete categorical class labels.
By incorporating language alongside images, we can bring this descriptive layer directly into AI models. This creates an opportunity to:
Rather than replacing existing approaches, language can enhance and complement visual understanding—helping models form a richer and more structured view of pathology.
To combine these modalities, we train models to learn a joint embedding space between whole-slide images of histopathology specimens and disease descriptions.
Specifically, we generate whole-slide-level embeddings by first extracting tile embeddings using our PLUTO vision foundation model and then aggregating them through a learnt attention pooling network - allowing the model to focus on the most relevant regions of the slide rather than treating all areas equally.
In parallel, we extract rich descriptions for every slide from pathologists’ diagnostic reports. These are augmented by textbook-style morphological descriptions written by a large language model (LLM) in the role of an expert pathologist, instructed to capture the defining visual features of each diagnosis. The descriptions are then embedded using a prominent general purpose language foundation model, followed by a projection layer to produce text embeddings in the same space.
We then align slide and text embeddings using a contrastive learning objective, illustrated in the below figure where:
Unlike traditional classification objectives which learn a fixed mapping from inputs to a predefined set of labels, this objective learns a shared embedding space where relationships between inputs and label descriptions are explicitly encoded. This structure enables making predictions in this shared embedding space through direct comparison between slides and descriptions.
In practice, this reframes the task from:
“Which label is this slide?”
to:
“Which description best matches this slide?”
This shift enables a more flexible and expressive way to model diagnosis, as predictions are grounded in explicit disease descriptions rather than opaque discrete labels.
We applied this approach to whole-slide prediction of dermatopathology (Derm) and gastrointestinal (GI) disease labels. Prediction in these domains is inherently challenging: potential diagnoses span a large number of fine-grained subtypes, many of which are visually similar and differ only by subtle morphological patterns. Compounding this, rare conditions create a long tail of underrepresented classes with limited training examples.
|
Dermatopathology |
Gastrointestinal pathology |
|
|
Classes (N) |
17 |
39 |
|
Subclasses (N) |
52 |
102 |
|
Training set slides (N) |
29,655 |
59,035 |
|
Evaluation set slides (N) |
3388 |
7857 |
The addition of language changes how the model organizes information internally. To better understand this, we examine the structure of both text embeddings and slide embeddings in the shared space using UMAP projections as illustrated below for the dermatopathology application.
Text embeddings reflect diagnostic structure.
When we visualize the text embeddings for the 52 subclasses alone, we see that subclasses naturally group by higher-level diagnostic categories. For example:
This result indicates that the aligned language embeddings can capture clinically meaningful relationships between dermatopathology diagnoses.
Slide embeddings align with this structure.
When we project the slide embeddings of the held-out evaluation set into the same space, we observe that:
For example, the organization of subclasses within basal cell carcinoma, inflammatory dermatoses, and squamous lesions is reflected in both the text and slide embedding spaces, suggesting that the model captures similar morphological relationships across modalities.
We evaluated the classification performance of the multimodal approach using nearest-neighbor matching in the shared embedding space. Specifically, for each slide, we compute its embedding and assign a label based on the closest matching text embedding of the label descriptions.
We compare this approach with our image-only aMIL classifiers which are trained using classification loss on fixed label sets.
Across both domains, multimodal approach outperformed image-only aMIL classifiers:
These gains highlight how language provides complementary signals that could strengthen the prediction of possible diagnoses.
It's important to note that inference remains efficient with the multimodal approach even though we have two modalities:
As a result, inference complexity is comparable to standard MIL models.
Together, these results show that the shared embedding space is not only more structured and consistent with how pathologists organize and relate diseases but also more accurate.
In pathology, understanding a case involves interpreting patterns, context, and relationships across the slide. By incorporating language, we can move from models that solely reason over visual patterns to ones that can reason over richer, more descriptive representations of disease.
This shift enables more adaptable systems that can better handle the complexity and variability of real-world pathology - particularly in settings where distinctions are subtle and categories continue to evolve.
Looking ahead, aligning images with language creates a path toward more flexible and expressive models, including:
By combining the strong visual foundation of PLUTO with ongoing advances in language models, multi-modal approaches represent a promising direction for the future of digital pathology.
1. PLUTO-4: Frontier Pathology Foundation Models. https://arxiv.org/abs/2511.02826
2. Additive MIL: Intrinsically Interpretable Multiple Instance Learning for Pathology https://proceedings.neurips.cc/paper_files/paper/2022/file/82764461a05e933cc2fd9d312e107d12-Paper-Conference.pdf
3. AIM-HER2 Breast Cancer https://www.pathai.com/aim-her2-breast-cancer
4. Artificial intelligence enables prediction of MET amplification & associated morphologic features from H&E-stained NSCLC specimens. Cancer Res (2025) 85 (8_Supplement_1): 2430. https://aacrjournals.org/cancerres/article/85/8_Supplement_1/2430/756341/Abstract-2430-Artificial-intelligence-enables