Advancing Multimodal Pathology: Learning from Images and Language for Disease Classification

Background: From images to embeddings

In our previous post introducing PLUTO-4¹, we described how foundation models transform complex input data like pathology images into embeddings—compact numerical representations that capture the key features of the data.

These embeddings are powerful because they:

Distill complex visual information into a versatile format
Enable efficient training of downstream models
Generalize across a wide range of pathology tasks

One important application of this approach is slide-level prediction, where embeddings from image tiles are aggregated and passed to a classifier. Our additive multiple instance learning (aMIL) method² has shown strong performance across many use cases.^3,4

Building on this strong visual foundation, we are now exploring how additional context from language can further enrich these representations

Expanding beyond vision with language context

Pathologists interpret histologic specimens using rich, structured morphological descriptions of the tissue, such as:

“Irregular nests of atypical melanocytes with pagetoid scatter and dermal invasion…”

These descriptions capture nuanced patterns and relationships that are not captured by discrete categorical class labels.

By incorporating language alongside images, we can bring this descriptive layer directly into AI models. This creates an opportunity to:

Capture finer-grained distinctions between related conditions
Reflect the semantic relationships between diseases
Build models that better align with how pathologists relate morphological details to diagnoses

Rather than replacing existing approaches, language can enhance and complement visual understanding—helping models form a richer and more structured view of pathology.

Our approach: Learning a shared vision-language space

To combine these modalities, we train models to learn a joint embedding space between whole-slide images of histopathology specimens and disease descriptions.

Specifically, we generate whole-slide-level embeddings by first extracting tile embeddings using our PLUTO vision foundation model and then aggregating them through a learnt attention pooling network - allowing the model to focus on the most relevant regions of the slide rather than treating all areas equally.

In parallel, we extract rich descriptions for every slide from pathologists’ diagnostic reports. These are augmented by textbook-style morphological descriptions written by a large language model (LLM) in the role of an expert pathologist, instructed to capture the defining visual features of each diagnosis. The descriptions are then embedded using a prominent general purpose language foundation model, followed by a projection layer to produce text embeddings in the same space.

We then align slide and text embeddings using a contrastive learning objective, illustrated in the below figure where:

Matching slide–description pairs are pulled closer together
Non-matching pairs are pushed farther apart

Unlike traditional classification objectives which learn a fixed mapping from inputs to a predefined set of labels, this objective learns a shared embedding space where relationships between inputs and label descriptions are explicitly encoded. This structure enables making predictions in this shared embedding space through direct comparison between slides and descriptions.

In practice, this reframes the task from:

“Which label is this slide?”

to:

“Which description best matches this slide?”

This shift enables a more flexible and expressive way to model diagnosis, as predictions are grounded in explicit disease descriptions rather than opaque discrete labels.

Screenshot 2026-04-01 at 12.43.35 PM

Proof-of-concept results

We applied this approach to whole-slide prediction of dermatopathology (Derm) and gastrointestinal (GI) disease labels. Prediction in these domains is inherently challenging: potential diagnoses span a large number of fine-grained subtypes, many of which are visually similar and differ only by subtle morphological patterns. Compounding this, rare conditions create a long tail of underrepresented classes with limited training examples.

	Dermatopathology	Gastrointestinal pathology
Classes (N)	17	39
Subclasses (N)	52	102
Training set slides (N)	29,655	59,035
Evaluation set slides (N)	3388	7857

1. Clinically Relevant Shared Embedding Structure

The addition of language changes how the model organizes information internally. To better understand this, we examine the structure of both text embeddings and slide embeddings in the shared space using UMAP projections as illustrated below for the dermatopathology application.

Screenshot 2026-04-01 at 12.45.28 PM

Text embeddings reflect diagnostic structure.
When we visualize the text embeddings for the 52 subclasses alone, we see that subclasses naturally group by higher-level diagnostic categories. For example:

Different subtypes of actinic keratosis (lime green) cluster together
Inflammatory dermatoses (maroon) group separately from melanocytic lesions and epidermal lesions

This result indicates that the aligned language embeddings can capture clinically meaningful relationships between dermatopathology diagnoses.

Slide embeddings align with this structure.
When we project the slide embeddings of the held-out evaluation set into the same space, we observe that:

Slides with the same class label cluster closely together, forming coherent groups
Related classes occupy nearby regions, reflecting shared morphology. For instance, actinic keratosis subtypes remain proximate to one another, and actinic keratosis as a group sits closer to seborrheic keratosis than to melanocytic lesions.
Slide embeddings occupy regions that mirror the structure seen in the text embeddings, indicating alignment between image features and clinical descriptions

For example, the organization of subclasses within basal cell carcinoma, inflammatory dermatoses, and squamous lesions is reflected in both the text and slide embedding spaces, suggesting that the model captures similar morphological relationships across modalities.

2. Improved quantitative performance

We evaluated the classification performance of the multimodal approach using nearest-neighbor matching in the shared embedding space. Specifically, for each slide, we compute its embedding and assign a label based on the closest matching text embedding of the label descriptions.

We compare this approach with our image-only aMIL classifiers which are trained using classification loss on fixed label sets.

Across both domains, multimodal approach outperformed image-only aMIL classifiers:

Dermatopathology: ~4–6% relative improvement
Gastrointestinal pathology: ~8–10% relative improvement

These gains highlight how language provides complementary signals that could strengthen the prediction of possible diagnoses.

Screenshot 2026-04-01 at 12.47.32 PM

It's important to note that inference remains efficient with the multimodal approach even though we have two modalities:

Text embeddings for label descriptions are precomputed and cached
At inference, we only compute the slide embedding and perform a similarity search over text embeddings of label descriptions

As a result, inference complexity is comparable to standard MIL models.

Together, these results show that the shared embedding space is not only more structured and consistent with how pathologists organize and relate diseases but also more accurate.

The Next Frontier: Bridging Visual Patterns and Language

In pathology, understanding a case involves interpreting patterns, context, and relationships across the slide. By incorporating language, we can move from models that solely reason over visual patterns to ones that can reason over richer, more descriptive representations of disease.

This shift enables more adaptable systems that can better handle the complexity and variability of real-world pathology - particularly in settings where distinctions are subtle and categories continue to evolve.

Looking ahead, aligning images with language creates a path toward more flexible and expressive models, including:

Open-vocabulary outputs: Models are no longer limited to a fixed set of labels for classification, but can make predictions from images to new or previously unseen text descriptions based on their proximity in the shared embedding space. This enables extending to new disease indications without retraining models.
Case retrieval using text descriptions: Allowing clinicians to search large slide datasets using clinically meaningful language (e.g., “slides showing interface dermatitis with basal vacuolization”), and retrieve corresponding case images
Enrichment with clinical context: Potential for further improvements using additional context, such as case-specific clinical information like patient history, prior diagnoses, clinical impression, etc., which can further refine predictions and improve performance in complex or ambiguous cases

By combining the strong visual foundation of PLUTO with ongoing advances in language models, multi-modal approaches represent a promising direction for the future of digital pathology.

References

1. PLUTO-4: Frontier Pathology Foundation Models. https://arxiv.org/abs/2511.02826

2. Additive MIL: Intrinsically Interpretable Multiple Instance Learning for Pathology https://proceedings.neurips.cc/paper_files/paper/2022/file/82764461a05e933cc2fd9d312e107d12-Paper-Conference.pdf

3. AIM-HER2 Breast Cancer https://www.pathai.com/aim-her2-breast-cancer

4. Artificial intelligence enables prediction of MET amplification & associated morphologic features from H&E-stained NSCLC specimens. Cancer Res (2025) 85 (8_Supplement_1): 2430. https://aacrjournals.org/cancerres/article/85/8_Supplement_1/2430/756341/Abstract-2430-Artificial-intelligence-enables