Abstract
Historically, eligibility for receiving human epidermal growth factor receptor 2 (HER2)-targeted therapies was limited to HER2-positive tumors (immunohistochemistry 3+ or in situ hybridization amplified), but recent advances in antibody-drug conjugates have expanded these criteria to include HER2-low and HER2-ultralow expression. This evolving therapeutic landscape underscores the need for precise and reproducible HER2 assessment. Digital and computational pathology tools may help address these needs, but their measurement variability must be evaluated to inform research and clinical use. We evaluated HER2 scoring variability across 10 independently developed computational pathology artificial intelligence models applied to 1124 whole-slide images from 733 patients with breast cancer. Analyses included American Society of Clinical Oncology-College of American Pathologists categorical scores (0, 1+, 2+, and 3+), H-scores, tumor cell staining percentages, and counts of total and stained invasive carcinoma cells. Agreement among models and 3 pathologists was assessed using pairwise overall percent agreement (OPA), Cohen kappa, and hierarchical clustering. Median model pairwise OPA for categorical HER2 scores was 65.1% (kappa, 0.51). Agreement was highest for HER2 3+ vs not 3+ (OPA, 97.3%; kappa, 0.86) and lowest for HER2-low cases, reflecting existing measurement challenges. For HER2 0 (negative) vs not 0 (positive) scoring, the average negative agreement was 65.3%, compared with the average positive agreement of 91.3%, suggesting more agreement in non-HER2 0 scores. H-score and cell count analyses indicated that scoring differences were more related to staining interpretation than tumor cell detection. Pathologists showed numerically higher concordance than models, but interobserver variability persisted. In exploratory analyses, sample type, staining artifacts, and heterogeneous HER2 expression appeared to be associated with discrepancies. Artificial intelligence–based HER2 scoring demonstrated high agreement in identifying HER2 3+ cases. Variability was most pronounced in borderline HER2 categories, particularly in HER2 low, underscoring the need for continued tool refinement for handling low-intensity staining. Standardized training data sets, validation frameworks, and regulatory alignment are important to improve reproducibility. Developing reference standards and benchmarking data sets is critical to evaluate performance, support regulatory decision-making, and ensure real-world applicability.
Authors
- McKelvey et al.
