Clinical-grade Detection of Microsatellite Instability in Colorectal Tumors by Deep Learning

Background and Aims: Microsatellite instability (MSI) and mismatch-repair deficiency (dMMR) in colorectal tumors are used to select treatment for patients. Deep learning can detect MSI and dMMR in tumor samples on routine histology slides faster and cheaper than molecular assays. But clinical application of this technology requires high performance and multisite validation, which have not yet been performed. Methods: We collected hematoxylin and eosin-stained slides, and findings from molecular analyses for MSI and dMMR, from 8836 colorectal tumors (of all stages) included in the MSIDETECT consortium study, from Germany, the Netherlands, the United Kingdom, and the United States. Specimens with dMMR were identified by immunohistochemistry analyses of tissue microarrays for loss of MLH1, MSH2, MSH6, and/or PMS2. Specimens with MSI were identified by genetic analyses. We trained a deep-learning detector to identify samples with MSI from these slides; performance was assessed by cross-validation (n=6406 specimens) and validated in an external cohort (n=771 specimens). Prespecified endpoints were area under the receiver operating characteristic (AUROC) curve and area under the precision-recall curve (AUPRC). Results: The deep-learning detector identified specimens with dMMR or MSI with a mean AUROC curve of 0.92 (lower bound 0.91, upper bound 0.93) and an AUPRC of 0.63 (range, 0.59– 0.65), or 67% specificity and 95% sensitivity, in the cross-validation development cohort. In the validation cohort, the classifier identified samples with dMMR with an AUROC curve of 0.95 (range, 0.92–0.96) without image-preprocessing and an AUROC curve of 0.96 (range, 0.93–0.98) after color normalization. Conclusions: We developed a deep-learning system that detects colorectal cancer specimens with dMMR or MSI using hematoxylin and eosin-stained slides; it detected tissues with dMMR with an AUROC of 0.96 in a large, international validation cohort. This system might be used for high-throughput, low-cost evaluation of colorectal tissue specimens. Many patients with bowel cancer are not tested for genetic changes. This study showed that and artificial intelligence system can complement existing histologic analyses of tissue specimens to detect colorectal cancer, increasing the speed and reduce the costs of testing.


Introduction
Mismatch repair deficiency (dMMR) is observed in 10% to 20% of colorectal cancer (CRC) patients and indicates a biologically distinct type of CRC with broad prognostic, predictive and therapeutic relevance. 1 In CRC and other cancer types, dMMR causes microsatellite instability (MSI), a specific DNA damage pattern. MSI and dMMR are associated with lack of chemotherapy response in intermediate stage CRC (pT3-4 N0-2), a reduced incidence of locoregional metastases and hence the opportunity of cure by local excision in early stage disease and a reduced requirement for adjuvant chemotherapy in stage II disease. In latestage disease, MSI and dMMR are predictive of response to immune checkpoint inhibition and is the only clinically approved pan-cancer biomarker for checkpoint inhibition in the United States. 2 Furthermore, MSI and dMMR are the genetic mechanism driving carcinogenesis in Lynch Syndrome (LS), the most common hereditary condition leading to colorectal cancer. 3 Because of this broad clinical importance, MSI or dMMR testing is recommended for all colorectal cancer patients by national and international guidelines such as the British National Institute for Health and Care Excellence (NICE) guideline 4 and the European Society for Medical Oncology (ESMO) guidelines. 5 However, in clinical practice, only a subset of CRC patients is investigated for presence of MSI or dMMR because of the high costs associated with universal testing. This lack of testing potentially leads to overtreatment with adjuvant chemotherapy, underdiagnosis of LS, reduced opportunities to consider local excision instead of extensive surgery with related risks and morbidity and failure to identify candidates for cancer immunotherapy.
Current laboratory assays for MSI and dMMR testing involve a multiplex PCR assay or a multiplex immunohistochemistry (IHC) panel. Specifically, MSI can be tested by the "Bethesda panel" PCR 6 whereas a four-plex IHC can demonstrate absence of one of four mismatch-repair (MMR) enzymes (MLH1, MSH2, MSH6, and PMS2) 7 . However, both assays for MSI or dMMR incur cost 8 , require additional sections of tumor tissue in addition to routine hematoxylin and eosin (H&E) histology 9 and yield imperfect results. Sensitivity and specificity of these tests have been evaluated in numerous population-based studies which are summarized in current clinical guidelines. 10 In these reference studies, test performance of molecular assays is reported with a sensitivity of 100% and specificity of 61.1% 11 or a higher specificity of 92.5% with a lower sensitivity of 66.7% 12 for MSI testing. Similarly, for IHC based tests, sensitivity is reported as 85.7% with a 91.9% specificity in a key study 13 while other international guidelines estimate that IHC testing has a sensitivity of 94% and a specificity of 88% 5 . This variable performance of clinical gold standard tests indicates that there is need for improvement. In addition, all available tests incur a substantial cost and require specialized molecular pathology laboratories. This highlights the need for new robust, low-cost and ubiquitously applicable diagnostic assays for MSI or dMMR detection in CRC patients.
In routine hematoxylin and eosin (H&E) histological images, MSI and dMMR tumors are characterized by distinct morphological patterns such as tumor-infiltrating lymphocytes, mucinous differentiation, heterogeneous morphology and a poor differentiation. 14 Although these patterns are well known to pathologists, manual quantification of these features by experts is not reliable enough for clinical diagnosis and therefore is not feasible in routine clinical practice. 15 In contrast, computer-based image analysis by deep learning has enabled robust detection of MSI and dMMR status directly from routine H&E histology: we have recently presented 16 and later refined 17 such a deep learning assay, which was independently validated by two other groups 18,19 . However, all of these studies have used a few hundred CRC patients at most, while clinical implementation of a deep learning based diagnostic assay requires enhanced sensitivity and specificity to those previously reported and largescale validation across multiple populations in different countries.
To address this, we formed the MSIDETECT consortium: a group of multiple academic medical centers across and beyond Europe (http://www.msidetect.eu). In this not-for-profit consortium, we collected tumor samples from more than 8000 patients with molecular annotation. Pre-specified intent was to train and externally validate a deep learning system for MSI and dMMR detection in CRC. The primary endpoint was diagnostic accuracy measured by area under the receiver operating curve (AUROC), area under the precisionrecall curve (AUPRC) and, correspondingly, specificity at multiple sensitivity levels (99%, 98%, 95%).

Ethics statement and patient cohorts
We retrospectively collected anonymized H&E stained tissue slides of colorectal adenocarcinoma patients from multiple previous studies and population registers. For each patient, at least one histological slide was available and MSI status or MMR status was known. We included patients from the following four previous studies with the intent of retraining a previously described deep learning system. 16,17 First, we used the publicly available Cancer Genome Atlas (TCGA, n=616 patients, Suppl. Figure 1), a multicenter study with Stage I to IV patients mainly from the United States of America. 20 All images and data from the TCGA study are publicly available at https://portal.gdc.cancer.gov. Second, we used "Darmkrebs: Chancen der Verhütung durch Screening" (DACHS, n=2292, Suppl. Figure 2), a population-based study of CRC Stage I to IV patients from south western Germany 21 . Tissue samples from the DACHS study were provided by the Tissue Bank of the National Center for Tumor Diseases (NCT) Heidelberg, Germany in accordance with the regulations of the tissue bank and the approval of the ethics committee of Heidelberg University. 21,22 Third, we used samples from the "Quick and Simple and Reliable" trial (QUASAR, n=2206, Suppl. Figure 3), which originally aimed to determine survival benefit from adjuvant chemotherapy in patients from the United Kingdom with mainly Stage II tumors. 23 Lastly, the Netherlands Cohort Study (NLCS, N=2197, Suppl. Figure 4) 24,25 collected tissue samples as part of the Rainbow-TMA consortium, and like DACHS, this study included patients with any tumor stage. All studies were cleared by the institutional ethics board of the respective institutions as described before (for QUASAR 23 , DACHS 22 and NLCS 25 ).
With the intent of external validation of the deep learning system, we collected H&E slides from the population-based Yorkshire Cancer Research Bowel Cancer Improvement Programme (YCR-BCIP) 26 cohort, where routine National Health Service diagnosis of dMMR was undertaken with further BRAF mutation and/or hMLH1 methylation screening to identify patients at high risk of having LS. The primary validation cohort from YCR-BCIP contained n=771 patients with standard histology after surgical resection (YCR-BCIP-RESECT, Suppl. Figure 5). For an additional exploratory analysis, we also acquired a nonoverlapping set of n=1531 patients from YCR-BCIP with endoscopic biopsy samples (YCR-BCIP-BIOPSY, Suppl. Figure 6). A set of N=128 polypectomy samples from the YCR-BCIP study (YCRBCIP-BIOPSY) contained only N=4 MSI or MMRd patients and was not used for further analyses as AUROC and AUCPR values are not meaningful for such low prevalence features. For all patient samples in YCR-BCIP 26 , a fully anonymized single scanned image of a representative H&E slide for each patient was utilized as a service evaluation study with no access to tissue or patient data aside from mismatch repair status.
Available clinico-pathological characteristics of all cases in each cohort are summarized in Table S1. MSI status in the TCGA study was determined genetically as described before. 20 MSI status in the DACHS study was determined genetically with a three-plex panel as described before. 27 In the QUASAR, NLCS and YCR-BCIP cohorts, mismatch-repair deficiency (dMMR) or proficiency (pMMR) was determined with a standard immunohistochemistry assays on tissue microarrays as described before (two-plex for MLH1 and MSH2 in NLCS and QUASAR, four-plex for MLH1, MSH2, MSH6 and PMS2 for YCR-BCIP). 23 This study complies with the "Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis" (TRIPOD) statement as shown in Table S2.

Image preprocessing and deep learning
All slides were individually, manually reviewed by trained observers supervised by expert pathologists to ensure that tumor tissue was present on the slide and the slide had diagnostic quality. Observers and supervisors were blinded regarding MSI status and any other clinical information. Tumor tissue was manually outlined in each slide. A small number of cases were excluded due to insufficient quality, technical issues, absence of tumor tissue on the observed slide or lack of molecular information (Suppl. Figure 1-6). Tumor regions were tessellated into square tiles of 256 μm edge length and saved at a resolution of 0.5 μm per pixel using QuPath v0.1.2 28 . Initially, the method pipeline was kept as simple as possible and color normalization was not used to preprocess the images. In a slight variation of the initial experiments, all image tiles were color-normalized with the Macenko method 29 as described previously 30 . A modified shufflenet deep learning system with a 512×512x input layer was trained on these image tiles in Matlab R2019a (Mathworks, Natick, MA, USA) with the hyperparameters listed in Table S3, as described before 17 . Tile-level predictions were averaged on a patient level with the proportion of predicted MSI or dMMR tiles (positive threshold) being the free parameter for the Receiver Operating Characteristic (ROC) analysis. All confidence intervals were obtained by 10-fold bootstrapping. No image tiles, or slides from the same patient were ever part of the training set and test set. All trained deep learning classifiers were assigned a unique identifier as listed in Table S4. All classifiers can be downloaded at https://dx.doi.org/10.5281/zenodo.3627523. Source codes are publicly available at https://github.com/jnkather/DeepHistology.

Experimental design
All deep learning experiments (training and test runs) were pre-specified and are listed in Table S5. All patients from TCGA, DACHS, QUASAR and NLCS were combined and served as the training set ("international cohort"). To assess the magnitude of batch effects, we trained a deep learning system on each sub-cohort in this international training cohort, assessing inter-cohort and intra-cohort performance, the latter being estimated by three-fold cross-validation (experiment #1). In addition, we performed a three-fold cross-validation on the full international cohort without (experiment #2) and with color normalization (experiment #2N), which was used for a detailed subgroup analysis according to predefined clinic-pathological and molecular subgroups. To identify the optimal number of patients needed for training, we used the international cohort, randomly set aside n=906 patients for testing, and trained on increasing proportions of the remaining n=5500 patients (experiment #3). To evaluate the deep learning system in an independent, external, population-based cohort, we trained on the international cohort and tested on YCR-BCIP-RESECT (experiment #4, this was the primary objective of our study). This experiment was repeated with color-normalized image tiles (experiment #4N). YCR-BCIP-RESECT was regarded as the "holy" test set and was not used for any other purpose than to evaluate the final classifier. Exploratively, we also evaluated the final classifier on YCR-BCIP-BIOPSY (experiment #5). Furthermore, to investigate the performance "train-on-biopsy, test-on-biopsy", we exploratively trained a three-fold cross-validated classifier on YCR-BCIP-BIOPSY (experiment #6).

Deep learning consistently predicts MSI in multiple patient cohorts
In the MSIDETECT consortium, a deep learning system was trained to predict MSI or dMMR status from digitized routine H&E whole slide images alone, with ground truth labels according to local standard procedures (PCR testing for MSI or IHC testing for dMMR). First, we investigated deep learning classifier performance in patients of the TCGA, DACHS, QUASAR and NLCS cohorts alone. We found that training the deep learning system on individual cohorts yielded an intra-cohort AUROC of 0.  (Table S6). This high intra-cohort performance dropped in some inter-cohort experiments ( Table 1, experiment #1 in Table S5). Together, these data show that deep learning systems attain high diagnostic accuracy in single-center cohorts but do not necessarily generalize to other patient cohorts.

Increasing patient number compensates for batch effects and improves performance
In the intra-cohort experiments (Table 1), training on larger cohorts generally yielded higher performance, corroborating the theoretical assumption that training on larger data sets yields more robust classifiers. To quantify this effect, we merged all patients from TCGA, DACHS, QUASAR and NLCS in a large "international cohort" (n=6406 patients) (Figure 1a). From these digitized whole slide histology images, we created a library of image tiles for training deep learning classifiers (Figure 1b). Thus, we increased the patient number as well as the data heterogeneity due to different pre-analytic pipelines in the respective medical centers. We set aside a randomly chosen proportion of n=906 of these patients and re-trained deep learning classifiers on 500, 1000, 1500 up to 5500 patients of the international cohort. In this experiment, we found that AUROC ( Figure 1c) and AUPRC (Suppl. Figure 7) on the test set initially increased as the number of patients in the training set increased. However, each increase in patient number yielded diminishing performance returns and AUROC and AUPRC plateaued at approximately 5000 patients (Figure 1d). The top performance was achieved by training on 5500 patients and testing on the fixed test set of n=906 patients, with an AUROC of 0.92 [0.90, 0.93] (compared to a baseline of 0.5 by a random model, Figure  1c), an AUPRC of 0.59 [0.49, 0.63] (compared to a baseline of 0.12 in a random model, Suppl. Figure 7, experiment #3 in Table S5), translating to a specificity of 52% at a sensitivity of 98%. To ensure that this performance was not due to the random selection of the internal test set, we performed a patient-level three-fold cross-validation on the full international cohort (n=6406), reaching a similar mean AUROC of 0.92 [0.91, 0.93] ( Figure  1d, experiment #2 in Table S5). Together, these data show that approximately 5000 patients are necessary and sufficient to train a high-quality deep learning detector of MSI and dMMR.

Clinical-grade performance in an external test cohort
Deep learning systems are prone to overfit to the dataset they were trained on and thus, must be validated in external test sets. Correspondingly, the pre-specified primary endpoint of this study was the test performance in a completely independent set of patients. This set of patients was intended to be population-based, i.e. to mirror the clinico-pathological characteristics of a real-world screening population. It was used for no other purpose than to validate the final classifier, which was previously trained on the international cohort. The test set comprised routine H&E slides from the population-based YCR-BCIP study (YCR-BCIPRESECT, n=771 patients, one slide per patient). In this population, we found a high classification performance with a mean AUROC of 0.95 and [0.92, 0.96] lower and upper bootstrapped confidence bounds, respectively (Figure 1e, Table S6, experiment #4). Because the target feature MSI and dMMR are unbalanced in real-world populations such as YCR-BCIP-RESECT, we also assessed the precision-recall-characteristics of this test, demonstrating a very high AUPRC of 0.79 [0.74, 0.86], compared to the baseline AUPRC of 0.14 of the null model in this cohort. These data show that a deep learning system trained on a large and heterogeneous international training cohort generalizes well beyond the training set, and thus constitutes a tool of potential clinical applicability.

Prediction performance is robust in clinico-pathological and molecular subgroups
Colorectal cancer comprises a number of anatomically and biologically distinct molecular sub-groups, including right-and left-sided colon cancer, rectal cancer, BRAF-driven and RAS-driven tumors, among others. This is especially relevant these features are partially dependent on each other, e.g. BRAF mutations and right-sidedness are associated with MSI status 31,32 . To assess if image-based MSI prediction is robust across these heterogeneous subgroups, we used the cross-validated deep learning system (experiment #2 in Table S5) and compared AUROC and AUPRC across subgroups. (Figure 2 and 9). We found some variation in classifier performance regarding anatomical location: the AUROC was 0.89 for right-sided cancer (n=2371 patients), 0.88 for left-sided cancer (n=3846), 0.91 for colon cancer overall (n=4408) and 0.83 for rectal cancer (n=1938). Little variation was observed in classifier performance according to molecular features: AUROC was 0.86 in BRAF mutants (N=298) and 0.91 in BRAF wild type (N=3226); also, AUROC was 0.90 in KRAS mutants (N=1263) and 0.93 in KRAS wild type tumors (N=2248). Finally, we analyzed the robustness of MSI predictions for different "Union for International Cancer Control" (UICC) stages, showing stable performance with an AUROC of 0.93 in Stage I (N=871), 0.92 in Stage II (N=3261) and 0.91 in Stage III (N=1554) tumors and a minor reduction of performance in Stage IV patients (N=636) reaching an AUROC of 0.83. In addition, histological grading (Suppl. Figure 8) did not influence classification performance. Next, we asked if this robust performance across subgroups was maintained in the external test cohort (YCR-BCIPRESECT, N=771 patients). Again, in this cohort, we did not find any relevant loss in performance with regard to the following subgroups: tumor stage, organ, anatomical location and sex (Suppl. Figure 10 and 11). In summary, this analysis demonstrates and quantifies variations in performance according to CRC subgroups, but demonstrates that overall, MSI and dMMR detection performance is robust.

Application of the deep learning system to biopsy samples
As additional exploratory endpoints, we tested if a deep learning system trained on histological images from surgical resections can predict MSI and dMMR status of images from endoscopic biopsy tissue. Biopsy samples include technical artifacts (fragmented tissue and small tissue area, Suppl. Figure 12a) as well as biological artifacts (they are sampled from luminal portions of the tumor only). We acquired endoscopic biopsy samples from n=1557 patients in the YCR-BCIP-BIOPSY study and tested the resection-trained classifier (experiment #5 in Table S6). We found that AUROC was reduced to 0.78 [0.75, 0.81] (Suppl. Figure 12b) in this experiment. In a three-fold cross-validated experiment on all n=1531 patients in the YCR-BCIP-BIOPSY cohort, MSI and dMMR detection performance was restored to an AUROC of 0.89 [0.88, 0.91] (experiment #6 in Table S5). These data suggest that MSI and dMMR testing on biopsies requires a classifier trained on biopsies.

Color normalization improves external test performance
As previous studies have pointed to a benefit of color-normalizing histology images before quantitative analysis 29 , the main experiments in this study were repeated on colornormalized image tiles. Native (non-normalized) image tiles ( Figure 4A) were subjectively more diverse in terms of staining hue and intensity than normalized tiles ( Figure 4B).
Repeating MSI and dMMR prediction by three-fold cross-validation on the full international cohort with color-normalized tiles (experiment #2N in Table S5), we found that color normalization modestly improves specificity at pre-defined sensitivity levels: Specificity was 57% at 99% sensitivity in experiment #2N, as opposed to specificity of 38% at 99% sensitivity in the corresponding non-normalized experiment (#2). However, this increase in specificity did not result in a higher AUROC overall (Table S5). To test if color normalization improves external test performance of MSI and dMMR predictors, we repeated experiment #4 (train on full international cohort, external test on YCR-BCIP-RESECT) after color normalization (experiment #4N). In this case, AUROC did improve (no normalization in #4: AUROC 0.95 [0.92, 0.96], color normalization in #4N: AUROC 0.96 [0.93, 0.98]). This slight increase in AUROC translated into a higher specificity at predefined sensitivity levels, reaching 58% specificity at 99% sensitivity (Table S5). These data show that color normalization can further improve classifier performance and improves generalizability of deep learning-based inference of MSI and dMMR status.

A clinical-grade deep learning-based molecular biomarker in cancer
Analysing more than 8000 CRC patients in an international consortium, we demonstrate that deep learning can reliably detect MSI and dMMR tumors based on routine H&E histology alone. In an external validation cohort, the deep learning MSI and dMMR detector performed with similar characteristics to gold standard tests 12 , reaching clinical-grade performance. As shown in previous studies 16 it can be assumed that this deep learning-based method can be cheaper and faster than routine laboratory assays and therefore has the potential to improve clinical diagnostic workflows. Our data show that classifier performance in surgical specimens remains robust even when the classifier is applied to external cohorts, but performance is lower in biopsy samples where tissue areas are much smaller than those of surgically resected specimens. This highlights the need to perform thorough large-scale evaluation of deep learning-based biomarkers in each intended use case. Deep learning histology biomarkers such as the MSI and dMMR detection system can be made understandable by visualization of prediction maps (Figure 3a-i) or by visualizing highly scoring image tiles (Suppl. Figure 13a-b). Together, these approaches show that the deep learning system yielded plausible predictions. For example, high MSI or dMMR scores were assigned to poorly differentiated tumor tissue (Suppl. Figure 13a) while high MSS or pMMR scores were assigned to well-differentiated areas. Interestingly, the spatial patterns of tile-level predictions showed varying degrees of heterogeneity: In all analyzed true positive MSI and dMMR cases in the YCR-BCIP-RESECT validation cohort, we found a homogeneously strong prediction of MSI and dMMR as shown in Figure 3a and d. In contrast, predictions in true MSS and pMMR cases were more heterogeneous. Necrotic, poorly differentiated or immune-infiltrated areas tended to be falsely predicted to be MSI or dMMR (Figure 3c and f). However, as patient-level predictions reflected overall scores in the full tumor area, most true MSS and pMMR patients were correctly predicted after pooling tile-level predictions, despite some degree of tile-level heterogeneity.

Clinical application: pre-screening or definitive testing
In this study, diagnostic performance was stable across multiple clinically relevant subgroups, except for lower-than-average performance in rectal cancer patients, possibly due to neoadjuvant pre-treatment of some of these patients. In summary, this study defines a thoroughly validated deep learning system for genotyping CRC based on histology images alone, which could be used in clinical settings after regulatory approval. By varying the operating threshold, sensitivity and specificity of this test can be changed according to the clinical workflow this test is embedded in: High-sensitivity deep learning assays could be used to pre-screen patients and could trigger additional genetic testing in case of positive predictions. Even with imperfect specificity, such classifiers could speed up the diagnostic workflow and provide immediate cost-savings, especially in the context of universal MSI and dMMR testing as recommended by clinical guidelines. Recent discussions and calculations on cost-effectiveness of systematic MSI or dMMR testing in CRC patients 33 should incorporate deep-learning-based assays among the other strategies in the future. Alternatively, deep learning biomarkers such as the method presented in this study could be used for definitive testing in the clinic, especially in healthcare settings in which limited resources are currently prohibitive for universal molecular biology tests. Further studies are needed to determine optimal operating thresholds for specific patient populations and clinical settings. In addition, clinical deployment will require prospective validation and regulatory approval. Ultimately, this method should rapidly identify MSS and pMMR cases with high certainty and identify high risk MSI, dMMR and possible LS cases for confirmation by other tests. This could substantially reduce molecular testing load in clinical workflows and enable direct, universal low-cost MSI and dMMR testing from ubiquitously available routine material. Technical improvements could conceivably further improve performance and open up new clinical applications. In this study, we explored color normalization as a way of reducing heterogeneity in staining intensity and hue between patient cohorts. This intervention (experiment #4N in Table S5) modestly improved performance, increasing specificity from 51% to 58% at 99% sensitivity in an external validation cohort. The deep learning system and the source codes used in this study have been publicly released, enabling other researchers to independently validate and, potentially, further improve its performance.

Limitations
A limitation to our experimental workflow is that the ground truth labels used to train the deep learning system are imperfect. In the MSIDETECT group, clinical routine assays were used to assess MSI or dMMR status and these assays have a non-zero error rate. Correspondingly, classifier performance could suffer from noisy labels in the training data. On the other hand, test cases flagged as "false positive" could be true MSI or dMMR cases that were missed by the clinical gold standard test. Ultimately, it is conceivable that deep learning assays can outperform classical genetic or molecular tests in terms of predictive and prognostic performance, but testing this hypothesis would require large cohorts with clinical end point data and/or deep genetic characterization. In particular, the deep learning classifier could potentially detect rare genetic aberrations with MSI-like morphology, but again, lack of large training cohorts for these rare features currently precludes deeper investigation of this aspect. Another potential limitation of this study is the performance in patient groups of potential clinical interest that were not analyzed in the subgroup analysis, such as hereditary versus sporadic MSI and dMMR cases or different ethnic backgrounds. This is due to the lack of available clinical data in the utilized patient cohorts and future studies are needed to investigate the stability of deep learning-based prediction in these and further subpopulations.
Interestingly, when we analyzed the per-patient predictions of MSI status in the external test set (YCR-BCIP-RESECT), we found an outlier among the "false negative" predictions: patient #441999 had a very low "predicted MSI probability" of less than 15%, while all other "true MSI" patients had MSI probability scores more than 40%. We went back to the original histology slide of patient #441999 and noticed that a technical artifact had resulted in a blurred image, which was only visible at high magnification and had thus gone undetected in the manual quality check. This shows that an improved quality control at multiple magnification levels could increase sensitivity of the deep learning assay maintaining a high specificity.
Finally, a possible practical challenge in further validation and future integration of the DL methods in a clinical workflow is the current lack of regular installation of slide scanners in hospitals. However, in the United Kingdom and other countries, large academic consortia are currently implementing nation-wide digital pathology workflows. This trend can be expected to further accelerate and will be supported by clinically useful applications of deep learning technology, especially after regulatory approval of such tools 34 . Still, initially it is probably more realistic to establish central testing facilities that are equipped with slide scanners and further hardware needed for deep learning applications. In this setting smaller hospitals and medical centers would not be confronted with high fixed costs but only with expenses and work that come with the distribution of H&E glass slides to central testing facilities.

Context: multicenter validation of deep learning biomarkers
Recent years have seen a surge of deep learning methods in digital pathology, but previous large-scale studies are limited to simple image analysis tasks such as tumor detection 35 and do not extend to scenarios of molecular biomarker detection. Smaller proof-of-concept studies have shown that deep learning can detect a range of molecular biomarkers directly from routine histology, including multiple clinically relevant oncogenes [17][18][19] . However, these classifiers were not validated in large multicenter cohorts and cannot be readily generalized beyond the training set. The present study is the first international collaborative effort to validate such a deep learning-based molecular biomarker. It identifies the need for very large series, training on a variety of sample types e.g. resection and biopsy and different populations. The high performance in this particular use case yields a tool of immediate clinical applicability and provides a blueprint for the emerging class of deep-learning-based molecular tests in oncology, with the potential to broadly improve workflows in precision oncology worldwide.

Background and context:
Microsatellite instability (MSI) and mismatch-repair deficiency (dMMR) in colorectal tumors are used to select treatment for patients. Deep learning can detect MSI and dMMR in tumor samples on routine histology slides faster and cheaper than molecular assays.

New findings:
We developed a deep-learning system that detects colorectal tumor specimens with MSI using hematoxylin and eosin-stained slides; it detected tissues with MSI with an area under the receiver operating characteristic curve of 0.95 in a large, international validation cohort.

Limitations:
This system requires further validation before it can be used routinely in the clinic.

Impact:
This system might be used for high-throughput, low-cost evaluation of colorectal tissue specimens. (A) Histological routine images were collected from four large patient cohorts. All slides were manually quality-checked to ensure presence of tumor tissue (circled in black). (B) Tumor regions were automatically tessellated and a library of millions of non-normalized (native) image tiles was created. (C) The deep learning system was trained on increasing numbers of patients and evaluated on a random subset (n=906 patients). Performance initially increased by adding more patients to the training set, but reached a plateau at approximately 5000 patients. (D) Cross-validated experiment on the full international cohort (comprising TCGA, DACHS, QUASAR and NLCS). Receiver operating characteristic (ROC) with true positive rate (TPR) shown against false positive rate (FPR), area under the ROC curve (AUROC) is shown on top. (E) ROC curve (left) and precision-recall-curve (right) of the same classifier applied to a large external dataset. High test performance was maintained in this dataset and thus, the classifier generalized well beyond the training cohorts. Black line = average performance, shaded area = bootstrapped confidence interval, red line = random model (no skill).  (A-C) Representative images from the YCR-BCIP-RESECT test cohort labeled with immunohistochemically defined mismatch repair (MMR) status. (D-F) Corresponding deep learning prediction maps. The edge length of each prediction tile is 256 μm. (G-I) Higher magnification of regions highlighted in a-e. True MSI or dMMR patients were strongly and homogeneously predicted to be MSI or dMMR (such as the patient shown in A). True MSS or pMMR patients were overall predicted to be MSS or pMMR (such as the patients in B and C), but a pronounced heterogeneity was observed in necrotic areas, poorly differentiated areas and immune-infiltrated tumor areas at the invasive edge.  Main performance measure was area under the receiver operating curve, shown as mean with lower and upper bounds in a 10-fold bootstrapped experiment. Intra-cohort-performance was estimated by three-fold cross-validation. US = United States, UK = United Kingdom, DE = Germany, NL = Netherlands, MSI = microsatellite instability, dMMR = mismatch repair deficiency.