The data and methodological approach used in this study are summarized in Fig. 1 and Fig. 2. The first step consists in classifying pathologist-defined scores A versus non-A, in terms of global scores. The second step aims at refining the classification of the predicted non-A into B and D scores using subcluster annotations.
Images of Brain Tumor Samples, Annotations and Segmentation
Brain tumor samples were obtained from the Onconeurotek collection of the Pitié-Salpêtrière Hospital. The protocol was approved by the IRB (Onconeurotek 2.0). Tumor specimens and clinicopathological information were collected with informed consent and approval by an Institutional Review Board (N° IDRCB: 2023-A02763-42, NCT06314607) in accordance with the national laws and the Declaration of Helsinki. The dataset includes diverse glioma types as defined by the reference classification of the World Health Organization (WHO) 2021: oligodendroglioma IDH-mutant and 1p/19q codeleted, astrocytoma IDH-mutant, glioblastoma IDH-wildtype, pleomorphic xanthoastrocytoma, ganglioglioma, and other rare variants. Tumor tissue sections were immunostained for the T cell marker CD3 with an automated stainer (Ultra, Roche Ventana), and scanned using a Zeiss Axioscan Z1 at 0.22 microns, generating CZI WSI images (2–3 GB per image). The total dataset comprised 214 images.
Neuropathologists annotated all images for tumor infiltration and peritumor zones and categorized them into four groups (A, B, C, D) based on the amount and pattern of T cell infiltration, referred to as the global score of the image. The groups were defined as follows (Fig. 1): global score A: absence or presence of few CD3-positive cells within the tumor, as defined by neuropathologists. Global score B: presence of CD3-positive cells primarily surrounding blood vessels inside the tumor. Global score C: presence of CD3-positive cells primarily outside the tumor infiltration zone. Global score D: diffuse infiltration of CD3-positive cells within the tumor. Cases with a global score C were excluded from the dataset as CD3-positive cells were mainly out of the tumor zone. The final dataset consisted of 134 WSIs with global score A, 60 WSIs with global score B and 20 with global score D. In the two-step strategy, the first goal was to distinguish between the global score A vs non-A (B and D) given the imbalance in the dataset. Data split into training, validation and test datasets was performed once and was the same for both the models (XGBoost and 2D CNN). Although the dataset was initially split once, it was later partitioned using a 10-fold split approach, resulting in ten separate subsets, each containing its own training, validation, and test sets for cross-validation analysis.
Visiopharm® was used to perform manual segmentation of tumor zones and automatically segment the nuclei, and CD3 + cells within tumor areas. The output XML files contained several object types: ROI (Type 1), CD3-positive cells (Type 4), CD3-positive cells nuclei (Type 5), and negative zones (Type 0). Types 0, 1, and 4 were parsed using 'xmltodict' (v0.13.0). Lymphocyte coordinates were extracted and their centers computed. Using Matplotlib (v3.9.0), we visualized CD3-positive cells centers and tumor zones, excluding CD3-positive cells from non-tumor regions. To normalize tumor sizes, we defined consistent global boundaries across all cases based on coordinate ranges. Each full mask had a size of 7200×7200 pixels, in two versions: with or without the tumor area (type 1).
Patch-based analysis
To cope with the lymphocyte infiltration details of the images, full masks were patchified into 50x50-pixel tiles using Python (python package patchify). Patches with less than 25% white pixels (i.e. without lymphocytes) were retained. In total, 93,214 patches were created from 214 full masks.
VGG16 (Keras v2.11.0) was used for feature extraction (Fig. 2). It is a pretrained neural network, which outputs 512 features, that were then used as input to K-means clustering algorithm (Sklearn v1.3.0) to group patches into 75 clusters. The method clustered visually similar image patches and integrated these clusters into a classification model. VGG-16 was selected for feature extraction due to evidence supporting its accuracy in brain tumor classification (Srinivas et al., 2022). Chen and coauthors used a fusion of ResNet101, DenseNet121, and EfficientNetB0, achieving 0.9918 accuracy (W. Chen et al., 2024), while Saeedi et al. applied deep learning for early tumor detection (Saeedi et al., 2023). Beyond brain tumors, VGG-16 has been used in breast cancer detection with principal component analysis (PCA) for dimensionality reduction (Alrubaie et al., 2023), and in pancreatic cancer detection combined with XGBoost (Bakasa & Viriri, 2023), highlighting its effectiveness in diverse medical imaging tasks.
In our study, after obtaining 75 clusters, we counted the number of patches assigned to each cluster within a given mask image. Next, we calculated the proportion of patches in each cluster relative to the total number of patches in that mask. In this way, each mask image was represented by a percentage distribution across clusters (see Table 1 in the supplementary material). This compact representation was paired with the corresponding global score and effectively preserved biologically relevant spatial information related to the diffusion patterns of CD3 positive cells.
To search for the best method, H2O AutoML (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html) was run for 24 hours using 5-fold cross-validation, class balancing, and accuracy as the main evaluation measure. The best model (ranked by AUC), a XGBoost model, was selected from more than 1000 candidates. XGBoost constructs an ensemble of decision trees in a sequential manner, where each new tree corrects the errors of the previous ones using gradient descent on a loss function. It incorporates regularization, shrinkage, and column subsampling to improve generalization and prevent overfitting.
To estimate the global score, a patch-based method was developed using features extracted from VGG16, followed by clustering and classification with this XGBoost model. This model was trained to distinguish WSIs classified as Group A from non-A (Groups B and D combined). The selected model used XGBoost with “Dropouts meet Multiple Additive Regression Trees” (DART) algorithm for binary classification, with 37 trees (depth = 15) and 5-fold cross-validation.
Density Map Analysis
A second approach relied on the density of T cells. Grayscale density maps were created to capture T cell density. CD3-positive cells full masks (without the border line of the tumoral zone) were divided into overlapping 100x100-pixel windows (50-pixel stride). The mean lymphocyte density was computed in each window and represented as grayscale maps. These maps were then used in a classifier (Fig. 2).
To estimate the global score in a different manner, a density-map-based approach was implemented using a custom 2D CNN architecture. This model also performed Group A vs non-A classification. This 2D CNN (implemented with TensorFlow) had 4 Conv2D layers (3x3, ReLU), MaxPooling2D (2x2), Flatten, Dense (256, ReLU), and final Dense (1, sigmoid) layers. It was trained over 3000 epochs with batch size 15 and learning rate decay from 10− 4 to 10− 6 over 3000 steps. Finally, the model has 20.18M trainable parameters.
Grad-CAM visualizations were generated using the final convolutional layer of the CNN model to highlight regions contributing to the classification (Chattopadhyay et al., 2018; Selvaraju et al., 2020).
Spatial Analysis based on Labeled Patterns
To differentiate between different types of lymphocyte aggregation (corresponding to global scores B and D), expert pathologists annotated the patch clusters (Fig. 3). A total of 75 clusters were classified into six subclusters (1, 2, 3, 4, E, and O), based on tissue morphology and CD3 positive lymphocyte density. For each cluster, 25 random patches were selected to determine its label, amounting to approximately 2% of the entire dataset (75 x 25 / 93,214). This is our weak labelling process of patches. Subclusters 1 to 4 reflected increasing levels of T cell infiltration, with subcluster 4 indicating the highest diffusion among the tumor tissue. Subcluster E (for Exclusion) corresponded to patches where T cells were densely packed near blood vessels without evidence of infiltration. Subcluster O included background patches with no relevant tissue or cellular information.
A pattern was defined as a 3x3 array of patches, i.e. a pattern was composed of 9 patches, with their corresponding label (one of each subcluster). This array inherently contains the spatial organization of the patches. However, knowing that a specific subcluster label comes after or before another patch in the mask image is irrelevant for our analysis. Therefore, we rearranged the labels in alphabetic order (see Fig. 4). We observed that there are unique patterns of sub-cluster groups per global score group. Thus, we hypothesized that the presence or absence of these distinct and unique patterns could allow us to distinguish between the global score groups B and D given the ability to distinguish between A and non-A. Following this, we analyzed the top 50 most frequent patterns per group in order to see if there were specific patterns of sub-cluster groups describing the global score groups B and D. Patterns were eliminated if background (represented by O) or absence of T cells in tumoral zone (represented by subcluster 1) were found for more than or equal to 5 patches in the array. For each global score group, we selected the top 50 most frequent patterns as a reference. This approach resulted in having three reference groups (A, B, D). To classify a query case, we compared its patterns to the reference patterns of each group and counted the common patterns. Even though we identified the patterns describing the global score A, these patterns had to be ignored at this stage as classifying A vs non-A was already achieved. The package R ggseglogo (https://cran.r-project.org/web/packages/ggseqlogo/index.html) was used to visualize the top 50 most frequent patterns in the reference groups.