Considering the Dice Similarity Coefficients (DSC) and the agreement of shape features between reader pairs across sequences, a difference in segmentation contours among readers was observed — being the smallest in T2WA (DSC: 0.8) and the largest in DWI (DSC: 0.73). Except for DWI/ADC, no major differences were observed in the mean Dice Similarity Coefficients (DSC), which reflect the overlapping areas of the segmentations, across different sequences. We attributed the relatively lower DSC in these segmentations to the fact that most of the images in our dataset were acquired using 1.5 Tesla MRI scanners, which typically yield diffusion-weighted images with lower signal-to-noise ratios, making segmentation more challenging. This explains the lower mean DSC observed in the diffusion-weighted images. Since our study aimed to reflect a real-world radiomic modeling scenario, we chose to include cases with low DSC across sequences rather than exclude them.
Our findings suggest that inter-reader segmentation differences have a significant impact on radiomic features. When evaluating the agreement of radiomic features calculated across sequences, discrepancies were observed that could not be fully explained by the mean or standard deviation of the DSC. In the study by Reijd et al., where 30 colorectal liver metastases were segmented by three different readers to evaluate the impact of manual segmentation contour differences on the agreement of radiomic features across MRI sequences, it was suggested that the variation in segmentation contours between readers had only a minor effect on the consistency of MRI-based radiomic features [11]. However, in lesions like EC, as in our study—where segmentation is more challenging compared to lesions such as lung nodules or liver metastases—contour differences appear to have a much greater impact on the agreement of radiomic features. Despite the contouring difficulty, particularly due to myometrial invasion, there are still radiomic features that demonstrate high reproducibility. In an analysis conducted by Xue et al. in 2021, which reviewed over 100 radiomic studies across various organs, it was suggested that inter-observer segmentation variability has a rather limited impact on the reproducibility of radiomic features [12]. In the same analysis, studies using the MRI modality that assessed the agreement of segmentation differences using ICC were found to have, on average, 80% (ranging from 20% to 100%) of their features deemed suitable for use after agreement analysis—referred to as the Satisfactory Feature Rate. However, although the satisfactory feature rate in this analysis was evaluated based on imaging modalities, it was not assessed on a lesion-specific basis. Moreover, only one EC radiomics study was included in this part of the analysis. Therefore, in studies developing MRI-based radiomic models for EC [13–15], the impact of segmentation variability on feature agreement remains a critical area of investigation. Particularly in EC radiomic modeling studies that use ICC for agreement assessment, the radiomic features will be filtered based on different reproducibility thresholds, such as 0.75 [14] or 0.8 [16]. Therefore, the selection of this threshold is of critical importance. Since there is no standardization for the ICC threshold selected to define agreement, reporting the proportion of radiomic features that are consistent at different ICC levels, and explicitly stating the ICC values of the features used in the constructed radiomic signature, can help create more reproducible models in such studies.
When comparing the radiomic features derived from the specified segmentations across different sequences, features extracted from T1CE and T2WS images showed the highest reproducibility (57.9% and 59.9%, respectively), while those from T2WA and ADC images demonstrated the lowest reproducibility (44.9% and 45.8%, respectively). This suggests that different sequences have different inter-reader radiomics feature reproducibility in EC.
By evaluating the agreement of shape features and DSC from segmentations performed on T2-weighted images, the three radiologists showed comparable results in terms of the similarity of segmented regions across all planes. However, despite this similarity in the delineated areas, the agreement of radiomic features derived from T2WA lagged behind those obtained from T2WS across all feature classes. This suggests that, when developing MRI-based radiomic models for EC, not only the sequence but also the imaging plane should be taken into consideration. Therefore our results do not align with what Reijd et al. has suggested, radiomic feature reproducibility being independent of sequence plane orientation [11]. Although most studies use the sagittal plane of T2-weighted sequences for segmentation, some have shown that axial [17] or axial-oblique [18] planes are also used. Such variations in plane selection may influence the reproducibility of the extracted features.
In our study, the segmentation mask was placed on the DWI images, and the ADC features were extracted in parallel based on this mask. However, the agreement of radiomic features derived from the ADC maps—particularly first-order features—was lower compared to those extracted from DWI. This is most likely due to differences in the MRI scanners and the software used to generate the ADC images [19–21]. In order to better reflect routine clinical practice, we deliberately utilized the original ADC maps without applying any standardized reconstruction methods, preserving them in their native form. Had we applied different normalization and quantization approaches prior to feature extraction, we might have observed improved agreement between radiomic features obtained from different readers [22]. In our study, we performed segmentation on the DWI images; however, had the segmentation been carried out directly on the ADC maps, an increase in reproducibility might have been achieved. Shape features demonstrated the highest level of agreement among all radiomic feature groups. In both MRI- and CT-based studies, shape features are consistently identified as one of the most robust classes against test-retest variability and inter-reader differences, and our findings are in line with the existing literature [22–24]. Among the shape features, although “Elongation”, “Sphericity” and “Flatness” showed low reproducibility across all sequences, features such as “Least Axis Length,” “Major Axis Length,” “Maximum 2D Diameter Slice,” “Maximum 3D Diameter Slice,” and “Minor Axis Length” demonstrated high reproducibility in all sequences. Since the first three features, referred to as compactness descriptors, primarily reflect the roundness of the lesion, we suggest that they are significantly influenced by factors such as slice thickness, resolution, and segmentation variability. On the other hand, the consistent agreement of features measuring the lesion’s widest and narrowest dimensions indicates that, despite uncertainties at lesion borders, the readers shared a similar understanding of lesion size.
When comparing radiomic feature subclasses, the NGTDM class showed the lowest reproducibility across all sequences. NGTDM is especially sensitive to contouring differences because it calculates its features by using a pixel and its neighbouring pixels. Since these neighboring pixels change significantly when contouring, NGTDM values become sensitive to segmentation differences. The low repeatability and high sensitivity to segmentation differences among readers of NGTDM features have been demonstrated in various disease contexts, and our findings are consistent with the existing literature [11, 25, 26]. GLCM, the second least reproducible radiomics subclass, quantifies how often specific combinations of gray-level intensities occur between pairs of pixels or voxels at a defined spatial relationship and orientation. We hypothesize since the pixel pair would be affected even at the slightest change in contouring, this subclass proved to be less producible than the other subclasses.
There is a clear predilection of “HighGray” features being more reproducible among sequences, namely T2W and DWI, in which cancer tissue is more hyperintense compared to surrounding healthy tissue. In these segmentations regions with low signal intensities consist of only a small number of voxels, rendering them more susceptible to variability. However in T1CE where cancer tissue is supposed to be lower intensity than the highly enhancing myometrium, a relatively low number of “LowGray’ features showed excellent responsibility, which suggests tumor to healthy tissue contrast is more dominant in T2W images compared to T1CE.
In the future, the use of fully automated segmentation techniques is anticipated to minimize inter-reader variability. In one study, when radiomic features derived from the segmentations of a deep learning model trained to segment EC on MRI were compared with those obtained from manual segmentations by an experienced radiologist, first-order and shape feature classes showed higher agreement than texture-based feature classes [27]. The same study reported that the segmentation model achieved a mean Dice Similarity Coefficient (DSC) of 0.806 when compared with the radiologist who segmented the training data and was considered the gold standard. This DSC is comparable to the average inter-reader DSC across all sequences observed in our study. Although radiomic features derived from both manual and automated segmentations were found to be consistent in that study, it should be noted that the segmentation area produced by the model inherently reflects the approach of the specific gold standard reader used during training. The lower agreements observed between different readers in our study suggest that, if such an automated segmentation model is to be developed, it should ideally be trained using segmentations performed by multiple experienced radiologists, preferably at different times. Otherwise, radiomic features extracted from automated segmentations may reflect the individual biases of the original reader.
Our study has several limitations. First, it was designed retrospectively. Second, only 59 patients were included, and the reliability of our results could be improved by increasing the sample size. Third, only original radiomic features were analyzed; wavelet-derived features were not included. Fourth, 58 out of the 59 scans were acquired using MRI machines with a magnetic field strength of 1.5 Tesla, which may have lowered segmentation quality, particularly in diffusion-weighted images, due to reduced signal-to-noise ratio. Lastly, external scans accounted for approximately 14% of the total, and the vast majority of all scans (57 out of 59) were acquired using Siemens MRI systems. Including a more diverse set of scanner brands and models could have yielded a more heterogeneous dataset and potentially more generalizable results.
Our study included 59 pre-treatment EC patients whose images were acquired from 9 different MRI scanners. We believe that this heterogeneous population enhances the generalizability of our findings. To develop better radiomic models, training on images from multiple centers may help compensate for acquisition differences, thus improving model generalizability [28].
In conclusion, with this study, we present the first MRI-based reproducibility analysis for EC and encourage other researchers to compare their results against ours. In EC, sagittal scan orientation yielded the best reproducibility among other orientations. T2 weighted and T1 weighted contrast enhanced images proved to be comparably reproducible while ADC features showed poor reproduciblity. Shape and GLDM features are the most reproducible radiomic feature subclasses. “HighGray” features are highly reproducible in T2 and diffusion weighted images. We hope that our findings will contribute to the generalizability and reproducibility of future radiomic models for EC.