br imaged as part of follow
imaged as part of follow-up care, whereas all luminal A can-cers were biopsy proven. Imaging was performed at 1.5 and 3.0 T using Philips scanners.
The images were segmented using an automated fuzzy C-means method requiring only the manual indication of a 9(S)-HODE point within a lesion (19). Thirty-eight features describing each lesion were extracted from each image, in categories of size, shape, morphology, enhancement texture, kinetics, and enhancement-variance kinetics (7,19 22) (Appendix). The extracted features used in this work were previously used as part of an investigation of deep learning methodologies across multiple modalities in the task of classification of benign lesions and malignant cancers (23). The features of the luminal A can-cers used here are part of a larger dataset of features extracted from breast cancers of all molecular subtypes and used in the previous study, and the benign features were used in the
Figure 2. Box plots for three lesion features from the dataset: size (maximum linear size), shape (irregularity), and enhancement texture (maxi-mum correlation coefficient). The width of the notches in a given box is proportional to the interquartile range of the data and provides a visual indication of possible difference in groups. The horizontal lines within the boxes indicate the median of the set, whereas the edges of the boxes indicate the 25th and the 75th percentiles. The crosses indicate outliers. (Color version of figure is available online.)
previous study as well. We note that the work here differs from the previous work in that this investigation focuses on classification of a single molecular subtype of breast cancers, utilizes features extracted from images from only one modality, and does not implement deep learning methodologies.
The Pearson correlation coefficient was determined for each feature against all other features, with particular attention given to the correlation of size features to morphology features, because of the nature of the proposed protocols. Linear dis-criminant analysis (LDA) was used as a classifier and we per-formed 10-fold cross validation in classifier training and testing. Lesions were partitioned to training or testing by case.
We investigated three classification protocols. First, classifi-cation was performed using the maximum linear size alone. Second, classification was performed concurrently with step-wise feature selection on all features. Third, classification was performed concurrently with stepwise feature selection on all features except those related to size. In each of the latter two protocols, feature selection was performed for each training fold. We tabulated which sets of features were selected in each fold in the second and third classification variations. Pos-terior probabilities of malignancy for each lesion in each test-ing set were scaled to the prevalence of cancer in the entire dataset (approximately 60%) (24). The scaled posterior proba-bilities of malignancy were then averaged by case.
We used the scaled posterior probabilities of malignancy by case to compare the classification performance using the area under the receiver operating characteristic curve (AUC) (25) as the figure of merit in assessing the ability to distinguish between benign and luminal A lesions. We used the conven-tional binormal ROC model (26). The software package ROCkit (27) was used to statistically compare the obtained AUC values using the two-tailed P values for differences in AUCs and the 95% confidence intervals in the difference in AUC for each comparison pair of protocols. The P values were corrected for three comparisons using the Holm-
Bonferroni method (28). Classification performance was con-sidered significantly different when the corrected two-tailed P value from the comparisons of AUCs for two protocols was less than .05.
In instances when we failed to reject the null hypothesis (that performances were equal) in superiority testing, equiva-lence testing was performed based on the same 95% confidence intervals. Equivalence is defined as having been demonstrated when the difference in AUC and the associated confidence interval of this difference falls between §D, where D is the equivalence margin (29). The determination of D is not well established in medical imaging, but seven studies summarized by Ahn et al. (29) used equivalence margins between 1.5% and 15.0%. In this work, we did not declare an equivalence margin ab initio, but rather observed if the calculated equivalence mar-gin fell within the range of equivalence margins seen in the selected literature reviewed by Ahn et al.
Box plots by cancer status of selected features demonstrate a separation in the median for benign or luminal A cancer, according to lesion status (Fig 2). Of note is the substantial number of outliers for the size feature shown here (maximum linear size) as calculated for this dataset, compared to the more consistent feature values for the irregularity feature by biopsy-proven classification of the lesions as benign or luminal A.