摘要
Breast cancer is the most common cancer type attacking women worldwide. Also, breast cancer has been phenotypically classified into five subtypes. Each subtype group has unique characteristics that demonstrate the heterogeneity present within the breast cancer tumour. In 2012, the American Association for Cancer Research provided a population based molecular integrative clusters for the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) dataset, resulting in ten subtypes. Previous work on the METABRIC dataset used only gene expression data to figure out the effective genes for each subtype, without applying integration to benefit from all data sources. The objective of this paper is to present a breast cancer subtype classification model that applies feature fusion on the METABRIC datasets, namely clinical, gene expression, Copy Number Aberrations (CNA), Copy Number Variations (CNV), and histopathological images. State-of-the-art machine learning classifiers were applied on different data profiles, including Linear-SVM, Radial-SVM, Random Forests (RF), Ensemble SVM (E-SVM), and Boosting. The highest accuracy achieved for IntClust subtyping was 88.36% using Linear-SVM, applied on the data profile with features fused from the clinical, gene expression, CNA, and CNV datasets, with a Jaccard and Dice scores of 0.802 and 0.8835, respectively. On the other hand, for the Pam50 subtyping, an accuracy of 97.1% was achieved, Jaccard score ranging from 0.9439 to 0.9472, and Dice score of 0.971, using Linear-SVM and E-SVM classifiers, with several data profiles that include features from histopathological images. Conclusively, the significance of our study is to validate that using feature fusion from various METABRIC datasets improves breast cancer subtypes classification performance. Moreover, histopathological images give promising results on Pam50 subtypes, and it is expected to improve the accuracy for IntClust subtyping when applied on a higher population.