人工智能
计算机科学
随机森林
试验装置
逻辑回归
机器学习
乳腺癌
精确性和召回率
交叉验证
集合(抽象数据类型)
相关性
样本量测定
癌症
统计
数据挖掘
数学
医学
几何学
内科学
程序设计语言
作者
Hua Chen,Nan Wang,Xueping Du,Kehui Mei,Yuan Zhou,Guangxing Cai
摘要
Breast cancer is the most common and deadly type of cancer in the world. Based on machine learning algorithms such as XGBoost, random forest, logistic regression, and K-nearest neighbor, this paper establishes different models to classify and predict breast cancer, so as to provide a reference for the early diagnosis of breast cancer. Recall indicates the probability of detecting malignant cancer cells in medical diagnosis, which is of great significance for the classification of breast cancer, so this article takes recall as the primary evaluation index and considers the precision, accuracy, and F1-score evaluation indicators to evaluate and compare the prediction effect of each model. In order to eliminate the influence of different dimensional concepts on the effect of the model, the data are standardized. In order to find the optimal subset and improve the accuracy of the model, 15 features were screened out as input to the model through the Pearson correlation test. The K-nearest neighbor model uses the cross-validation method to select the optimal k value by using recall as an evaluation index. For the problem of positive and negative sample imbalance, the hierarchical sampling method is used to extract the training set and test set proportionally according to different categories. The experimental results show that under different dataset division (8 : 2 and 7 : 3), the prediction effect of the same model will have different changes. Comparative analysis shows that the XGBoost model established in this paper (which divides the training set and test set by 8 : 2) has better effects, and its recall, precision, accuracy, and F1-score are 1.00, 0.960, 0.974, and 0.980, respectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI