This paper proposes a statistical framework with which artificial intelligence can assist human decision making. The performance of each human decision maker is first benchmarked against machine predictions; we then replace the decisions made by relatively incapable decision makers with the recommendation from the proposed artificial intelligence algorithm. Our statistical frameworks are motivated by both Bayesian principles and frequentist principles of hypothesis testing and confidence set formation. We illustrate our methods using an example of birth defect detection. In this example, we use a large dataset of pregnancy outcomes and doctor diagnoses from prepregnancy checkups of reproductive age couples; the data are provided by the Chinese National Health Commission. Overall, with a higher true positive rate and a lower false positive rate, our algorithm on a test dataset outperforms the diagnoses made only by doctors. We also find that the diagnoses of doctors who are from rural areas are more likely to be replaceable with machine learning prediction; this finding suggests that decision making with artificial intelligence is more beneficial to poorer areas than to more developed regions.