In the realm of global smart education, optimizing classroom learning efficiency is paramount. This study harnesses cutting-edge technology to delve into learning engagement dynamics. Leveraging the XGBoost model, 468 facial and bodily features are captured in real-time, offering insights into student behavior. The ResNet neural transformation network and PNP algorithm discern facial expressions, body movements, and head rotation angles. Through a fuzzy comprehensive evaluation method, proportions of these factors are delineated for objective assessment. Empirical validation involves scrutinizing final exam scores of 234 college students. Comparative analyses reveal a significant correlation between classroom participation and performance (p-value = 0.034). Results underscore the impact of real-time judgment on learning efficiency and highlight the potency of multimodal fusion in assessing engagement comprehensively. Practical application validates system effectiveness, offering insights for educational stakeholders and contributing to the evolution of smart education.