Video classification is a complex task that involves analyzing audio and video signals using deep neural models. To reliably classify these signals, researchers have developed multimodal fusion techniques that combine audio and video data into compact, quickly processed representations. However, previous approaches to multimodal data fusion have relied heavily on manually designed attention mechanisms. To address these limitations, we propose the Multimodal Auto Attention Fusion (MAF) model, which uses Neural Architecture Search (NAS) to automatically identify effective attentional representations for a wide range of tasks. Our approach includes a custom-designed search space that allows for the automatic generation of attention representations. Using automated Key, Query, and Value representation design, the MAF model enhances its self-attentiveness, allowing for the creation of highly effective attention representation designs. Compared to other multimodal fusion methods, our approach exhibits competitive performance in detecting modality interactions. We conducted experiments on three large datasets (UCF101, ActivityNet, and YouTube-8M), which confirmed the effectiveness of our approach and demonstrated its superior performance compared to other popular models. Furthermore, our approach exhibits robust generalizability across diverse datasets.