计算机科学
嵌入
卷积神经网络
人工智能
变压器
模式识别(心理学)
建筑
可扩展性
计算机工程
机器学习
数据挖掘
数据库
艺术
物理
量子力学
电压
视觉艺术
作者
Hubert Truchan,Evgenii Naumov,Rezaul Abedin,Gregory Palmer,Zahra Ahmadi
标识
DOI:10.1007/978-981-99-8079-6_14
摘要
Patch embedding has been a significant advancement in Transformer-based models, particularly the Vision Transformer (ViT), as it enables handling larger image sizes and mitigating the quadratic runtime of self-attention layers in Transformers. Moreover, it allows for capturing global dependencies and relationships between patches, enhancing effective image understanding and analysis. However, it is important to acknowledge that Convolutional Neural Networks (CNNs) continue to excel in scenarios with limited data availability. Their efficiency in terms of memory usage and latency makes them particularly suitable for deployment on edge devices. Expanding upon this, we propose Minape, a novel multimodal isotropic convolutional neural architecture that incorporates patch embedding to both time series and image data for classification purposes. By employing isotropic models, Minape addresses the challenges posed by varying data sizes and complexities of the data. It groups samples based on modality type, creating two-dimensional representations that undergo linear embedding before being processed by a scalable isotropic convolutional network architecture. The outputs of these pathways are merged and fed to a temporal classifier. Experimental results demonstrate that Minape significantly outperforms existing approaches in terms of accuracy while requiring fewer than 1M parameters and occupying less than 12 MB in size. This performance was observed on multimodal benchmark datasets and the authors' newly collected multi-dimensional multimodal dataset, Mudestreda, obtained from real industrial processing devices $$^{1}$$ ( $$^{1}$$ Link to code and dataset: https://github.com/hubtru/Minape ).
科研通智能强力驱动
Strongly Powered by AbleSci AI