Sequential and causal relationships among actions are critical for accurate video interpretation. Therefore, capturing both short-term and long-term temporal information is essential for effective action recognition. Current research, however, primarily focuses on fusing spatial features from diverse modalities for short-term action recognition, inadequately modeling the complex temporal dependencies in videos, leading to suboptimal performance. To address this limitation, we propose a skeleton-weighted and multi-scale temporal-driven action recognition network that integrates RGB and skeleton modalities to effectively capture both short-term and long-term temporal information. First, we propose a temporal-enhanced adaptive graph convolutional network. This network derives motion attention masks from the skeletal joints and transfers them to RGB videos to generate visually salient regions, thereby achieving a concise and effective input representation. Subsequently, we develop a multi-scale local–global temporal modeling network driven by a self-attention mechanism, which effectively captures fine-grained local details of individual actions along with global temporal relationships among actions across multiple temporal resolutions. Moreover, we design a multi-level adaptive temporal scale mixer module that efficiently integrates multi-scale features, creating a unified temporal feature representation to ensure temporal consistency. Finally, we conducted extensive experiments on the NTU-RGBD-60, NTU-RGBD-120, NW-UCLA, and Kinetics datasets to validate the effectiveness of the proposed method.