Electroencephalogram (EEG)-based brain-computer interface enables humans to communicate with or control the external environments via brain signals. Spatial, spectral, and temporal features from multiple dimensions have proven effective in modeling human intention events. However, current research either focuses on a single dimension with/without spatial information or is highly dependent on hand-crafted features. To address this issue, a multi-dimensional deep-learning approach is proposed to jointly learn spatial–spectral–temporal representations from EEG. To begin, we generate two-dimensional EEG topographic maps from both the time and frequency domains to preserve spatial information. Next, a dual-stream neural network (DSNN), which comprises spectral-stream and temporal-stream, is designed to learn spatial–spectral–temporal representations from EEG topographic maps. Specifically, each stream of DSNN consists of convolutional and recurrent neural networks for learning spatial and dynamic spectral/temporal representations. An attentive neural network is stationed at the end of the two streams to fuse features from different dimensions and to emphasize the most distinguishable periods. Empirical evaluation of DSNN on the benchmark EEG motor imagery dataset demonstrated significant improvement over current methods within the subject-independent scenario. Furthermore, this work reveals the learning process of different neural network components and demonstrates the potential advantages of DSNN in modeling human intention events from EEG.