Benefiting from advances in speech synthesis and speech conversion technology, artificial speech is so close to natural speech that it is sensory indistinguishable. This situation brings great challenges to the security of voice-based biometric authentication systems. In this work, we propose an end-to-end spoofing detection method which first augments the raw-audio waveform with random channel masking, then feeds it into the lightweight spectral-temporal attention module for cross-dimensional interaction, and finally selects an appropriate attention fusion method to maximise the potential of capturing interactive cues in both spectral and temporal domains. The experimental results show that the proposed method can effectively improve the accuracy of spoof speech detection.