In this paper, deep learning method is applied to study the estimation of human joint points in complex multi-person videos. Although there are many excellent research results on human pose estimation from single frame images, but as a combination of multiple frames of continuous images, video contains more complex temporal information. So human pose estimation for videos is more challenging. In order to make use of temporal and spatial continuity in video sequence, temporal attention and spatial attention are used in this paper. To better consider the amount of information in each frame and aggregate information across multiple features. Temporal attention fusion is adopted by calculating the element correlation between the features of the current frame and adjacent frames; in order to take advantage of the spatial relationship within the frame, spatial attention is used to achieve efficient fusion. Through experimental evaluation, the method in this paper can recover the accurate human pose skeleton with high accuracy.