Indoor gesture recognition technology is concerned with making the machine accurately recognize dynamic gestures within a certain range. Remarkably, most of this technology is based on passive recognition methods. This is quite striking because the high cost is a crucial factor in active recognition methods and ignoring this aspect can increase the reality gap. In this paper, we tend to use the fine-grained channel state information (CSI) in Wi-Fi to build a dynamic CNN-GRU-Attention (CGA) model to implement a gesture recognition system and thus alleviate this problem. Firstly, we study the influence of gestures on the amplitude and phase difference in CSI, and prove the feasibility of proposed method by analyzing the fluctuation of amplitude and phase difference under different conditions. Then, we use data processing methods such as phase correction and unwrapping with a new proposed adaptive gesture action truncation algorithm to extract the phase difference and remove redundant information, thus ensuring the validity of data. Finally, we propose to segment gesture fragment into 3-channel CSI images as input information of model. Extensive comparison experiments are conducted under the influence of different people, different indoor environments, and different sampling rates. The results show that the system has high accuracy.