The sentiment of human language is usually reflected through multimodal forms such as natural language, facial expression, and voice intonation. However, the previous research methods uniformly treated different modalities of time series alignment and ignored the missing modal information fragments. The main challenge is the partial absence of multimodal information. In this work, the integrating consistency and difference networks(ICDN) is firstly proposed to model modalities interaction through mapping and generalization learning, which includes a special cross-modal Transformer designed to map other modalities to the target modality. Then, the unimodal sentiment labels are obtained through self-supervision to guide the final sentiment analysis. Compared with other popular multimodal sentiment analysis methods, we obtain better sentiment classification results on CMU-MOSI and CMU-MOSEI benchmark datasets.