In this paper, we propose a two-stage network with modality-specific and -shared contrastive learning (MMCL) for multimodal sentiment analysis. MMCL comprises a category-aware modality-specific contrastive (CMC) module and a self-decoupled modality-shared contrastive (SMC) module. In the first stage, the CMC module guides the encoders to extract modality-specific representations by constructing positive-negative pairs according to sample categories. In the second stage, the SMC module guides the encoders to extract modality-shared representations by constructing positive-negative pairs based on modalities and decoupling the self-contrast of all modalities. In the aforementioned modules, we leverage self-modulation factors to focus more on hard positive pairs through assigning different loss weights to positive pairs depending on their distance. In particular, we introduce a dynamic routing algorithm to cluster the inputs of the contrastive modules during training, where a gradient stopping strategy is utilized to isolate the backpropagation process of the CMC and SMC modules. Extensive experiments on the CMU-MOSI and CMU-MOSEI datasets show that MMCL achieves the state-of-the-art performance.