Multimodal sentiment analysis is a complex research problem. Firstly, current multimodal approaches fail to adequately consider the intricate multi-level correspondence between modalities and the unique contextual information within each modality; secondly, cross-modal fusion methods for inter-modal fusion somewhat weaken the mode-specific internal features, which is a limitation of the traditional single-branch model. To this end, we proposes a dual-branch enhanced multi-task learning network (DBEM), a new architecture that considers both the multiple dependencies of sequences and the heterogeneity of multimodal data, for better multimodal sentiment analysis. The global-local branch takes into account the intra-modal dependencies of different length time subsequences and aggregates global and local features to enrich the feature diversity. The cross-refine branch considers the difference in information density of different modalities and adopts coarse-to-fine fusion learning to model the inter-modal dependencies. Coarse-grained fusion achieves low-level feature reinforcement of audio and visual modalities, and fine-grained fusion improves the ability to integrate information complementarity between different levels of modalities. Finally, multi-task learning is carried out to improve the generalization and performance of the model based on the enhanced fusion features obtained from the dual-branch network. Compared with the single branch network (SBEM, variant of DBEM model) and SOTA methods, the experimental results on the two datasets CH-SIMS and CMU-MOSEI validate the effectiveness of the DBEM model.