Recently developed single-cell multiomics technologies are enhancing our understanding of cellular heterogeneity by providing multiple views of a biological system. CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) is one such multiomics assay, with the ability to connect cell states to functions by simultaneously profiling RNA and surface proteins from the same cell. However, the distinct technical characteristics of these data modalities pose significant challenges to their integration into a cohesive representation of cellular identity. Here we present multiHIVE, a hierarchical multimodal deep generative model for inferring cellular embeddings by integrating CITE-seq data modalities. multiHIVE employs hierarchically stacked latent variables as well as modality-specific latent variables to capture shared and private information from the modalities respectively, facilitating integration, denoising and imputation tasks. Extensive benchmarking using gold-standard real and simulated datasets demonstrates multiHIVE's superiority in integrating CITE-seq datasets. Moreover, multiHIVE outperformed the state-of-the-art methods in imputing missing protein measurements and integration of CITE-seq dataset with unimodal dataset. Using a thymocyte development dataset, we showed that multiHIVE's cellular embeddings can lead to improved trajectory inference and gene trend identification. Finally, using datasets across development and disease, we demonstrated that factorization of multiHIVE-inferred denoised expression into gene expression programs aids in identifying biological processes at multiple levels of cellular hierarchy. multiHIVE is implemented in Python and is publicly available at https://github.com/Zafar-Lab/multiHIVE.