In this paper, we propose a Hierarchical Multimodal Variational Encoder-Decoder (HMMVED) to predict the popularity of micro-videos by comprehensively leveraging the user information and the micro-video content in a hierarchical fashion. In particular, the multimodal variational encoder-decoder framework encodes the input modalities to a lower dimensional stochastic embedding, from which the popularity of micro-videos can be decoded. Considering the leading role of the user’s social influence in social media for information dissemination, a user encoder-decoder is designed to learn the prior Gaussian embedding of the micro-video from the user information, which is informative about the coarse-grained popularity. In order to incorporate the fluctuation around the coarse-grained popularity caused by the diverse multimodal content, in the micro-video encoder-decoder, the refined posterior distribution of the micro-video embedding is encoded from the content features while encouraged to be close to the learned prior distribution. The fine-grained popularity of each micro-video is decoded from the posterior embedding of the micro-video. Based on the multimodal extension of variational information bottleneck theory, we show that the learned latent embeddings of micro-videos are maximally expressive about the popularity whilst maximally compressing the information from input modalities. Extensive experiments conducted on two real-world datasets demonstrate the effectiveness of the proposed method. Codes and datasets are available at: https://github.com/JennyXieJiayi/HMMVED .