Predicting a 3D pose directly from a monocular image is a challenging problem. Most pose estimation methods proposed in recent years have shown ‘quantitatively’ good results (below $\sim$ 50 mm ). However, these methods remain ‘perceptually’ flawed because their performance is only measured via a simple distance metric. Although this fact is well understood, the reliance on ‘quantitative’ information implies that the development of 3D pose estimation methods has been slowed down. To address this issue, we first propose a perceptual Pose SIMilarity (PSIM) metric, by assuming that human perception (HP) is highly adapted to extracting structural information from a given signal. Second, we present a perceptually robust 3D pose estimation framework: Temporal Propagating Long Short-Term Memory networks (TP-LSTMs). Toward this, we analyze the information-theory-based spatio-temporal posture correlations, including joint interdependency, temporal consistency, and HP. The experimental results clearly show that the proposed PSIM metric achieves a superior correlation with users’ subjective opinions than conventional pose metrics. Furthermore, we demonstrate the significant quantitative and perceptual performance improvements of TP-LSTMs compared to existing state-of-the-art methods.