Accurate ocean environment perception is crucial for weather and climate prediction. Satellite and buoy real-time observation is constrained by environmental limitations and deployment costs, leading to sparse data availability. This paper proposes a novel approach, multimodal fusion-based spatiotemporal incremental learning, enhancing the ocean environment perception under sparse observations. This method uses sparse real-time observations to comprehend, reconstruct, and predict the full environment. First, spatiotemporal disentanglement decouples intrinsic features by integrating physical principles and data learning. Subsequently, incremental extension captures the dynamic environment through stable representation updating and dynamic behavior learning. Then, multimodal information fusion synergizes multisource intrinsic features, enabling the full perception of the ocean environment. Finally, the methodology is supported by convergence analysis and error boundary evaluation. Validation with global sea surface temperature and western Pacific Ocean high-dimensional temperature datasets demonstrates its potential for advancing ocean research and applications using sparse real-time observation.