Multimodal networks that juxtapose visual and linguistic modalities are currently widely adopted for solving vision-and-language tasks. They perform well in simple and intu-itive tasks, but are prone to mistakes in tasks involving latent or implicit details, due to the difficulty of capturing crucial but imperceptible visual signals in the real world. Perception errors lead to nonsensical results, but can be corrected by commonsense knowledge. To this end, we combine visual perception and linguistic commonsense to solve the challenging daily events causality reasoning task. We propose a novel Object-Aware Reasoning Network to focus on object inter-action while ignoring distracting information to refine visual perception. Further, a language branch with an independent prediction head is supervised to learn causality commonsense to help correct obvious perception errors, resulting in more plausible conclusions. Extensive experiments demonstrate that our method achieves new state-of-the-art results on Vis-Causal dataset.