We present a 3D densely connected convolutional networks(DenseNet3D) for the task of automatic 2D-to-3D video conversion. In contrast to previous automatic 2D-to-3D conversion algorithms, which have separate stages and need ground truth depth as supervision, our method uses spatial transformer networks to combine the separate stages so that our model can be trained end-to-end and directly use stereo image pairs as supervision. We further propose a 3D densely connected convolutional networks(DenseNet3D), which replace the original convolution layer with 3D convolution in densely connected convolutional networks to capture the spatiotemporal characteristics of videos. The experiment shows that the network has better results and faster speed than existing state-of-the-art methods.