Most existing multi-person tracking approaches rely on appearance based re-identification (re-ID) to resolve fragmented tracklets. However, simply using appearance information could be insufficient for videos containing severe pose changes, such as sports or dance videos. With the goal of learning pose-invariant representations, we propose an end-to-end deep learning framework Sparse-Temporal ReID Network. Our proposed network not only realizes human pose disentanglement in an image recovery manner, but also makes efficient linkages between the identical subjects via a unique Sparse temporal identity sampling technique across time steps. Experimental results demonstrate the effectiveness of our proposed method on both multi-view re-ID benchmarks and our newly collected dance video dataset DanceReID 1 .