We propose an effective video denoising method based on highly sparse signal representation in local 3D transform domain. A noisy video is processed in blockwise manner and for each processed block we form a 3D data array that we call “group” by stacking together blocks found similar to the currently processed one. This grouping is realized as a spatio-temporal predictive-search block-matching, similar to techniques used for motion estimation. Each formed 3D group is filtered by a 3D transform-domain shrinkage (hard-thresholding and Wiener filtering), the result of which are estimates of all grouped blocks. This filtering — that we term “collaborative filtering” — exploits the correlation between grouped blocks and the corresponding highly sparse representation of the true signal in the transform domain. Since, in general, the obtained block estimates are mutually overlapping, we aggregate them by a weighted average in order to form a non-redundant estimate of the video. Significant improvement of this approach is achieved by using a two-step algorithm where an intermediate estimate is produced by grouping and collaborative hard-thresholding and then used both for improving the grouping and for applying collaborative empirical Wiener filtering. We develop an efficient realization of this video denoising algorithm. The experimental results show that at reasonable computational cost it achieves state-of-the-art denoising performance in terms of both peak signal-to-noise ratio and subjective visual quality.