Leisure tourism is an indispensable activity in urban people's life. Due to the popularity of intelligent mobile devices, a large number of photos and videos are recorded during a trip. Therefore, the ability to vividly and interestingly display these media data is a useful technique. In this paper, we propose SnapVideo, a new method that intelligently converts a personal album describing of a trip into a comprehensive, aesthetically pleasing, and coherent video clip. The proposed framework contains three main components. The scenic spot identification model first personalizes the video clips based on multiple prespecified audience classes. We then search for some auxiliary related videos from YouTube1 according to the selected photos. To comprehensively describe a scenery, the view generation module clusters the crawled video frames into a number of views. Finally, a probabilistic model is developed to fit the frames from multiple views into an aesthetically pleasing and coherent video clip, which optimally captures the semantics of a sightseeing trip. Extensive user studies demonstrated the competitiveness of our method from an aesthetic point of view. Moreover, quantitative analysis reflects that semantically important spots are well preserved in the final video clip.1https://www.youtube.com/.