People are increasingly consuming video media through the internet. This can lead to a mis-match between the auditory and visual streams due to internet connectivity. For instance in a video of a news anchor reporting a story, there can be time lag between the spoken words and the corresponding movements of the anchor’s mouth and lips (as well as body gestures). This asynchrony between the auditory and visual streams can also arise due to various physical, bio-physical and neural mechanisms, but people are often not aware of these differences. There is accumulating evidence that people adapt to auditory-visual asynchrony at different time scales and for different stimulus categories. However, previous studies often used very simple auditory-visual stimuli (e.g., flashing lights paired with brief tones) or they used short videos of a few seconds. In the current study, we investigated the temporal adaption of continuous speech presented in longer videos. Speech is one of the strongest case for auditory-visual integration, as demonstrated by multi-sensory illusions like the McGurk-McDonald and Ventriloquist effects. To measure temporal adaption of speech videos, we developed a continuous-judgment paradigm in which participants continuously judge over several tens of seconds whether an auditory-visual speech stimulus is synchronous or not. The stimuli consisted of 40 videos (duration: M = 63.3s, SD = 10.3s). For each video, we filmed a close-up (upper body) of one male and one female speaker reporting a news story transcribed from a real news clip (e.g., about the Brexit vote outcome or about Boris Johnson’s resignation). Each speaker reported 20 news stories. We then created seven asynchronous versions of each video by shifting the relative stimulus onset asynchrony (SOA) between the auditory and visual streams between -240ms (auditory stream leading) to +240ms (visual stream leading) in 80ms steps. This included SOA = 0ms (i.e., the original synchronous video). The first 5-10s of all videos were synchronous. For each participant in the continuous-judgment task, we randomly selected 10 videos at each SOA (70 total). Participants continuously judged the synchrony of each video by pressing/releasing the spacebar throughout the duration of the video (response sampling rate = 33ms). The mean proportion perceived synchrony across the duration of the videos were calculated from participants’ continuous responses after the initial synchronous period. For the auditory-leading videos (SOAs 0ms), participants initially showed a drop in proportion perceived synchrony but this proportion increased over time, suggesting that they were adapting to the asynchrony over time. The magnitude of temporal adaptation depended on the SOA, with the largest SOA producing the largest adaptation. Consistent with previous studies, our findings suggest that temporal adaptation occurs for long, continuous speech videos but only when the visual stream leads the auditory stream.