I use ffmpeg to concatenate a lot of video files, using filter_complex. However, the result file has its audio out of sync gradually.
and I use mediainfo --Inform='Video;%Duration%' filename.ext
and mediainfo --Inform='Audio;%Duration%' filename.ext
to display duration number in the following process.
Here's how to re-produce my problem, given an original source file:
Stream #0:0(eng): Video: wmv3 (Main) (WMV3 / 0x33564D57), yuv420p, 1920x1080, 6000 kb/s, 29.97 fps, 29.97 tbr, 1k tbn, 1k tbc
Stream #0:1(eng): Audio: wmav2 (a[1][0][0] / 0x0161), 48000 Hz, stereo, fltp, 128 kb/s
the size is too big, but its video and audio tracks shares the exact same duration XXXXXXX ms reported by mediainfo
for test purpose, I use its first 5 sec, with double "-t 5":
ffmpeg -t 5 -i input.wmv -map 0:v:0 -map 0:a:0 -map_chapters -1 \
-vcodec copy -acodec copy -t 5 source_v5a5.mkv
result duration(ms):
5004.000000 video of source_v5a5.mkv
5119.000000 audio of source_v5a5.mkv
the difference is 119-4=115ms, and mediainfo filename.ext
reports nothing about delay
at this moment, this snippet plays fine when I watch it, maybe containing a 115ms delay (in the head?) which is not that noticeable, like
[vvvvvvvvv………………v]
[-aaaaaaaaa………………a]
now copy this file 3 times, pretending we have a lot of different snippets, then encode video and audio tracks separately:
ffmpeg -i source_v5a5_p1.mkv -i source_v5a5_p2.mkv -i source_v5a5_p3.mkv -i source_v5a5_p4.mkv \
-filter_complex " \
[0:v:0]setpts=PTS-STARTPTS[v0];[0:a:0]asetpts=PTS-STARTPTS[a0]; \
[1:v:0]setpts=PTS-STARTPTS[v1];[1:a:0]asetpts=PTS-STARTPTS[a1]; \
[2:v:0]setpts=PTS-STARTPTS[v2];[2:a:0]asetpts=PTS-STARTPTS[a2]; \
[3:v:0]setpts=PTS-STARTPTS[v3];[3:a:0]asetpts=PTS-STARTPTS[a3]; \
[v0][a0][v1][a1][v2][a2][v3][a3] concat=n=4:v=1:a=1 [out]" \
-map "[out]" \
-vsync vfr -vcodec libx264 -preset veryfast -tune film -crf 23 \
-acodec pcm_s16le -f tee "[select=v:f=mp4]output_video_track.mp4"
yes, i add acodec here but only output video stream. now encode audio, pipe ffmpeg output to NeroAAC:
ffmpeg -i source_v5a5_p1.mkv -i source_v5a5_p2.mkv -i source_v5a5_p3.mkv -i source_v5a5_p4.mkv \
-filter_complex " \
[0:v:0]setpts=PTS-STARTPTS[v0];[0:a:0]asetpts=PTS-STARTPTS[a0]; \
[1:v:0]setpts=PTS-STARTPTS[v1];[1:a:0]asetpts=PTS-STARTPTS[a1]; \
[2:v:0]setpts=PTS-STARTPTS[v2];[2:a:0]asetpts=PTS-STARTPTS[a2]; \
[3:v:0]setpts=PTS-STARTPTS[v3];[3:a:0]asetpts=PTS-STARTPTS[a3]; \
[v0][a0][v1][a1][v2][a2][v3][a3] concat=n=4:v=1:a=1 [out]" \
-map "[out]" \
-vcodec rawvideo \
-acodec pcm_f32le -f tee "[select=a:f=wav]pipe\:"|neroAacEnc -ignorelength \
-q 0.2 -if - -of "output_audio_track.m4a"
yes, i add vcodec here but only output audio stream.
result duration(ms):
20020 output_video_track.mp4
20309 output_audio_track.m4a
20069.000000 video stream of output_MkvMergeMuxed.mkv
20310.000000 audio stream of output_MkvMergeMuxed.mkv
the difference is over 200ms, seems the delay got included during the concat? while playing the muxed file, at first it is okay, but the last part i would feel the delay
assuming the delay is in the head, it draws like:
[v111111v222222v333333v444444]
[-a111111-a222222-a333333-a444444]
as is written in the documention: https://ffmpeg.org/ffmpeg-filters.html#concat
The concat filter will use the duration of the longest stream in each segment (except the last one), and if necessary pad shorter audio streams with silence.
suspecting that my test is not enough, I did the whole process again with source_v5a2.mkv, and again with source_v5a10.mkv
duration:
5004.000000 video of source_v5a2.mkv
2279.000000 audio of source_v5a2.mkv
5004.000000 video of source_v5a10.mkv
10281.000000 audio of source_v5a10.mkv
ffmpeg did as documention says(Silence padded as if apad was applied / Last frame frozen), but result remains about the same: noticeable delay found at the beginning of last segment
[v111111v222222v333333v444444]
[-a111___-a222___-a333___-a444]
and
[v111___v222___v333___v444___]
[-a111111-a222222-a333333-a444444]
the test above concat only 4 files. When concating 50+ files, the out of sync is significant that you cannot ignore it
Question:
Given a bunch of video files(50+, video audio same res/codec/track#/etc, same duration mostly, some not) to concat, how to reduce / avoid the delay to make it sync without padding the video with black screen? like
[v111111v222222v333333v444444]
[-a111111a222222a333333a444444]
or even better with the delay cropped (maybe mkvmerge can handle this with some calculation afterwards
[v111111v222222v333333v444444]
[a111111a222222a333333a444444]
it would be better to have no intermediate files created, piping is okay
Update:
Perhaps I got it all wrong. Maybe it's not a delay, but a "stretch/squeeze" instead. I ran a long test, concating 30 wmv files, with the command like above, I got the result File A, with over 1s desync:
Stream #0:0: Video: h264 (High), yuv420p(progressive), 640x480 [SAR 4:3 DAR 4:3], 29.97 fps, 29.97 tbr, 1k tbn, 59.94 tbc (default)
Metadata:
DURATION-eng : 05:32:10.544000000
NUMBER_OF_FRAMES-eng: 597298
Stream #0:1: Audio: aac (HE-AAC), 48000 Hz, stereo, fltp (default)
Metadata:
DURATION-eng : 05:32:11.861000000
NUMBER_OF_FRAMES-eng: 467153
after that, I add aresample=async=1
to the filter before asetpts, and re-encode into File B:
Stream #0:0: Video: h264 (High), yuv420p(progressive), 640x480 [SAR 4:3 DAR 4:3], 29.97 fps, 29.97 tbr, 1k tbn, 59.94 tbc (default)
Metadata:
DURATION-eng : 05:32:11.727000000
NUMBER_OF_FRAMES-eng: 597298
Stream #0:1: Audio: aac (HE-AAC), 48000 Hz, stereo, fltp (default)
Metadata:
DURATION-eng : 05:32:11.862000000
NUMBER_OF_FRAMES-eng: 467153
file A has the sync problem as well, but file B sync fine! So the aresample=async=1
which applies to the audio affects nothing to the audio indeed, but to the video instead! I think it has something to do with the PTS. After some Googling, I did the following Exp A:
- convert 05:32:10.544000000 and 05:32:11.727000000 into 19930544 and 19931727
- using mkvmerge, drag in File A, put 19931727/19930544 into the "Stretch By" box of the video track, Start Muxing
the result file sync fine (maybe not a noticeable desync), seems the sync problem DOES have something to do with PTS? Further more research, let's say the correctly sync file has longer duration, while the desync one has shorter duration, I did the following Exp B:
- use
mediainfo --Inform='General;%Duration%' filename.ext
to get duration of each file - add every duration number up
the total duration is 05:32:10.438, almost the number of shorter duration
New Questions:
- My initial commands, did they produce "Correct PTS, Longer Audio" or "Squeezed PTS, Correct Audio"?
- If it's "Correct PTS, Longer Audio", how do I make the audio correct?
- If it's "Squeezed PTS, Correct Audio", is using
aresample=async=1
the right way to fix PTS while concating videos from scratch? - If it's "Squeezed PTS, Correct Audio", why did my Exp B shows the total duration is very close to the shorter(squeezed) one?
- If Exp B is wrong, how should I predict/calculate the correct total duration before encoding process?
- Given a "Squeezed PTS, Correct Audio" file, without the source file, can I fix the sync problem by stretching/squeezing the PTS just simply using the number "AudioDuration/VideoDuration"?
- When not concating files, just encoding one single file, is it
necessary for
aresample=async=1
to be added when NO vf or af is used? Necessary if vf or af is used? Any downside?
It's a long text above, even if you couldnt answer, thank you for reading to the end. :)