2

I use ffmpeg to concatenate a lot of video files, using filter_complex. However, the result file has its audio out of sync gradually.

and I use mediainfo --Inform='Video;%Duration%' filename.ext and mediainfo --Inform='Audio;%Duration%' filename.ext to display duration number in the following process.

Here's how to re-produce my problem, given an original source file:

Stream #0:0(eng): Video: wmv3 (Main) (WMV3 / 0x33564D57), yuv420p, 1920x1080, 6000 kb/s, 29.97 fps, 29.97 tbr, 1k tbn, 1k tbc
Stream #0:1(eng): Audio: wmav2 (a[1][0][0] / 0x0161), 48000 Hz, stereo, fltp, 128 kb/s

the size is too big, but its video and audio tracks shares the exact same duration XXXXXXX ms reported by mediainfo

for test purpose, I use its first 5 sec, with double "-t 5":

ffmpeg -t 5 -i input.wmv -map 0:v:0 -map 0:a:0 -map_chapters -1 \
    -vcodec copy -acodec copy -t 5 source_v5a5.mkv

result duration(ms):

5004.000000     video of source_v5a5.mkv
5119.000000     audio of source_v5a5.mkv

the difference is 119-4=115ms, and mediainfo filename.ext reports nothing about delay at this moment, this snippet plays fine when I watch it, maybe containing a 115ms delay (in the head?) which is not that noticeable, like

[vvvvvvvvv………………v]
[-aaaaaaaaa………………a]

now copy this file 3 times, pretending we have a lot of different snippets, then encode video and audio tracks separately:

ffmpeg -i source_v5a5_p1.mkv -i source_v5a5_p2.mkv -i source_v5a5_p3.mkv -i source_v5a5_p4.mkv \
    -filter_complex " \
    [0:v:0]setpts=PTS-STARTPTS[v0];[0:a:0]asetpts=PTS-STARTPTS[a0]; \
    [1:v:0]setpts=PTS-STARTPTS[v1];[1:a:0]asetpts=PTS-STARTPTS[a1]; \
    [2:v:0]setpts=PTS-STARTPTS[v2];[2:a:0]asetpts=PTS-STARTPTS[a2]; \
    [3:v:0]setpts=PTS-STARTPTS[v3];[3:a:0]asetpts=PTS-STARTPTS[a3]; \
    [v0][a0][v1][a1][v2][a2][v3][a3] concat=n=4:v=1:a=1 [out]" \
    -map "[out]" \
    -vsync vfr -vcodec libx264 -preset veryfast -tune film -crf 23 \
    -acodec pcm_s16le -f tee "[select=v:f=mp4]output_video_track.mp4"

yes, i add acodec here but only output video stream. now encode audio, pipe ffmpeg output to NeroAAC:

ffmpeg -i source_v5a5_p1.mkv -i source_v5a5_p2.mkv -i source_v5a5_p3.mkv -i source_v5a5_p4.mkv \
    -filter_complex " \
    [0:v:0]setpts=PTS-STARTPTS[v0];[0:a:0]asetpts=PTS-STARTPTS[a0]; \
    [1:v:0]setpts=PTS-STARTPTS[v1];[1:a:0]asetpts=PTS-STARTPTS[a1]; \
    [2:v:0]setpts=PTS-STARTPTS[v2];[2:a:0]asetpts=PTS-STARTPTS[a2]; \
    [3:v:0]setpts=PTS-STARTPTS[v3];[3:a:0]asetpts=PTS-STARTPTS[a3]; \
    [v0][a0][v1][a1][v2][a2][v3][a3] concat=n=4:v=1:a=1 [out]" \
    -map "[out]" \
    -vcodec rawvideo \
    -acodec pcm_f32le -f tee "[select=a:f=wav]pipe\:"|neroAacEnc -ignorelength \
    -q 0.2 -if - -of "output_audio_track.m4a"

yes, i add vcodec here but only output audio stream.

result duration(ms):

20020           output_video_track.mp4
20309           output_audio_track.m4a
20069.000000    video stream of output_MkvMergeMuxed.mkv
20310.000000    audio stream of output_MkvMergeMuxed.mkv

the difference is over 200ms, seems the delay got included during the concat? while playing the muxed file, at first it is okay, but the last part i would feel the delay

assuming the delay is in the head, it draws like:

[v111111v222222v333333v444444]
[-a111111-a222222-a333333-a444444]

as is written in the documention: https://ffmpeg.org/ffmpeg-filters.html#concat

The concat filter will use the duration of the longest stream in each segment (except the last one), and if necessary pad shorter audio streams with silence.

suspecting that my test is not enough, I did the whole process again with source_v5a2.mkv, and again with source_v5a10.mkv

duration:

5004.000000         video of source_v5a2.mkv
2279.000000         audio of source_v5a2.mkv
5004.000000         video of source_v5a10.mkv
10281.000000        audio of source_v5a10.mkv

ffmpeg did as documention says(Silence padded as if apad was applied / Last frame frozen), but result remains about the same: noticeable delay found at the beginning of last segment

[v111111v222222v333333v444444]
[-a111___-a222___-a333___-a444]

and

[v111___v222___v333___v444___]
[-a111111-a222222-a333333-a444444]

the test above concat only 4 files. When concating 50+ files, the out of sync is significant that you cannot ignore it


Question:

Given a bunch of video files(50+, video audio same res/codec/track#/etc, same duration mostly, some not) to concat, how to reduce / avoid the delay to make it sync without padding the video with black screen? like

[v111111v222222v333333v444444]
[-a111111a222222a333333a444444]

or even better with the delay cropped (maybe mkvmerge can handle this with some calculation afterwards

[v111111v222222v333333v444444]
[a111111a222222a333333a444444]

it would be better to have no intermediate files created, piping is okay


Update:

Perhaps I got it all wrong. Maybe it's not a delay, but a "stretch/squeeze" instead. I ran a long test, concating 30 wmv files, with the command like above, I got the result File A, with over 1s desync:

Stream #0:0: Video: h264 (High), yuv420p(progressive), 640x480 [SAR 4:3 DAR 4:3], 29.97 fps, 29.97 tbr, 1k tbn, 59.94 tbc (default)
Metadata:
  DURATION-eng    : 05:32:10.544000000
  NUMBER_OF_FRAMES-eng: 597298
Stream #0:1: Audio: aac (HE-AAC), 48000 Hz, stereo, fltp (default)
Metadata:
  DURATION-eng    : 05:32:11.861000000
  NUMBER_OF_FRAMES-eng: 467153

after that, I add aresample=async=1 to the filter before asetpts, and re-encode into File B:

Stream #0:0: Video: h264 (High), yuv420p(progressive), 640x480 [SAR 4:3 DAR 4:3], 29.97 fps, 29.97 tbr, 1k tbn, 59.94 tbc (default)
Metadata:
  DURATION-eng    : 05:32:11.727000000
  NUMBER_OF_FRAMES-eng: 597298
Stream #0:1: Audio: aac (HE-AAC), 48000 Hz, stereo, fltp (default)
Metadata:
  DURATION-eng    : 05:32:11.862000000
  NUMBER_OF_FRAMES-eng: 467153

file A has the sync problem as well, but file B sync fine! So the aresample=async=1 which applies to the audio affects nothing to the audio indeed, but to the video instead! I think it has something to do with the PTS. After some Googling, I did the following Exp A:

  1. convert 05:32:10.544000000 and 05:32:11.727000000 into 19930544 and 19931727
  2. using mkvmerge, drag in File A, put 19931727/19930544 into the "Stretch By" box of the video track, Start Muxing

the result file sync fine (maybe not a noticeable desync), seems the sync problem DOES have something to do with PTS? Further more research, let's say the correctly sync file has longer duration, while the desync one has shorter duration, I did the following Exp B:

  1. use mediainfo --Inform='General;%Duration%' filename.ext to get duration of each file
  2. add every duration number up

the total duration is 05:32:10.438, almost the number of shorter duration

New Questions:

  1. My initial commands, did they produce "Correct PTS, Longer Audio" or "Squeezed PTS, Correct Audio"?
  2. If it's "Correct PTS, Longer Audio", how do I make the audio correct?
  3. If it's "Squeezed PTS, Correct Audio", is using aresample=async=1 the right way to fix PTS while concating videos from scratch?
  4. If it's "Squeezed PTS, Correct Audio", why did my Exp B shows the total duration is very close to the shorter(squeezed) one?
  5. If Exp B is wrong, how should I predict/calculate the correct total duration before encoding process?
  6. Given a "Squeezed PTS, Correct Audio" file, without the source file, can I fix the sync problem by stretching/squeezing the PTS just simply using the number "AudioDuration/VideoDuration"?
  7. When not concating files, just encoding one single file, is it necessary for aresample=async=1 to be added when NO vf or af is used? Necessary if vf or af is used? Any downside?

It's a long text above, even if you couldnt answer, thank you for reading to the end. :)

0

You must log in to answer this question.

Browse other questions tagged .