2

I'm using yt-dlp to download some videos, and I've told it to embed the subtitles. This seems to work, except it's generating subtitles in the most horrible way by duplicating the text.

For example, if the audio says "but even that system isn't fast enough to get you to other galaxies. Remarkably, there is a trick using proven physics" then youtube will show one half of the text, erase that and show the second half of the text, and then move on. Nothing gets duplicated.

What yt-dlp (or ffmpeg?) is doing is showing the first half of the text and the second half of the text on two separate lines, then replacing the first line with the second line and the second line becomes whatever comes next. The result is that I'm constantly reading the lines twice! It would work perfectly if it only showed one line at a time. I don't know if there's a name for this behavior, or if it's intentional (some flag being set?) or a bug. How do I make it generate subtitles that are displayed on the video the same way youtube displays them?

edit:

This is the command used to generate the video file: yt-dlp.exe -k --write-auto-sub --embed-subs --merge-output-format mp4 https://www.youtube.com/watch?v=b3D7QlMVa5s

And this is an image showing the text duplication

3
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer.
    – Community Bot
    Commented Jul 20, 2022 at 21:10
  • 1
    You could start by editing your question with the exact command you've used. Commented Jul 20, 2022 at 21:12
  • 1
    See this issue. Maybe a solution with ttml format, if not some workarounds.
    – Alain1A45
    Commented Aug 19, 2022 at 10:52

2 Answers 2

1

The issue Alain1A45 refers to, Timelines in fetched Subtitles are overlapping each other! · Issue #9038 · ytdl-org/youtube-dl, posted on Mar 31, 2016, first suggests using --sub-format ttml --convert-subs vtt to get the correct vtt file. Several posts said that this doesn't work anymore. Post by nickaein, commented on Oct 20, 2017, says, "Downloading subtitle with vtt format fixed it for me."

I tried this just now and can confirm that it does work (2024/03/07 08:41:56).

I used --sub-format vtt --convert-subs vtt and got perfect subtitle formatting. --sub-format vtt is not really necessary but I included it anyway in case that format is already availabe.

0

I've written a PHP script that processes the captions and handles the duplication issue.

function cleanVttFile($fileName, $outputName) {

    $lines = file($fileName);
    $headers = ['WEBVTT', 'Kind: captions', 'Language: en'];
    $modified_lines = [];
    $prev_line = "";

    foreach ($lines as $line) {
        // Skip headers
        if (in_array(trim($line), $headers)) {
            $modified_lines[] = $line;
            continue;
        }

        // Skip timestamp lines and blank lines
        if (preg_match('/\d{2}:\d{2}:\d{2}\.\d{3} --> \d{2}:\d{2}:\d{2}\.\d{3}.*/', $line) || trim($line) == "") {
            $modified_lines[] = $line;
            continue;
        }

        // Remove time tags
        $stripped_line = preg_replace('/<[^>]*>/', '', $line);

        // Compare with previous line
        if ($stripped_line != $prev_line || $prev_line == "") {
            $modified_lines[] = $line;
        }

        // Update previous line
        $prev_line = $stripped_line;
    }

    file_put_contents($outputName, $modified_lines);
}
2
  • 1
    Code without any explanation is useless. Can you elaborate on this a little more?
    – Toto
    Commented Jul 26, 2023 at 15:38
  • The comments provided should be clear enough Commented Jul 27, 2023 at 21:52

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .