ffprobe OCR of a subtitle stream

Question

I have some video files containing HDMV PGS subtitles, and I need to convert them to subrip (or other text subtitles). I know that I can ffmpeg demux the video file to extract .sup and on the fly convert it to VobSub, then subtitleedit /convert that to subrip with its own tesseract.

However, I'd like to use only ffprobe/ffmpeg, which I have previously compiled with libtesseract and all. I don't mind parsing raw tesseract output into subrip either, but I just need to get it.

I've tried e.g.:

ffprobe -show_entries frame_tags=lavfi.ocr.text -f lavfi -i "movie=pgs.mkv,ocr"

Naturally, it tries to read the video stream instead of one of the subtitle streams. Aiming it at a .sup file or multi-sub .mks or .sub/idx files gets me No video stream with index '-1' found error, which is technically true, but...

is there a way to make ffprobe/ffmpeg OCR the actual subtitles instead of the video?

Gyan · Accepted Answer · 2022-04-30 04:41:43Z

4

Image-based subtitles are a hybrid media type, they contain video data but are designated as subtitles. Almost all ffmpeg code expects actual stream data to be of the type they are indicated as. So, ffmpeg (and ffmpeg only) has bespoke routines to ingest such subtitles.

Use

ffmpeg -f lavfi -i color=black:s=hd720 -i pgs.mkv -filter_complex "[0][1:s:0]overlay=format=yuv444:shortest=1,ocr,metadata=print:key=lavfi.ocr.text:file=subs.txt" -an -f null -

answered Apr 30, 2022 at 4:41

Gyan

37k6 gold badges64 silver badges105 bronze badges

As I understand, this renders the subtitles on top of a black background of a size hd720 (I had to put hd1080 since my source is Subtitle: hdmv_pgs_subtitle, 1920x1080), and then OCRs the entire thing frame by frame, as I'm getting multiple reads of the same text. That's... pretty horrible. Beyond slow. But works, so thanks for showing me the way and an interesting trick! Since this appears to be the only way to do it with ffmpeg, I guess I'll stick to Subtitle Edit and maybe raw Tesseract at some point. PS I love your work.
– Minty
Commented Apr 30, 2022 at 10:24
1

Add mpdecimate after the ocr filter to strip duplicates.
– Gyan
Commented Apr 30, 2022 at 14:45

Add a comment |

softworkz · Accepted Answer · 2022-06-07 07:55:02Z

3

Yes, there is a new way.

If you like, you can try out an upcoming addition to ffmpeg which provides the capability for processing subtitles in filter graphs. Currently available here:

https://github.com/ffstaging/FFmpeg/pull/18

It also includes a new graphicsub2text filter for subtitle OCR including text size, style and position, colors, outlines and alignment.

answered Jun 7, 2022 at 7:55

softworkz

311 bronze badge

Add a comment |

Stack Exchange Network

ffprobe OCR of a subtitle stream

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
ffmpeg
subtitles
ocr
ffprobe
tesseract-ocr
.

Hot Network Questions

ffprobe OCR of a subtitle stream

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged ffmpegsubtitlesocrffprobetesseract-ocr.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
ffmpeg
subtitles
ocr
ffprobe
tesseract-ocr
.