"fixing" RTL texts from logical to visual, before embedding in video as subtitles with ffmpeg

Question

I'm searching for the correct way to pre-process my subtitles files before hard-coding them into video clips.

Currently, ffmpeg does not process RTL (right-to-left) languges properly; I have detailed the problem here: https://superuser.com/questions/1679536/how-to-embed-rtl-subtitles-in-a-video-hebrew-arabic-with-the-correct-lan

However, there could be 2 programmatic solutions:

adding certain unicode control characters can fix (or partially fix) the text, which is then fed into ffmpeg, giving good results.

character 0x200F at the end of a hebrew clause, after punctuation
character 0x202B, I haven't yet learned its usage.

I can edit the text so that it will produce the correct results on ffmpeg. But that requires smart BiDi algorithm.

Do you know how to preprocess such text?

(this is NOT an encoding question. It is about RTL/LTR algorithm to use.)

Thank you

@ImanMohammadi BiDi algorithm is super complex. I'm not going to implement it myself. — Berry Tsakala, Commented May 12 at 6:49

Berry Tsakala · Accepted Answer · 2024-05-13 08:43:18Z

Such preprocessing for FFmpeg's subtitles is possible. Here is an example of a python script that parses an existing .srt file, determines for every section if it is rtl or not, and adds the Right to Left Embedding unicode character to the beginning of the section if it is indeed rtl, and the Pop Directional Formatting to the section's end. When burned in by FFmpeg, the result solves both broken sentences and punctuation placement.


    import pysrt
    from collections import Counter
    import unicodedata as ud
    
    # Function that detects the dominant writing direction of a string
    # https://stackoverflow.com/a/75739782/10327858
    def dominant_strong_direction(s):
        count = Counter([ud.bidirectional(c) for c in list(s)])
        rtl_count = count['R'] + count['AL'] + count['RLE'] + count["RLI"]
        ltr_count = count['L'] + count['LRE'] + count["LRI"] 
        return "rtl" if rtl_count > ltr_count else "ltr"
    
    filename = "file.srt"
    subs = pysrt.open(filename)
    for sub in subs:
        if dominant_strong_direction(sub.text) == "rtl":
            sub.text = "\u202b"+sub.text+"\u202c"
    
    subs.save(filename, encoding='utf-8')

The script was tested on an M1 Mac.

The idea is 100% correct. The specific implementation might vary depends on subtitle format. see my answer for an alternative — Berry Tsakala, Commented May 13 at 11:02

Berry Tsakala · Accepted Answer · 2024-05-13 11:08:16Z

0

an alternative solution based on @ronRegev's answer:


def dominant_strong_direction(line):
    #see RonRegev's answer
    
with open(out, 'w') as fp:
    for line in s.splitlines():
        if dominant_strong_direction(line) == "rtl":
            line = "\u202b" + line + "\u202c"
        fp.write(line+'\n')

answered May 13 at 11:08

Berry Tsakala

16.2k13 gold badges59 silver badges88 bronze badges

Add a comment |

Collectives™ on Stack Overflow

"fixing" RTL texts from logical to visual, before embedding in video as subtitles with ffmpeg

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
ffmpeg
right-to-left
control-characters
bidi
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged ffmpegright-to-leftcontrol-charactersbidi or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
ffmpeg
right-to-left
control-characters
bidi
or ask your own question.