10
\$\begingroup\$

I use the following utility method to convert Persian and Arabic digits to English using regex:

convertNumbers2English: function (string) {
    return string.replace(/[٠١٢٣٤٥٦٧٨٩]/g, function (c) {
        return c.charCodeAt(0) - 1632;
    }).replace(/[۰۱۲۳۴۵۶۷۸۹]/g, function (c) {
       return c.charCodeAt(0) - 1776;
   });
}
\$\endgroup\$
1
  • \$\begingroup\$ See this post. \$\endgroup\$
    – Mahozad
    Commented Dec 24, 2021 at 15:04

3 Answers 3

29
\$\begingroup\$

Be nice to the maintenance programmer, even (especially?) if you expect it to be you. If you're mixing characters which are visually indistinguishable but don't need to be literal self-representations, you can use Unicode escapes and hexadecimal offsets as so:

convertNumbers2English: function (string) {
    return string.replace(/[\u0660-\u0669]/g, function (c) {
        return c.charCodeAt(0) - 0x0660;
    }).replace(/[\u06f0-\u06f9]/g, function (c) {
       return c.charCodeAt(0) - 0x06f0;
   });
}

Just that small change accomplishes the following:

  1. I can easily see that I haven't missed any digits without having to count;
  2. I can easily see that I haven't accidentally mixed digits from the two styles;
  3. I can easily see that the offset subtracted is correct in each case;
  4. I can easily see that the values returned by the anonymous functions are integers from 0 to 9 and not strings or codepoints corresponding to '0' to '9', which is useful if I'm not primarily a JS developer;
  5. If I care about squeezing every last byte out of my JS, I can see a way to combine the two into one:

    convertNumbers2English: function (string) {
        return string.replace(/[\u0660-\u0669\u06f0-\u06f9]/g, function (c) {
            return c.charCodeAt(0) & 0xf;
        });
    }
    

    The minimiser should take care of unescaping the Unicode escapes.

  6. It might be slightly easier for me to find which characters they are, because I can look up the hex values in a Unicode character table.
\$\endgroup\$
2
  • \$\begingroup\$ Just out of curiosity, why do you say that the characters don't need to be literal self-representations? Wouldn't it be more meaningful to use the self-representation? \$\endgroup\$ Commented Jun 27, 2017 at 19:26
  • 3
    \$\begingroup\$ @KodosJohnson, is ٩ the Persian one or the Arabic one? Is \u06f5 the Persian one or the Arabic one? I hope that answers your question. \$\endgroup\$ Commented Jun 27, 2017 at 20:07
6
\$\begingroup\$

You can use capture groups

return string.replace(/([٠١٢٣٤٥٦٧٨٩])|([۰۱۲۳۴۵۶۷۸۹])/g, function(m, $1, $2) {
    return m.charCodeAt(0) - ($1 ? 1632 : 1776);
});

$1 is the character matched by [٠١٢٣٤٥٦٧٨٩] and $2 is character matched by [۰۱۲۳۴۵۶۷۸۹]. Using ternary operator, correct value is subtracted from the charcode.

If arrow function is supported by target environments, the code can be shortened to

convertNumbers2English: str => str.replace(/([٠١٢٣٤٥٦٧٨٩])|([۰۱۲۳۴۵۶۷۸۹])/g, (m, $1, $2) => m.charCodeAt(0) - ($1 ? 1632 : 1776));
\$\endgroup\$
1
\$\begingroup\$

If the string may contain both "Arabic" and "Persian" numbers then a one-line "replace" can do the job as follows.

The Arabic and Persian numbers are converted to English equivalents. Other text characters remain unchanged.

Num= "۳٣۶٦۵any٥۵٤۶32٠۰";     // Output should be "33665any55453200"

Num = Num.replace(/[٠-٩]/g, d => "٠١٢٣٤٥٦٧٨٩".indexOf(d)).replace(/[۰-۹]/g, d => "۰۱۲۳۴۵۶۷۸۹".indexOf(d));

console.log(Num);

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.