If I interpret documentation ci=orrectly, perl --CSD is suppose to ensure that input and output, processed or commands, are using UTF-8 encoding.

But if I replace two hyphens -- with an em-dash — (U+2014), the result is not rendered as an em-dash in UTF-8 locale in MacOS 12.1 (I don't have any other O.S. to try things on).

To avoid the chance of further encoding/rendering issues between upload, server, and client-side rendering, I'm showing a screen shot rather than pasting text:

enter image description here

If I open the file in an editor that assumes UTF-8 input, it displays the same. If I use that editor to add in another em-dash, the second one renders correctly, and is definitely encoded differently:

WGroleau@MBP ~ % od -xc /tmp/demo.txt 
0000000      2049    6177    746e    6120    206e    6d65    642d    7361
           I       w   a   n   t       a   n       e   m   -   d   a   s
0000020      2068    6562    7774    6565    206e    6874    7365    3a65
           h       b   e   t   w   e   e   n       t   h   e   s   e   :
0000040      4a20    656f    a2c3    80c2    94c2    6f54    0a6d    2049
               J   o   e   â  ** 302 200 302 224   T   o   m  \n   I    
0000060      6177    746e    6120    206e    6d65    642d    7361    2068
           w   a   n   t       a   n       e   m   -   d   a   s   h    
0000100      6562    7774    6565    206e    6874    7365    3a65    4a20
           b   e   t   w   e   e   n       t   h   e   s   e   :       J
0000120      656f    80e2    5494    6d6f    0a0a                        
           o   e   —  **  **   T   o   m  \n  \n                        

Is there a bug, or am I doing something wrong? I need to automate several replacements in many files, and they contain multiple languages, so non-ASCI characters may be on the search side as well as the replacement side.

UPDATE: I do have access to a Debian system, but it's through ssh. I see the same thing with "perl 5, version 28, subversion 1 (v5.28.1) built for x86_64-linux-gnu-thread-multi (with 65 registered patches …" but since I'm connected remotely, it's still being rendered by my system.

My perl is "perl 5, version 34, subversion 0 (v5.34.0) built for darwin-thread-multi-2level" with no mention of patches.

I'm open to another tool instead of perl, if it doesn't require a much bigger script or hours of learning a new language. There are already several languages I could do this in, but none of them are particularly convenient for this purpose.

  • Not sure if it's a bug but it seems that the O flag of -C cause the "input" (I'm unsure about the exact definition/scope of "input" I'm talking about here) to be treated as Unicode code point for every single byte: ix.io/3MDY
    – Tom Yan
    Commented Jan 18, 2022 at 7:36

2 Answers 2


The command line is not the same thing as standard input and doesn't go through PerlIO – it is a flat string array (@ARGV in Perl) which is handled by -CA rather than -CS. You need -CSDA to cover everything.

(Alternatively, call utf8::decode($_) for @ARGV near the beginning of your script.)

  • Unfortunately, both of those got the same wrong result. Does @ARGV really matter when using -i (inplace) on a filename?
    – WGroleau
    Commented Jan 18, 2022 at 7:09
  • Your regex is being provided through @ARGV, technically. Commented Jan 18, 2022 at 7:13
  • Ah, of course. My bad. So wouldn't the decode option need to be invoked before the script starts instead of within the script? What if I put the script in an actual file with a perl bang-line. Might that work differently? (I'm not sure how to do in-place in that case, but I could < old > temp and then copy the temp over the old.)
    – WGroleau
    Commented Jan 18, 2022 at 7:19
  • When the docs say (thing) makes it handle UTF-8 but does the opposite and not using it handles UTF8 correctly, it seems like a bug to me. But I don't know enough about perl to say that confidently.
    – WGroleau
    Commented Jan 18, 2022 at 20:44

From Tom Yan’s comment, it appears that -CSD is actually messing it up somehow.  Leave it out, and usually¹ I get what I want (at least with my locale):

WGroleau@MBP ~ % echo "Let’s try again for an em-dash" > /tmp/tmp
WGroleau@MBP ~ % cat /tmp/tmp
Let’s try again for an em-dash
WGroleau@MBP ~ % perl -p -i -e 's:em-dash:—:g;' !$
perl -p -i -e 's:em-dash:—:g;' /tmp/tmp
WGroleau@MBP ~ % cat !$
cat /tmp/tmp
Let’s try again for an —
WGroleau@MBP ~ % perl -p -i -e 's:—:--:g;' /tmp/tmp # change it to ASCII
WGroleau@MBP ~ % cat /tmp/tmp
Let’s try again for an --

Seems like a bug to me, but I don’t really know. As I mentioned, there are other non-ASCII substitutions I need to make (testing first, of course).

¹Except that if I try to replace (U+2019) with ASCII ', zsh complains of an open quote! Escape \' doesn't help.

  • Not using any -C is like "passthrough" mode: it doesn't try to handle any encoding at all, neither encoding or decoding, it just handles the input "—" as 3 separate bytes. That's fine in exact match situations, as the bytes are preserved exactly from input to output, but let's say you have "a—b" and try to s/a.//, this will end up cutting the UTF-8 character in half (the . would match one byte rather than one character). That's what -C is for. Commented Jan 18, 2022 at 20:48
  • So, in that situation (which fortunately is not currently a problem for me), it looks like it won't work with or without -C. Still seems to be a bug, in two different versions of perl 5 I have to put the quoting issue in a different question.
    – WGroleau
    Commented Jan 18, 2022 at 21:19

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .