diff while ignore unicode whitespace characters (CJK full width space)

Ask Question

Asked 2 years, 10 months ago

Modified 2 years, 10 months ago

Viewed 121 times

I'm trying to diff two files. One of these files contains extra full-width spaces (U+3000):

Text 1
Text 2
Text 3

Text 1
 　 Text 2
Text 3

diff -w A.txt B.txt reports

2c2
< Text 2
---
>  　 Text 2

I want to know if there are any options / workarounds so I can get diff by ignore any whitespaces characters (U+3000 Ideographic Space, for example).

Files processed are UTF-8 (with BOM) with CRLF line breaks.

It is fine to use other tools / workarounds if it is not possible with diff.

asked Aug 18, 2021 at 3:24

tsh

1114 bronze badges

diff against the output of sed? <()
– Tom Yan
Commented Aug 18, 2021 at 3:44
@TomYan Seems diff <(cat A.txt | sed 's/\s//g' | sed 's/　//g') <(cat B.txt | sed 's/\s//g' | sed 's/　//g') works in my case. Though quite ugly...
– tsh
Commented Aug 18, 2021 at 4:04
First of all you don't need cat. sed can take a file as input, just don't use -i. Besides sed s/\(\s\|　\)//g should probably work (not sure about portability whatsoever).
– Tom Yan
Commented Aug 18, 2021 at 4:37
What are settings for locale?
– Romeo Ninov
Commented Aug 18, 2021 at 5:33
@RomeoNinov locale command says LANG=C.UTF-8, LC_CTYPE="C.UTF-8", ... I hadn't touched that setting before. Should I set locale to something else?
– tsh
Commented Aug 18, 2021 at 9:54

Add a comment |

0