Determine and change filename encoding on Windows

Question

I have files on a Windows server that have certain accented characters in the name. On Windows Explorer files are displayed normally but running 'dir' at the command prompt with default settings displays substituted characters.

For example, the character ö is displayed as o" in the listing. This causes problems when accessing these files from other platforms over SMB, presumably because of conflicting encoding/code pages. The problem is not present with all files and I don't know where the problem files came from.

Example:

E:\folder\files>dir
 Volume in drive E is data
 Volume Serial Number is 5841-C30E

 Directory of E:\folder\files  

07/05/2016  07:46 PM    <DIR>          .
07/05/2016  07:46 PM    <DIR>          ..
12/01/2015  11:12 AM            14,105 file with o" character.xlsx
01/22/2015  05:30 PM            11,598 file with correct ö character.xlsx
               2 File(s)         25,703 bytes
               2 Dir(s)  2,727,491,600,384 bytes free

I've changed file and directory names, but you'll get the idea.

Any ideas how the names could have gotten this way? Perhaps they were copied or created using another platform or tool?

How could I batch find and rename all the problem files? I looked at a couple of GUI renaming utilities but they don't see the problem and only work with the name shown in Windows Explorer.

Filesystem on the drive is ReFS, could that have something to do with it?

Edit: ran PowerShell command

Y:\test>powershell -c Get-ChildItem ^|ForEach-Object {$x=$_.Name; For ($i=0;$i
-lt $x.Length; $i++) {\"{0} {1} {2}\" -f $x,$x[$i],[int]$x[$i]}}
file with o¨ character.xlsx o 111
file with o¨ character.xlsx ¨ 776

Cleaned up to show only relevant part.

So looks like it's really a combining diaeresis and not a vertical quotation mark. Like it should be, as I understand, when talking about unicode normalization.

Use chcp in the cmd shell to set an appropriate code page. See chcp - Change the active console Code Page. The default code page is determined by the Windows Locale. — DavidPostill, Commented Jul 5, 2016 at 17:33
nixer please edit your question and add a real example of such dir (Copy & Paste from cmd window). @DavidPostill chcp would not suffice; looks like there is displayed a Canonical or Compatibility Decomposition o ̈ (U+006F Latin Small Letter O followed by U+0308 Combining Diaeresis) instead of the ö character (U+00F6 Latin Small Letter O With Diaeresis). — JosefZ, Commented Jul 5, 2016 at 20:07
@DavidPostill @JosefZ I played around with chcp but couldn't get the name to show up correctly. It just changes the " to some other character like ?. So it seems to have been originally saved with decomposition and command prompt shows the actual name, Windows Explorer combines it back on the fly. — nixer, Commented Jul 6, 2016 at 8:52
I can't believe that there is " (Quotation Mark) listed in a file name as this character is reserved (disallowed in a filename) by Naming Files, Paths, and Namespaces article. Should apply to both NTFS and ReFS file systems. Please run oneliner powershell -c Get-ChildItem ^|ForEach-Object {$x=$_.Name; For ($i=0;$i -lt $x.Length; $i++) {\"{0} {1} {2}\" -f $x,$x[$i],[int]$x[$i]}} instead of dir and edit again and Copy&Paste only relevant output lines (numbers should suffice). FYI " code is 34. — JosefZ, Commented Jul 7, 2016 at 20:39

JosefZ · Accepted Answer · 2016-07-08 16:15:33Z

I can reproduce your problem using next simple Powershell script

$RatedName = "šöü"                            # set sample string
$FormDName = $RatedName.Normalize("FormD")    # its Canonical Decomposition
$FormCName = $FormDName.Normalize("FormC")    #     followed by Canonical Composition
                                              # list each string character by character
($RatedName,$FormDName,$FormCName) | ForEach-Object {
    $charArr = [char[]]$_ 
    "$_"      # display string in new line for better readability
              # display each character together with its Unicode codepoint
    For( $i=0; $i -lt $charArr.Count; $i++ ) { 
        $charInt = [int]$charArr[$i]
        # next "Try-Catch-Finally" code snippet adopted from my "Alt KeyCode Finder"
        #                                       http://superuser.com/a/1047961/376602
        Try {    
            # Get-CharInfo module downloadable from http://poshcode.org/5234
            #        to add it into the current session: use Import-Module cmdlet
            $charInt | Get-CharInfo |% {
                $ChUCode = $_.CodePoint
                $ChCtgry = $_.Category
                $ChDescr = $_.Description
            }
        }
        Catch {
            $ChUCode = "U+{0:x4}" -f $charInt
            if ( $charInt -le 0x1F -or ($charInt -ge 0x7F -and $charInt -le 0x9F)) 
                 { $ChCtgry = "Control" } else { $ChCtgry = "" }
            $ChDescr = ""
        }
        Finally { $ChOut = $charArr[$i] }
        "{0} {1,-2} {2} {3,5} {4}" -f $i, $charArr[$i], $ChUCode, $charInt, $ChDescr
    }
}
# create sample files
$RatedName | Out-File "D:\test\1097217Rated$RatedName.txt" -Encoding utf8
$FormDName | Out-File "D:\test\1097217FormD$FormDName.txt" -Encoding utf8
$FormCName | Out-File "D:\test\1097217FormC$FormCName.txt" -Encoding utf8


""                                 # very artless draft of possible solution
Get-ChildItem "D:\test\1097217*" | ForEach-Object {
    $y = $_.Name.Normalize("FormC")
    if ( $y.Length -ne $_.Name.Length ) {
        Rename-Item -NewName $y -LiteralPath $_ -WhatIf
    } else {
        "       : file name is already normalized $_"
    }
}

Above script is updated as follows: 1st shows more info on composed/decomposed Unicode characters i.e their Unicode names (see Get-CharInfo module); 2nd embedded very artless draft of possible solution.
Output from cmd prompt:

==> powershell -c D:\PShell\SU\1097217.ps1
šöü
0 š  U+0161   353 Latin Small Letter S With Caron
1 ö  U+00F6   246 Latin Small Letter O With Diaeresis
2 ü  U+00FC   252 Latin Small Letter U With Diaeresis
šöü
0 s  U+0073   115 Latin Small Letter S
1 ̌  U+030C   780 Combining Caron
2 o  U+006F   111 Latin Small Letter O
3 ̈  U+0308   776 Combining Diaeresis
4 u  U+0075   117 Latin Small Letter U
5 ̈  U+0308   776 Combining Diaeresis
šöü
0 š  U+0161   353 Latin Small Letter S With Caron
1 ö  U+00F6   246 Latin Small Letter O With Diaeresis
2 ü  U+00FC   252 Latin Small Letter U With Diaeresis

       : file name is already normalized D:\test\1097217FormCšöü.txt
What if: Performing the operation "Rename File" on target "Item: D:\test\1097217
FormDšöü.txt Destination: D:\test\1097217FormDšöü.txt".
       : file name is already normalized D:\test\1097217Ratedšöü.txt

==> dir /b D:\test\1097217*
1097217FormCšöü.txt
1097217FormDšöü.txt
1097217Ratedšöü.txt

In fact, above dir output looks like 1097217FormDsˇo¨u¨.txt in cmd window and my unicode-aware browser composes strings as listed above but unicode analyzer shows the characters truly as well as the latest image:

However, next example shows the problem in its full width: a for loop changes combining accents to normal ones:

==> for /F "delims=" %G in ('dir /b /S D:\test\1097217*') do @echo %~nxG & dir /B %~fG
1097217FormCšöü.txt
1097217FormCšöü.txt
1097217FormDsˇo¨u¨.txt
File Not Found
1097217Ratedšöü.txt
1097217Ratedšöü.txt

==>

Here's very artless draft of possible solution (see output above):

""                                 # very artless draft of possible solution
Get-ChildItem "D:\test\1097217*" | ForEach-Object {
    $y = $_.Name.Normalize("FormC")
    if ( $y.Length -ne $_.Name.Length ) {
        Rename-Item -NewName $y -LiteralPath $_ -WhatIf
    } else {
        "       : file name is already normalized $_"
    }
}

~~(ToDo: invoke Rename-Item merely if necessary):~~

~~Get-ChildItem "D:\test\1097217*" | ForEach-Object { $y = $_.Name.Normalize("FormC") if ($true) { ### ToDo Rename-Item -NewName $y -LiteralPath $_ -WhatIf } }~~

~~and its output~~ (again, here are rendered composed strings and image below shows cmd window look unbiased):

What if: Performing the operation "Rename File" on target "Item: D:\test\1097217 FormCšöü.txt Destination: D:\test\1097217FormCšöü.txt". What if: Performing the operation "Rename File" on target "Item: D:\test\1097217 FormDšöü.txt Destination: D:\test\1097217FormDšöü.txt". What if: Performing the operation "Rename File" on target "Item: D:\test\1097217 Ratedšöü.txt Destination: D:\test\1097217Ratedšöü.txt".

Updated cmd output

Very nice detective work! At the moment a PowerShell script seems like the best option for correcting the issue. I haven't found a file renaming utility that understands decomposed unicode. — nixer, Commented Jul 8, 2016 at 13:22
@nixer please note updated answer: renaming part could help! — JosefZ, Commented Jul 8, 2016 at 16:18
The draft script works wonderfully in the current directory. I tried to modify it to do renaming recursively but due to my poor PowerShell skills, I haven't been able to yet. — nixer, Commented Jul 22, 2016 at 8:45
@nixer please search stackoverflow for your additional request. — JosefZ, Commented Jul 23, 2016 at 7:34

nixer · Accepted Answer · 2016-08-09 12:47:55Z

1

Based on JosefZ's script, here is a modified version that works recursively:

Get-ChildItem "X:\" -Recurse | ForEach-Object {
    $y = $_.Name.Normalize("FormC")
    $file = $_.Fullname
    if ( $y.Length -ne $_.Name.Length ) {
        Rename-Item -LiteralPath "$file" -NewName "$y" -WhatIf
        Write-Host "renamed file $file"
    }
}

Remove -WhatIf after testing. I had problems with paths that were too long, but that's a topic for another post.

answered Aug 9, 2016 at 12:47

nixer

611 gold badge1 silver badge5 bronze badges

Add a comment |

miroxlav · Accepted Answer · 2016-07-06 09:20:11Z

The problem originates in this tab of Region control panel:

This affects not only screen fonts, but also file system (basically in way that you describe).

The screenshot is from my machine. If I would change locale to English, all special Slovak national characters like ľôščťž in file names will become a garbage, while some of them will even completely prevent opening the file (tested...) with no workaround (until the code page is reverted). However, this problem does not appear with more common national characters like áíé which can be seen across many languages.

This also affects some offline media, e.g. on attempt to open a backup made with different locale.

The easiest solution is to keep the same locale on all machines accessing the resource.

The workaround is to determine which machine has different locale and from that machine perform mass-replace of all national charactes (e.g. č->c, ž->z) in all file names. Total Commander (a file manager) can perform replacement of each such a pair in entire directory tree at once. Then you can return that machine to English (beware, it might not be able to read its own backups), or keep it as it is, asking users not to use national characters in file names.

(Yet before that you can try one thing: you can run chcp on machine with that different locale, learn which code page is in use (e.g. 852) and then try on other machines with chcp 852. Not sure whether this will satisfactorily fix the problem.)

Thanks for the tip. I tried several locales but none of them affected the decomposition and I wasn't able to replicate the issue. I also tried several file renaming utilities, but none of them knew how to operate with decomposition. This leads me to believe that the files were transferred from another machine or platform using some tool that mangled the names. I'm still searching for a bulk renaming that could find and fix all the files having this issues. — nixer, Commented Jul 7, 2016 at 12:15
@nixer – regarding bulk renaming, I already wrote how it can be done. More details: Inside TCMD, use Search&Replace in Multi-rename tool (accessible from main menu). Although be careful and create a backup before, you can get yourself into logical catch by using incorrect renaming order. I think the best option (if viable) would be to use the files to determine who uploaded them and focus on machine of that user. — miroxlav, Commented Jul 7, 2016 at 12:44

Stack Exchange Network

Determine and change filename encoding on Windows

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
windows
filesystems
encoding
smb
character-encoding
.

Linked

Hot Network Questions

Determine and change filename encoding on Windows

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged windowsfilesystemsencodingsmbcharacter-encoding.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
windows
filesystems
encoding
smb
character-encoding
.