0

We are processing files that our clients generated on their local Windows machines which use the CP-1252 character set. Occasionally, while processing one of these files in our backend (running on CentOS), we get runtime errors (it's a Java backend, so RuntimeExceptions). If we remote in to the server and rename the file (using UTF-8) and re-run it, the file processes perfectly fine.

Is there any way to "add" CP-1252 to CentOS's available character sets so that this stops happening?

5
  • Can you post the Java run-time exception that you receive? And call stack? Is the issue that there is a CP-1252 character in the file name that is being processed by a Java program? Commented Aug 20, 2012 at 21:40
  • @HeatfanJohn - I will need a few hours before I can get access to the appropriate logs to get the exact stacktrace, but yes, you nailed it. It happens when there is a CP-1252 character in the file name and the system chokes. Simply SSHing in to the server, renaming it and re-processing the file fixes it, but is a sub-optimal (manual!) solution.
    – pnongrata
    Commented Aug 20, 2012 at 21:42
  • Do you have any control over the code that creates that file that is processed by your Java back-end or over the source code to the Java application that processes the file? Commented Aug 20, 2012 at 21:56
  • Only the backend but not the (client-side) file generator. But the Java backend is 100% under our control.
    – pnongrata
    Commented Aug 20, 2012 at 22:01
  • How come you can't fix the Java program to read the data as bytes and then pass it through a decoder? Commented Aug 21, 2012 at 5:20

1 Answer 1

1

Check out this bug report from Oracle on the behavior of Java bug_id=4733494 related to the "default locale". According to this bug report (actually Sun/Oracle says that this behavior is really not a bug but just how Java was designed), from Sun/Oracle:

In versions of the JDK prior to 1.4, we always forced the "C" locale to the ISO8859-1 character set. In releases 1.4 and later, we support the "C" locale which requires restriction to 7-bit ASCII.

The recommendation is to set environment variable LC_ALL to en_US.ISO8859-1 or whatever the appropriate locale for the system should be es_ES.ISO-8859-1, etc.

Adding:

export LC_ALL="en_US.ISO-8859-1"

To the command file that runs your Java back-end should resolve the problem.

This is also documented in SO question: https://stackoverflow.com/questions/5663709/how-to-fix-java-when-if-refused-to-open-a-file-with-special-charater-in-filename

4
  • Thanks @HeatfanJohn (+1) - quick followup: what's this "C" locale? I've never heard of it before or seen it referenced anywhere. What purpose does it serve? Thanks again!
    – pnongrata
    Commented Aug 21, 2012 at 1:59
  • @zharvey I didn't know what that was either. From this chemie.fu-berlin.de/chemnet/use/info/libc/libc_19.html#SEC324 web page it appears to be a legacy GNU C locale. Commented Aug 21, 2012 at 2:42
  • @zharvey if you run locale from a command prompt on your Linux system, what is output? Commented Aug 21, 2012 at 3:07
  • 1
    The "C" locale does no (i.e. bitwise) collation, no number or currency formatting, and primitive date and time formatting, and does not translate the native strings used in an application. Commented Aug 21, 2012 at 5:17

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .