59

On this Oracle page Java HotSpot VM Options, it lists -XX:+UseCompressedStrings as being available and on by default. However in Java 6 update 29, it is off by default and in Java 7 update 2 it reports a warning

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option UseCompressedStrings; support was removed in 7.0

Does anyone know the thinking behind removing this option?


sorting lines of an enormous file.txt in java

With -mx2g, this example took 4.541 seconds with the option on and 5.206 second with it off in Java 6 update 29. It is hard to see that it impacts performance.

Note: Java 7 update 2 requires 2.0 G whereas Java 6 update 29 without compressed strings requires 1.8 GB and with compressed string requires only 1.0 GB.

2
  • 11
    not related exactly but for future ref: -XX:+PrintFlagsFinal lists all the flags available and their values.
    – bestsss
    Commented Apr 27, 2012 at 9:10
  • 2
    Looking forward to this feature making a comeback under JEP 254 in JDK 9. I still keep JDK6-32 around for a small but string-heavy app (100MB total RAM, vs. 150MB on JDK8-32, vs 250 MB or JDK8-64) and 30% faster reg-ex searches. Commented Dec 2, 2015 at 4:32

5 Answers 5

42

Originally, this option was added to improve SPECjBB performance. The gains are due to reduced memory bandwidth requirements between the processor and DRAM. Loading and storing bytes in the byte[] consumes 1/2 the bandwidth versus chars in the char[].

However, this comes at a price. The code has to determine if the internal array is a byte[] or char[]. This takes CPU time and if the workload is not memory bandwidth constrained, it can cause a performance regression. There is also a code maintenance price due to the added complexity.

Because there weren't enough production-like workloads that showed significant gains (except perhaps SPECjBB), the option was removed.

There is another angle to this. The option reduces heap usage. For applicable Strings, it reduces the memory usage of those Strings by 1/2. This angle wasn't considered at the time of option removal. For workloads that are memory capacity constrained (i.e. have to run with limited heap space and GC takes a lot of time), this option can prove useful.

If enough memory capacity constrained production-like workloads can be found to justify the option's inclusion, then maybe the option will be brought back.

Edit 3/20/2013: An average server heap dump uses 25% of the space on Strings. Most Strings are compressible. If the option is reintroduced, it could save half of this space (e.g. ~12%)!

Edit 3/10/2016: A feature similar to compressed strings is coming back in JDK 9 JEP 254.

10
  • I assume that large JEE based systems will store most of their data in a database, JSE systems do this but to a lesser degree. Being able to store less data in memory reduces the size of the cache you can have but it is less critical (i.e. you won't get a failure as such) I am assuming the SPECjBB doesn't take into account the cost of being able to cache less data. For my applications, I store most of my data in Memory Mapped Files with byte based encoding for strings and use re-usable StringBuilder rather than String to limit GC impact, so it may not help me as much as it did. Commented Apr 24, 2012 at 7:10
  • 5
    It shouldn't have to come at a price. Java should be able to provide apis that could be used to build a string from a source that is known to only contain bytes. Instead it chooses to distrust the programmer and verify everything itself. Similarly, java could provide an api allowing a String to be instantiated out of an existing array; instead it completely distrusts all programmers and forces always copying the array.
    – srparish
    Commented Jul 11, 2012 at 13:08
  • 2
    I hope this is added back in. It actually is very useful for speeding up text parsing applications that handle minimal character sets and definitely reduces heap usage if you're keeping your dataset in memory. Commented Jul 25, 2012 at 16:28
  • 3
    @srparish: I'm pretty sure that allowing what you described would undermine the JVM security/ Accepting your array as a String component makes the String mutable and with String used as class names you could probably do whatever you want in the JVM. So the non-copying public constructor String(char[]) would have to be guarded by a SecurityManager, what would probably make them slower then the copying version.
    – maaartinus
    Commented Jul 27, 2012 at 21:43
  • @maaartinus It is not a problem if we can make immutable arrays. By the way, will escape analysis optimize away the copy?
    – ntysdd
    Commented Sep 15, 2017 at 2:53
16

Just to add, for those interested...

The java.lang.CharSequence interface (which java.lang.String implements), allows more compact representations of Strings than UTF-16.

Apps which manipulate a lot of strings, should probably be written to accept CharSequence, such that they would work with java.lang.String, or more compact representations.

8-bit (UTF-8), or even 5, 6, or 7-bit encoded, or even compressed strings can be represented as CharSequence.

CharSequences can also be a lot more efficient to manipulate - subsequences can be defined as views (pointers) onto the original content for example, instead of copying.

For example in concurrent-trees, a suffix tree of ten of Shakespeare's plays, requires 2GB of RAM using CharSequence-based nodes, and would require 249GB of RAM if using char[] or String-based nodes.

4
  • 1
    CharSequence seems interesting, but I see no means by which an implementation can indicate whether it should be considered immutable (i.e. whether persisting a reference is sufficient to persist the sequence of characters therein). Obviously it's possible for any interface to be implemented in broken fashion, but the interface would seem most useful if it had IsImmutable and AsImmutable methods.
    – supercat
    Commented Sep 24, 2013 at 22:33
  • Yes the immutability of a CharSequence depends on the immutability of all CharSequences it references transitively. I suppose if you implement an ImmutableCharSequence which can only reference other ImmutableCharSequences, then you could do instanceof checks, to detect immutability transitively.
    – npgall
    Commented Sep 25, 2013 at 15:31
  • 1
    While it would be helpful to have an interface ImmutableCharSequence which inherited CharSequence but didn't add any new members--just an expectation that IsImmutable would return true and AsImmutable would return this, and methods which need immutable strings could accept that type without having to call IsImmutable or AsImmutable, there's no way one could restrict what types of objects could be encapsulated by an ImmutableCharSequence, since what would matter would not be whether any encapsulated instance was of a mutable type, but rather whether it would ever be...
    – supercat
    Commented Sep 25, 2013 at 16:21
  • 1
    ...exposed to anything that would mutate it. The vast majority of immutable objects encapsulate instances of mutable classes either directly or indirectly, but are immutable despite that because those instances are never freely exposed to the outside world.
    – supercat
    Commented Sep 25, 2013 at 16:23
14

Since there were up votes, I figure I wasn't missing something obvious so I have logged it as a bug (at the very least an omission in the documentation)

https://bugs.java.com/bugdatabase/view_bug?bug_id=7129417

(Should be visible in a couple of days)

22
  • 1
    filing the bug was the right thing to do, yet overall SO has no known JVM engineers participants.
    – bestsss
    Commented Apr 27, 2012 at 9:13
  • 1
    true, but SO is much more responsive. ;) I wanted to check I wasn't missing something obvious. @Nathan's explaination is as good as any. Commented Apr 27, 2012 at 9:16
  • 2
    @bestsss: I wonder what difficulties there would have been with having, instead of having an array type for each numerical primitive, having a unified primitive-array instance type which could be cast to any other, along with a JVM-defined static final variable indicating whether such arrays will behave as big-endian or little-endian? There are many kinds of operation which could be accelerated by grabbing two or four things at once.
    – supercat
    Commented Feb 27, 2014 at 18:37
  • 1
    @bestsss: The greater the performance advantages of compressed strings, the more likely those advantages would be to overcome any overhead imposed by having to conditionally or virtually dispatch members. If I were designing a String, I'd probably have three main storage formats: an array of bytes for ASCII strings, an array of characters for non-ASCII strings, and an array of Object which would hold a list of String a long with a list of offsets (concatenating two list-style strings should produce a new string with a combined list whose items, other than the first and last, were...
    – supercat
    Commented May 5, 2014 at 18:44
  • 2
    @supercat, Keep in mind no one stops JIT to optimize the generated code and reduce loads (even though they are very cheap when hitting L1). Even now the JIT can use intrinsics (and SSE on x86 that's more effective than 64bit long) to work with the String, it's just not visible on java level. Admittedly I have not been following JIT for quite some time, though. Actually as of java1.7 hashCode primary uses murmur32 as hash function and the original hashCode is not much used. The code uses32 bit ops everywhere and the JIT should be able to optimize the fetch too.
    – bestsss
    Commented May 6, 2014 at 5:12
6

Java 9 executes the sorting lines of an enormous file.txt in java twice as fast on my machine as Java 6 and also only needs 1G of memory as it has -XX:+CompactStrings enabled by default. Also, in Java 6, the compressed strings only worked for 7-bit ASCII characters, whereas in Java 9, it supports Latin1 (ISO-8859-1). Some operations like charAt(idx) might be slightly slower though. With the new design, they could also support other encodings in future.

I wrote a newsletter about this on The Java Specialists' Newsletter.

1
  • Welcome to Stackoverflow Heinz. Commented May 2, 2016 at 7:09
4

In OpenJDK 7 (1.7.0_147-icedtea, Ubuntu 11.10), the JVM simply fails with an

Unrecognized VM option 'UseCompressedStrings'

when JAVA_OPTS (or command line) contains -XX:+UseCompressedStrings.

It seems Oracle really removed the option.

1
  • 2
    Well that sucks. I just learned about this option, and wanted to try it on our testing environments. We handle a lot of Strings, and this could have potentially reduced our memory usage.
    – mjuarez
    Commented Feb 10, 2013 at 9:05

Not the answer you're looking for? Browse other questions tagged or ask your own question.