SlideShare a Scribd company logo
JVM goes BigData srisatish.ambati AT gmail.com DataStax/OpenJDK 2/28/2011 @srisatish
Motivation A compendium of recent jvm scale issues while working with big data. This talk will not have details on big data. Thanks Sid!
Trail Ahead synchronized Non-blocking Hashmap      - A state transition view Collections Serialization UUID Garbage Collection      - The free parameters!      - Generations, Promotion, Fragmentation      - Offheap Questions & asynchronous IO
tools of trade What the JVM is doing: dtrace, hprof, introscope, jconsole, visualvm, yourkit, gchisto, zvision Invasive JVM observation tools: bci, jvmti, jvmdi/pi agents, logging What the OS is doing: dtrace, oprofile, vtune, perf What the network/disk is doing: ganglia, iostat, lsof, nagios, netstat, tcpdump
 
synchronized under the hood Fast path for no-contention thin lock Bias threads to lock or bulk revoke bias Store free biasing
JMM:  happens-before, causality Partial order volatile Piggybacking FutureTask BlockingQueue jsr133
java.util.concurrent also holds locks!
Tomcat under concurrent load!
Non-blocking collections:  Amdahl's > Moore's!   State, Actions – key/value pairs! get, put, delete, _resize ByteArray to hold Data Concurrent writes: using CAS No  locks , no  volatile Much  faster than locking under heavy load Directly reach main data array in 1 step Resize as needed Copy Array to a larger Array on demand. Post updates
Death & Taxes: Java Overheads! Cost of an 8-char String? Cost of 100-entry TreeMap<Double,Double> ? 8b hdr 12b fields 4b ptr 4b pad 8b hdr 4b len 16b data A: 56 bytes, or a 7x blowup 48b TreeMap 40b TreeMap$Entry 16b Double 16b Double A: 7248 bytes or a ~5x blowup
yourkit: memory profile
Which collection: Mozart or Bach? Concurrency:   Non-blocking HashMap Google Collections Overheads Watch out for per-element costs! Primitives can be hard to manage! Sparse collections Average collection size in enterprise is ~3
    java.io.Serializable is S.L..O.…W True to platform Use “transient” ObjectSerialField[] Avro Google Protocol Buffers,  Externalizable + byte[] Roll your own serializable
ser+deser  smaller is better https://github.com/eishay/jvm-serializers.git
avro Schema No per datum overheads Optional code gen Types are runtime Untagged data No manually-assigned field Ids Cons: Schema mismatches Runtime only checks
google-proto-buffer Define message format in .proto file All data in key/value pairs Generate sources .builder for each class with getter/setter
thrift Type, Transport, Protocol, Version, Processors Separation of structure from protocol & transport TCompactProtocol, etc tag/data, compression TSocket, TfileTransport, etc colocated clients & servers
UUID java.util.UUID is slow dominated by sha_transform costs Leach-salz (128-bit)  Turns out that default PRNG (via SecureRandom) Uses /dev/urandom for seed initialization -Djava.security.egd=file:/dev/urandom PRNG without file is atleast 20%-40% better. Use TimeUUIDs where possible – much faster Alternatives: JUG – java.uuid.generator, com.eaio.uuid ~10x faster http://github.com/cowtowncoder/java-uuid-generator  http://jug.safehaus.org/   http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.htm
/** * Returns a {@code String} object representing this {@code UUID}. * * <p> The UUID string representation is as described by this BNF: * <blockquote><pre> * {@code * UUID  = <time_low> &quot;-&quot; <time_mid> &quot;-&quot; *  <time_high_and_version> &quot;-&quot; *  <variant_and_sequence> &quot;-&quot; *  <node> * time_low  = 4*<hexOctet> * time_mid  = 2*<hexOctet> * time_high_and_version  = 2*<hexOctet> * variant_and_sequence  = 2*<hexOctet> * node  = 6*<hexOctet> * hexOctet  = <hexDigit><hexDigit> * hexDigit  = *  &quot;0&quot; | &quot;1&quot; | &quot;2&quot; | &quot;3&quot; | &quot;4&quot; | &quot;5&quot; | &quot;6&quot; | &quot;7&quot; | &quot;8&quot; | &quot;9&quot; *  | &quot;a&quot; | &quot;b&quot; | &quot;c&quot; | &quot;d&quot; | &quot;e&quot; | &quot;f&quot; *  | &quot;A&quot; | &quot;B&quot; | &quot;C&quot; | &quot;D&quot; | &quot;E&quot; | &quot;F&quot; * }</pre></blockquote> * * @return  A string representation of this {@code UUID} */ public String toString() { return (digits(mostSigBits >> 32, 8) + &quot;-&quot; + digits(mostSigBits >> 16, 4) + &quot;-&quot; + digits(mostSigBits, 4) + &quot;-&quot; + digits(leastSigBits >> 48, 4) + &quot;-&quot; + digits(leastSigBits, 12)); } Leach-salz UUID
------------------------------------------------------------------------------------------------------------------------------- PerfTop:  1485 irqs/sec  kernel:18.6%   exact:  0.0% [1000Hz cycles],  (all, 8 CPUs) ------------------------------------------------------------------------------------------------------------------------------- samples  pcnt function  DSO _______ _____ ________________________________________________________________  1882.00 26.3% intel_idle  [kernel.kallsyms]  1678.00 23.5% os::javaTimeMillis()  libjvm.so  382.00  5.3% SpinPause  libjvm.so  335.00  4.7% Timer::ImplTimerCallbackProc()  libvcllx.so  291.00  4.1% gettimeofday  /lib/libc-2.12.1.so  268.00  3.7% hpet_next_event  [kernel.kallsyms]  254.00  3.6% ParallelTaskTerminator::offer_termination(TerminatorTerminator*) libjvm.so  ------------------------------------------------------------------------------------------------------------------------------- PerfTop:  1656 irqs/sec  kernel:59.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs) ------------------------------------------------------------------------------------------------------------------------------- samples  pcnt function  DSO _______ _____ ________________________________________________________________ 6980.00 38.5% sha_transform  [kernel.kallsyms] 2119.00 11.7% intel_idle  [kernel.kallsyms] 1382.00  7.6% mix_pool_bytes_extract  [kernel.kallsyms]  437.00  2.4% i8042_interrupt  [kernel.kallsyms] 416.00  2.3% hpet_next_event  [kernel.kallsyms] 390.00  2.2% extract_buf  [kernel.kallsyms] 376.00  2.1% ThreadInVMfromNative::~ThreadInVMfromNative()  libjvm.so  321.00  1.8% T.3542  libjvm.so  298.00  1.6% __ticket_spin_lock  [kernel.kallsyms] 296.00  1.6% Timer::ImplTimerCallbackProc()  libvcllx.so  255.00  1.4% Unsafe_GetInt  libjvm.so
summary TimebasedUUIDs vs. UUIDs use ~4 times less kernel time on creation! No SHA library calls! optimized toString() Much faster than standard java.util.UUID - Better Instructions per clocks as well.  If on EC2:  Watch out for non-cacheable file access to /dev/urandom!
String theory of Java! byte[] vs. char[] If ver > jdk16u21 try -XX:+UseCompressedStrings Append performance (gc) differs: Strings vs. StringBuffers com.google.common.base.Joiner Join text for cheap,  skipNulls or useForNulls()
“ Null References: A billion dollar mistake” - C.A.R Hoare “ I call it my billion-dollar mistake. It was the invention of the null reference in 1965.  At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement.  This has led to innumerable errors, vulnerabilities, and system crashes,  which have probably caused a billion dollars of pain and damage in the last forty years.” - qconlondon, '09
Best Practices: Garbage Collection
verbose:gc GC Logs are cheap even in production -Xloggc:gc.log  -XX:+PrintGCDetails  -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution  A bit expensive/obscure ones: -XX:PrintFLSStatistics=2 -XX:CMSStatistics=1 -XX:CMSInitiationStatistics -XX:+PrintFLSCensus
Three free parameters Allocation Rate: your workload! Size: defines runway! Live Set, memory Pause times:  Stoppages!
Four free parameters Allocation Rate: your application load! Size: defines runway! Live Set, system memory Pause times:  Stoppages! (fourth: Overheads of GC – Space & CPU.)
Part I: Sizing to be -Xmx == -Xms or not? Young generation: Use -Xmn for predictable performance eden survivor spaces new Object() survivor ratio jvm allocates  Tenuring Threshold promotion old gen
Part II: Pick a collector! Serial GC – Serial new + Serial Old Parallel GC (default) Parallel Scavenge + Serial Old UseParallelOldGC : Parallel Scavenge + Parallel Old UseConcurrentMarkSweep: ParNew, CMS Old, Serial Old G1/Experimental
Reading GC logs – a topic/tool Full GC is STW Initial Mark, Rescan/WeakRef/Remark  are STW Look for promotion failures Look for concurrent mode failures
...  995.330: [CMS-concurrent-mark: 0.952/1.102 secs] [Times: user=3.69 sys=0.54, real=1.10 secs]  995.330: [CMS-concurrent-preclean-start] 995.618: [CMS-concurrent-preclean: 0.279/0.287 secs] [Times: user=0.90 sys=0.20, real=0.29 secs]  995.618: [CMS-concurrent-abortable-preclean-start] 995.695: [GC 995.695: [ParNew (promotion failed) Desired survivor size 41943040 bytes, new threshold 1 (max 1) - age  1:  29826872 bytes,  29826872 total : 720596K->703760K(737280K), 0.4710410 secs]996.166: [CMS996.317: [CMS-concurrent-abortable-preclean: 0.218/0.699 secs] [Times: user=1.39 sys=0.10, real=0.70 secs]  (concurrent mode failure): 4100132K->784070K(5341184K), 4.7478300 secs] 4780154K->784070K(6078464K), [CMS Perm : 17033K->17014K(28400K)], 5.2191410 secs] [Times: user=5.70 sys=0.01, real=5.22 secs] ...
Tuning CMS Don���t promote too often! Frequent promotion causes fragmentation (avoid never tenure) TenuringThreshold Size the generations Min GC times are a function of Live Set Old Gen should host steady state comfortably Avoid CMS Initiating heuristic -XX:+UseCMSInitiationOccupanyOnly Use Concurrent for System.gc() -XX:+ExplicitGCInvokesConcurrent
GC Threads Parallelize on multicores  -XX:ParallelGCThreads=4 (default: derived from # of cpus on system)  *8 + (n-5)/8 -XX:ParallelCMSThreads=4  (default: derived from # of parallelgcthreads) Strategy A:  Tune min gcs & let appl data in eden
Did someone ask about defaults? if (FLAG_IS_DEFAULT(ParallelGCThreads)) { assert(ParallelGCThreads == 0, &quot;Default ParallelGCThreads is not 0&quot;); // For very large machines, there are diminishing returns // for large numbers of worker threads.  Instead of // hogging the whole system, use a fraction of the workers for every // processor after the first 8.  For example, on a 72 cpu machine // and a chosen fraction of 5/8 // use 8 + (72 - 8) * (5/8) == 48 worker threads. unsigned int ncpus = (unsigned int) os::active_processor_count(); return (ncpus <= switch_pt) ? ncpus : (switch_pt + ((ncpus - switch_pt) * num) / den); } else { return ParallelGCThreads; }
Fragmentation Performance degrades over time Inducing “Full GC” makes problem go away Free memory that cannot be used Round off errors Reduce occurrence Use a compacting collector Promote less often Use uniform sized objects
Not enough large contiguous space for promotion Small objects still can fit in the holes! Compaction – stop the world. Unsolved on Oracle/Sun Hotspot  Azul Systems Pauseless JVM.
JRockit Mission Control
Example Application suddenly transitions to back-to-back full gcs. Cannot use free mem – too many holes!
Tools GCHisto jconsole VisualVM/VisualGC Logs Thread dumps yourkit memory profile, snapshots
GCSpy
Gone 0xff the heap !! ByteBuffer.allocateDirect(16 * 1024 * 1024) Also can be mapped memory of a file region Store long-lived objects outside jvm  Managed by native i/o ops. JNA: dynamically load & call native libraries without compile time decl like JNI Works for limited use cases in the lab. Ex: Terracotta, Hbase, Cassandra
Gone 0xff the heap ? Issues to consider: No clear api to de-allocate from this region  See jbellis patch to JNA-179 for FreeableBuffer Object cleanup relegated to finalization Single finalizer thread, Bug ID: 4469299 Behind WeakReference processing in jdk16u21 Workaround: -XX:MaxDirectMemorySize=<size>  Manually Trigger System.gc() to avoid “leak”
Virtually there!  Ballooning driver for Memory: Disable it! Time (TSC) issue! It's relative! Scheduling when # of threads > # of vcpus.. Tickless _nohz kernel GC Thread starvation = STW pauses large ec2 instances are not all equal.. DirectPathIO & vt-d, rvi – Watch out for Sockets! Tools: Performance counters still not virtualized!
summary JVM is still the most popular platform for deployment for the new languages! JVM heartburn around scale! Serialization UUID Object overhead Garbage Collection Hypervisor
References Chris Wimmer,  http://wikis.sun.com/display/HotSpotInternals/Synchronization Russel & Detlefs  http://www.oracle.com/technetwork/java/biasedlocking-oopsla2006-wp-149958.pdf Google Protocol Buffers  http://code.google.com/p/protobuf Thrift  http://incubator.apache.org/thrift/static/thrift-20070401.pdf Leach-Salz Variant of UUID  http://www.upnp.org/resources/draft-leach-uuids-guids-00.txt Hans Boehm,  http://www.hpl.hp.com/personal/Hans_Boehm/gc/complexity.html Brian Goetz, JSR-133  http://www.ibm.com/developerworks/java/library/j-jtp03304/ GCSpy  http://www.cs.kent.ac.uk/projects/gc/gcspy/ Understanding GC logs  http://blogs.sun.com/poonam/entry/understanding_cms_gc_logs Cliff Click's http://sourceforge.net/projects/high-scale-lib/

More Related Content

jvm goes to big data

  • 1. JVM goes BigData srisatish.ambati AT gmail.com DataStax/OpenJDK 2/28/2011 @srisatish
  • 2. Motivation A compendium of recent jvm scale issues while working with big data. This talk will not have details on big data. Thanks Sid!
  • 3. Trail Ahead synchronized Non-blocking Hashmap      - A state transition view Collections Serialization UUID Garbage Collection      - The free parameters!      - Generations, Promotion, Fragmentation      - Offheap Questions & asynchronous IO
  • 4. tools of trade What the JVM is doing: dtrace, hprof, introscope, jconsole, visualvm, yourkit, gchisto, zvision Invasive JVM observation tools: bci, jvmti, jvmdi/pi agents, logging What the OS is doing: dtrace, oprofile, vtune, perf What the network/disk is doing: ganglia, iostat, lsof, nagios, netstat, tcpdump
  • 5.  
  • 6. synchronized under the hood Fast path for no-contention thin lock Bias threads to lock or bulk revoke bias Store free biasing
  • 7. JMM: happens-before, causality Partial order volatile Piggybacking FutureTask BlockingQueue jsr133
  • 10. Non-blocking collections: Amdahl's > Moore's!   State, Actions – key/value pairs! get, put, delete, _resize ByteArray to hold Data Concurrent writes: using CAS No locks , no volatile Much faster than locking under heavy load Directly reach main data array in 1 step Resize as needed Copy Array to a larger Array on demand. Post updates
  • 11. Death & Taxes: Java Overheads! Cost of an 8-char String? Cost of 100-entry TreeMap<Double,Double> ? 8b hdr 12b fields 4b ptr 4b pad 8b hdr 4b len 16b data A: 56 bytes, or a 7x blowup 48b TreeMap 40b TreeMap$Entry 16b Double 16b Double A: 7248 bytes or a ~5x blowup
  • 13. Which collection: Mozart or Bach? Concurrency: Non-blocking HashMap Google Collections Overheads Watch out for per-element costs! Primitives can be hard to manage! Sparse collections Average collection size in enterprise is ~3
  • 14.     java.io.Serializable is S.L..O.…W True to platform Use “transient” ObjectSerialField[] Avro Google Protocol Buffers, Externalizable + byte[] Roll your own serializable
  • 15. ser+deser smaller is better https://github.com/eishay/jvm-serializers.git
  • 16. avro Schema No per datum overheads Optional code gen Types are runtime Untagged data No manually-assigned field Ids Cons: Schema mismatches Runtime only checks
  • 17. google-proto-buffer Define message format in .proto file All data in key/value pairs Generate sources .builder for each class with getter/setter
  • 18. thrift Type, Transport, Protocol, Version, Processors Separation of structure from protocol & transport TCompactProtocol, etc tag/data, compression TSocket, TfileTransport, etc colocated clients & servers
  • 19. UUID java.util.UUID is slow dominated by sha_transform costs Leach-salz (128-bit) Turns out that default PRNG (via SecureRandom) Uses /dev/urandom for seed initialization -Djava.security.egd=file:/dev/urandom PRNG without file is atleast 20%-40% better. Use TimeUUIDs where possible – much faster Alternatives: JUG – java.uuid.generator, com.eaio.uuid ~10x faster http://github.com/cowtowncoder/java-uuid-generator http://jug.safehaus.org/ http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.htm
  • 20. /** * Returns a {@code String} object representing this {@code UUID}. * * <p> The UUID string representation is as described by this BNF: * <blockquote><pre> * {@code * UUID = <time_low> &quot;-&quot; <time_mid> &quot;-&quot; * <time_high_and_version> &quot;-&quot; * <variant_and_sequence> &quot;-&quot; * <node> * time_low = 4*<hexOctet> * time_mid = 2*<hexOctet> * time_high_and_version = 2*<hexOctet> * variant_and_sequence = 2*<hexOctet> * node = 6*<hexOctet> * hexOctet = <hexDigit><hexDigit> * hexDigit = * &quot;0&quot; | &quot;1&quot; | &quot;2&quot; | &quot;3&quot; | &quot;4&quot; | &quot;5&quot; | &quot;6&quot; | &quot;7&quot; | &quot;8&quot; | &quot;9&quot; * | &quot;a&quot; | &quot;b&quot; | &quot;c&quot; | &quot;d&quot; | &quot;e&quot; | &quot;f&quot; * | &quot;A&quot; | &quot;B&quot; | &quot;C&quot; | &quot;D&quot; | &quot;E&quot; | &quot;F&quot; * }</pre></blockquote> * * @return A string representation of this {@code UUID} */ public String toString() { return (digits(mostSigBits >> 32, 8) + &quot;-&quot; + digits(mostSigBits >> 16, 4) + &quot;-&quot; + digits(mostSigBits, 4) + &quot;-&quot; + digits(leastSigBits >> 48, 4) + &quot;-&quot; + digits(leastSigBits, 12)); } Leach-salz UUID
  • 21. ------------------------------------------------------------------------------------------------------------------------------- PerfTop: 1485 irqs/sec kernel:18.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ________________________________________________________________ 1882.00 26.3% intel_idle [kernel.kallsyms] 1678.00 23.5% os::javaTimeMillis() libjvm.so 382.00 5.3% SpinPause libjvm.so 335.00 4.7% Timer::ImplTimerCallbackProc() libvcllx.so 291.00 4.1% gettimeofday /lib/libc-2.12.1.so 268.00 3.7% hpet_next_event [kernel.kallsyms] 254.00 3.6% ParallelTaskTerminator::offer_termination(TerminatorTerminator*) libjvm.so ------------------------------------------------------------------------------------------------------------------------------- PerfTop: 1656 irqs/sec kernel:59.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ________________________________________________________________ 6980.00 38.5% sha_transform [kernel.kallsyms] 2119.00 11.7% intel_idle [kernel.kallsyms] 1382.00 7.6% mix_pool_bytes_extract [kernel.kallsyms] 437.00 2.4% i8042_interrupt [kernel.kallsyms] 416.00 2.3% hpet_next_event [kernel.kallsyms] 390.00 2.2% extract_buf [kernel.kallsyms] 376.00 2.1% ThreadInVMfromNative::~ThreadInVMfromNative() libjvm.so 321.00 1.8% T.3542 libjvm.so 298.00 1.6% __ticket_spin_lock [kernel.kallsyms] 296.00 1.6% Timer::ImplTimerCallbackProc() libvcllx.so 255.00 1.4% Unsafe_GetInt libjvm.so
  • 22. summary TimebasedUUIDs vs. UUIDs use ~4 times less kernel time on creation! No SHA library calls! optimized toString() Much faster than standard java.util.UUID - Better Instructions per clocks as well. If on EC2: Watch out for non-cacheable file access to /dev/urandom!
  • 23. String theory of Java! byte[] vs. char[] If ver > jdk16u21 try -XX:+UseCompressedStrings Append performance (gc) differs: Strings vs. StringBuffers com.google.common.base.Joiner Join text for cheap, skipNulls or useForNulls()
  • 24. “ Null References: A billion dollar mistake” - C.A.R Hoare “ I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.” - qconlondon, '09
  • 26. verbose:gc GC Logs are cheap even in production -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution A bit expensive/obscure ones: -XX:PrintFLSStatistics=2 -XX:CMSStatistics=1 -XX:CMSInitiationStatistics -XX:+PrintFLSCensus
  • 27. Three free parameters Allocation Rate: your workload! Size: defines runway! Live Set, memory Pause times: Stoppages!
  • 28. Four free parameters Allocation Rate: your application load! Size: defines runway! Live Set, system memory Pause times: Stoppages! (fourth: Overheads of GC – Space & CPU.)
  • 29. Part I: Sizing to be -Xmx == -Xms or not? Young generation: Use -Xmn for predictable performance eden survivor spaces new Object() survivor ratio jvm allocates Tenuring Threshold promotion old gen
  • 30. Part II: Pick a collector! Serial GC – Serial new + Serial Old Parallel GC (default) Parallel Scavenge + Serial Old UseParallelOldGC : Parallel Scavenge + Parallel Old UseConcurrentMarkSweep: ParNew, CMS Old, Serial Old G1/Experimental
  • 31. Reading GC logs – a topic/tool Full GC is STW Initial Mark, Rescan/WeakRef/Remark are STW Look for promotion failures Look for concurrent mode failures
  • 32. ... 995.330: [CMS-concurrent-mark: 0.952/1.102 secs] [Times: user=3.69 sys=0.54, real=1.10 secs] 995.330: [CMS-concurrent-preclean-start] 995.618: [CMS-concurrent-preclean: 0.279/0.287 secs] [Times: user=0.90 sys=0.20, real=0.29 secs] 995.618: [CMS-concurrent-abortable-preclean-start] 995.695: [GC 995.695: [ParNew (promotion failed) Desired survivor size 41943040 bytes, new threshold 1 (max 1) - age 1: 29826872 bytes, 29826872 total : 720596K->703760K(737280K), 0.4710410 secs]996.166: [CMS996.317: [CMS-concurrent-abortable-preclean: 0.218/0.699 secs] [Times: user=1.39 sys=0.10, real=0.70 secs] (concurrent mode failure): 4100132K->784070K(5341184K), 4.7478300 secs] 4780154K->784070K(6078464K), [CMS Perm : 17033K->17014K(28400K)], 5.2191410 secs] [Times: user=5.70 sys=0.01, real=5.22 secs] ...
  • 33. Tuning CMS Don’t promote too often! Frequent promotion causes fragmentation (avoid never tenure) TenuringThreshold Size the generations Min GC times are a function of Live Set Old Gen should host steady state comfortably Avoid CMS Initiating heuristic -XX:+UseCMSInitiationOccupanyOnly Use Concurrent for System.gc() -XX:+ExplicitGCInvokesConcurrent
  • 34. GC Threads Parallelize on multicores -XX:ParallelGCThreads=4 (default: derived from # of cpus on system) *8 + (n-5)/8 -XX:ParallelCMSThreads=4 (default: derived from # of parallelgcthreads) Strategy A: Tune min gcs & let appl data in eden
  • 35. Did someone ask about defaults? if (FLAG_IS_DEFAULT(ParallelGCThreads)) { assert(ParallelGCThreads == 0, &quot;Default ParallelGCThreads is not 0&quot;); // For very large machines, there are diminishing returns // for large numbers of worker threads. Instead of // hogging the whole system, use a fraction of the workers for every // processor after the first 8. For example, on a 72 cpu machine // and a chosen fraction of 5/8 // use 8 + (72 - 8) * (5/8) == 48 worker threads. unsigned int ncpus = (unsigned int) os::active_processor_count(); return (ncpus <= switch_pt) ? ncpus : (switch_pt + ((ncpus - switch_pt) * num) / den); } else { return ParallelGCThreads; }
  • 36. Fragmentation Performance degrades over time Inducing “Full GC” makes problem go away Free memory that cannot be used Round off errors Reduce occurrence Use a compacting collector Promote less often Use uniform sized objects
  • 37. Not enough large contiguous space for promotion Small objects still can fit in the holes! Compaction – stop the world. Unsolved on Oracle/Sun Hotspot Azul Systems Pauseless JVM.
  • 39. Example Application suddenly transitions to back-to-back full gcs. Cannot use free mem – too many holes!
  • 40. Tools GCHisto jconsole VisualVM/VisualGC Logs Thread dumps yourkit memory profile, snapshots
  • 41. GCSpy
  • 42. Gone 0xff the heap !! ByteBuffer.allocateDirect(16 * 1024 * 1024) Also can be mapped memory of a file region Store long-lived objects outside jvm Managed by native i/o ops. JNA: dynamically load & call native libraries without compile time decl like JNI Works for limited use cases in the lab. Ex: Terracotta, Hbase, Cassandra
  • 43. Gone 0xff the heap ? Issues to consider: No clear api to de-allocate from this region See jbellis patch to JNA-179 for FreeableBuffer Object cleanup relegated to finalization Single finalizer thread, Bug ID: 4469299 Behind WeakReference processing in jdk16u21 Workaround: -XX:MaxDirectMemorySize=<size> Manually Trigger System.gc() to avoid “leak”
  • 44. Virtually there! Ballooning driver for Memory: Disable it! Time (TSC) issue! It's relative! Scheduling when # of threads > # of vcpus.. Tickless _nohz kernel GC Thread starvation = STW pauses large ec2 instances are not all equal.. DirectPathIO & vt-d, rvi – Watch out for Sockets! Tools: Performance counters still not virtualized!
  • 45. summary JVM is still the most popular platform for deployment for the new languages! JVM heartburn around scale! Serialization UUID Object overhead Garbage Collection Hypervisor
  • 46. References Chris Wimmer,  http://wikis.sun.com/display/HotSpotInternals/Synchronization Russel & Detlefs  http://www.oracle.com/technetwork/java/biasedlocking-oopsla2006-wp-149958.pdf Google Protocol Buffers http://code.google.com/p/protobuf Thrift http://incubator.apache.org/thrift/static/thrift-20070401.pdf Leach-Salz Variant of UUID http://www.upnp.org/resources/draft-leach-uuids-guids-00.txt Hans Boehm, http://www.hpl.hp.com/personal/Hans_Boehm/gc/complexity.html Brian Goetz, JSR-133 http://www.ibm.com/developerworks/java/library/j-jtp03304/ GCSpy http://www.cs.kent.ac.uk/projects/gc/gcspy/ Understanding GC logs http://blogs.sun.com/poonam/entry/understanding_cms_gc_logs Cliff Click's http://sourceforge.net/projects/high-scale-lib/