invited netflix talk: JVM issues in the age of scale! We take an under the hood look at java locking, memory model, overheads, serialization, uuid, gc tuning, CMS, ParallelGC, java.
Report
Share
Report
Share
1 of 46
More Related Content
jvm goes to big data
1. JVM goes BigData srisatish.ambati AT gmail.com DataStax/OpenJDK 2/28/2011 @srisatish
2. Motivation A compendium of recent jvm scale issues while working with big data. This talk will not have details on big data. Thanks Sid!
3. Trail Ahead synchronized Non-blocking Hashmap - A state transition view Collections Serialization UUID Garbage Collection - The free parameters! - Generations, Promotion, Fragmentation - Offheap Questions & asynchronous IO
4. tools of trade What the JVM is doing: dtrace, hprof, introscope, jconsole, visualvm, yourkit, gchisto, zvision Invasive JVM observation tools: bci, jvmti, jvmdi/pi agents, logging What the OS is doing: dtrace, oprofile, vtune, perf What the network/disk is doing: ganglia, iostat, lsof, nagios, netstat, tcpdump
5.
6. synchronized under the hood Fast path for no-contention thin lock Bias threads to lock or bulk revoke bias Store free biasing
10. Non-blocking collections: Amdahl's > Moore's! State, Actions – key/value pairs! get, put, delete, _resize ByteArray to hold Data Concurrent writes: using CAS No locks , no volatile Much faster than locking under heavy load Directly reach main data array in 1 step Resize as needed Copy Array to a larger Array on demand. Post updates
11. Death & Taxes: Java Overheads! Cost of an 8-char String? Cost of 100-entry TreeMap<Double,Double> ? 8b hdr 12b fields 4b ptr 4b pad 8b hdr 4b len 16b data A: 56 bytes, or a 7x blowup 48b TreeMap 40b TreeMap$Entry 16b Double 16b Double A: 7248 bytes or a ~5x blowup
13. Which collection: Mozart or Bach? Concurrency: Non-blocking HashMap Google Collections Overheads Watch out for per-element costs! Primitives can be hard to manage! Sparse collections Average collection size in enterprise is ~3
14. java.io.Serializable is S.L..O.…W True to platform Use “transient” ObjectSerialField[] Avro Google Protocol Buffers, Externalizable + byte[] Roll your own serializable
15. ser+deser smaller is better https://github.com/eishay/jvm-serializers.git
16. avro Schema No per datum overheads Optional code gen Types are runtime Untagged data No manually-assigned field Ids Cons: Schema mismatches Runtime only checks
17. google-proto-buffer Define message format in .proto file All data in key/value pairs Generate sources .builder for each class with getter/setter
18. thrift Type, Transport, Protocol, Version, Processors Separation of structure from protocol & transport TCompactProtocol, etc tag/data, compression TSocket, TfileTransport, etc colocated clients & servers
19. UUID java.util.UUID is slow dominated by sha_transform costs Leach-salz (128-bit) Turns out that default PRNG (via SecureRandom) Uses /dev/urandom for seed initialization -Djava.security.egd=file:/dev/urandom PRNG without file is atleast 20%-40% better. Use TimeUUIDs where possible – much faster Alternatives: JUG – java.uuid.generator, com.eaio.uuid ~10x faster http://github.com/cowtowncoder/java-uuid-generator http://jug.safehaus.org/ http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.htm
22. summary TimebasedUUIDs vs. UUIDs use ~4 times less kernel time on creation! No SHA library calls! optimized toString() Much faster than standard java.util.UUID - Better Instructions per clocks as well. If on EC2: Watch out for non-cacheable file access to /dev/urandom!
23. String theory of Java! byte[] vs. char[] If ver > jdk16u21 try -XX:+UseCompressedStrings Append performance (gc) differs: Strings vs. StringBuffers com.google.common.base.Joiner Join text for cheap, skipNulls or useForNulls()
24. “ Null References: A billion dollar mistake” - C.A.R Hoare “ I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.” - qconlondon, '09
26. verbose:gc GC Logs are cheap even in production -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution A bit expensive/obscure ones: -XX:PrintFLSStatistics=2 -XX:CMSStatistics=1 -XX:CMSInitiationStatistics -XX:+PrintFLSCensus
27. Three free parameters Allocation Rate: your workload! Size: defines runway! Live Set, memory Pause times: Stoppages!
28. Four free parameters Allocation Rate: your application load! Size: defines runway! Live Set, system memory Pause times: Stoppages! (fourth: Overheads of GC – Space & CPU.)
29. Part I: Sizing to be -Xmx == -Xms or not? Young generation: Use -Xmn for predictable performance eden survivor spaces new Object() survivor ratio jvm allocates Tenuring Threshold promotion old gen
30. Part II: Pick a collector! Serial GC – Serial new + Serial Old Parallel GC (default) Parallel Scavenge + Serial Old UseParallelOldGC : Parallel Scavenge + Parallel Old UseConcurrentMarkSweep: ParNew, CMS Old, Serial Old G1/Experimental
31. Reading GC logs – a topic/tool Full GC is STW Initial Mark, Rescan/WeakRef/Remark are STW Look for promotion failures Look for concurrent mode failures
33. Tuning CMS Don’t promote too often! Frequent promotion causes fragmentation (avoid never tenure) TenuringThreshold Size the generations Min GC times are a function of Live Set Old Gen should host steady state comfortably Avoid CMS Initiating heuristic -XX:+UseCMSInitiationOccupanyOnly Use Concurrent for System.gc() -XX:+ExplicitGCInvokesConcurrent
34. GC Threads Parallelize on multicores -XX:ParallelGCThreads=4 (default: derived from # of cpus on system) *8 + (n-5)/8 -XX:ParallelCMSThreads=4 (default: derived from # of parallelgcthreads) Strategy A: Tune min gcs & let appl data in eden
35. Did someone ask about defaults? if (FLAG_IS_DEFAULT(ParallelGCThreads)) { assert(ParallelGCThreads == 0, "Default ParallelGCThreads is not 0"); // For very large machines, there are diminishing returns // for large numbers of worker threads. Instead of // hogging the whole system, use a fraction of the workers for every // processor after the first 8. For example, on a 72 cpu machine // and a chosen fraction of 5/8 // use 8 + (72 - 8) * (5/8) == 48 worker threads. unsigned int ncpus = (unsigned int) os::active_processor_count(); return (ncpus <= switch_pt) ? ncpus : (switch_pt + ((ncpus - switch_pt) * num) / den); } else { return ParallelGCThreads; }
36. Fragmentation Performance degrades over time Inducing “Full GC” makes problem go away Free memory that cannot be used Round off errors Reduce occurrence Use a compacting collector Promote less often Use uniform sized objects
37. Not enough large contiguous space for promotion Small objects still can fit in the holes! Compaction – stop the world. Unsolved on Oracle/Sun Hotspot Azul Systems Pauseless JVM.
42. Gone 0xff the heap !! ByteBuffer.allocateDirect(16 * 1024 * 1024) Also can be mapped memory of a file region Store long-lived objects outside jvm Managed by native i/o ops. JNA: dynamically load & call native libraries without compile time decl like JNI Works for limited use cases in the lab. Ex: Terracotta, Hbase, Cassandra
43. Gone 0xff the heap ? Issues to consider: No clear api to de-allocate from this region See jbellis patch to JNA-179 for FreeableBuffer Object cleanup relegated to finalization Single finalizer thread, Bug ID: 4469299 Behind WeakReference processing in jdk16u21 Workaround: -XX:MaxDirectMemorySize=<size> Manually Trigger System.gc() to avoid “leak”
44. Virtually there! Ballooning driver for Memory: Disable it! Time (TSC) issue! It's relative! Scheduling when # of threads > # of vcpus.. Tickless _nohz kernel GC Thread starvation = STW pauses large ec2 instances are not all equal.. DirectPathIO & vt-d, rvi – Watch out for Sockets! Tools: Performance counters still not virtualized!
45. summary JVM is still the most popular platform for deployment for the new languages! JVM heartburn around scale! Serialization UUID Object overhead Garbage Collection Hypervisor
46. References Chris Wimmer, http://wikis.sun.com/display/HotSpotInternals/Synchronization Russel & Detlefs http://www.oracle.com/technetwork/java/biasedlocking-oopsla2006-wp-149958.pdf Google Protocol Buffers http://code.google.com/p/protobuf Thrift http://incubator.apache.org/thrift/static/thrift-20070401.pdf Leach-Salz Variant of UUID http://www.upnp.org/resources/draft-leach-uuids-guids-00.txt Hans Boehm, http://www.hpl.hp.com/personal/Hans_Boehm/gc/complexity.html Brian Goetz, JSR-133 http://www.ibm.com/developerworks/java/library/j-jtp03304/ GCSpy http://www.cs.kent.ac.uk/projects/gc/gcspy/ Understanding GC logs http://blogs.sun.com/poonam/entry/understanding_cms_gc_logs Cliff Click's http://sourceforge.net/projects/high-scale-lib/