jvm goes to big data

JVM goes BigData srisatish.ambati AT gmail.com DataStax/OpenJDK 2/28/2011 @srisatish

Motivation A compendium of recent jvm scale issues while working with big data. This talk will not have details on big data. Thanks Sid!

Trail Ahead synchronized Non-blocking Hashmap - A state transition view Collections Serialization UUID Garbage Collection - The free parameters! - Generations, Promotion, Fragmentation - Offheap Questions & asynchronous IO

tools of trade What the JVM is doing: dtrace, hprof, introscope, jconsole, visualvm, yourkit, gchisto, zvision Invasive JVM observation tools: bci, jvmti, jvmdi/pi agents, logging What the OS is doing: dtrace, oprofile, vtune, perf What the network/disk is doing: ganglia, iostat, lsof, nagios, netstat, tcpdump

synchronized under the hood Fast path for no-contention thin lock Bias threads to lock or bulk revoke bias Store free biasing

JMM: happens-before, causality Partial order volatile Piggybacking FutureTask BlockingQueue jsr133

java.util.concurrent also holds locks!

Non-blocking collections: Amdahl's > Moore's! State, Actions – key/value pairs! get, put, delete, _resize ByteArray to hold Data Concurrent writes: using CAS No locks , no volatile Much faster than locking under heavy load Directly reach main data array in 1 step Resize as needed Copy Array to a larger Array on demand. Post updates

Death & Taxes: Java Overheads! Cost of an 8-char String? Cost of 100-entry TreeMap<Double,Double> ? 8b hdr 12b fields 4b ptr 4b pad 8b hdr 4b len 16b data A: 56 bytes, or a 7x blowup 48b TreeMap 40b TreeMap$Entry 16b Double 16b Double A: 7248 bytes or a ~5x blowup

Which collection: Mozart or Bach? Concurrency: Non-blocking HashMap Google Collections Overheads Watch out for per-element costs! Primitives can be hard to manage! Sparse collections Average collection size in enterprise is ~3

java.io.Serializable is S.L..O.…W True to platform Use “transient” ObjectSerialField[] Avro Google Protocol Buffers, Externalizable + byte[] Roll your own serializable

ser+deser smaller is better https://github.com/eishay/jvm-serializers.git

avro Schema No per datum overheads Optional code gen Types are runtime Untagged data No manually-assigned field Ids Cons: Schema mismatches Runtime only checks

google-proto-buffer Define message format in .proto file All data in key/value pairs Generate sources .builder for each class with getter/setter

thrift Type, Transport, Protocol, Version, Processors Separation of structure from protocol & transport TCompactProtocol, etc tag/data, compression TSocket, TfileTransport, etc colocated clients & servers

UUID java.util.UUID is slow dominated by sha_transform costs Leach-salz (128-bit) Turns out that default PRNG (via SecureRandom) Uses /dev/urandom for seed initialization -Djava.security.egd=file:/dev/urandom PRNG without file is atleast 20%-40% better. Use TimeUUIDs where possible – much faster Alternatives: JUG – java.uuid.generator, com.eaio.uuid ~10x faster http://github.com/cowtowncoder/java-uuid-generator http://jug.safehaus.org/ http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.htm

/** * Returns a {@code String} object representing this {@code UUID}. * * <p> The UUID string representation is as described by this BNF: * <blockquote><pre> * {@code * UUID = <time_low> "-" <time_mid> "-" * <time_high_and_version> "-" * <variant_and_sequence> "-" * <node> * time_low = 4*<hexOctet> * time_mid = 2*<hexOctet> * time_high_and_version = 2*<hexOctet> * variant_and_sequence = 2*<hexOctet> * node = 6*<hexOctet> * hexOctet = <hexDigit><hexDigit> * hexDigit = * "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" * | "a" | "b" | "c" | "d" | "e" | "f" * | "A" | "B" | "C" | "D" | "E" | "F" * }</pre></blockquote> * * @return A string representation of this {@code UUID} */ public String toString() { return (digits(mostSigBits >> 32, 8) + "-" + digits(mostSigBits >> 16, 4) + "-" + digits(mostSigBits, 4) + "-" + digits(leastSigBits >> 48, 4) + "-" + digits(leastSigBits, 12)); } Leach-salz UUID

------------------------------------------------------------------------------------------------------------------------------- PerfTop: 1485 irqs/sec kernel:18.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ________________________________________________________________ 1882.00 26.3% intel_idle [kernel.kallsyms] 1678.00 23.5% os::javaTimeMillis() libjvm.so 382.00 5.3% SpinPause libjvm.so 335.00 4.7% Timer::ImplTimerCallbackProc() libvcllx.so 291.00 4.1% gettimeofday /lib/libc-2.12.1.so 268.00 3.7% hpet_next_event [kernel.kallsyms] 254.00 3.6% ParallelTaskTerminator::offer_termination(TerminatorTerminator*) libjvm.so ------------------------------------------------------------------------------------------------------------------------------- PerfTop: 1656 irqs/sec kernel:59.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ________________________________________________________________ 6980.00 38.5% sha_transform [kernel.kallsyms] 2119.00 11.7% intel_idle [kernel.kallsyms] 1382.00 7.6% mix_pool_bytes_extract [kernel.kallsyms] 437.00 2.4% i8042_interrupt [kernel.kallsyms] 416.00 2.3% hpet_next_event [kernel.kallsyms] 390.00 2.2% extract_buf [kernel.kallsyms] 376.00 2.1% ThreadInVMfromNative::~ThreadInVMfromNative() libjvm.so 321.00 1.8% T.3542 libjvm.so 298.00 1.6% __ticket_spin_lock [kernel.kallsyms] 296.00 1.6% Timer::ImplTimerCallbackProc() libvcllx.so 255.00 1.4% Unsafe_GetInt libjvm.so

summary TimebasedUUIDs vs. UUIDs use ~4 times less kernel time on creation! No SHA library calls! optimized toString() Much faster than standard java.util.UUID - Better Instructions per clocks as well. If on EC2: Watch out for non-cacheable file access to /dev/urandom!

String theory of Java! byte[] vs. char[] If ver > jdk16u21 try -XX:+UseCompressedStrings Append performance (gc) differs: Strings vs. StringBuffers com.google.common.base.Joiner Join text for cheap, skipNulls or useForNulls()

“ Null References: A billion dollar mistake” - C.A.R Hoare “ I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.” - qconlondon, '09

Best Practices: Garbage Collection

verbose:gc GC Logs are cheap even in production -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution A bit expensive/obscure ones: -XX:PrintFLSStatistics=2 -XX:CMSStatistics=1 -XX:CMSInitiationStatistics -XX:+PrintFLSCensus

Three free parameters Allocation Rate: your workload! Size: defines runway! Live Set, memory Pause times: Stoppages!

Four free parameters Allocation Rate: your application load! Size: defines runway! Live Set, system memory Pause times: Stoppages! (fourth: Overheads of GC – Space & CPU.)

Part I: Sizing to be -Xmx == -Xms or not? Young generation: Use -Xmn for predictable performance eden survivor spaces new Object() survivor ratio jvm allocates Tenuring Threshold promotion old gen

Part II: Pick a collector! Serial GC – Serial new + Serial Old Parallel GC (default) Parallel Scavenge + Serial Old UseParallelOldGC : Parallel Scavenge + Parallel Old UseConcurrentMarkSweep: ParNew, CMS Old, Serial Old G1/Experimental

Reading GC logs – a topic/tool Full GC is STW Initial Mark, Rescan/WeakRef/Remark are STW Look for promotion failures Look for concurrent mode failures

... 995.330: [CMS-concurrent-mark: 0.952/1.102 secs] [Times: user=3.69 sys=0.54, real=1.10 secs] 995.330: [CMS-concurrent-preclean-start] 995.618: [CMS-concurrent-preclean: 0.279/0.287 secs] [Times: user=0.90 sys=0.20, real=0.29 secs] 995.618: [CMS-concurrent-abortable-preclean-start] 995.695: [GC 995.695: [ParNew (promotion failed) Desired survivor size 41943040 bytes, new threshold 1 (max 1) - age 1: 29826872 bytes, 29826872 total : 720596K->703760K(737280K), 0.4710410 secs]996.166: [CMS996.317: [CMS-concurrent-abortable-preclean: 0.218/0.699 secs] [Times: user=1.39 sys=0.10, real=0.70 secs] (concurrent mode failure): 4100132K->784070K(5341184K), 4.7478300 secs] 4780154K->784070K(6078464K), [CMS Perm : 17033K->17014K(28400K)], 5.2191410 secs] [Times: user=5.70 sys=0.01, real=5.22 secs] ...

Tuning CMS Don��t promote too often! Frequent promotion causes fragmentation (avoid never tenure) TenuringThreshold Size the generations Min GC times are a function of Live Set Old Gen should host steady state comfortably Avoid CMS Initiating heuristic -XX:+UseCMSInitiationOccupanyOnly Use Concurrent for System.gc() -XX:+ExplicitGCInvokesConcurrent

GC Threads Parallelize on multicores -XX:ParallelGCThreads=4 (default: derived from # of cpus on system) *8 + (n-5)/8 -XX:ParallelCMSThreads=4 (default: derived from # of parallelgcthreads) Strategy A: Tune min gcs & let appl data in eden

Did someone ask about defaults? if (FLAG_IS_DEFAULT(ParallelGCThreads)) { assert(ParallelGCThreads == 0, "Default ParallelGCThreads is not 0"); // For very large machines, there are diminishing returns // for large numbers of worker threads. Instead of // hogging the whole system, use a fraction of the workers for every // processor after the first 8. For example, on a 72 cpu machine // and a chosen fraction of 5/8 // use 8 + (72 - 8) * (5/8) == 48 worker threads. unsigned int ncpus = (unsigned int) os::active_processor_count(); return (ncpus <= switch_pt) ? ncpus : (switch_pt + ((ncpus - switch_pt) * num) / den); } else { return ParallelGCThreads; }

Fragmentation Performance degrades over time Inducing “Full GC” makes problem go away Free memory that cannot be used Round off errors Reduce occurrence Use a compacting collector Promote less often Use uniform sized objects

Not enough large contiguous space for promotion Small objects still can fit in the holes! Compaction – stop the world. Unsolved on Oracle/Sun Hotspot Azul Systems Pauseless JVM.

Example Application suddenly transitions to back-to-back full gcs. Cannot use free mem – too many holes!

Tools GCHisto jconsole VisualVM/VisualGC Logs Thread dumps yourkit memory profile, snapshots

Gone 0xff the heap !! ByteBuffer.allocateDirect(16 * 1024 * 1024) Also can be mapped memory of a file region Store long-lived objects outside jvm Managed by native i/o ops. JNA: dynamically load & call native libraries without compile time decl like JNI Works for limited use cases in the lab. Ex: Terracotta, Hbase, Cassandra

Gone 0xff the heap ? Issues to consider: No clear api to de-allocate from this region See jbellis patch to JNA-179 for FreeableBuffer Object cleanup relegated to finalization Single finalizer thread, Bug ID: 4469299 Behind WeakReference processing in jdk16u21 Workaround: -XX:MaxDirectMemorySize=<size> Manually Trigger System.gc() to avoid “leak”

Virtually there! Ballooning driver for Memory: Disable it! Time (TSC) issue! It's relative! Scheduling when # of threads > # of vcpus.. Tickless _nohz kernel GC Thread starvation = STW pauses large ec2 instances are not all equal.. DirectPathIO & vt-d, rvi – Watch out for Sockets! Tools: Performance counters still not virtualized!

summary JVM is still the most popular platform for deployment for the new languages! JVM heartburn around scale! Serialization UUID Object overhead Garbage Collection Hypervisor

References Chris Wimmer, http://wikis.sun.com/display/HotSpotInternals/Synchronization Russel & Detlefs http://www.oracle.com/technetwork/java/biasedlocking-oopsla2006-wp-149958.pdf Google Protocol Buffers http://code.google.com/p/protobuf Thrift http://incubator.apache.org/thrift/static/thrift-20070401.pdf Leach-Salz Variant of UUID http://www.upnp.org/resources/draft-leach-uuids-guids-00.txt Hans Boehm, http://www.hpl.hp.com/personal/Hans_Boehm/gc/complexity.html Brian Goetz, JSR-133 http://www.ibm.com/developerworks/java/library/j-jtp03304/ GCSpy http://www.cs.kent.ac.uk/projects/gc/gcspy/ Understanding GC logs http://blogs.sun.com/poonam/entry/understanding_cms_gc_logs Cliff Click's http://sourceforge.net/projects/high-scale-lib/

jvm goes to big data

More Related Content

jvm goes to big data