Session Presented at 5th IndicThreads.com Conference On Java held on 10-11 December 2010 in Pune, India
WEB: http://J10.IndicThreads.com
------------
Enterprise applications typically comprise of multi layered stacks including the application modules, application servers, the Java Virtual Machine and the underlying Operating System. Consequently the performance of these applications are a factor of these different layers. In the eventuality of a performance problem, it is often difficult to determine the starting point for diagnosis. The Java Virtual Machine is the ‘engine’ for most of the applications. It is responsible broadly for efficient execution and memory management of applications. End users have difficulty attributing the effect of the JVM on the performance of the application. This is because usually JVM is viewed as a ‘black box’.
This talk provides an insight into the key subsystems of the JVM by looking under the hood of a high performance JVM. It ventures onto talk about approaches and techniques for analyzing performance issues. It concludes by introducing the audience to a tool called the “Health Center” which is useful for evaluating and comprehending the JVM behavior of a running application in an unobtrusive, lightweight manner.
Takeaways for the Audience A better understanding of key JVM components, approaches and techniques to diagnose performance issues and performance evaluation using the Health Center
Report
Share
Report
Share
1 of 42
More Related Content
Best Practices for performance evaluation and diagnosis of Java Applications [5th IndicThreads.com Conference On Java]
1. Best Practices for Performance Evaluation and Diagnosis of Java Applications Prashanth K Nageshappa Venkataraghavan Lakshminarayanachar IBM
2. Agenda Inside a High Performance Java Virtual Machine (JVM) Performance Issues – Diagnosis Techniques The Healthcenter
4. Lifting the Hood Overall Architecture User Code VM Extensions Core VM Portability Layer Operating systems Debugger Profilers Java Application Code JVMTI SE5 Classes SE6 Classes Harmony Classes User Natives GC JIT Class Library Natives Pluggable VM Interfaces Java Native Interface (JNI) Core VM (Interpreter, Verifier, Stack Walker) Trace & Dump Engines Port Library (Files, Sockets, Memory) Thread Library AIX Linux Windows z/OS PPC-32 PPC-64 x86-32 x86-64 PPC-32 PPC-64 390-31 390-64 x86-32 x86-64 390-31 390-64 = User Code = Java Platform API = VM-aware = Core VM
5. Java: Adaptive Compilation in J9/TR Methods start out being interpreted After N invocations (or via interpreter sampling) methods get compiled at ‘cold’ or ‘warm’ level Low overhead sampling thread is used to identify hot methods Methods may get recompiled at ‘hot’ or ‘scorching’ levels (for more optimizations) Transition to ‘scorching’ goes through a temporary profiling step cold hot scorching profiling interpreter warm
6. Code Example public static int total = 55; public static int dummy( int i, int j, int N, int [] a) { int k = 0; for (i = 0; i < N; i++) { k = k + j + a[i] + ( total + foo ()); } return k; } public static int foo() { return 75; }
8. Garbage Collection - Goals Tidying up… Fast allocation path Large contributor to overall JVM performance. Low pause times and concurrent operation Fit for purpose – different algorithms with different tradeoffs. Hardware exploitation Multiple CPUs & varying memory architectures. Algorithmic and processor parallelism. Accurate garbage collection Earlier IBM JVMs did a ‘partially conservative’ GC, which was suboptimal .
9. Compressed References 32-bit Object (24 bytes – 100%) 64-bit Object (48 bytes – 50%) 64-bit Compressed (24 bytes – 100%) Use 32-bit values (offsets) to represent object fields With scaling, between 4 GB and 32 GB can be addressed To enable the feature : - Xcompressedrefs clazz flags monitor int field object field object field Clazz Flags Pad Monitor int field Pad object field object field Clazz Flags Monitor int field object field object field
10. Threading and Monitors Java uses monitors everywhere Good – easy to use, safety built-in for many cases! Bad – there’s a tax, even when there’s no contention. Central to performance in JVMs Avoid it? Escape analysis (but remember JSR 133!). Make it cheaper? Tasuki locks Lock reservations
11. Bimodal lock – ‘thin’ or ‘inflated’ Single atomic operation (on enter) A Study of Locking Objects with Bimodal Fields (Tamiya Onodera & Kiyokuni Kawachiya, IBM Research, OOPSLA 1999) Lock Reservation: Java Locks Can Mostly Do Without Atomic Operations ( Kiyokuni Kawachiya, Akira Koseki, Tamiya Onodera, IBM Research, OOPSLA 2002) Tasuki locks 0 0 1 Inflated Monitor Thread ID 0 Unowned Thin owned Inflated owned
12. Historical Perspective Is it just the hardware? We’ve come a long, long way… Why? Processors – better control & understanding of the memory hierarchy Language understanding (idiom recognition) Processing budget (new instructions, more cores)
13. SPECjbb Trademarks and Results SPEC and SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation. Results referenced are current as of June, 2009. The SPECjbb2005 results are posted at www.spec.org, which contains a complete list of published SPECjbb2005 results. SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. 9% 0.005% 7% 4% 7% 0.06% 6% 1% 4% 14% 14% 7% 17% 6% 5% base Leap vs prev 16 15 14a 14a 13 12b 12a 11 10b 10a 9b 9a 8 7b 7a 6 5b 5a 4 3b 3a 2 1 Data Pt link 278,396 556,792 2.93 8 2 5570 Cisco UCS B200-M1 Mar 09 JRockit P28.0 link 82,651 330,605 3.33 8 2 5470 IBM System x3650 Sep 08 J9 6sr1 link 59,618 238,472 3.0 8 2 5365 Dell PowerEdge 2950 Aug 07 JRockit P27.2 link 75,783 303,130 3.16 8 2 5460 Dell PowerEdge 2950 III Nov 07 JRockit P27.4 2.58 2.38 2.38 2.23 2.14 2.01 2.01 1.90 1.88 1.81 1.59 1.39 1.3 1.11 1.05 1.0 Accum Leap vs base link 105,033 210,065 2.66 8 2 5355 Dell PowerEdge 2950 Nov 06 JRockit P27.1 35503/28314=1.25; 1.25*1.11=1.39 link 35,503 35,503 3.8 2 2 DP Dell PowerEdge 1850 Jun 06 JRockit P26.4 5570 5570 5470 5470 5460 5460 5365 5355 5355 5160 5160 5160 dual dual DP dual DP Xeon 49233/41987=1.17; 1.11*1.17=1.30 link 49,233 49,233 2.8 4 2 FSC PRIMERGY TX300 S2 Mar 06 JRockit P26.0 link link link link link link link link link link link link link link link link www 28314/24208*3.6/3.8=1.11 28,314 28,314 3.8 2 2 FSC PRIMERGY TX300 S2 Dec 05 HotSpot 5u5 J9 6sr5 HotSpot 6u14p JRockit P28.0 J9 6sr3 J9 6sr1 HotSpot 6u5p JRockit P27.4 JRockit P27.2 J9 5.0sr5 JRockit P27.1 J9 5.0sr2 JRockit P26.4 HotSpot 5u5 J9 5.0GA JRockit R25.2 JVM 151,104 278,411 92,009 86,109 80,793 75,824 63,101 110,324 109,016 130,589 114,941 100,407 41,986 39,585 24,208 SPECjbb2005 bops/jvm 2.93 2.93 3.33 3.33 3.16 3.16 3.0 2.66 2.66 3.0 3.0 3.0 2.8 2.8 3.6 GHz math SPECjbb2005 bops Core(s) Chips Hardware 604417/556822=1.09; 2.38*1.09=2.58 604,417 8 2 IBM BladeCenter HS22 Mar 09 556822/557/792=1.00005 556,822 8 2 Sun Fire X4270 Mar 09 368034/344436=1.07; 2.23*1.07=2.38 368,034 8 2 FSC PRIMERGY RX200 S4 Mar 09 344436/330605=1.04; 2.14*1.04=2.23 344,436 8 2 IBM System x3650 Oct 08 323172/303297=1.07; 2.01*1.07=2.14 323,172 8 2 IBM System X 3650 Mar 08 303297/303130=1.0006 303,297 8 2 Sun fire X4150 Feb 08 252403/238472=1.06;1.90*1.06=2.01 252,403 8 2 Dell PowerEdge 2950 Nov 07 220648/218032=1.01; 1.88*1.01=1.90 220,648 8 2 Dell PowerEdge 2950 Mar 07 218.032/210,065=1.04; 1.81*1.04=1.88 218,032 8 2 IBM System X 3650 Feb 07 130589/114941=1.14; 1.59*1.14=1.81 130,589 4 2 Dell PowerEdge 2950 Nov 06 114941/100407=1.14; 1.39*1.14=1.59 114,941 4 2 IBM System X 3650 July 06 100,407 4 2 FSC PRIMERGY BX620 S3 Jun 06 41986/39585=1.06 41,986 4 2 FSC PRIMERGY RX300 S2 Dec 05 1.11 – 1.06 = 1.05 39,585 4 2 IBM eServer xSeries 346 Oct 05 24,208 2 2 Dell PowerEdge SC1425 Jun 05
15. Debugging Performance Problems Four layers of deployment: Operating System / Infrastructure Java Runtime / Garbage Collection Application Code External Delays Simple process is to start at the bottom, and eliminate layers
19. “ MustGather” Diagnostics Set of data requested by IBM Support initial problem diagnosis Specified on a per-scenario basis Requests only the data relevant to the scenario Specified on a per-platform basis Leverages OS specific tools and capabilities Split into two parts: Setup: to be done before starting the Java application Gather: to be done when the problem has occurred Linked to from product support pages Java: http://www.ibm.com/software/webservers/appserv/runtimes/support/ WAS: http://www.ibm.com/software/webservers/appserv/was/support/
21. Resource Contention: Physical Memory Lack of physical memory will cause paging/swapping of memory Swapping is very costly for a Java process Particularly affects Garbage Collection performance Garbage collection touches every point of memory in the process All memory therefore would need to be paged back in Leads to long “mark” and “sweep” phases of GC
22. Resource Contention: CPU Insufficient CPU time availability will reduce performance Normally surfaces when something periodically takes CPU time on the box, eg. Cron Jobs running batch applications Database backups
23. System Resource Contention: Solutions Ensure there are enough resources! Where resource can contention occurs it is important to ensure the Java application has its pool of resources Isolation be achieved on some platforms using LPARs/WPARs/ Zones Otherwise move other applications onto separate machines
25. Garbage Collection Performance GC performance issues can take many forms Definition of a performance problem is very user centric User requirement may be for: Very short GC “pause” times Maximum throughput A balance of both First step is ensure that the correct GC policy has been selected for the workload type Helpful to have an understanding of GC mechanisms Second step is to look for specific performance issues
26. Object Allocation Requires a contiguous area of Java heap Driven by requests from: The Java application JNI code Most allocations take place in Thread Local Heaps (TLHs) Threads reserve a chunk of free heap to allocate from Reduces contention on allocation lock Keeps code running in a straight line (fewer failures) Meant to be fast Available for objects < 512 bytes in size Larger allocates take place under a global “heap lock” These allocations are one time costs – out of line allocate Multiple threads allocating larger objects at the same time will contend
27. Object Reclamation (Garbage Collection) Occurs under two scenarios: An “allocation failure” An object allocation is requested and not enough contiguous memory is available A programmatically requested garbage collection cycle call is made to System.GC() or Runtime.GC() the Distributed Garbage Collector is running call to JVMPI/TI is made Two main technologies used to remove the garbage: Mark Sweep Collector Copy Collector IBM uses a mark sweep collector or a combination for generational
28. Global Collection Policies Garbage Collection can be broken down into 2 (3) steps Mark: Find all live objects in the system Sweep: Reclaim unused heap memory to the free list Compact: Reduce fragmentation within the free list All steps are in a single stop-the-world (STW) phase Application “pauses” whilst garbage collection is done Each step is performed as a parallel task within itself Four GC “Policies”, optimized for different scenarios -Xgcpolicy:optthruput optimized for “batch” type applications -Xgcpolicy:optavgpause optimized for applications with responsiveness criteria -Xgcpolicy:gencon optimized for highly transactional workloads -Xgcpolicy:subpools optimized for large systems with allocation contention
29. Introduction to GCMV Garbage Collection and Memory Visualizer Verbose GC data visualizer Eclipse based tool available as plugin in ISA and as a standalone tool. Parses and plots all verbose GC logs Extensible to parse and plot other forms of input Provides graphical display of wide range of verbose GC data values Handles optthruput, optavgpause, and gencon GC modes Has raw log, tabulated data and graph views and can save data to jpeg or .csv files (for export to spreadsheets)
30. GCMV usage scenarios Investigate performance problems Long periods of pausing or unresponsiveness Evaluate your heap size Check heap occupancy and adjust heap size if needed Garbage collection policy tuning Examine GC characteristics, compare different policies Look for memory growth Heap consumption slowly increasing over time Evaluate the general health of an application
33. Evaluating Your Application through the Healthcenter Answers to.. What is my Java application doing ? Why is it doing that ? Why is my application going so slowly ? Is my application scaling well ? Do we need to tune the JVM ? Am I using the right options? Available from/as a part of https://www.ibm.com/developerworks/java/jdk/tools/ http://www.ibm.com/software/support/isa
35. Environment Subsystem Shows Version information for the JVM Operating system and architecture information for the monitored system Process ID All system properties All environment variables
36. Shows all loaded classes Shows classes loaded time Visualizes classloading activity Identifies shared classes Makes recommendations Classes Subsystem
37. GC Subsystem - Shows Used Heap (after collection) & GC pause times - Identify memory leaks - Provides tuning recommendations and analysis of GC data
38. Locking Subsystem - Always-on lock monitoring - All lock usage is profiled such as lock request totals, blocking requests and hold times - Helps to identify points of contention that prevents the application from scaling
39. Profiling Subsystem - Sampling based profiler - Instantly identifies hottest methods in an application - See full call stacks to identify where methods are being called from and what methods they call
40. Features (New) I/O Provides File open events Provides File close events Provides Details of files that are currently open Native Memory Provides native memory usage of the process and system monitored Does not provide a native memory perspective view for the z/OS® 31-bit or z/OS 64-bit platforms.
41. Merci Grazie Gracias Obrigado Danke Japanese English French Russian German Italian Spanish Brazilian Portuguese Arabic Traditional Chinese Simplified Chinese Hindi Tamil Thai Korean Teşekkürler turkish Thank You
Notes The table gives the background for the 16 data points used in the SPECjbb2005 JVM leapfrog chart The 2 nd column “Leap vs prev”, lists how much improvement went into the JVM relative to the previous data point using a comparable Xeon chip. The 3 rd column, “Accumulated Leap vs base” lists the accumulated leaps. For example, the 4 th data point is a 17% improvement or leap compared to the 3 rd data point since the result measured in SPECjbb2005 bops improved by 17% on a comparable Xeon chip. The accumulated improvement for the 3 rd data point is listed as 1.11. The accumulated improvement for the 4 th data point is therefore 1.3x since 1.11*1.17=~1.3. An ‘a’ and a ‘b’ result, for example 3a and 3b, are two results using the same JVM on different Intel chips. This provides 2 comparison points. The actual calculation used to get the Leap data is in the ‘math’ column. The “Leap vs Prev” number is generally a simple comparison against the previous bops score on a comparable Xeon chip. The 2 nd and 3 rd data point calculations are a little more complicated. The 3 rd data point has an associated score in row 3a which was compared against the first data point, however since the clock speeds are different the leap was scaled by the clock speed differences (this is the only direct comparison that required a clock speed scaling). This calculation yields a leap of 11% The 3 rd data point has a score in the row labeled 3b which is 6% faster than the 2 nd data point. Since data point 3 is 11% ahead of data point 1 and 6% ahead of data point 2 we’ve extrapolated that data point 2 is 5% ahead data point 1. The column labeled SPECjbb2005 bops is used for all the calculated the ‘leaps’. The bops/jvm data is there for SPEC result reporting compliance.