Member Comments on our presentation Tuning the JVM

under

I have a few observations and heuristics from my experience tuning the Java GC I would like to share, while the topic is fresh in the minds of the  JUG members from this week's great presentation by Mike Richardson on the topic. It would be great if  you can post this note.

I did some GC data collection, analysis, and tuning, on a project that was a typical J2EE architecture, used EJBs in the middle tier, and Struts in the web tier.

My findings and heuristics are as follows:

Objectives of GC analysis and Tuning:

  1. Try not to have a major GC at all, or a a major GC every one or two hours only. So you either don't have "stop the world", or have it very infrequently.
  2. Decide what your tolerance for a minor GC time is. Can you tolerate 10 seconds, 20, or 30? Less or more? That's one of your NFRs that you have to decide on.

From there you need heuristics and measurements of actual behavior under load. Either do a real load test under realistic conditions, or write a multi-threaded simulator (good job for Groovy!) that simulates the major use cases. You would need to know the behavior of your major use cases well, to simulate memory footprint - without going through the complexities of the application. Either of these would do but you need to determine the application behavior metrics.

The two vital numbers you need are:

  1. Object creation rate - to size Eden area
  2. Object survival rate  - to size Tenured area

The Invariants

Given an object creation rate of M objects per second, and an acceptable tolerance time for a minor GC of T secods, then

     Eden >= Minor GC time X Object Creation Rate = MT

Note: Object creation rates are typically different in different tiers: The presentation tier has a very different profile form a business tier, or rule engine.

Say you observe a 30 M per second object creation rate, and you can tolerate a 20 second minor GC, then you can size Eden to be = 30 X 20 = 600 M.

Survivor area >= Major GC time X object survival rate (tenuring rate)

Say you want Major GCs every two hours, and you observe an object survival rate of 1 M per minute, then:

    Survivor area = 2 X 60 = 120 M.

The third invariant is:

     New area = Eden + 2 Survivor Areas

     New = 600 + 120 + 120 = 840 M

     Survivor ratio = 5 (600/120)

4. Tenured area

     That's the easiest to size: it is the steady state lowest observed heap value.

You keep measuring and tuning till you get a nice seesaw pattern of minor GCs every 20 secs and a major GC every two hours. The subtle point that would trip you is that these invariants are minimum values, the values must be greater than or equal. So Eden must be big enough to allow the minor GC to reclaim all the dead infants in the window of time allowed, or tolerate(terrible analogy!). If Eden is not big enough,  then you will have more frequent minor GCs. Survivor area must be big enough to hold all the survivors of successive minor GCs (candidates for tenure).

The point of these observations is that you can get pretty confused with all the possible parameter tweaks. You must have practical heuristics to tell you how to use these observed numbers.  To control what you can and to accept what  you cant.

The situation I was in was a very large web app (2 million lines of code). 15 subsystems, with 15 App Archs, 15 lead developers and about a 100 developers 20 functional testers and 3 performance testers, 15 project managers, and 15 build masters. You really can't go around telling programmers best practices. There isn't enough hours in the day, and the process did not allow it. You can suggest a best practice, but it would take a year from them to start using it. So the tuning has to happen outside of the development area. The app archs with the infrastructure person.

At the end of the day, memory is cheap. So you should not get OutOfMemoryError. The only reason for getting OutOfMemoryError (other than a bad leak) that consumes the tenured space in the 2 hours allotteed), would be size of perm. You should find early what your classes and methods need and just provide that. A simple experiment with one user could determine perm size.

If you have a leak, you'll see a pattern of the valleys of the seesaw trending upward, sort of heading North East! The seesaw pattern should be between two parallel lines and the two lines should be as close to horizontal as possible. Any North East slope, is the rate of your leak.

If you don't have GC tuned, you'll experience long pauses, or poor throughput. GC follows Amdahl's law. http://en.wikipedia.org/wiki/Amdahl%27s_law. You can't solve it by throwing more processors or memory at it.

Hope that helps.