To understand the process, let us consider method A, which calls methods B and C, and the metric we will observe will be the number of instructions completed. We will attempt to calibrate the number of instructions completed in both B and C, as well as the instructions completed between those calls.
Our instrumentation of Java methods is achieved by capturing the method entry and method exit events. We use a device driver to capture the current total number of instructions completed on that thread and compute deltas of that metric. To provide the maximum accuracy of our calibration, we capture the number of instructions completed a fixed number of instructions after we receive the event and fixed number of instructions before we return from the event. By doing this, we can ignore any processing we do between the two times we observe the metric.
As an example, let us consider a Main program that has 4 subroutines: A, B, C, and D. Let A call B and C and let C call D, as shown below:
Main--+ +->Main | e e x | +->A--+ +->A--+ +->A--+ | x | | e x | +->B--+ +->C--+ +->C--+ | x | +->D--+ J J J J J J J J
In the example above, JPROF processing occurs at each transition marked with a J. Every "->X--" section represents time attributed to routine X. The letter above each section represents the amount of instrumentation overhead associated with each section: e (entry overhead) and x (exit overhead).
The original calibration algorithm, computed these overheads by observing all entry and exit events and saving the minimum observed values. It assumed that the minimum value was the result of a single call or return between invocations of JPROF. Thus, the calibration was accomplished by subtracting e-1 or x-1 from each observed value, respectively.
Our next version took this one step further and computed 32 different minimum overheads for all combinations of entry and exits events for all combinations of Interpreted, Jitted, Native, and Other methods calling Interpreted, Jitted, Native, and Other methods. Although that helped, this more complicated view of the example was needed:
Main--+ +->Main |e2+e1 x2+e1 x2+x1| +->A--+ +->A--+ +->A--+ |e2+x1| |e2+e1 x2+x1| +->B--+ +->C--+ +->C--+ |e2+x1| +->D--+ J J J J J J J J
In the example above, the single letters above each section have been replaced with sums representing the detailed amount of instrumentation overhead associated with each section. The terms in these sums are: e1 (pre-JPROF entry overhead), e2 (post-JPROF entry overhead), x1 (pre-JPROF exit overhead), and x2 (post-JPROF exit overhead). This now means that we compute different minimum overheads for all combinations of entry and exits events followed by entry and exit events for all combinations of identifiable method types. At the moment, these are Interpreted, Jitted-Dynamic, Jitted-Static, Native, BuiltIn, Compiling, Thread, and Other methods.
The total number of categories is not quite as bad as it may seem, because some combinations of categories do not actually occur. For example, an entry followed by an exit means that the third method type is the same as the first method type, because we returned to the caller. Thus, Interp-En-Jitted-Ex-Jitted should never occur. However, the number of valid categories is still quite large.
The solution to all of these problems is runtime calibration. Rather than waiting until a flush is performed to apply the calibrations, they are applied when each event is encountered. This both eliminates the uncertainty about which sequence of events is occurring and reduces the size of the node, since now only the total adjustment must be saved.
The only shortcoming of runtime calibration is that the calibrations applied may change over time. That is, the calibration applied at time X might be greater than the calibration applied at time Y, because a new minimum for that category might be observed between time X and time Y. This becomes even more of a factor when we use the DELAY_START parameter to suppress instrumentation until a START command is received. To minimize this effect, we can either warmup the calibration or initialize our calibration statistics from the results or previous executions.
Runtime calibration is alway performed, unless DELAY_CALIBRATION is specified. Unless START is specified, a warmup will be performed to gather calibration data for a short period of time during startup. This is controlled by the WARMUPDELAY=n and WARMUPCOUNT=n parameters. The WARMUPDELAY defines the number of JIT compilations that must occur before warmup begins. The default is to wait until the first compilation is complete. The WARMUPCOUNT defines the number of Entry/Exit events that will be processed before stopping the warmup. The default is one million events. The WARMUPDELAY guarantees that there is a mixture of both compiled and interpreted methods during warmup, rather than just the interpreted methods that would be seen if warmup started immediately. Warmup can be turned off by specifying WARMUPCOUNT=0.
If you find that this warmup capability still does not give you sufficient accuracy, you can specify the CALSTATS=filename parameter to identify a file to be used to save and restore calibration data from one execution to another. If the specified file exists, the calibration statistics will be initialized from this file. When JPROF terminates, the current minimum values will be written to the file.
We provide a testcase, trainer.java, which repeatedly calls minimum length routines of all categories to maximize the chances that good minimum values will be found. If we detect differences in the JVM, JPROF, or the metrics being used, the file of calibration statistics will be rejected and you will have to either delete the old file or specify a different file.
It is recommended that you use the same options when running the trainer testcase that you use when you run your application. Using different options, such as disabling function inlining, can produce different results on some JVMs.
When we examine an instruction trace of code that has not been instrumented, we discover that some of the kinds of code blocks that we identified above have a minimum size greater than one. For example, x-En-y-En-z code blocks, where x, y, and z are any type of method, actually include method prologue code that x-Ex-y-En-z code blocks do not. The instructions necessary to setup the stack frame at the start of a method are a typical example of this kind of prologue code. By parsing the output of an instruction trace, we can determine what is the associated language overhead for each type of code block. We have computed this overhead for each supported combination of JVM and operating system and automatically apply it.
When this language overhead is used as the minimum value after calibration, we are able to achieve CALLFLOW results which are almost identical to those achieved with ITRACE, but without the tremendous execution-time overhead of ITRACE.