Demystifying the JVM: JVM Variants, Cppinterpreter and TemplateInterpreter

January 27, 2017

Introduction

Maybe after reading my previous post, or probably before that, you have wondered how JVM actually interprets the Java bytecodes. This post is about the Java Interpreter implementations in the OpenJDK. The reason I say implementations (plural) and not implementation (singular) will be clear after you finish reading the post. Also, before starting with the interpreters, I will tell you about the JVM variants.

Please be aware that I am not a member/contributor of OpenJDK or any other core Java, JVM, JDK projects. What I am saying here is purely my understanding of various public information, and mostly of the source code of the OpenJDK project.

OpenJDK Source Code

It is easy to get the OpenJDK source code. Since the project uses Mercurial, you need a proper Mercurial client on your system. Then:

hg clone http://hg.openjdk.java.net/jdk9/jdk9 openjdk9-src

jdk9 repo is the root repo of OpenJDK and provides a utility to fetch all repos used in the OpenJDK. To fetch them all, just do this:

cd openjdk9-src
sh ./get_source.sh

These steps are documented in the README file in the jdk9 repo. This will fetch the following repos:

  • hotspot: the Hotspot Java Virtual Machine
  • langtools: javac etc. language tools
  • jdk: JDK runtime libraries
  • jaxp, jax-ws, corba: code for the named funtionalities
  • nashorn: OpenJDK Javascript implementation

Because this series will be about the JVM, I will probably only look at the hotspot repo, so any source file I point out should be there under the src directory.

OpenJDK Hotspot Virtual Machine Variants

A few different variants of JVM can be built with the OpenJDK. The best place to understand this is to look at Makefiles. This is from the Spec.gmk which is automatically generated when running bash ./configure under OpenJDK source directory:

JVM_FEATURES_server := all-gcs aot cds compiler1 compiler2 fprof graal jni-check jvmci jvmti management nmt services vm-structs
JVM_FEATURES_client := all-gcs cds compiler1 fprof jni-check jvmci jvmti management nmt services vm-structs
JVM_FEATURES_core := all-gcs cds fprof jni-check jvmti management nmt services vm-structs
JVM_FEATURES_minimal := compiler1 minimal
JVM_FEATURES_zero := all-gcs cds fprof jni-check jvmti management nmt services vm-structs zero
JVM_FEATURES_zeroshark := all-gcs cds fprof jni-check jvmti management nmt services shark vm-structs zero

These (server, client, core, minimal, zero, zeroshark) are JVM variants and the supported features are listed for each variant in the same line. The feature names mean:

  • all-gcs: all garbage collectors
  • aot: Ahead-Of-Time AOT support
  • cds: not sure what this is yet
  • compiler 1: JIT Compiler 1 (as you see it is used both in client and server)
  • compiler 2: JIT Compiler 2 (as you see it is used only in server)
  • fprof: Flat Profiler, I believe, to support the -Xprof option
  • graal: exposes JVM functionality as API
  • jni-check: I believe it enables additional JNI checks to support -Xcheck:jni option
  • jvmci: JVM Compiler Interface (used by dynamic compilers like graal)
  • jvmti: JVM Tool Interface
  • management: I believe it enables the Java Management features
  • minimal: I believe it is just for making a minimal build and sets some compiler/linker flags to optimize the binaries
  • nmt:native memory tracking
  • services: not sure what this is yet
  • shark: LLVM based JIT Compiler for zero
  • vm-structs: not sure what this is yet
  • zero: Zero-Assembler port of JDK

In addition to the ones above, there are these features as well:

  • dtrace: dtrace support
  • static-build: makes a static build
  • link-time-opt: link time optimizations

As you guess correctly, normally, everybody uses either server or client variant. The zero variant is interesting, because it is based on no assembly code, that means it can be ported more easily than others. Also, because compiler 1 and 2 uses assembly codes, zero should be used with shark if one wants JIT compilation.

I am not sure if there is a better way to see this yet, but if you run java -version the output will tell you the variant. On my JDK9 built, it outputs:

java 9-ea
Java(TM) SE Runtime Environment (build 9-ea+153)
Java HotSpot(TM) 64-Bit Server VM (build 9-ea+153, mixed mode)

So a “Server” variant, in mixed mode (interpret + JIT compile).

Java Interpreters

As I said in my previous post, first of all, JVM needs to interpret the Java bytecodes (of course besides class loading, garbage collection etc.). After spending a few days, thinking I have found “the” interpreter in the JVM, and I have actually started writing this post based on that one, I realized there are two interpreters in the JVM source code. To my understanding, only one of them is/can be used in a JVM build, so it is a compile time (compile time of the JVM) decision. So you cannot switch the interpreters when you run the Java code.

The two interpreters are called CppInterpreter and TemplateInterpreter. Both of their source code are in share/vm/interpreter directory. A short history about them can be found here.

The reason I actually mentioned above about JVM variants and features is to explain which Interpreter is used in what variant.

There is an interesting typedef in share/vm/interpreter/interpreter.hpp:

typedef CC_INTERP_ONLY(CppInterpreter) NOT_CC_INTERP(TemplateInterpreter) Interpreter;

This is for using Interpreter type to refer both CppInterpreter and TemplateInterpreter. As you might have guess, CC_INTERP_ONLY and NOT_CC_INTERP is defined as:

##ifdef CC_INTERP
  #define CC_INTERP_ONLY(code) code
  #define NOT_CC_INTERP(code)
##else
  #define CC_INTERP_ONLY(code)
  #define NOT_CC_INTERP(code) code
##endif // CC_INTERP

So only one of them is possible, and this depends on the CC_INTERP . If I search the source files, the only place I find the definition of CC_INTERP is at one of the Makefiles:

ifeq ($(call check-jvm-feature, zero), true)
  JVM_CFLAGS_FEATURES += -DZERO -DCC_INTERP -DZERO_LIBARCH='"$(OPENJDK_TARGET_CPU_LEGACY_LIB)"' $(LIBFFI_CFLAGS)
  JVM_LIBS_FEATURES += $(LIBFFI_LIBS)
endif

So this means, if jvm feature “zero” is to be built, CppInterpreter is built. All other times, TemplateInterpreter is built.

To my understanding, CppInterpreter functionality is implemented together in CppInterpreter and BytecodeInterpreter classes, so when I refer BytecodeInterpreter, CppInterpreter should be understood.

So to sum up, JVM can be built based on platform dependent source code, e.g. platform specific assembly code, or it can also be built with platform independent (so pure C/C++) code. If you choose the former, you have server and client variants and you will use the TemplateInterpreter. If you choose the latter, then you have only zero variant, and you will use the CppInterpreter. As a side note, server and client uses compiler 1 and compiler 2 JIT compilers which also contain platform dependent codes, so assembly; whereas the zero variant supports shark JIT compiler which is based on LLVM JIT compiler and due to that it is platform independent.

Now lets look at both of the interpreters.

CppInterpreter

If we look at the BytecodeInterpreter class, which is defined/implemented in share/vm/interpreter/bytecodeInterpreter.hpp and cpp, we see a run method and BytecodeInterpreter::run method effectively has an infinite loop — really a while(1) — which runs until all the bytecodes are consumed in a Java method. Inside the while(1) loop, there is a big switch for every Java bytecode instruction. Basically it looks like this:

void
BytecodeInterpreter::run(interpreterState istate)
{

  // things to do before
  while (1)
  {
    opcode = *pc;
    switch (opcode)
    {
      // case for each opcode
    }
  }
  // things to do after
}

As you guess, pc is the program counter, pointing to next opcode to be interpreted.

Now, I would like to write a Java class and trace the interpretation of its bytecode. Here is my Java class:

public class Calculator {

  public int one() {
    return 1;
  }

  public int add(int x, int y) {
    return x+y;
  }

}

If we compile this and disassemble its byte codes with:

javac Calculator.java

javap -c Calculator

The output is:

Compiled from "Calculator.java"
public class Calculator {
  public Calculator();
    Code:
       0: aload_0
       1: invokespecial #1. // Method java/lang/Object."<init>":()V
       4: return
public int one();
    Code:
       0: iconst_1
       1: ireturn
public int add(int, int);
    Code:
       0: iload_1
       1: iload_2
       2: iadd
       3: ireturn
}

We see the methods one and add, and also Calculator contstructor, Calculator(), which we actually did not implement but as you know default constructor is added automatically by Java. In this post, I will only focus on the methods one and add.

As you see there are only 5 different opcodes in these methods:

iconst_1
iload_1
iload_2
iadd
ireturn

I am not going to tell more about this but remember JVM is a stack machine, which means it only has a stack to operate with, so no registers as in all mainstream CPUs (think about an Intel processor with all registers eax etc.). The opcodes above briefly means:

  • iconst_1, push integer constant/number 1 into the stack.
  • iload_1 and iload_2, pushes the integer value of local variables 1 and 2 (from the local variable array, see below) into the stack.
  • iadd, two integer values are popped from the stack, added, and the result is pushed back to the stack.
  • ireturn, an integer value is popped from the stack, and pushed back to the stack of the invoker of this method.

So, for these methods, it is very easy to follow how they map to bytecodes. I have hyperlinked each opcode to its official documentation, but here is also the link for the official Java Virtual Machine Instruction Set documentation.

Before continuing, I think it is good to mention what a frame is. Every time a method is invoked by JVM, a new frame is created and a frame consists of a stack (operand stack in JVM terms), local variable array (and a reference to runtime constant pool which we will not use in this post). It is important to know that local variable array implicitly contains the method parameters starting from array index zero. So the value of the first method parameter will be at local variable array index zero. You will naturally ask, in the example above, add function uses iload_1 and iload_2 not iload_0, so method parameters are at index 1 and 2, then what is at index zero ? For instance methods like add, the first method parameter is implicit, and it is as you guess correctly the object instance=the famous “this”.

Now going back to BytecodeInterpreter class, lets see how these opcodes are actually implemented. Below I show the highly simplifiedswitch (opcode) statement for these 5 opcodes.

switch (opcode) {
  OPC_CONST_n(_iconst_1, INT, 1);
  OPC_LOAD_n(1);
  OPC_LOAD_n(2);
  OPC_INT_BINARY(add, Add, 0);
  CASE(_ireturn):
  {
    goto handle_return;
  }
  do_continue:
}
handle_return:
// do work for returning back to invoker frame

Since many opcodes are similar, there are preprocessor macros to define their implementations, as you see above, both iconst, iload_1 and iload_2 and iadd is defined as a macro. Only ireturn is implemented in place, and it simply jumps out of the while loop (since it is the last opcode in this method). You will also see the use of do_continue: label later in this post.

Lets look at the macros now, and please keep in mind I am simplifying these because sometimes they include all variants of the opcode for every type (e.g. iadd, fadd):

##define OPC_CONST_n(opcode, const_type, value)
  CASE(opcode):
    SET_STACK_ ##const_type(value, 0);
    UPDATE_PC_AND_TOS_AND_CONTINUE(1, 1);

OK, we need to look at SET_STACK_INT (remember it is used with const_type=INT above) and UPDATE_PC_AND_TOS_AND_CONTINUE as well.

##define OPC_LOAD_n(num)
  CASE(_iload_##num):
    SET_STACK_SLOT(LOCALS_SLOT(num), 0);
    UPDATE_PC_AND_TOS_AND_CONTINUE(1, 1);

We need to look at SET_STACK_SLOT and LOCALS_SLOT also, and the other method UPDATE_PC_AND_TOS_AND_CONTINUE is used here as well.

##define OPC_INT_BINARY(opcname, opname, test)
  CASE(_i##opcname):
    if (test && (STACK_INT(-1) == 0)) {
      VM_JAVA_ERROR(vmSymbols::java_lang_ArithmeticException(),
                    "/ by zero", note_div0Check_trap);
}
    SET_STACK_INT(VMint##opname(STACK_INT(-2), STACK_INT(-1)), -2);
    UPDATE_PC_AND_TOS_AND_CONTINUE(1, -1);

We have SET_STACK_INT, VMintAdd (##opname is set to Add above), STACK_INT and the same method UPDATE_PC_AND_TOS_AND_CONTINUE as above. An interesting thing to note is test is set to 0 in our case for iadd opcode, but as you guess for idiv and irem, there is a division by zero check exactly here. So now you know when you divide something by 0, the exception is thrown from here.

Lets look at the implementation of the other macros I mentioned above. For SET_STACK_INT and other macros, we should go to another source, cpu/zero/vm/bytecodeInterpreter_zero.hpp:

##define SET_STACK_INT(value, offset)    (*((jint *)&topOfStack[-(offset)]) = (value))

It sets the (top minus offset) index of stack to a value. With iconst_1, it is used as SET_STACK_INT(1, 0), so it simply sets the topmost element to 1, which means actually it pushes the value 1 to stack.

Similarly:

##define STACK_INT(offset)     (*((jint*) &topOfStack[-(offset)]))

STACK_INT is a peek operation with an offset. For STACK_INT(-1) it means the top element, STACK_INT(-2) is the one before.

Lets look at the SET_STACK_SLOT and LOCALS_SLOT:

##define SET_STACK_SLOT(value, offset)   (*(intptr_t*)&topOfStack[-(offset)] = *(intptr_t*)(value))
##define LOCALS_SLOT(offset)    ((intptr_t*)&locals[-(offset)])

LOCALS_SLOT is very similar to STACK_INT, but operates on locals array not on stack. Also, SET_STACK_SLOT is similar to SET_STACK_INT but works with intptr instead of jint type.

What about UPDATE_PC_AND_TOS_AND_CONTINUE:

##define UPDATE_PC_AND_TOS_AND_CONTINUE(opsize, stack) { \
  pc += opsize; MORE_STACK(stack); \
  DO_UPDATE_INSTRUCTION_COUNT(opcode); \
  DEBUGGER_SINGLE_STEP_NOTIFY(); \
  goto do_continue; \
}

Step by step:

  • pc (program counter) is increased.
  • stack size adjusted. See below for the implementation of MORE_STACK.
  • DO_UPDATE_INSTRUCTION_COUNT is for debugging, it does nothing to affect bytecode interpreter.
  • DEBUGGER_SINGLE_STEP_NOTIFY is for JVM Tool Interface (JVMTI) Agents. The definition of this macro depends on a compiler flag, since it is a separate feature “jvmti”. It is an empty statement if jvmti is not included.
  • Jump to continue to another cycle of while(1) infinite interpreter loop

Lets look at what MORE_STACK does, in share/vm/interpreter/bytecodeInterpreter.hpp:

##define MORE_STACK(count)  \
    (topOfStack -= ((count) * Interpreter::stackElementWords))

Simply adjusts the stack size.

You may have realized the stack operations are not implemented as very traditional (and dynamic) push and pop functions but with direct reference to the array underlying the stack. One reason for this is the size of stack is known at compile time (it is naturally 100% known by javac how much stack you need by local variables, opcodes etc.), so there is no need for a dynamic stack implementation by the interpreter.

You may have missed there is only one method/macro left to look at for this example, it is VMintAdd, and it is called by iadd as VMintAdd(STACK_INT(-2), STACK_INT(-1)), so with the two elements at the top of the stack. Going to cpu/zero/vm/bytecodeInterpreter_zero.inline.hpp:

inline jint BytecodeInterpreter::VMintAdd(jint op1, jint op2) {
  return op1 + op2;
}

Here is the function responsible for integer addition in JVM :) or more correctly the integer addition in CppInterpreter in the JVM.

I am sure you are a little bit lost following all of these. I will summarize it in a pseudo-ish code for Calculator::one method.

void
BytecodeInterpreter::run(interpreterState istate)
{
  // assume: pc    = {iconst_1, ireturn}
  //         stack = {}
  //         sp    = 0
  //         locals= {this}
  while (1)
  {
    opcode = *pc;
    switch (opcode)
    {
      case (iconst_1):
        stack[sp] = 1; // this makes stack = {1}
        pc += 1;
        sp += 1;
        goto do_continue;
      case (ireturn):
        goto handle_return;
      do_continue:
    }

  } // end of while (1)
  handle_return:
}

Now for the Calculator::add method:

void
BytecodeInterpreter::run(interpreterState istate)
{
  // assume: pc    = {iload_1, iload_2, iadd, ireturn}
  //         stack = {}
  //         sp    = 0
  //         locals= {this, param1, param2}
  while (1)
  {
    opcode = *pc;
    switch (opcode)
    {
      case (iload_1):
        stack[sp] = locals[1]; // now stack = {param1}
        pc += 1;
        sp += 1;
        goto do_continue;
      // when this is called sp = 1
      case (iload_2):
        stack[sp] = locals[2]; // now stack = {param1, param2}
        pc += 1
        sp += 1
        goto do_continue;
     // when this is called sp = 2
     case (iadd):
        stack[sp-2] = VMintAdd(stack[sp-2], stack[sp-1])
        // now stack = {result, param2}
        pc += 1
        sp -= 1
        goto do_continue;
      case (ireturn):
        goto handle_return;
      do_continue:
    }

  } // end of while(1)
  handle_return:
}

You may have realized I have reversed the stack direction. In actual OpenJDK implementation, if I understand correct, as the traditional way, stack grows downwards, so top of the stack or stack pointer is decreased when new items are added. In my pseudo-ish code, I did the reverse to explain the flow easier.

I should say, if you look at the OpenJDK source code, you will see many macros and definitions and it may look much more complicated than I summarized above. Most of it is due to enabling/disabling the features I mentioned in the beginning and for additional debugging capabilities.

Also, there are two ways how opcodes are fetched. The one I showed above is the normal case, but there is a PREFETCH_OPCODE definition and if it is defined the opcodes are fetched beforehand, not just before the switch (opcode) statement but after every opcode is interpreted and before entering while(1) loop.

There are also two ways how the switch opcode part works. Again the normal case is the one I showed above, but there is also a USELABELS definition, and if defined, it uses a big table/array (exactly with 256 elements since it is the maximum number of opcodes possible) consisting all opcodes to their labels inside the switch block above but without using switch (it jumps with goto), like the below:

const static void* const opclabels_data[256] = {
/* 0x00 */ &&opc_nop,     &&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0,
...
}
register uintptr_t *dispatch_table = (uintptr_t*)&opclabels_data[0];
##define DISPATCH(opcode) goto *(void*)dispatch_table[opcode]
void
BytecodeInterpreter::run(interpreterState istate)
{

  // things to do before
  while (1)
  {
    opcode = *pc;

    DISPATCH(opcode);
    {
      // label for each opcode e.g.
      opc_iconst_1:
    }
  }
  // things to do after
}

So basically a big block of code with many labels where the execution jumps around the labels.

TemplateInterpreter

Now lets see why TemplateInterpreter is different. As you have seen above, there were no assembly code to interpret the bytecodes, even the basic “add” method was implemented as a C function returning op 1 + op 2.

TemplateInterpreter is defined and implemented in share/vm/interpreter/templateInterpreter.hpp and cpp , but it also uses other platform dependent files. In order to show you how it works, I need to add a main method to the Calculator, to be able to execute it with “java”.

public class Calculator {

  public int one() {
    return 1;
  }

  public int add(int x, int y) {
    return x+y;
  }

  public static void main(String[] args) {
    new Calculator().one();
    new Calculator().add(1, 2);
  }

}

As you might guess, if we disassemble this with javap, we will see the , one and add as before, and we will see themain method also. The reason I said I need to run the Calculator is because TemplateInterpreter is a runtime/on-the-fly generated code and we can only see it by running “java”. Sounds interesting ?

It is not that easy to understand how it works from the code, so it is good to see this slide for some visual explanation. The TemplateInterpreter contains a Template Table, for each opcode, and the entry there refers to a code cache to execute each opcode. Furthermore, the code in the code cache is a native code, and generated directly as machine code when you run the “java” executable. So basically an opcode is read, the Template Table is referred and execution jumps to the code addressed in the table, which is an address in the code cache, code in the cache runs and jumps back to repeat the cycle.

If we look at cpu/x86/templateTable_x86.cpp , for iconst opcode, we see this:

void TemplateTable::iconst(int value) {
  transition(vtos, itos);
  if (value == 0) {
    __ xorl(rax, rax);
  } else {
    __ movl(rax, value);
  }
}

First, it covers all iconst variants (m1, 0, 1, 2, 3, 4, 5), and double underscore (__) is defined as:

##define __ _masm->

So, __ movl(rax, value) is masm->movl(rax, value) . It is hard to believe but this code is implemented in cpu/x86/vm/assembler_x86.cpp as:

void Assembler::movl(Register dst, int32_t imm32) {
  int encode = prefix_and_encode(dst->encoding());
  emit_int8((unsigned char)(0xB8 | encode));
  emit_int32(imm32);
}

where it actually emits/generates the machine code ! Now to repeat from the beginning, we were looking for implementation of iconst_1 opcode, we found the iconst method in templateTable, and this method calls this movl method to generate machine code. So after all these, when we launch a “java”, we have a populated TemplateTable, and a CodeCache which contains the machine codes of the opcodes. Can we see this generated TemplateInterpreter or the generated opcodes ? Yes !

Another JVM option, -XX:+PrintInterpreter, prints the generated interpreter code. It is a diagnostic option so it should be used like:

java
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintInterpreter
Calculator

I highly recommend to redirect this to a file, since it will generate lots of output.

The output will contain:

--------------------------------------------------------------------
Interpreter
code size        =    139K bytes
total space      =    267K bytes
wasted space     =    128K bytes
## of codelets    =    270
avg codelet size =    528 bytes
--------------------------------------------------------------------
slow signature handler  [0x00007f2555cd37a0, 0x00007f2555cd3960]  448 bytes
Loaded disassembler from /home/mete/jdk-9/lib/server/hsdis-amd64.so
[Disassembling for mach='i386:x86-64']
... assembly code continues ...
----------------------------------------------------------------------
iconst_1  4 iconst_1  [0x00007f2555ce5300, 0x00007f2555ce5360]  96 bytes
  0x00007f2555ce5300: push   %rax
  0x00007f2555ce5301: jmpq   0x00007f2555ce533f
  0x00007f2555ce5306: sub    $0x8,%rsp
  0x00007f2555ce530a: vmovss %xmm0,(%rsp)
  0x00007f2555ce530f: jmpq   0x00007f2555ce533f
  0x00007f2555ce5314: sub    $0x10,%rsp
  0x00007f2555ce5318: vmovsd %xmm0,(%rsp)
  0x00007f2555ce531d: jmpq   0x00007f2555ce533f
  0x00007f2555ce5322: sub    $0x10,%rsp
  0x00007f2555ce5326: mov    %rax,(%rsp)
  0x00007f2555ce532a: movabs $0x0,%r10
  0x00007f2555ce5334: mov    %r10,0x8(%rsp)
  0x00007f2555ce5339: jmpq   0x00007f2555ce533f
  0x00007f2555ce533e: push   %rax
  0x00007f2555ce533f: mov    $0x1,%eax
  0x00007f2555ce5344: movzbl 0x1(%r13),%ebx
  0x00007f2555ce5349: inc    %r13
  0x00007f2555ce534c: movabs $0x7f2576e63b00,%r10
  0x00007f2555ce5356: jmpq   *(%r10,%rbx,8)
  0x00007f2555ce535a: nopw   0x0(%rax,%rax,1)
... assembly code continues ...

There are native functions with assembly code here for each java opcode, where I included only iconst_1. You may ask why there are 270 “codelets” where there can be maximum 256 java opcodes. The reason is there is actually 202 opcodes and there is an implementation here for every one of them, but additionally there is implementation of the reserved opcode breakpoint and 67 others that is used for different purposes by the JVM, I believe for example for optimization as in fast_agetfield , method entry point (kind = java_lang_math_sin) or for internal tasks as in exception handling . So in total, there are 270 (202+1+67).

Conclusion

As we have seen, JVM can be built in different variants, even more than the ones we usually know: server and client. Also, JVM contains two interpreters, namely; CppInterpreter and TemplateInterpreter. CppInterpreter is a more straight-forward and a platform agnostic implementation, whereas TemplateInterpreter is quite an optimized, runtime generated, platform specific machine code.

I am sure everybody had an idea about how the bytecodes can be interpreted but I hope this post made it very clear, by seeing two real world implementations.