APT Wiki | Jamaica / Compilation

The Virtual Machine Environment

The JaVM,like the RVM, employs two compilers, a baseline compiler which performs inline expansion of bytecode, and an optimizing compiler, which performs a host of sophisticated optimizations. Both these compilers translate Java code to machine code -- the JaVM does not employ an interpreter.

The JaVM's compilers operate inside a memory image which contains a variety of Java objects which model the virtual machine environment and the underlying physical machine. For example, each available hardware execution context is modelled using an instance of class VM_Processor; each Java thread of execution is modelled using an instance of VM_Thread (or, possibly of its subclass java.lang.Thread); each class loaded into the JaVM is modelled using an instance of VM_Class. The virtual machine classes are themselves cross-compiled by the JaVM compiler to generate an initial Java memory image within which application code can be loaded and run.

Two special, non-Java data structures are employed to implement the virtual machine. The JTOC is an int[] array which is used to store global Java data values. This includes literal constants such as integers, doubles (stored in two successive slots), strings (the array slot contains a reference -- an offset address-- to an instance of java.lang.String), etc. The JTOC also stores references to compiled code for static methods and static field data (e.g. VM_Scheduler.threads which is an array of type VM_Thread[]). The compiler can invoke static methods and access or update static data by indexing into the JTOC.

The JTOC also stores TIBs, one per loaded class. A TIB is essentially a method table with bells. It is an int array which contains a reference to a VM_Class instance in slot 0, possibly some interface lookup tables in the next slot and then, in order, all current compiled instance methods for the class, including inherited as well as local methods. A TIB is created when a class is loaded and when an instance is created the TIB is stored in the header of the instance. The compiler can invoke an instance method by loading relative to the instance's TIB at an offset fixed at compile time. This caters for overriding because TIBs for subclasses which override a method will have a different code pointer to the superclass TIB at the relevant method's offset.

Special Non-Java Operations

The compilers translate Java bytecodes into a sequence of machine instructions with the same semantics as the bytecode. in an interpreted virtual machine Some of the operations of the virtual machine, such as memory allocation and collection, context switching etc, are normally implemented by callouts to a language like C. In Jamaica they are implemented using either callouts to routines built into the hardware bootstrap program (syscalls) or by invoking magic operations which the compiler recognises as special. These special operations are coded as invocations of overloaded APIs on two classes, VM_SysCall and VM_Magic. So, for example, VM_magic.threadSwitch(VM_Thread t, VM_Registers s) can be used to effect a context switch to a new thread t saving the current thread's regsiter state in the regsister save state object s. Magic calls do not have a Java implementation. A dummy implementation of the class is provided so that the source can be compiled to bytecode by a Java bytecode compiler. The JaVM compiler recognises invocations of methods defined in these two classes and ignores the bytecode definitions, instead translating the calls into inline machine code sequences or into invocations of booter callbacks.

Compilation Strategy

The baseline compiler uses a simple inline expansion scheme to translate bytecode into a corresponding sequence of machine instructions. Its translation strategy is described below.

The optimising compiler translates Java bytecode into machine code in 3 stages, each of which employsits own internal, transfomed representation of the original bytecode. The High-level Intermediate Representation (HIR) is an almost direct translation of the original bytecode into a target platform-independent format It is susceptible only to a small number of simple transformations. The Low-level Intermediate Representation (LIR) is a much richer representation which is still target-platform independent. It is repeatedly traversed, transformed and annotated in order to produce a machine-specific, Machine-Level Intermediate Representation (MIR) which can be readily transformed into target-platform machine code. All three IR formats consists of a graph of instructions which define or use shared data values. The level of abstraction of the instrcutions and the complexity of the graph changes as the IR is massaged form HIR to LIR to MIR. Instructions and values may be annotated with type and other information. Inlining transformations mean that LIR and MIR graphs may correspond to more than one source Java method.

The HIR and LIR transforms are mostly generic so will not be described here. The transition form LIR to MIR includes a series of manipulations performed automatically using a BURS grammar, which is a series of pattern based transformation rules. These include transformations which begin to take account of machine specific features such as addressing modes, locking primitives etc. The MIR is in a format whch requires only a few modifications in order to generate an assembled instruction sequence.

The rest of this document describes the stack and register management strategies adopted by the JaVM's two compilers. These strategies guide the process of transforming the LIR to MIR to machine code. Reference is made to some of the Java values which identify the Jamaica machine registers. These are defined in the compiler classes, specifically VM_Registers, VM_RegisterConstants, VM_BaselineConstants and VM_StackframeLayoutConstants. Occasional reference is also made to other system classes but these should be straightforward to understand. An understanding of the operation of the Jamaica chip is assumed.

General & Special Register Usage

The register allocation strategy for the OPT compiler is built around two lists of registers, those which are volatile across calls (defined in array VOLATILE_GPRS) and those which are non-volatile (defined in array NONVOLATILE_GPRS). The bulk of the per-context registers (O, I and X regs) are divided among these two sets.

The baseline allocator employs a small subset of the VOLATILE_GPRS using a much more simple strategy (it stores all its data on the stack and only loads values into registers temporarily either to perform primitive data operations or to invoke methods).

There are a variety of special registers omitted from these two sets, the two SCRATCH_GPRS (X6 and X7), REG_PR (X4), REG_FP and REG_SP (I6 and O6), REG_JMP and REG_RCP (O7 and i7) and REG_ZERO (G0) and REG_JTOC (G1). These registers are not managed by the register allocator.

The volatile registers comprise the remaining Os and Xs. The non-volatiles comprise the remaining Is. Clearly this division of Is and Os makes sense given Jamaica's windowing architecture. We get save and restore of I registers from the hardware so, unlike the RVM compiler, we don't actually have any spill/fill to do in software at call/return. The other registers are not guaranteed to be saved by the hardware but we have enough non-volatiels that we can usefully employ them all as non-volatiles and avoid having to resort to spilling/fillihng.

Use of the special registers is described first before detailing the register allocation strategy and then the use of registers during calling.

JTOC

JTOC identifies the table containing TIBs (type information blocks -- essentially a method table but with a class reference and, potentially, an interface table in the initial entries). i.e. this register esstablishes the global JVM state shared by all VM_Processors/threads of control in the JavM. It can be global since it is written once only at bootstrap and is never written by the Java compilers (nor by *our* C compiler which avoids all globals except G0). JTOC is therefore non-volatile so, unlike the RVM, we do not need to save and restore it when calling C code.

FP & SP

FP and SP point to caller top of stack and callee top of stack. FP should not be confused with the framePointer field link address stored in each VM_Processor (confusingly also called FP -- the Jikes people don't know about Jamaica :). The latter is always equal to FP + 8, since the first thing any method does is to build a frame header i.e.

assign SP from FP

push a return PC value

push the current value of the framePointer field

store the current SP (i.e. FP + 8) in the framePointer field.

push a compiled method id unique to the current method

On return FP is automatically restored as caller top of stack

JMP and RCP

JMP (O7) is traditionally used to load an address for a JSR. Actually the Opt compiler uses any spare register for this and we probably ought to make O7 a volatile GPR. RCP (I7 = return from call pointer) is where the hardware places a return address at JSR. We actually write this to the stack in the frame header because the GC code, the stackwalk code and the interrupt handler code relies on being able to read the value out of the stack. Avoiding this would i) require much jiggery-pokery with spilling register windows and traversing the spill stack (we have to do this anyway for GC but not for the other two cases) and ii) make the code run much slower. So we live with two copies of the return address.

PR

This register identifies the processor and, via the latter's activeThread field, the local execution context for a running Java thread. The compiler uses this field to access per VM_Processor and per thread state. This register is all that is need to establish a specific Java execution context (to complement the global context established by JTOC). It is written once-only (at bootstrap) by Java code, so is non-volatile wrt Java execution. However, our C compiler may try to use it so we have to save and restore it when we make a call to a C function.

PR is allocated from the X registers for three reasons. Firstly, so that it is available in all call frames. Secondly, so it is unique to each hardware context. Finally, so that the processor context is not lost across hardware idle.

What this last point is saying is that when a context goes idle it retains a handle on its extra window. So, when it resumes after a THJ it will still find the same VM_Processor in X4 and will find the VM_BranchThread which initiated the hw idle in the processor's activeThread field. In other words the Java execution context for the shipped task is the VM_BranchThread attached to the idle processor and this provides a stack and a target for thread operations like suspending locking, waiting, notifying etc.

SCRATCH_GPRS

These are spare registers which are not managed by the opt compiler register allocator. They guarantee the availability of some scratch space which can be used in the back end assembler to implement macro operations which require a temporary e.g. a LDA/LDAH pair, a THJ sequence etc. They are not needed by the baseline compiler register allocator as it works mostly on the stack and so does not need many registers.

Baseline General Register Allocation

The baseline compiler uses the stack for most of its operations so it rarely has to use registers other than to hold a limited number of temporaries used in the inline expansion of a bytecode.

Argument words are pushed onto the stack at the start of a method invocation. Local variables are stored on the stack in a space allocated above the argument data and loaded/saved as they are used/modified. Temporary values constructed as components of expressions or as call arguments are pushed on trhe stack as they are computed and popped as they are used.

The compiler defines four registers S0-S3 which are used to hold temporary data loaded from the stack or results from execution of a bytecode operation. So, for example an ADD will pop two stack values into S0 and S1, add the values into S0 and push S0 back onto the stack. These four temporary registers are distinct from the four transfer registers and the special registers.

Opt General Register Allocation

The opt compiler tries to keep data in the GPRS as much as possible (volatile or non-volatile, as appropriate). Where necessary it spills register data into a spill area on the stack allocated (at a compile-time computed size) above the frame header. It prefers to shuffle data around between registers than do this.

Registers are initially allocated as logical registers from an unbounded pool. The mapping to physical registers happens in stages.

Argument transfer registers (see below) are marked as allocated at the start of the method according to the number of argument words employed by the method. The allocator may not reuse these registers unless the argument value has either been spilled or transferred to another register or is no longer needed.

n.b arguments passed in registers are often moved into non-volatile registers early on in the method i) to allow argument registers to be reused for another call and ii) because the value is required after the call has been completed. However, in certain simple methods arguments may simply be used then discarded, freeing the argument transfer register for reuse.

Mapping at call sites is done next ensuring that constraints imposed by the calling convention are obeyed (see below). Where a call employs a logical register the call argument is modified to the relevant physical register and a copy from logical to physical register is inserted before the call. Stack adjustments may also be inserted around the call. This may give rise to unnecessary moves which the compiler attempts to elide later on during compilation.

A data liveness analysis is then performed and the remaining logical registers are mapped to physical registers and spills/fills inserted as necessary. Physical registers are allocated in order from the end of the volatile or non-volatile list as they are needed. Registers are freed for reuse when the data they contain is no longer live or when it has been spilled to the spill area. Volatiles are only used when necessary (i.e. when a value must persist across a call or when there are no non-volatiles left). Data liveness and physical register constraints imposed by argument or call bindings are analysed in an attempt to minimise the amount of shuffling around of data between registers (bad) or spilling to the spill area (very bad).

Argument Transfer for Java methods

An initial segment of the sequence of registers defined by array VOLATILE_GPRS is used for transfer of 4 words of arguments from caller to callee. The number of available registers is defined by NUM_PARAMETER_GPRS, which in the JaVM is 4, meaning we use X0-X3 (we could also use X5 but don't).

The reason we stop at this point is that once we hit the O registers we find that these registers change name either side of a call (O0 becomes I0). They also change from volatile to non-volatile. Since the RVM compiler was built to transfer and receive via registers with the same name (and using volatile registers only) we have stoped at X3 so as to avoid having to rewrite the compiler to cope with this effect of windowing.

So, the first four words of the arguments are passed in X0--X3 respectively, and the remaining words are passed on the stack pushed in argument order (but remember the stack grows negative). Note that if there are N words of arguments (N > 4) then N words of stack are allocated at the call. The baseline compiler always pushes all N argument words before loading the top 4 into registers (it works by recursive evaluation so this is the natural way to assemble the argument data anyway). The Opt compiler will increment the stack by 4 before pushing the last N-4 argument words.

Note that longs and doubles are pushed lo word first then high word (opposite to the RVM) and are loaded into registers in teh same order. If the lo word comes fourth it is still passed in the fourth transfer register and the hi word goes on the stack.

Note also that we don't have FP registers so that float and double arguments are passed as one or two argument words, respectively, either in the transfer registers or on the stack. This also differs from the RVM.

Value Return for Java Calls

Java results are returned in the first or first two transfer registers (i.e. the first or first two non-volatile registers, X0 & X1) depending upon the number of words in the result.

The callee is responsible for popping the N argument words off the callee stack before returning i.e. the final sequence in any Java method will be

SP <- FP - 8 ; restore callee stack to header(FP)

[X4 + framePointer Offset] <- [SP] ; restore saved frame pointer in VP

FP <- FP + N ; pop N words from call top of stack

RET ; return to caller

Calling C Code

A call to VM_SysCall.MethName gets translated to a call to the C function whose address is found in the BootRecord field MethNameIP (IP for instruction pointer). This normally translates to a C function defined in tools/BootImageRunner/jamiaca/sys.c called sysMethName.

Similarly, calls to native methods which have been implemented in C (as opposed to those we have patched in using Java methods on class VM_NativeSupport) are translated to calls to C functions in a BootRecord field.

Finally, calls ot sim builtins are made using a JSR which has a calling convention similar to C code.

The C compiler expects up to 6 arguments to be passed in caller's O0-O6, received in callee's I0-I6. Any remaining arguments must be pushed on the stack. All the X registers are considered volatile so X4 must be saved and restored around the call. The C code copies FP (plus a frame size) to SP to set up its initial top of stack. It does not write FP mor RCP so these are automatically restored on return (FP may on rare occasions need adjusting -- see below).

C code does not allocate a frame header but it will allocate a scratch area including space for a register spill if an interrupt occurs. So, something like 0xffffffa0 will be added to I6 when assigning O6. We don't actually care about this register spill space as our C callbacks are not supposed to generate interrupts.

The C compiler currently can only return single word results which are returned via the caller's first out register (O0). Since the arguments transferred via registers have already been cleared from the stack the callee only needs to clear any extra arguments which were passed on the stack (usually there are none of these). So normally FP is unadjusted at return. If the call requires N argument words (N > 6) then the return must be preceded by FP <-- FP + (N - 6).

Argument Transfer For SysCalls etc

So, for a syscall the argument words are loaded in order into registers O0-O6 and any remaining args are passed on the stack. In baseline code the arguments will have been assembled on the stack before loading into registers. So, with 6 or less words the stack is cleared of all arguments. With more than 6 words the bottom 6 words are removed form the stack and the remaining words shuffled down. So, with N argument words (N > 6) there are N - 6 extra words on the stack. With 6 or less arguments the stack is clear of any argument data before the call.

In opt code the arguments will either be in registers already or in the spill area located above the frame header. After filling registers I0-I6 any remaining arguments are pushed onto the stack. As with the baseline compiler, with N argument words (N > 6) there are N - 6 extra words on the stack. With 6 or less arguments the stack is clear of any argument data before the call.

For native method calls the caller uses Java calling conventions to invoke a stub method which wraps rounf the the native implementation. If the native method has been implemented in Java the stub code sets up a call to the Java code using Java calling conventions. If it has been implemented in C then a call is set up using the syscall conventions. So, this does not introduce any extra call modes, merely an extra intermediate Java frame.

Calls to sim builtins are sometimes made direct from Java (e.g. to fetch the current cycle count, do float/double to/from int/long conversions, float/double arithmetic etc). Bultins also expect arguments to be passed in caller's O1-O6 (I don't think any of them use more than O1 and O2). Builtins do not modify any of the other registers and, as for syscalls, the stack is clear of any argument data before the call.

Value Return for SysCalls etc

Syscalls currenlty can only receive one return word in O0. At the point of return the callee will have adjusted the stack to get rid of any stacked argument words (if there were more than 6 such words in total).

We have to fiddle return of some long data from C functions (e.g. time values returned by sysGetTimeOfDay) by passing in a two word array. Since only one call can be outstanding on any given VM_Processor such transfer buffers must be allocated one per VM_Processor in order to avoid any problems with concurrent calls.

Argument Handling/Value Return for Magic Code

Magic code is usually implemented by an inline code sequence. Occasionally this includes a JSR to a Java method, C function or sim builtin but details of parameter passing for these cases are all wrapped up inside the magic. Magic call arguments are set up by bith compilers as Java calls (in the bytecode they look just like Java calls -- it is only when we see the invoke virtual that we know it is magic code).

So, in baseline code the arguments are stacked and then the magic code generator gets invoked. Inline code generation involves popping the parameters and consuming them using the relevant opcodes to produce result values then pushing these results back onto the stack (although the magic invocation is nominally a call the compiler folds a push of the 'result' into the magic code generation stage. it would normally append this push from the return register(s) after generating a cJSR for a conventional call).

In opt code the magic transformation happens when the call has logical regsiters as parameters. These logical regsiters correspond to values 'generated' (defined) by previous instructions, parameters or constants. Inline code generation involves inserting a sequence of instructions which 'consumes' (uses) some of these registers and 'generates' (defines) a return value in a logical result register.

This may involve the creation of temporary logical registers to hold intermediate results and/or inserion of call instructions as well as data manipulation instructions. The result is 'returned' by pushing it onto a stack of logical values maintained in the IR which is passed as a parameter to the inline generation code. This allows the return value to be 'copied' and 'consumed' (used) by subsequent instructions.