APT Wiki | Jamaica / RegisterWindows

This is an old topic, but coincidentally one which is still an issue with the Jamaica architecture.

As you will know the original JAMAICA architecture as proposed by Greg implements register windows, Ahmed's thesis also provides some rationale behind the choice, from memory this was to do with the lower overhead associated with subroutine calls and returns. I thought it might be an idea to start some form of discussion in the manner of updates and sucessive edits to this wiki page, to

construct a strong argument for keeping them in the architecture, or
discussion of the benefits to be gained by removing them from the architecture.

For a refresher on register windows you could look here.

On Register Windows - John Mashey

An interesting article on register windows

Some known issues ...

Hardware complexity of Register Windows

In terms of the complexity of register windows, i.e. the large decode stage, and complexity of lazily evaluating the [g|i|o|x][0..7] into physical register 0 - 191, register windows seem to be expensive.

It is interesting to note, that in jamaica we currently have 192 registers (24 windows of 8 registers) to put this into context: the x86 has a minimal number of registers (8 + 8FP), the powerPC has 32 + 32F.

With register windows we can make use of any given number of registers (we just keep changing pointers each time we do a JSR/BSR or a RET instruction), whilst still only having to use 5-bits for each operand identifier in the opcode. If we were to make the move away form register windows and keep the same code base (or at least keep it with minimal changes) then each context would only be able to address 32 registers.

Would 32 registers hamper the performance of Jamaica? (discuss...)

I personally can't see it causing a problem. We have plenty of room in the ISA for more instructions especially if we implement FP instructions and don't keep all of the vector operations(?), so floating point operations could work from a different set of registers if this was felt to be necessary, and there aren't many other general purpose architectures with more than 32 registers.

Obviously there would be some associated overhead with interprocedural calls, but as the article linked above describes, this is only really going to be 5% and it is not clear that 5% would actually be lost in multi-threaded code.

The 5% overhead is a lower bound, actual performance degradation is not estimated. The article above is interesting, but it doesn't consider argument passing and method return value overheads. Those issues are handled by "register-precolouring" in flat register sets (the simple Cacoa JIT used something similar). But it is not clear how efficient precolouring would be. I

I had conducted an experiment on my thesis with rather crude assumption that all local would require spill/fill on method calls (if using flat register file). The results showed that about 40% of instructions (not cycles) are removed. If we take that as an upper bound, then the overhead of method call is 5% upto 40%. That might at least suggest more potential for register windows.

Moreover the 5% analysis is not based on object oriented applications, where method calls are more frequent. I would suggest implementing a proper register-windows aware register allocator and compare performance.

-ahmed

Would losing registers be detrimental to the Thread shipping policy (discuss...)

Can we not define a remote procedure calling convention, that sets up the remote invocation in a subset of the 32 registers, say 16-23, and then use the mechanism we already have for THJ/THB?

We could easily establish a suitable convention regarding shipping. We could, for example load into registers 8-15 before the THJ and have it deliver these values into registers 0-7. We could also maybe implement a load into registers 8-23 and delivery into registers 0-15 for an alternative flavour of THJ with 16 words of params. Note that we would have to retain the values of the registers containing persistent thread state (i.e. JTOC and initial SP) between when a thread leaves its bottom frame and when it restarts. I guess with this scheme registers 16-31 would automatically hang around so we could e.g. R28 for the JTOC, R29 for SP, R30 for a call/return register and R31 as a zero register. None of this presents any real problems.

-aed

Deep recursion and Register Windows

I believe there is still a problem associated with deeply recursive methods, i.e. we end up a) thrashing through all our windows, and hence end up filling and spilling all the time, and b) we end up running out of spill/fill area in memory, presently we don't handle this event. Clearly without register windows, stack/heap management would change, but wouldn't we only be limited by the full size of the stack/heap.

Register Windows and other architectures

As far as I am aware, and I am ready to stand corrected, the SPARC architecture is the only main-stream architecture still supporting register windows, and this is more for legacy reasons, as modern SPARC compilers can compile code to run without register windows.

The IA-64 instruction set architecture (which is not a main-stream architecture) uses a form of register windows. IA-64 defines "stack registers", which is effectively a variable-size windows mechanism. Register 0 to 31 are fixed registers. Stack registers are allocated with alloc instruction. The allocated frame is divided into locals, and outs (similar to JAMAICA as locals are effectively ins).

The IA-64 architecture has 128 integer register in total. Stack registers are from 32 to 127. The stack register are used to speed up method calls. In addition, the stack register supports register rotation. Register rotation is useful for software pipelining, loop transformation. It avoids the need to unroll and the extra startup and wind-down code required for software pipelining; thus making it beneficial to apply software pipelining to loops with few iterations.

TLS and Register Windows

Do register windows make sense in Thread Level Speculation?

8 explicit registers are shared (the non-specs ins are the speculatives out window), plus the global and extra registers.
How do we predict return values, if there are 24 potential registers that can be changed between calls?
Do we have a set convention for interprocedural calling?
If not, then it becomes difficult to justify speculative state for 24 registers?