APT Wiki | Jamaica / MattHorsnell

I am a final year PhD student working on the Jamaica project. My working thesis title is "Chip multi-clustering, scalable coherence and thread distribution", although you can expect this thesis title to change almost weekly!

Chip multi-clustering and the Jamaica architecture

During my PhD I have spent most of my time developing a simulation platform for the JAMAICA architecture, which allows exploration of the CMP design space. Unfortunately, whilst this has consumed a great deal of my time, and sanity, it is not what I expect to gain a PhD with and so is used for my thesis only as an evaluation tool.

Recently I have been working on extending the memory hierarchy within the JAMAICA architecture to include multiple-levels of cache. This opens up the possibility of chip multi-clustering, that is having multiple MPs on a single chip. Clusters provide two nice properties. Firstly they allow many processors to be connected on the same chip, using the shared memory paradigm, without needed to connect all of the processors to the same interconnection network for coherence. This allows clusters to run in independant shared caches, and hopefully can isolate some cache thrashing between larger threads, or at the application level. This comes however at the expense of requiring an inclusive cache hierarchy, which does impact on overall cache performance, but when given sufficiently large shared levels of cache (>n x the size of the n higher caches) is minimal. There is an argument that the transistor count available on a chip is already significantly bigger than we can easily design for, and use of larger caches is touted as a reasonable use of this space.

The JAMAICA architecture's novelty lies in its novel thread distribution mechanism. In a multiple cluster chip the mechanism is extended to support a notion thread locality, such that threads can be distributed across a chip, based on knowledge of the caches that they are likely to share. I am currently investigating whether distribution can be optimised to exploit this property using dynamic compilation and hardware feedback.

PIMMS - A multiple level cache coherence protocol.

The PIMMS protocol implements an MSI like cache coherence protocol with extensions for allow multiple levels of cache in a memory hierarchy. It should be noted that when I refer to multiple levels of cache I will normally be referring to multiple levels of shared cache, rather than simply multiple levels of private cache.

As you can imagine, adding multiple levels of caches complicates the necessary checks made during cache coherence. A line may be modified by, say, L20 but the actual modified line may be in an L1 cache that is sharing L20. This means the some hunt and gather commands enter the protocol.

The "hunt and gather" transactions, can be defined by 4-phases, namely, Request, Action, ReAction, and Response. There are now 3 classes of transactions in the coherence system:

Cache-to-Cache transfers: The transaction is satisfied during the initial request phase because a bus-local cache can supply the required permissions and data.
Split-Transactions: The transaction cannot be satisfied by the bus during the request phase, say on an L2$ miss, and the data and permissions must be gathered from L3$/Memory, at which point the initial request is matched by a response phase on the bus.
Hunt and Gather Transactions: These transactions require the full 4 phase handshake in order to gather the permissions. These transactions occur when, say, a read to the L3$ determines that a peer L2$ has the data but in one of its subordinate L1$s. Here the initial Request, lets say a SH request, triggers an Action of MEM_WB to the modified branch in the hierarchy, which eventually hits the corresponding data, which bounces back a Reaction with the data and permissions, which eventually meets the L2$ and can set the data and permissions of that L2$ to modified. Then the initial Request can be matched by a Cache-to-Cache transfer.

more to come on this...

Distributed Work Distribution

When JAMAICA was initially envisaged, multi-core architectures were in their infancy, and as such 4 - 8 processors all sharing a data bus and one level of shared coherence was seen as a large enough step in the architecture to consider. This is also reflected in the novel token distribution mechanism, based around a ring network connecting all of the processors. In future architecture's where conceivably the number of processors on a chip may be in the 10's - 100's a single ring connecting multiple cores seems like a weak link, both from a design point and from being tolerant to failure.

In the JamML architecture, the token distribution (and hence the work distribution) needs to be distributed amongst a far larger size of cores. Also, because the cores are not all to a single level of shared coherent memory, there is some non-uniformity and play off between the multiple levels of hierarchy. Therefore a scheme based on the cache distance is used to allow processors and the program to select a distribution metric based on the cache locality.