APT Wiki | Jamaica / HotTopicsInComputerArchitecture

Following the IEE workshop on processor cores the following appear to be the current hot topics or good ideas:

Long Instruction Words (LIW) - getting ILP but without the cost of out-of-order (OoO) execution
Single Instruction Multiple Data (SIMD) - most embedded tasks rely on good SIMD instructions
Local memories with no cache - keeping memory local is fast and improves the predictability of a system
Communicating Sequential Processes (CSP) - PicoChip demonstrates how cores can communicate on a switched network avoiding the problem of memory bus saturation (Jamaica will saturate the memory bus at around 32 cores - apparently according to Greg's thesis). There new problem is utilizing all their cores, code placement and creating static switching tables as PicoChip doesn't place addresses to route CSP data.
Exabyte computation (well it was mentioned - Matt)- basically bit addressed memories, 128 bit computing, etc.

Greg's thesis doesn't actually state when bus saturation evolves, see page 184 (unless I missed it). He simply states that for his estimated memory calculations given current (in 2000) and future (his predictions for 2010) the memory requests saturate at 8 (2000) and 32 (2010) threads. He states that for the 2010 predictions this gives a bus utilisation of 60%, which is fairly high. Clearly increasing the number of memory banks, increasing the number of ports, increasing the size of the memory queues, and doing early address resolution to route effectively into seperate queues would all help, potentially seperate banked groups could have seperate queues. As memory saturation will occur whatever your interconnect fabric, I don't think we really need to worry too much about bus utilisation just yet, 32 threads is the same as niagara and I don't see them marketing full bus-utilisation in their white papers ;) - Matt

Update, sorry Niagara uses a cross-bar switch and write-through caches - Matt

I think the problem comes with the memory saturation and the only scalable solution is local memories and a different interconnect. I think its telling that Niagra has 32 threads and picoChip 308. I don't advocate not having a global memory for a full picture, but cores should work on their local subset and keep the main memory coherent when necessary. This is quite similar to Cell. - Ian R

I still don't think picoChip technology is applicable to large codesets or code that can change over time, it really needs a small algorithm, they are mapping the algorithm onto the 308 cores (some of which are actually buffers, load/store units etc hence the slightly strange number 308). The only local memory each core has is code, which changes the function of the core on a cycle by cycle basis. It simply wouldn't be feasible to have such a scheme and transport code to each core to run something different each cycle. Niagara has 32 threads all of which can run anycode dependant only on the location of their current PC. This is not possible using picoChip without reprogramming the chip, which will also run into saturation problems and will no longer have optimal routing. picoChip is a good solution for static code embedded devices, not for GP computing. Also picoChip like dataFlow in the past will run into memory saturation problems, (limited number of load store ports to memory) however they are typically using streaming data, where coherence is not particulary necessary, data comes in one side, and goes out the other side. - Matt

If you have enough contexts then why does the code need to change over time? Large codesets exhibit the 90/10 rule, so you only need your local memory to hold the hot 10%. The problem with the local memory being too small is the same as the problem of the cache being too small. The problem with a cache compared to a local memory is the need for it to be content addressable (so for the same storage its bigger) and that it lacks flexibility in layout (its fixed by the associativity). If you have a model where the local memory is kept coherent with the global memory or other local memories through CSP messages you needn't burden the CPU if the object model is fixed in hardware. The downside of this could be that passing objects around isn't as efficient as cache lines. A plus side is that it eases garbage collection and as we can potetially use point-to-point buses between CPU cores we avoid the problem of having to go to the top-level cache and then multicasting down the memory hierarchy. I don't propose that picoChip or RAW are architectures we'd want to clone wholesale for our software stack. But my belief is that they are architectures showing how to do orders of magnitude more performance than CMPs at the cost of the programming model. If we can fix the programming model then it seems we've solved our problem. Ian R

The picoChip and the picoBus fabric were very simple interesting and elegant solutions, to simple, well-understood, regular DSP-like algorithms. Think what Ian Jason talked about a couple of weeks ago, but much bigger! Basically they compile their algorithm into functions, each function sits on a seperate core. A matrix is calculated which determines what should write/read on which cycle, and a time-stepping dynamic network is created. Each switch in the network knows which connections to make each cycle. This allows regular algorithms to be mapped neatly onto the chip and run very efficiently (80-85% utilisation of cores). However I still fail to see how this can be applied to Jamaica, it was like an optimised data flow machine, that went on to run the same algorithm repeatedly, presumabling changing/reprogramming the cores/switches, requires the host controller to send a message to every single core/switch thus halting operation for as many cycles as components that need to be reprogrammed. Also it was effectively a no memory machine, as once the inputs were read by the input cores, the results from each calculation got forwarded on to the next operation, and so on. picoChip doesn't use addresses on the bus, because they want to resolve routing issues at compile time, this was an issue with transputers (hot-spots and random routing) it also avoids any address calculation costs, routing algorithms, and contention head of line problems. This all reduces the power consumption, something they wanted to achieve, speed parallelism and low-power. - Matt

I think Jamaica has to solve small DSP like problems as this is what benchmarks eventually become. Most benchmarks exhibit single mode behaviour (they're stuck constantly in the same loop). Programs are incredibly small compared to data. Even a large program like the Doom/Quake engine is about 600kb (Bill Gates was right to say "who need more than a megabyte" but only if you think in terms of code). With a CSP system parallel resource can work in parallel and communicate with each other and theoretically extract the maximum parallelism. SMPs will always get limited by the shared memory resource, however, this is a great programmer convenience so I wouldn't advocate throwing it out like picoChip have. - Ian R

Other ideas that are in the ether are:

Speculation - see the rest of the wiki to see conversation on this :-)
Object IDs as addresses - Greg Wright's new work advocates this as a means of simplifying garbage collection

We already believe in:

Chip multiprocessors - more cores the better, and industry believes this too
Simultaneous Multi-Threading (SMT) - having multiple contexts so that when the core must wait we can hide the fact by introducing new work from a different context

If all of these ideas are good what do we end up with:

a CMP system
simple cores but with SIMD and LIW capabilities to extract some available ILP at low cost
local memories, so that cores don't go to a central bus thereby allowing greater scalability
some kind of global memory that is also the hardware interface
buses that enable cores to communicate with each other without going to global memory
a notion that a cores local memory and execution is speculative so we don't necessarily commit local state back
SMT to hide delays in the CSP and global memory systems
an instruction set and buses oriented around memory just containing objects
parts of the object header capturing the sharing status of objects, coherent access can be maintained either through global memory (but with no caches would this make sense) or through CSP messages

I believe a desirable goal would be to have a > 32 core CMP system, cores capable of doing graphics tasks meaning that expensive GPUs (both in terms of cost and power) can be dropped from the overall system design. A system capable of running the envisaged Jamaica software stack of Java OS combining the Jikes RVM compiler and PearColator binary translator. Binary translation means we will want some efficient means to represent address spaces in an object oriented memory. The object coherence scheme should allow for point-to-point buses, probably a combined software and hardware coherence protocol that can recognize that some contexts are speculative.