RISC 아키텍쳐는 왜 load store architecture를 도입했는가

이동재·2021년 7월 28일
0

사실 아키텍쳐 하는 사람들이라면 한번쯤은 궁금할 거 같아서 quora에 있는 글을 잠시 빌려와봤다.

출처: https://www.quora.com/Why-do-most-RISC-systems-implement-load-store-architecture

Two concepts come into play here:

Orthogonality
Minimalism
Orthogonality means you can combine operations together with minimal restrictions. For example, suppose bucket A contains “operand addressing modes” and bucket B contains “mathematical operations.” A fully-orthogonal architecture would let you combine any operand addressing mode from bucket A with any mathematical operation from bucket B.

Minimalism, especially in the context of a RISC architecture, means breaking down operations into their fundamental pieces in a way that allows them to be combined however you like. The goal is to have a minimal* number of pieces that can be combined into all the operations you could envision. The pieces should be simple enough to only require 1 execute cycle in the pipeline. If you need to add two numbers, if at all possible you’d like to limit yourself to the fundamental ways in which you can add two numbers, without considering where the operands come from.

Note: If you examine MIPS, SPARC, ARM, and other RISC / RISC-like processors, you’ll discover there’s not necessarily a consensus on what constitutes a fundamental set of operations. For example, do you need separate signed and unsigned addition? Different word widths? In the end, if you can do it in a single cycle, you may get a pass…

In any case, this led to the idea of breaking apart memory accesses from computations. It allows you to retain orthogonality while still reaching for minimalism.

In CISC machines, you’ll find a mix of memory-register, register-memory, and register-register instructions. For example, in x86, I can do an ADD between a value in memory and a value in a register, writing the result back to memory. Or I can add two values in registers, writing the result to a register. Or, I can add a register with a value in memory, writing that back to a register. If you want to make such a machine orthogonal, you now end up with instructions for every combination of addressing mode crossed with every computation type you support.

In a RISC-like Load/Store architecture, the memory access is factored out to its own instructions. So, instead of needing O(modes×operations) instructions to reach full orthogonality, you only need O(modes+operations) instructions.

Your instruction pressure drops from quadratic to linear. That makes it much easier to achieve actual orthogonality.

Load/store has an additional benefit: Memory accesses are expensive. They often have large and unpredictable latencies. Factoring the memory accesses out from regular computation makes it easier to schedule the memory accesses independently of the instructions that depend on them. That’s a big part of why CISC architectures will crack CISC instructions into RISC-like µops under the hood: It makes it easier to handle the memory system effects.

The early RISC architectures didn’t leverage that the way modern machines do. The early RISC architectures thought that exposing the latency of a memory read was a good idea, and so introduced the world to the idea of a load delay slot. Unfortunately, that concept doesn’t scale when you need to change the pipeline. If you’re making embedded processors that you’re happy to statically schedule instructions for, you can get away with it. (I did for about 20 years.) It doesn’t really work for mainstream processors that have to run binaries for code you can’t recompile.

Modern RISC/RISC-like architectures just run with the inherent advantage that the memory access is split from the computation, and use various scoreboarding techniques to dynamically schedule instructions into the pipeline when their arguments are available.

One final thought: These days, I consider RISC and CISC to be more marketing labels than any sort of strict classification. There are many machines that claim to be RISC that look nothing like the stark minimalism of the MIPS R2000. On the other side, practical modern CISC machines optimize their most RISC-like subset, making it easy to crack instructions into RISC-like pieces to run on the underlying microarchitecture.

  • OK, not absolutely minimal. There’s a practical threshold. You could reduce everything to NANDs at the limit, but you won’t if you’re building a practical processor. At some point cutting your pipeline stages even narrower makes common, serial, critical path operations take more clock cycles and more wall-clock time.

In practice, it seems that an integer addition at the machine word size seems to be the threshold, at least on the processors I’ve worked on. If you’re at or below that, you’re good. If you’re above that, you get cut into multiple cycles or multiple operations. Why? Because accumulation (foo += bar) is extremely common, and making that operation take multiple cycles would blow up the cycle count of too many things.

That means that practical RISC-branded machines aren’t quite as “Reduced” as the acronym implies. Like I said, it’s more a marketing label than a strict engineering classification.

profile
자투리 시간

0개의 댓글