Science Fair Project Encyclopedia
The simplest processors are scalar processors. A scalar processor processes one data item at a time. In a vector processor, by contrast, a single instruction operates simultaneously on multiple data items. The difference is analogous to the difference between scalar and vector arithmetic. A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple processing units so that multiple instructions can be processing separate data items at the same time.
A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle. That is essentially the purpose of the architecture.
Just processing multiple instructions at the same time does not make an architecture superscalar. Simple pipelining, where a CPU may be loading an instruction while doing arithmetic for the previous one and storing the results from the one before that (thus executing 3 instructions at the same time) is not superscalar processing.
Essentially all general purpose CPUs developed since about 1998 are superscalar.
In a superscalar CPU, there are several functional units of the same type, along with additional circuitry to dispatch instructions to the units. For instance, most superscalar designs include more than one integer unit (typically referred to as an ALU). The dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to the two units.
Performance of the dispatcher is key to the overall performance of a superscalar design. The task is not a simple one, the instructions
a = b + c; d = e + f can be run in parallel because none of the results are dependent on other calculations. However the instructions
a = b + c; d = a + f may or may not be able to run in parallel, depending on the order in which the instructions complete as they move through the units.
Much of modern CPU design is dedicated to increasing the accuracy of the dispatcher system, and allowing it to keep the multiple units in use at all times. This has become increasingly important as the number of units has increased. While early superscalar CPUs would have two ALUs and a single FPU, a modern design like the PowerPC 970 include four ALUs, two FPUs and two SIMD units as well. If the dispatcher is ineffective in keeping all of these units fed with instructions, the performance of the system as a whole will suffer greatly.
Superscalar systems were originally implemented on RISC CPUs. This was because the RISC design results in a simple core, allowing several of them to be built onto a single CPU. This was the reason that RISC designs were faster than CISC through the 1980s and into the 1990s, but as the chip manufacturing processes improved, even "complex" designs like the IA-32 were able to go superscalar.
Dramatic improvements in the quality of the control unit now appear unlikely, limiting future improvements in speed of the basic superscalar design. One potential solution to this problem is to move the dispatcher logic out of the chip and into the compiler, which can spend considerably more time and effort on making the best decisions possible. This is the basic premise of very long instruction word(VLIW) CPU designs, which is also known as, static superscalar or compile time scheduling.
The contents of this article is licensed from www.wikipedia.org under the GNU Free Documentation License. Click here to see the transparent copy and copyright details