• To use an efficient instruction pipeline
    • To implement an instruction pipeline using a small number of suboperations, with each being executed in one clock cycle.
    • Because of the fixed-length instruction format, the decoding of the operation can occur at the same time as the register selection.
    • Therefore, the instruction pipeline can be implemented with two or three segments.
      • One segment fetches the instruction from program memory
      • The other segment executes the instruction in the ALU
      • Third segment may be used to store the result of the ALU operation in a destination register
    • The data transfer instructions in RISC are limited to load and store instructions.
      • These instructions use register indirect addressing. They usually need three or four stages in the pipeline.
      • To prevent conflicts between a memory access to fetch an instruction and to load or store an operand, most RISC machines use two separate buses with two memories.
      • Cache memory: operate at the same speed as the CPU clock 
    • One of the major advantages of RISC is its ability to execute instructions at the rate of one per clock cycle.
      • In effect, it is to start each instruction with each clock cycle and to pipeline the processor to achieve the goal of single-cycle instruction execution.
      • RISC can achieve pipeline segments, requiring just one clock cycle.
    • Compiler supported that translates the high-level language program into machine language program.
      • Instead of designing hardware to handle the difficulties associated with data conflicts and branch penalties.
      • RISC processors rely on the efficiency of the compiler to detect and minimize the delays encountered with these problems.

Example: Three-Segment Instruction Pipeline

  • Thee are three types of instructions:
    • The data manipulation instructions: operate on data in processor registers
  • The data transfer instructions:
  • The program control instructions:
  • The control section fetches the instruction from program memory into an instruction register.
    • The instruction is decoded at the same time that the registers needed for the execution of the instruction are selected.
  • The processor unit consists of a number of registers and an arithmetic logic unit (ALU).
  • A data memory is used to load or store the data from a selected register in the register file.
  • The instruction cycle can be divided into three suboperations and implemented in three segments:
    • I: Instruction fetch
      • Fetches the instruction from program memory
    • A: ALU operation
      • The instruction is decoded and an ALU operation is performed.
      • It performs an operation for a data manipulation instruction.
      • It evaluates the effective address for a load or store instruction.
      • It calculates the branch address for a program control instruction.
    • E: Execute instruction
      • Directs the output of the ALU to one of three destinations, depending on the decoded instruction.
      • It transfers the result of the ALU operation into a destination register in the register file.
      • It transfers the effective address to a data memory for loading or storing.
      • It transfers the branch address to the program counter.

Delayed Load

  • Consider the operation of the following four instructions:
    • LOAD: R1 ß M[address 1]
    • LOAD: R2 ß M[address 2]
    • ADD: R3 ß R1 +R2
    • STORE: M[address 3] ß R3
  • There will be a data conflict in instruction 3 because the operand in R2 is not yet available in the A segment.
  • This can be seen from the timing of the pipeline shown in Fig. 4-9(a).
    • The E segment in clock cycle 4 is in a process of placing the memory data into R2.
    • The A segment in clock cycle 4 is using the data from R2.
  • It is up to the compiler to make sure that the instruction following the load instruction uses the data fetched from memory.
  • This concept of delaying the use of the data loaded from memory is referred to as delayed load.

             

                           Fig 4-9(a): Three segment pipeline timing - Pipeline timing with data conflict

 

  • Fig 4-9(b) shows the same program with a no-op instruction inserted after the load to R2 instruction.

                 

                             Fig 4-9(b): Three segment pipeline timing - Pipeline timing with delayed load

 

  • Thus the no-op instruction is used to advance one clock cycle in order to compensate for the data conflict in the pipeline.
  • The advantage of the delayed load approach is that the data dependency is taken care of by the compiler rather than the hardware.

Delayed Branch

  • The method used in most RISC processors is to rely on the compiler to redefine the branches so that they take effect at the proper time in the pipeline. This method is referred to as delayed branch.
  • The compiler is designed to analyze the instructions before and after the branch and rearrange the program sequence by inserting useful instructions in the delay steps.
  • It is up to the compiler to find useful instructions to put after the branch Failing that, the compiler can insert no-op instructions.

An Example of Delayed Branch

  • The program for this example consists of five
    • Load from memory to R1
    • Increment R2
    • Add R3 to R4
    • Subtract R5 from R6
    • Branch to address X
  • In Fig. 4-10(a) the compiler inserts two no-op instructions after the
    • The branch address X is transferred to PC in clock cycle

           

                                                  Fig 4-10(a): Using no operation instruction

  • The program in 4-10(b) is rearranged by placing the add and subtract instructions after the branch instruction.
    • PC is updated to the value of X in clock cycle 5.

           

                                                       Fig 4-10(b): Rearranging the instructions