- Wednesday, February 20 (11:59pm)
- Extension: Monday, February 25
Homework can submitted via email or printout by deadline.
Patterson & Hennessy. Chapters 2 and Chapter 3 (through Section 3.5)
One can compare the CPU performance of two processors (with respect to a program) by comparing the time it takes to run the same program on each and inverting them. I.e.,
|
= |
|
For Program 1, TimeM1 / TimeM2 = 10/5 = 2. So M2 is twice as fast as M1.
For Program 2, TimeM1 / TimeM2 = 3/4, so M1 is 1.33 times as fast as M2.
Here we are given the number of instructions executed by the processors in running Program 1 (from the previous question), and asked to compute the (average) number of instructions executed per second.
For M1, that's 200 x 106 instructions / 10 sec = 2 x 107 instructions/sec.
For M2, that's 160 x 106 instructions / 5 sec = 3.2 x 107 instructions/sec.
Here we are given the clock rates (clock cycles per second) for our two processors, and asked for the respective CPI values over Program 1. By definition,
| Cycles per Instruction (CPI) | = |
|
For M1, that's 200 MHz / 2 x 107 = 200 x 106 / 20 x 106 = 10 cycles per instruction.
For M2, that's 300 MHz / 3.2 x 107 = 300 x 106 / 32 x 106 = 9.375 cycles per instruction.
To determine the peak performance of a processor, we need to identify an instruction sequence that the machine can execute maximally fast. Since the processor's clock rate is fixed, this is the same as identifing the sequence that executes (on the machine) in the fewest number of clock cycles.
How many instructions long would this sequence be? Well, that does not matter either, as it's the average time that matters. That average time is just CPI. So the problem has been reduced to finding the instruction section with the smallest possible CPI.
For M1, the peak performance will be achieved with a sequence of instructions of class A, which have a CPI of 1. The peak performance is thus 500M instrutions per second (MIPS).
For M2, a mixture of A and B instructions, both of which have CPI of 2, will achieve the peak performance, which is 375 MIPS.
An instruction sequence evenly divided among the four classes listed in the previous question would have a CPI of (1+2+3+4)/4 = 2.5 for M1, and (2+2+4+4)/4 = 3. for M2.
For any given program, we can go from CPI to execution time using the following formula:
| CPU execution time | = |
|
x | CPI | x |
|
(This is the same formula as that presented at the start of Exercise Set A, but with some different term names.)
Let's use I to refer to the number of instructions in the "certain program" mentioned in this question. Thus,
| TimeM1 | = | I | x | CPIM1 | x |
|
= | I | x | 2.5 | x |
|
= |
|
Similarly,
| TimeM2 | = | I | x | CPIM2 | x |
|
= | I | x | 3.0 | x |
|
= |
|
Applying the formula from Question 1 above, we see that M2 is 250/200 or 1.25 times faster than M1.
We'll use I to denote the number of instructions in program, and C to denote the cycles in program. Observing that CPI x I = C, and that "clock rate" is just the reciprocal of "clock cycle", the subsets of the variables that together can be used to calculate execution time are:
|
Therefore, Machine B executes 9/8 (or 12.5% more) as many instructions as Machine A.
| Binary (Base 2) | Decimal (Base 10) | Hexadecimal (Base 16) |
| 0101 0000 | 80 | 0x50 |
| 1001 0000 | 144 | 0x90 |
| 1000 0001 | 129 | 0x81 |
| 1101 1000 | 216 | 0xD8 |
begin: addi $t0, $zero, 0 # initalize $t0 = 0 addi $t1, $zero, 1 # initalize $t1 = 1 loop: slt $t2, $a0, $t1 # test if $a0 < $t1 bne $t2, $zero, finish # if true, go to Finish add $t0, $t0, $t1 # $t0 += $t1 addi $t1, $t1, 2 # $t1 += 2 j loop # go to Loop finish: add $v0, $t0, $zero # result $v0 = $t0
This program calculates the sum 1 + 3 + 5 + ... + n (or n - 1 if n is even). This equals (ceiling(n/2))2
Here's the program from p.127 of the text:
Loop: add $t1, $s3, $s3 # Temp reg $t1 = 2 * i add $t1, $t1, $t1 # Temp reg $t1 = 4 * i add $t1, $t1, $s6 # $t1 = address of save[i] lw $t0, 0($t1) # Temp reg $t0 = save[i] bne $t0, $s5, Exit # go to Exit if save[i] != k add $s3, $s3, $s4 # i = i + j j Loop # go to Loop Exit:
If the number of loop iterations is 10 (save[i] != k on the 11th try) then the number of instructions executed is 7 x 10 + 5 (for the final pass) = 75.
We are asked to rewrite the program so that there is only one branch/jump instruction within the loop. This will have the effect of reducing the number of instructions inside the loop. Since those instructions are each executed 10 times, this will serve to reduce the total number of instructed executed.
To achieve this, we need to change the branch (BNE) instruction so that in the termination case, it can just "fall through" to Exit. Simply reversing the sense of the test (and changing the target address from Exit to Jump) is a start, but we will need to move [ADD $s3, $s3, $s4] somewhere. We can't just insert it at Loop, because then $s3 (that is, our loop index i) would start too high (i+j instead of i).
One approach is to move the first iteration of the loop "out".
add $t1, $s3, $s3 # Temp reg $t1 = 2 * i add $t1, $t1, $t1 # Temp reg $t1 = 4 * i add $t1, $t1, $s6 # $t1 = address of save[i] lw $t0, 0($t1) # Temp reg $t0 = save[i] bne $t0, $s5, Exit # go to Exit if save[i] != k Loop: add $s3, $s3, $s4 # i = i + j add $t1, $s3, $s3 # Temp reg $t1 = 2 * i add $t1, $t1, $t1 # Temp reg $t1 = 4 * i add $t1, $t1, $s6 # $t1 = address of save[i] lw $t0, 0($t1) # Temp reg $t0 = save[i] beq $t0, $s5, Loop # go to Loop if save[i] == k Exit:
While our program is now longer it will, in fact, execute fewer instructions. Assuming 11 loop iterations, then the number of instructions executed is now 5 (for the first pass) + 6 x 10 = 65.
We can optimize the program a lot further (although the question doesn't ask for it) by observing that we can avoid having to multiply i by 4 (to calculate the memory address) each iteration if, instead of adding j to i and recalculate the address, we added 4 * j to the address each time around.
add $t2, $s4, $s4 # Temp reg $t2 = 2 * j add $t2, $t2, $t2 # Temp reg $t2 = 4 * j add $t1, $s3, $s3 # Temp reg $t1 = 2 * i add $t1, $t1, $t1 # Temp reg $t1 = 4 * i add $t1, $t1, $s6 # $t1 = address of save[i] lw $t0, 0($t1) # Temp reg $t0 = save[i] bne $t0, $s5, Exit # go to Exit if save[i] != k Loop: add $t1, $t1, $t2 # $t1 = address of save[i + m * j] lw $t0, 0($t1) # Temp reg $t0 = save[i + m * j] beq $t0, $s5, Loop # go to Loop if save[i + m * j] == k Exit:
Using this program, or 10 iterations will now execute in 7 + 10 x 3 = 37 instructions!
Init: add $t0, $zero, $zero # Initialize $t0, the loop variable (i) lw $t1, 0($s0) # Load the value of c into memory addi $t2, $zero, 4 # step for loop variable addi $t3, $zero, 401 # limit for loop variable Loop: add $t4, $a1, $t0 # $t4 = the address of b[i] lw $t5, 0($t4) # $t5 = b[i] add $t6, $t5, $t1 # $t6 = b[i] + c add $t7, $a0, $t0 # $t7 = a[i] sw $t6, 0($t7) # a[i] = b[i] + c add $t0, $t0, $t2 # $t0 = $t0 + 4 (i.e., i = i + 1) slt $t8, $t0, $t3 # $t0 < 401 (i.e., i <= 100) bne $t8, $zero, Loop # loop if i <= 100 Exit:
The number of instructions executed is 4 (for initialization) + 8 x 101 (loop trips) = 812. The number of memory data references is 1 (init) + 2 x 101 (loop) = 203.