0706.205.01 Computer Organization

Homework 1

Performance / Simple Machine Instruction


Due Date

Wednesday, February 20 (11:59pm)
Extension: Monday, February 25

Form of Submission

Homework can submitted via email or printout by deadline.

Preparation

Patterson & Hennessy. Chapters 2 and Chapter 3 (through Section 3.5)

Questions

  1. Patterson & Hennessy: Exercise 2.1 [3 points]

    One can compare the CPU performance of two processors (with respect to a program) by comparing the time it takes to run the same program on each and inverting them. I.e.,

    CPU PerformanceA

    CPU PerformanceB
    =
    Execution timeB

    Execution timeA

    For Program 1, TimeM1 / TimeM2 = 10/5 = 2. So M2 is twice as fast as M1.

    For Program 2, TimeM1 / TimeM2 = 3/4, so M1 is 1.33 times as fast as M2.

  2. P&H: Exercise 2.2 [3 points]

    Here we are given the number of instructions executed by the processors in running Program 1 (from the previous question), and asked to compute the (average) number of instructions executed per second.

    For M1, that's 200 x 106 instructions / 10 sec = 2 x 107 instructions/sec.

    For M2, that's 160 x 106 instructions / 5 sec = 3.2 x 107 instructions/sec.

  3. P&H: Exercise 2.3 [3 points]

    Here we are given the clock rates (clock cycles per second) for our two processors, and asked for the respective CPI values over Program 1. By definition,

    Cycles per Instruction (CPI) =
    cycles per second

    instructions per second

    For M1, that's 200 MHz / 2 x 107 = 200 x 106 / 20 x 106 = 10 cycles per instruction.

    For M2, that's 300 MHz / 3.2 x 107 = 300 x 106 / 32 x 106 = 9.375 cycles per instruction.

  4. P&H: Exercise 2.10 [3 points]

    To determine the peak performance of a processor, we need to identify an instruction sequence that the machine can execute maximally fast. Since the processor's clock rate is fixed, this is the same as identifing the sequence that executes (on the machine) in the fewest number of clock cycles.

    How many instructions long would this sequence be? Well, that does not matter either, as it's the average time that matters. That average time is just CPI. So the problem has been reduced to finding the instruction section with the smallest possible CPI.

    For M1, the peak performance will be achieved with a sequence of instructions of class A, which have a CPI of 1. The peak performance is thus 500M instrutions per second (MIPS).

    For M2, a mixture of A and B instructions, both of which have CPI of 2, will achieve the peak performance, which is 375 MIPS.

  5. P&H: Exercise 2.11 [3 points]

    An instruction sequence evenly divided among the four classes listed in the previous question would have a CPI of (1+2+3+4)/4 = 2.5 for M1, and (2+2+4+4)/4 = 3. for M2.

    For any given program, we can go from CPI to execution time using the following formula:

    CPU execution time  = 
    Instructions

    Program
    x CPI x
    1

    Clock rate

    (This is the same formula as that presented at the start of Exercise Set A, but with some different term names.)

    Let's use I to refer to the number of instructions in the "certain program" mentioned in this question. Thus,

    TimeM1  =  I x CPIM1 x
    1

    RateM1
     =  I x 2.5 x
    1

    500 MHz
     = 
    I

    200 million

    Similarly,

    TimeM2  =  I x CPIM2 x
    1

    RateM2
     =  I x 3.0 x
    1

    750 MHz
     = 
    I

    250 million

    Applying the formula from Question 1 above, we see that M2 is 250/200 or 1.25 times faster than M1.

  6. P&H: Exercise 2.14 [6 points]

    We'll use I to denote the number of instructions in program, and C to denote the cycles in program. Observing that CPI x I = C, and that "clock rate" is just the reciprocal of "clock cycle", the subsets of the variables that together can be used to calculate execution time are:

  7. Both Machine A and Machine B execute a certain program in the same amount of time. Machine A has a clock rate of 200MHz and a CPI of 1.8 (for the program), while Machine B has a clock rate of 300MHz and CPI of 2.4. Compare the number of instructions performed by Machine A and Machine B during the execution of the program. [3 points]

    CPU time = InstructionsA x CPIA x 1/200MHz   =   InstructionsB x CPIB x 1/300MHz
      = InstructionsA x 1.8 x 1/200MHz   =   InstructionsB x 2.4 x 1/300MHz
      = InstructionsA x 0.9 x 1/100MHz   =   InstructionsB x 0.8 x 1/100MHz
      = InstructionsA x 0.9   =   InstructionsB x 0.8

    Therefore, Machine B executes 9/8 (or 12.5% more) as many instructions as Machine A.

  8. Complete the empty slots in the following table: [8 points]
    (Note the "0x" convention to denote hexadecimal values)

    Binary (Base 2) Decimal (Base 10) Hexadecimal (Base 16)
    0101 0000 80 0x50
    1001 0000 144 0x90
    1000 0001 129 0x81
    1101 1000 216 0xD8

  9. Compute the following sums, staying in base 2 or base 16, as appropriate: [6 points]
    1. 1010 + 0011 (binary) = 1101
    2. 0111 + 0011 (binary) = 1010
    3. 0x14 + 0x19  = (0x10 + 0x04) + (0x10 + 0x09) = (0x10 + 0x10) + (0x04 + 0x09) = 0x20 + 0x0D = 0x2D
    4. 0x2F + 0x81 = (0x20 + 0x0F) + (0x80 + 0x01) = (0x20 + 0x80) + (0x0F + 0x01) = 0xA0 + 0x10 = 0xB0

  10. Patterson & Hennessy: Exercise 3.1 [10 points]
    This question uses a MIPS instruction that was only alluded to in class (on Feb 7). ADDI is used to add a constant value (in machine language idiom, an immediate value) and a register value, storing the result in a register. Also note the two new register names ($a0 and $v0).
     
    begin:	addi	$t0, $zero, 0		# initalize $t0 = 0
    	addi	$t1, $zero, 1		# initalize $t1 = 1
    loop:	slt	$t2, $a0, $t1		# test if $a0 < $t1 
    	bne	$t2, $zero, finish	# if true, go to Finish
    	add	$t0, $t0, $t1		# $t0 += $t1
    	addi	$t1, $t1, 2		# $t1 += 2
    	j	loop			# go to Loop
    finish:	add	$v0, $t0, $zero		# result $v0 = $t0
    

    This program calculates the sum 1 + 3 + 5 + ... + n (or n - 1 if n is even). This equals (ceiling(n/2))2

  11. P&H: Exercise 3.9 [12 points]

    Here's the program from p.127 of the text:

     
    Loop:	add	$t1, $s3, $s3		# Temp reg $t1 = 2 * i
    	add	$t1, $t1, $t1		# Temp reg $t1 = 4 * i
    	add	$t1, $t1, $s6		# $t1 = address of save[i]
    	lw	$t0, 0($t1)		# Temp reg $t0 = save[i]
    	bne	$t0, $s5, Exit		# go to Exit if save[i] != k
    	add	$s3, $s3, $s4		# i = i + j
    	j	Loop			# go to Loop
    Exit:
    

    If the number of loop iterations is 10 (save[i] != k on the 11th try) then the number of instructions executed is 7 x 10 + 5 (for the final pass) = 75.

    We are asked to rewrite the program so that there is only one branch/jump instruction within the loop. This will have the effect of reducing the number of instructions inside the loop. Since those instructions are each executed 10 times, this will serve to reduce the total number of instructed executed.

    To achieve this, we need to change the branch (BNE) instruction so that in the termination case, it can just "fall through" to Exit. Simply reversing the sense of the test (and changing the target address from Exit to Jump) is a start, but we will need to move [ADD $s3, $s3, $s4] somewhere. We can't just insert it at Loop, because then $s3 (that is, our loop index i) would start too high (i+j instead of i).

    One approach is to move the first iteration of the loop "out".

     
    	add	$t1, $s3, $s3		# Temp reg $t1 = 2 * i
    	add	$t1, $t1, $t1		# Temp reg $t1 = 4 * i
    	add	$t1, $t1, $s6		# $t1 = address of save[i]
    	lw	$t0, 0($t1)		# Temp reg $t0 = save[i]
    	bne	$t0, $s5, Exit		# go to Exit if save[i] != k
    Loop:	add	$s3, $s3, $s4		# i = i + j
    	add	$t1, $s3, $s3		# Temp reg $t1 = 2 * i
    	add	$t1, $t1, $t1		# Temp reg $t1 = 4 * i
    	add	$t1, $t1, $s6		# $t1 = address of save[i]
    	lw	$t0, 0($t1)		# Temp reg $t0 = save[i]
    	beq	$t0, $s5, Loop		# go to Loop if save[i] == k
    Exit:
    

    While our program is now longer it will, in fact, execute fewer instructions. Assuming 11 loop iterations, then the number of instructions executed is now 5 (for the first pass) + 6 x 10 = 65.

    We can optimize the program a lot further (although the question doesn't ask for it) by observing that we can avoid having to multiply i by 4 (to calculate the memory address) each iteration if, instead of adding j to i and recalculate the address, we added 4 * j to the address each time around.

     
    	add	$t2, $s4, $s4		# Temp reg $t2 = 2 * j
    	add	$t2, $t2, $t2		# Temp reg $t2 = 4 * j
    	add	$t1, $s3, $s3		# Temp reg $t1 = 2 * i
    	add	$t1, $t1, $t1		# Temp reg $t1 = 4 * i
    	add	$t1, $t1, $s6		# $t1 = address of save[i]
    	lw	$t0, 0($t1)		# Temp reg $t0 = save[i]
    	bne	$t0, $s5, Exit		# go to Exit if save[i] != k
    Loop:	add	$t1, $t1, $t2		# $t1 = address of save[i + m * j]
    	lw	$t0, 0($t1)		# Temp reg $t0 = save[i + m * j]
    	beq	$t0, $s5, Loop		# go to Loop if save[i + m * j] == k
    Exit:
    

    Using this program, or 10 iterations will now execute in 7 + 10 x 3 = 37 instructions!

  12. P&H: Exercise 3.11 [15 points]
    This question will require some work; the authors think that it will take you 30 minutes to complete. I used the new ADDI instruction for simplicity.

     
    Init:	add	$t0, $zero, $zero	# Initialize $t0, the loop variable (i)
    	lw	$t1, 0($s0)		# Load the value of c into memory
    	addi	$t2, $zero, 4		# step for loop variable
    	addi	$t3, $zero, 401		# limit for loop variable
    
    Loop:	add	$t4, $a1, $t0		# $t4 = the address of b[i]
    	lw	$t5, 0($t4)		# $t5 = b[i]
    	add	$t6, $t5, $t1		# $t6 = b[i] + c
    	add	$t7, $a0, $t0		# $t7 = a[i]
    	sw	$t6, 0($t7)		# a[i] = b[i] + c
    	add	$t0, $t0, $t2		# $t0 = $t0 + 4 (i.e., i = i + 1)
    	slt	$t8, $t0, $t3		# $t0 < 401 (i.e., i <= 100)
    	bne	$t8, $zero, Loop	# loop if i <= 100
    Exit:
    

    The number of instructions executed is 4 (for initialization) + 8 x 101 (loop trips) = 812. The number of memory data references is 1 (init) + 2 x 101 (loop) = 203.