The lab is organized into three parts, each with its own handin
In Part A you will write some simple Y86-64 programs and become familiar with the Y86-64 tools.
In Part B, you will extend the SEQ simulator with a new instruction.
These two parts will prepare you for Part C, the heart of the lab, where you will optimize the Y86-64 benchmark program and the processor design.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
$ tar xvf archlab-handout.tar $ cat README This directory contains the files that you need for the CS:APP architecture lab.
****** Files: ******
Makefile Use this to handin your solutions README This file archlab.{ps,pdf} Lab writeup sim.tar Archive of the Y86-64 tools in tar format simguide.{ps,pdf} CS:APP Guide to Simulators document
1 2 3
$ tar xvf sim.tar $ cd sim $ make clean; make # for GUI version
1 2 3
$ vim Makefile # comment out TKLIBS and TKINC for non-GUI version $ make clean; make
PartA
You will be working in directory sim/misc in this part.
Your task is to write and simulate the following three Y86-64 programs.
put your name and ID in a comment at the beginning of each program
You can test your programs by first assemblying them with the program YAS and then running them with the instruction set simulator YIS.
you should follow the x86-64 conventions for passing function arguments, using registers, and using the stack. This includes saving and restoring any callee-save registers that you use.
The C code of the program is in sim/misc/example.c
/* * Architecture Lab: Part A * * High level specs for the functions that the students will rewrite * in Y86-64 assembly language */
/* $begin examples */ /* linked list element */ typedefstructELE { long val; structELE *next; } *list_ptr;
/* sum_list - Sum the elements of a linked list */ longsum_list(list_ptr ls) { long val = 0; while (ls) { val += ls->val; ls = ls->next; } return val; }
/* rsum_list - Recursive version of sum_list */ longrsum_list(list_ptr ls) { if (!ls) return0; else { long val = ls->val; long rest = rsum_list(ls->next); return val + rest; } }
/* copy_block - Copy src to dest and return xor checksum of src */ longcopy_block(long *src, long *dest, long len) { long result = 0; while (len > 0) { long val = *src++; *dest++ = val; result ^= val; len--; } return result; } /* $end examples */
Program 1
sum.ys: Iteratively sum linked list elements
Your program should consist of some code that sets up the stack structure, invokes a function, and then halts
Test your program using the following three-element list:
# code section start at address 0x0 .pos 0 code: # stack set up irmovq stack, %rsp
# entry of the program main: irmovq data,%rdi call sum_list halt
# sum_list(Node * ls) sum_list: # initialize return value to 0 irmovq $0 , %rax # test if ls is NULL irmovq $0 , %rbx subq %rbx, %rdi je return # guarded do strategy loop: mrmovq (%rdi),%rsi addq %rsi,%rax mrmovq 8(%rdi),%rdi subq %rbx , %rdi jne loop return: ret
# stack grows from address 0x500 to 0x0 (should be limited at 0x300) .pos 0x500 stack:
1 2 3 4 5 6 7 8
Stopped in 24 steps at PC = 0x1d. Status 'HLT', CC Z=1 S=0 O=0 Changes to registers: %rax: 0x0000000000000000 0x0000000000000cba %rsp: 0x0000000000000000 0x0000000000000500 %rsi: 0x0000000000000000 0x0000000000000c00
Changes to memory: 0x04f8: 0x0000000000000000 0x000000000000001d
Program 2
rsum.ys: Recursively sum linked list elements
This code should be similar to the code in sum.ys, except that it should use a function rsum_list that recursively sums a list of numbers, as shown with the C function rsum_list
My solution:
when using callee saved register in recursive function:
push it before use it
pop it after use it
if not used in some branch, not push and pop are needed at all.
# entry of the program main: irmovq data,%rdi call rsum_list halt
# rsum_list(Node * ls) rsum_list: irmovq $0 , %rax # initialize return value to 0 irmovq $0 , %rcx # get a 0 constant subq %rcx , %rdi je return # else: pushq %rbp mrmovq (%rdi),%rsi addq %rsi, %rax rrmovq %rax, %rbp mrmovq 8(%rdi),%rdi call rsum_list addq %rbp ,%rax popq %rbp return: ret
# Sample linked list # data section start at address 0x300 .pos 0x300 .align 8 data: ele1: .quad 0x00a .quad ele2 ele2: .quad 0x0b0 .quad ele3 ele3: .quad 0xc00 .quad 0 # stack grows from address 0x500 to 0x0 (should be limited at 0x300) .pos 0x500 stack:
1 2 3 4 5 6 7 8 9 10 11 12 13
Stopped in 48 steps at PC = 0x1d. Status 'HLT', CC Z=0 S=0 O=0 Changes to registers: %rax: 0x0000000000000000 0x0000000000000cba %rsp: 0x0000000000000000 0x0000000000000500 %rsi: 0x0000000000000000 0x0000000000000c00
copy.ys: Copy a source block to a destination block
Write a program (copy.ys) that copies a block of words from one part of memory to another (non-overlapping area) area of memory, computing the checksum (Xor) of all the words copied.
Test your program using the following three-element source and destination blocks:
You will be working in directory sim/seq in this part.
Your task in Part B is to extend the SEQ processor to support the iaddq
To add this instructions, you will modify the file seq-full.hcl, which implements the version of SEQ described in the CS:APP3e textbook. In addition, it contains declarations of some constants that you will need for your solution.
Your HCL file must begin with a header comment containing the following information:
Your name and ID.
A description of the computations required for the iaddq instruction. Use the descriptions of irmovq and OPq in Figure 4.18 in the CS:APP3e text as a guide.
Building and Testing Your Solution
you will need to build a new instance of the SEQ simulator (ssim) based on this HCL file, and then test it
Build
1
$ make VERSION=full
(This builds a version of ssim that uses the control logic you specified in seq-full.hcl. To save typing, you can assign VERSION=full in the Makefile.)
Test
Testing your solution on a simple Y86-64 program.
1
$ ./ssim -t ../y86-code/asumi.yo
Debug
1
$ ./ssim -g ../y86-code/asumi.yo
Retesting your solution using the benchmark programs
1
$ (cd ../y86-code; make testssim)
Note that none of these programs test the added instructions.
You are simply making sure that your solution did not inject errors for the original instructions. See ../y86-code/README file for more details.
Performing regression tests.
test everything except iaddq and leave
1
$ (cd ../ptest; make SIM=../seq/ssim)
test iaddq
1
$ (cd ../ptest; make SIM=../seq/ssim TFLAGS=-i)
For more information on the SEQ simulator refer to the handout CS:APP3e Guide to Y86-64 Processor Simulators (simguide.pdf).
My solution :
The problem could have been much harder. Due to the bloody clear comments and concise codes given by the file itself, adding the iaddq instruction becomes a piece of cake. Just adding IIADDQ at serveral different places will get the job done.
The teachers at CMU really did a splendid job, I really appreciate their efforts and passion.
#################################################################### # Control Signal Definitions. # ####################################################################
## What register should be used as the A source? word srcA = [ icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ } : rA; icode in { IPOPQ, IRET } : RRSP; 1 : RNONE; # Don't need register ];
## What register should be used as the B source? word srcB = [ icode in { IOPQ, IRMMOVQ, IMRMOVQ , IIADDQ } : rB; icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP; 1 : RNONE; # Don't need register ];
## What register should be used as the E destination? word dstE = [ icode in { IRRMOVQ } && Cnd : rB; icode in { IIRMOVQ, IOPQ , IIADDQ } : rB; icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP; 1 : RNONE; # Don't write any register ];
## What register should be used as the M destination? word dstM = [ icode in { IMRMOVQ, IPOPQ } : rA; 1 : RNONE; # Don't write any register ];
## Select input A to ALU word aluA = [ icode in { IRRMOVQ, IOPQ } : valA; icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ , IIADDQ } : valC; icode in { ICALL, IPUSHQ } : -8; icode in { IRET, IPOPQ } : 8; # Other instructions don't need ALU ];
## Select input B to ALU word aluB = [ icode in { IRMMOVQ, IMRMOVQ, IOPQ, ICALL, IPUSHQ, IRET, IPOPQ , IIADDQ } : valB; icode in { IRRMOVQ, IIRMOVQ } : 0; # Other instructions don't need ALU ];
## Set the ALU function word alufun = [ icode == IOPQ : ifun; 1 : ALUADD; ];
## Should the condition codes be updated? bool set_cc = icode in { IOPQ , IIADDQ };
## Set read control signal bool mem_read = icode in { IMRMOVQ, IPOPQ, IRET };
## Set write control signal bool mem_write = icode in { IRMMOVQ, IPUSHQ, ICALL };
## Select memory address word mem_addr = [ icode in { IRMMOVQ, IPUSHQ, ICALL, IMRMOVQ } : valE; icode in { IPOPQ, IRET } : valA; # Other instructions don't need address ];
## Select memory input data word mem_data = [ # Value from register icode in { IRMMOVQ, IPUSHQ } : valA; # Return PC icode == ICALL : valP; # Default: Don't write anything ];
## Determine instruction status word Stat = [ imem_error || dmem_error : SADR; !instr_valid: SINS; icode == IHALT : SHLT; 1 : SAOK; ];
################ Program Counter Update ############################
## What address should instruction be fetched at
word new_pc = [ # Call. Use instruction constant icode == ICALL : valC; # Taken branch. Use instruction constant icode == IJXX && Cnd : valC; # Completion of RET instruction. Use value from stack icode == IRET : valM; # Default: Use incremented PC 1 : valP; ]; #/* $end seq-all-hcl */
$ cd ../ptest; make SIM=../seq/ssim ./optest.pl -s ../seq/ssim Simulating with ../seq/ssim All 49 ISA Checks Succeed ./jtest.pl -s ../seq/ssim Simulating with ../seq/ssim All 64 ISA Checks Succeed ./ctest.pl -s ../seq/ssim Simulating with ../seq/ssim All 22 ISA Checks Succeed ./htest.pl -s ../seq/ssim Simulating with ../seq/ssim All 600 ISA Checks Succeed
Test iaddq
1 2 3 4 5 6 7 8 9 10 11 12 13
$ cd ../ptest; make SIM=../seq/ssim TFLAGS=-i ./optest.pl -s ../seq/ssim -i Simulating with ../seq/ssim All 58 ISA Checks Succeed ./jtest.pl -s ../seq/ssim -i Simulating with ../seq/ssim All 96 ISA Checks Succeed ./ctest.pl -s ../seq/ssim -i Simulating with ../seq/ssim All 22 ISA Checks Succeed ./htest.pl -s ../seq/ssim -i Simulating with ../seq/ssim All 756 ISA Checks Succeed
Part C
You will be working in directory sim/pipe in this part.
Your task in Part C is to modify ncopy.ys and pipe-full.hcl with the goal of making ncopy.ysrun as fast as possible.
Coding rules
You are free to make any modifications you wish, with the following constraints:
Your ncopy.ys function must work for arbitrary array sizes.
Your ncopy.ys function must run correctly with YIS
The assembled version of your ncopy file must not be more than 1000 bytes long
1 2
# length check $ ./check-len.pl < ncopy.yo
Your pipe-full.hcl implementation must pass the regression tests in ../y86-codeand ../ptest(without the -i flag that tests iaddq)
Other than that, you are free to implement the iaddq instruction if you think that will help.
You may make any semantics preserving transformations to the ncopy.ys function, such as reordering instructions, replacing groups of instructions with single instructions, deleting some instructions, and adding other instructions
(You may find it useful to read about loop unrolling in Section 5.8 )(Unfortunately, I haven’t progressed to Chapter 5)
Building and Running Your Solution
Each time you modify your ncopy.ys program, you can rebuild the driver programs by typing
1
$ make drivers
Each time you modify your pipe-full.hcl file, you can rebuild the simulator by typing
1
$ make psim VERSION=full
If you want to rebuild the simulator and the driver programs, type
1
$ make VERSION=full
Testing your driver files on the ISA simulator. Make sure that your ncopy.ys function works prop- erly with YIS:
1 2
$ make drivers $ ../misc/yis sdriver.yo
Testing your code on a range of block lengths with the ISA simulator
1
$ ./correctness.pl
The program will end with register %rax having the following value:
0xaaaa : All tests pass.
0xbbbb : Incorrect count
0xcccc : Function ncopy is more than 1000 bytes long.
0xdddd : Some of the source data was not copied to its destination.
0xeeee : Some word just before or just after the destination region was corrupted.
Testing your pipeline simulator on the benchmark programs.
1
$ (cd ../y86-code; make testpsim)
Testing your pipeline simulator with extensive regression tests
1
$ (cd ../ptest; make SIM=../pipe/psim TFLAGS=-i)
Testing your code on a range of block lengths with the pipeline simulator.
1
$ ./correctness.pl -p
My solution :
For ncopy.ys, I use conditional move to erase the small branch.
#################################################################### # Control Signal Definitions. # ####################################################################
## What address should instruction be fetched at word f_pc = [ # Mispredicted branch. Fetch at incremented PC M_icode == IJXX && !M_Cnd : M_valA; # Completion of RET instruction W_icode == IRET : W_valM; # Default: Use predicted value of PC 1 : F_predPC; ];
## Determine icode of fetched instruction word f_icode = [ imem_error : INOP; 1: imem_icode; ];
## What register should be used as the A source? word d_srcA = [ D_icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ } : D_rA; D_icode in { IPOPQ, IRET } : RRSP; 1 : RNONE; # Don't need register ];
## What register should be used as the B source? word d_srcB = [ D_icode in { IOPQ, IRMMOVQ, IMRMOVQ , IIADDQ } : D_rB; D_icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP; 1 : RNONE; # Don't need register ];
## What register should be used as the E destination? word d_dstE = [ D_icode in { IRRMOVQ, IIRMOVQ, IOPQ , IIADDQ } : D_rB; D_icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP; 1 : RNONE; # Don't write any register ];
## What register should be used as the M destination? word d_dstM = [ D_icode in { IMRMOVQ, IPOPQ } : D_rA; 1 : RNONE; # Don't write any register ];
## What should be the A value? ## Forward into decode stage for valA word d_valA = [ D_icode in { ICALL, IJXX } : D_valP; # Use incremented PC d_srcA == e_dstE : e_valE; # Forward valE from execute d_srcA == M_dstM : m_valM; # Forward valM from memory d_srcA == M_dstE : M_valE; # Forward valE from memory d_srcA == W_dstM : W_valM; # Forward valM from write back d_srcA == W_dstE : W_valE; # Forward valE from write back 1 : d_rvalA; # Use value read from register file ];
word d_valB = [ d_srcB == e_dstE : e_valE; # Forward valE from execute d_srcB == M_dstM : m_valM; # Forward valM from memory d_srcB == M_dstE : M_valE; # Forward valE from memory d_srcB == W_dstM : W_valM; # Forward valM from write back d_srcB == W_dstE : W_valE; # Forward valE from write back 1 : d_rvalB; # Use value read from register file ];
## Select input A to ALU word aluA = [ E_icode in { IRRMOVQ, IOPQ } : E_valA; E_icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ , IIADDQ } : E_valC; E_icode in { ICALL, IPUSHQ } : -8; E_icode in { IRET, IPOPQ } : 8; # Other instructions don't need ALU ];
## Select input B to ALU word aluB = [ E_icode in { IRMMOVQ, IMRMOVQ, IOPQ, ICALL, IPUSHQ, IRET, IPOPQ , IIADDQ } : E_valB; E_icode in { IRRMOVQ, IIRMOVQ } : 0; # Other instructions don't need ALU ];
## Set the ALU function word alufun = [ E_icode == IOPQ : E_ifun; 1 : ALUADD; ];
## Should the condition codes be updated? bool set_cc = E_icode in {IOPQ , IIADDQ } && # State changes only during normal operation !m_stat in { SADR, SINS, SHLT } && !W_stat in { SADR, SINS, SHLT };
## Generate valA in execute stage word e_valA = E_valA; # Pass valA through stage
## Set dstE to RNONE in event of not-taken conditional move word e_dstE = [ E_icode == IRRMOVQ && !e_Cnd : RNONE; 1 : E_dstE; ];
## Select memory address word mem_addr = [ M_icode in { IRMMOVQ, IPUSHQ, ICALL, IMRMOVQ } : M_valE; M_icode in { IPOPQ, IRET } : M_valA; # Other instructions don't need address ];
## Set read control signal bool mem_read = M_icode in { IMRMOVQ, IPOPQ, IRET };
## Set write control signal bool mem_write = M_icode in { IRMMOVQ, IPUSHQ, ICALL };
#/* $begin pipe-m_stat-hcl */ ## Update the status word m_stat = [ dmem_error : SADR; 1 : M_stat; ]; #/* $end pipe-m_stat-hcl */
## Set E port register ID word w_dstE = W_dstE;
## Set E port value word w_valE = W_valE;
## Set M port register ID word w_dstM = W_dstM;
## Set M port value word w_valM = W_valM;
## Update processor status word Stat = [ W_stat == SBUB : SAOK; 1 : W_stat; ];
################ Pipeline Register Control #########################
# Should I stall or inject a bubble into Pipeline Register F? # At most one of these can be true. bool F_bubble = 0; bool F_stall = # Conditions for a load/use hazard E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB } || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode };
# Should I stall or inject a bubble into Pipeline Register D? # At most one of these can be true. bool D_stall = # Conditions for a load/use hazard E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB };
bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Cnd) || # Stalling at fetch while ret passes through pipeline # but not condition for a load/use hazard !(E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB }) && IRET in { D_icode, E_icode, M_icode };
# Should I stall or inject a bubble into Pipeline Register E? # At most one of these can be true. bool E_stall = 0; bool E_bubble = # Mispredicted branch (E_icode == IJXX && !e_Cnd) || # Conditions for a load/use hazard E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB};
# Should I stall or inject a bubble into Pipeline Register M? # At most one of these can be true. bool M_stall = 0; # Start injecting bubbles as soon as exception passes through memory stage bool M_bubble = m_stat in { SADR, SINS, SHLT } || W_stat in { SADR, SINS, SHLT };
# Should I stall or inject a bubble into Pipeline Register W? bool W_stall = W_stat in { SADR, SINS, SHLT }; bool W_bubble = 0; #/* $end pipe-all-hcl */
Again, we only need to do little modification to the codes, all complex stuff such as data forwarding, bubble injection and PC predication for ret and call, which I don’t quite understand, are done by the teachers.
By the way, don’t forget that iaddq need to set the condition code.
# test ncopy.ys $ ./correctness.pl Simulating with instruction set simulator yis ncopy 0 OK 1 OK 2 OK 3 OK 4 OK 5 OK 6 OK 7 OK 8 OK 9 OK 10 OK 11 OK 12 OK 13 OK 14 OK 15 OK 16 OK 17 OK 18 OK 19 OK 20 OK 21 OK 22 OK 23 OK 24 OK 25 OK 26 OK 27 OK 28 OK 29 OK 30 OK 31 OK 32 OK 33 OK 34 OK 35 OK 36 OK 37 OK 38 OK 39 OK 40 OK 41 OK 42 OK 43 OK 44 OK 45 OK 46 OK 47 OK 48 OK 49 OK 50 OK 51 OK 52 OK 53 OK 54 OK 55 OK 56 OK 57 OK 58 OK 59 OK 60 OK 61 OK 62 OK 63 OK 64 OK 128 OK 192 OK 256 OK 68/68 pass correctness test
# test with extensive regression tests $ (cd ../ptest; make SIM=../pipe/psim TFLAGS=-i) ./optest.pl -s ../pipe/psim -i Simulating with ../pipe/psim All 58 ISA Checks Succeed ./jtest.pl -s ../pipe/psim -i Simulating with ../pipe/psim All 96 ISA Checks Succeed ./ctest.pl -s ../pipe/psim -i Simulating with ../pipe/psim All 22 ISA Checks Succeed ./htest.pl -s ../pipe/psim -i Simulating with ../pipe/psim All 756 ISA Checks Succeed
# test correctness on modified pipeline $ ./correctness.pl -p Simulating with pipeline simulator psim ncopy 0 OK 1 OK 2 OK 3 OK 4 OK 5 OK 6 OK 7 OK 8 OK 9 OK 10 OK 11 OK 12 OK 13 OK 14 OK 15 OK 16 OK 17 OK 18 OK 19 OK 20 OK 21 OK 22 OK 23 OK 24 OK 25 OK 26 OK 27 OK 28 OK 29 OK 30 OK 31 OK 32 OK 33 OK 34 OK 35 OK 36 OK 37 OK 38 OK 39 OK 40 OK 41 OK 42 OK 43 OK 44 OK 45 OK 46 OK 47 OK 48 OK 49 OK 50 OK 51 OK 52 OK 53 OK 54 OK 55 OK 56 OK 57 OK 58 OK 59 OK 60 OK 61 OK 62 OK 63 OK 64 OK 128 OK 192 OK 256 OK 68/68 pass correctness test
I know there are many tricks to accelerate the program, such as expanding the loop, using different branch predication strategies and implementing data forwarding in pipe-lf.hcl. But I failed to figure the implementation on my own. Thanks for dreamanddead.
(In fact, the brenchmark.pl only use output of psim to judge the performance, so wrong implementation will be scored very high, for example, just delete the implementation of iaddq)