2021-09-03

csapp

architecturelab

Architecture lab

Introduction

The lab is organized into three parts, each with its own handin

In Part A you will write some simple Y86-64 programs and become familiar with the Y86-64 tools.
In Part B, you will extend the SEQ simulator with a new instruction.
These two parts will prepare you for Part C, the heart of the lab, where you will optimize the Y86-64 benchmark program and the processor design.

$ tar xvf archlab-handout.tar
$ cat README
This directory contains the files that you need for the CS:APP
architecture lab.

******
Files:
******

Makefile		Use this to handin your solutions
README			This file
archlab.{ps,pdf}	Lab writeup
sim.tar			Archive of the Y86-64 tools in tar format
simguide.{ps,pdf}	CS:APP Guide to Simulators document

1
2
3

$ tar xvf sim.tar
$ cd sim
$ make clean; make # for GUI version

1
2
3

$ vim Makefile
# comment out TKLIBS and TKINC for non-GUI version
$ make clean; make

PartA

You will be working in directory sim/misc in this part.

Your task is to write and simulate the following three Y86-64 programs.

put your name and ID in a comment at the beginning of each program
You can test your programs by first assemblying them with the program YAS and then running them with the instruction set simulator YIS.
you should follow the x86-64 conventions for passing function arguments, using registers, and using the stack. This includes saving and restoring any callee-save registers that you use.
The C code of the program is in sim/misc/example.c

/* 
 * Architecture Lab: Part A 
 * 
 * High level specs for the functions that the students will rewrite
 * in Y86-64 assembly language
 */

/* $begin examples */
/* linked list element */
typedef struct ELE {
    long val;
    struct ELE *next;
} *list_ptr;

/* sum_list - Sum the elements of a linked list */
long sum_list(list_ptr ls)
{
    long val = 0;
    while (ls) {
	val += ls->val;
	ls = ls->next;
    }
    return val;
}

/* rsum_list - Recursive version of sum_list */
long rsum_list(list_ptr ls)
{
    if (!ls)
	return 0;
    else {
	long val = ls->val;
	long rest = rsum_list(ls->next);
	return val + rest;
    }
}

/* copy_block - Copy src to dest and return xor checksum of src */
long copy_block(long *src, long *dest, long len)
{
    long result = 0;
    while (len > 0) {
	long val = *src++;
	*dest++ = val;
	result ^= val;
	len--;
    }
    return result;
}
/* $end examples */

Program 1

sum.ys: Iteratively sum linked list elements

Your program should consist of some code that sets up the stack structure, invokes a function, and then halts

Test your program using the following three-element list:

# Sample linked list 
.align 8
ele1:
	.quad 0x00a 
	.quad ele2
ele2:
	.quad 0x0b0 
	.quad ele3
ele3:
	.quad 0xc00 
	.quad 0

My solution :

use .pos 0xvalue to specify the address in code section
I know we can use gcc to first assemble example.c into x86_64 assembly code and then modify it to y86 code, but just not recommended.

# code section start at address 0x0
    .pos 0
code:
# stack set up
irmovq stack, %rsp

# entry of the program
main:
    irmovq data,%rdi
    call sum_list
    halt

# sum_list(Node * ls)
sum_list:
    # initialize return value to 0
    irmovq $0 , %rax
    # test if ls is NULL
    irmovq $0 , %rbx
    subq %rbx, %rdi
    je return
    # guarded do strategy
loop:
    mrmovq (%rdi),%rsi
    addq %rsi,%rax
    mrmovq 8(%rdi),%rdi
    subq %rbx , %rdi
    jne loop
return:
    ret

# Sample linked list 
# data section start at address 0x300
.pos 0x300
.align 8
data:
ele1:
	.quad 0x00a 
	.quad ele2
ele2:
	.quad 0x0b0 
	.quad ele3
ele3:
	.quad 0xc00 
	.quad 0

# stack grows from address 0x500 to 0x0 (should be limited at 0x300)
    .pos 0x500
stack:

Stopped in 24 steps at PC = 0x1d.  Status 'HLT', CC Z=1 S=0 O=0
Changes to registers:
%rax:	0x0000000000000000	0x0000000000000cba
%rsp:	0x0000000000000000	0x0000000000000500
%rsi:	0x0000000000000000	0x0000000000000c00

Changes to memory:
0x04f8:	0x0000000000000000	0x000000000000001d

Program 2

rsum.ys: Recursively sum linked list elements

This code should be similar to the code in sum.ys, except that it should use a function rsum_list that recursively sums a list of numbers, as shown with the C function rsum_list

My solution:

when using callee saved register in recursive function:

push it before use it
pop it after use it
if not used in some branch, not push and pop are needed at all.

# code section start at address 0x0
    .pos 0
code:

# stack set up
irmovq stack, %rsp

# entry of the program
main:
    irmovq data,%rdi
    call rsum_list
    halt

# rsum_list(Node * ls)
rsum_list:
    irmovq $0 , %rax # initialize return value to 0
    irmovq $0 , %rcx # get a 0 constant
    subq %rcx , %rdi
    je return
# else:
    pushq %rbp
    mrmovq (%rdi),%rsi
    addq %rsi, %rax
    rrmovq %rax, %rbp
    mrmovq 8(%rdi),%rdi
    call rsum_list
    addq %rbp ,%rax 
    popq %rbp
return:
    ret

# Sample linked list 
# data section start at address 0x300
.pos 0x300
.align 8
data:
ele1:
	.quad 0x00a 
	.quad ele2
ele2:
	.quad 0x0b0 
	.quad ele3
ele3:
	.quad 0xc00 
	.quad 0
# stack grows from address 0x500 to 0x0 (should be limited at 0x300)
    .pos 0x500
stack:

Stopped in 48 steps at PC = 0x1d.  Status 'HLT', CC Z=0 S=0 O=0
Changes to registers:
%rax:	0x0000000000000000	0x0000000000000cba
%rsp:	0x0000000000000000	0x0000000000000500
%rsi:	0x0000000000000000	0x0000000000000c00

Changes to memory:
0x04c8:	0x0000000000000000	0x0000000000000060
0x04d0:	0x0000000000000000	0x00000000000000b0
0x04d8:	0x0000000000000000	0x0000000000000060
0x04e0:	0x0000000000000000	0x000000000000000a
0x04e8:	0x0000000000000000	0x0000000000000060
0x04f8:	0x0000000000000000	0x000000000000001d

Program 3

copy.ys: Copy a source block to a destination block

Write a program (copy.ys) that copies a block of words from one part of memory to another (non-overlapping area) area of memory, computing the checksum (Xor) of all the words copied.

Test your program using the following three-element source and destination blocks:


.align 8
# Source block src:
.quad 0x00a .quad 0x0b0 .quad 0xc00
# Destination block
dest:
.quad 0x111 .quad 0x222 .quad 0x333

My solution :

Note that in y86 address is assigned by lines

So quad 0x00a .quad 0x0b0 .quad 0xc00 and

.quad 0x00a .quad 0x0b0 .quad 0xc00

are different.

# code section start at address 0x0
    .pos 0
code:

# stack set up
irmovq stack, %rsp

# entry of the program
main:
    irmovq src,%rdi
    irmovq dest,%rsi
    irmovq $3, %rcx
    call copy_block
    halt

copy_block:
    # initialize return value to 0
    irmovq $0 , %rax
    # set constants
    irmovq $0, %rbp
    irmovq $8, %rbx
    # jump to middle strategy
    jmp test
loop:
    mrmovq (%rdi),%rdx 
    addq %rbx,%rdi
    rmmovq %rdx,(%rsi)
    addq %rbx , %rsi
    xorq %rdx,%rax
    irmovq $1,%rdx
    subq %rdx,%rcx
test:
    subq %rbp , %rcx
    jg loop
    ret

# data section start at address 0x300
.pos 0x300
.align 8
# Source block 
src:
.quad 0x00a 
.quad 0x0b0 
.quad 0xc00
# Destination block
dest:
.quad 0x111 
.quad 0x222 
.quad 0x333

# stack grows from address 0x500 to 0x0 (should be limited at 0x300)
    .pos 0x500
stack:

Stopped in 40 steps at PC = 0x31.  Status 'HLT', CC Z=1 S=0 O=0
Changes to registers:
%rax:	0x0000000000000000	0x0000000000000cba
%rdx:	0x0000000000000000	0x0000000000000001
%rbx:	0x0000000000000000	0x0000000000000008
%rsp:	0x0000000000000000	0x0000000000000500
%rsi:	0x0000000000000000	0x0000000000000330
%rdi:	0x0000000000000000	0x0000000000000318

Changes to memory:
0x0318:	0x0000000000000111	0x000000000000000a
0x0320:	0x0000000000000222	0x00000000000000b0
0x0328:	0x0000000000000333	0x0000000000000c00
0x04f8:	0x0000000000000000	0x0000000000000031

Part B

You will be working in directory sim/seq in this part.

Your task in Part B is to extend the SEQ processor to support the iaddq

To add this instructions, you will modify the file seq-full.hcl, which implements the version of SEQ described in the CS:APP3e textbook. In addition, it contains declarations of some constants that you will need for your solution.

Your HCL file must begin with a header comment containing the following information:

Your name and ID.
A description of the computations required for the iaddq instruction. Use the descriptions of irmovq and OPq in Figure 4.18 in the CS:APP3e text as a guide.

Building and Testing Your Solution

you will need to build a new instance of the SEQ simulator (ssim) based on this HCL file, and then test it

Build

1	$ make VERSION=full

(This builds a version of ssim that uses the control logic you specified in seq-full.hcl. To save typing, you can assign VERSION=full in the Makefile.)

Test

Testing your solution on a simple Y86-64 program.

1	$ ./ssim -t ../y86-code/asumi.yo

Debug

1	$ ./ssim -g ../y86-code/asumi.yo

Retesting your solution using the benchmark programs

1	$ (cd ../y86-code; make testssim)

Note that none of these programs test the added instructions.

You are simply making sure that your solution did not inject errors for the original instructions. See ../y86-code/README file for more details.

Performing regression tests.

test everything except iaddq and leave

1	$ (cd ../ptest; make SIM=../seq/ssim)

test iaddq

1	$ (cd ../ptest; make SIM=../seq/ssim TFLAGS=-i)

For more information on the SEQ simulator refer to the handout CS:APP3e Guide to Y86-64 Processor Simulators (simguide.pdf).

My solution :

The problem could have been much harder. Due to the bloody clear comments and concise codes given by the file itself, adding the iaddq instruction becomes a piece of cake. Just adding IIADDQ at serveral different places will get the job done.

The teachers at CMU really did a splendid job, I really appreciate their efforts and passion.

Stage               iaddq V,rB

Fetch               icode:ifunc ← M1[PC]
                    rA:rB ← M1[PC+1]
                    valC ← M8[PC + 2]
                    valP = PC + 10
                    
Decode              valB ← R[rB]

Execute             valE ← valB + valC

Memory

Write back          R[rB] ← valE

PC update           PC ← valP

####################################################################
#    Control Signal Definitions.                                   #
####################################################################

################ Fetch Stage     ###################################

# Determine instruction code
word icode = [
	imem_error: INOP;
	1: imem_icode;		# Default: get from instruction memory
];

# Determine instruction function
word ifun = [
	imem_error: FNONE;
	1: imem_ifun;		# Default: get from instruction memory
];

bool instr_valid = icode in 
	{ INOP, IHALT, IRRMOVQ, IIRMOVQ, IRMMOVQ, IMRMOVQ,
	       IOPQ, IJXX, ICALL, IRET, IPUSHQ, IPOPQ , IIADDQ };

# Does fetched instruction require a regid byte?
bool need_regids =
	icode in { IRRMOVQ, IOPQ, IPUSHQ, IPOPQ, 
		     IIRMOVQ, IRMMOVQ, IMRMOVQ , IIADDQ };

# Does fetched instruction require a constant word?
bool need_valC =
	icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ, IJXX, ICALL , IIADDQ };

################ Decode Stage    ###################################

## What register should be used as the A source?
word srcA = [
	icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ  } : rA;
	icode in { IPOPQ, IRET } : RRSP;
	1 : RNONE; # Don't need register
];

## What register should be used as the B source?
word srcB = [
	icode in { IOPQ, IRMMOVQ, IMRMOVQ   , IIADDQ } : rB;
	icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
	1 : RNONE;  # Don't need register
];

## What register should be used as the E destination?
word dstE = [
	icode in { IRRMOVQ } && Cnd : rB;
	icode in { IIRMOVQ, IOPQ , IIADDQ } : rB;
	icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
	1 : RNONE;  # Don't write any register
];

## What register should be used as the M destination?
word dstM = [
	icode in { IMRMOVQ, IPOPQ } : rA;
	1 : RNONE;  # Don't write any register
];

################ Execute Stage   ###################################

## Select input A to ALU
word aluA = [
	icode in { IRRMOVQ, IOPQ } : valA;
	icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ  , IIADDQ } : valC;
	icode in { ICALL, IPUSHQ } : -8;
	icode in { IRET, IPOPQ } : 8;
	# Other instructions don't need ALU
];

## Select input B to ALU
word aluB = [
	icode in { IRMMOVQ, IMRMOVQ, IOPQ, ICALL, 
		      IPUSHQ, IRET, IPOPQ  ,  IIADDQ } : valB;
	icode in { IRRMOVQ, IIRMOVQ } : 0;
	# Other instructions don't need ALU
];

## Set the ALU function
word alufun = [
	icode == IOPQ : ifun;
	1 : ALUADD;
];

## Should the condition codes be updated?
bool set_cc = icode in { IOPQ , IIADDQ  };

################ Memory Stage    ###################################

## Set read control signal
bool mem_read = icode in { IMRMOVQ, IPOPQ, IRET };

## Set write control signal
bool mem_write = icode in { IRMMOVQ, IPUSHQ, ICALL };

## Select memory address
word mem_addr = [
	icode in { IRMMOVQ, IPUSHQ, ICALL, IMRMOVQ } : valE;
	icode in { IPOPQ, IRET } : valA;
	# Other instructions don't need address
];

## Select memory input data
word mem_data = [
	# Value from register
	icode in { IRMMOVQ, IPUSHQ } : valA;
	# Return PC
	icode == ICALL : valP;
	# Default: Don't write anything
];

## Determine instruction status
word Stat = [
	imem_error || dmem_error : SADR;
	!instr_valid: SINS;
	icode == IHALT : SHLT;
	1 : SAOK;
];

################ Program Counter Update ############################

## What address should instruction be fetched at

word new_pc = [
	# Call.  Use instruction constant
	icode == ICALL : valC;
	# Taken branch.  Use instruction constant
	icode == IJXX && Cnd : valC;
	# Completion of RET instruction.  Use value from stack
	icode == IRET : valM;
	# Default: Use incremented PC
	1 : valP;
];
#/* $end seq-all-hcl */

Simple y86 test

$ make VERSION=full
34 instructions executed
Status = HLT
Condition Codes: Z=1 S=0 O=0
Changed Register State:
%rax:	0x0000000000000000	0x0000abcdabcdabcd
%rsp:	0x0000000000000000	0x0000000000000200
%rdi:	0x0000000000000000	0x0000000000000038
%r8:	0x0000000000000000	0x0000000000000008
%r9:	0x0000000000000000	0x0000000000000001
%r10:	0x0000000000000000	0x0000a000a000a000
Changed Memory State:
0x01f0:	0x0000000000000000	0x0000000000000055
0x01f8:	0x0000000000000000	0x0000000000000013
ISA Check Succeeds

Benchmark test

$ cd ../y86_code ; make testssim
grep "ISA Check" *.seq
asum.seq:ISA Check Succeeds
asumr.seq:ISA Check Succeeds
cjr.seq:ISA Check Succeeds
j-cc.seq:ISA Check Succeeds
poptest.seq:ISA Check Succeeds
prog1.seq:ISA Check Succeeds
prog2.seq:ISA Check Succeeds
prog3.seq:ISA Check Succeeds
prog4.seq:ISA Check Succeeds
prog5.seq:ISA Check Succeeds
prog6.seq:ISA Check Succeeds
prog7.seq:ISA Check Succeeds
prog8.seq:ISA Check Succeeds
pushquestion.seq:ISA Check Succeeds
pushtest.seq:ISA Check Succeeds
ret-hazard.seq:ISA Check Succeeds

Regression test

$ cd ../ptest; make SIM=../seq/ssim
./optest.pl -s ../seq/ssim 
Simulating with ../seq/ssim
  All 49 ISA Checks Succeed
./jtest.pl -s ../seq/ssim 
Simulating with ../seq/ssim
  All 64 ISA Checks Succeed
./ctest.pl -s ../seq/ssim 
Simulating with ../seq/ssim
  All 22 ISA Checks Succeed
./htest.pl -s ../seq/ssim 
Simulating with ../seq/ssim
  All 600 ISA Checks Succeed

Test `iaddq`

$ cd ../ptest; make SIM=../seq/ssim TFLAGS=-i
./optest.pl -s ../seq/ssim -i
Simulating with ../seq/ssim
  All 58 ISA Checks Succeed
./jtest.pl -s ../seq/ssim -i
Simulating with ../seq/ssim
  All 96 ISA Checks Succeed
./ctest.pl -s ../seq/ssim -i
Simulating with ../seq/ssim
  All 22 ISA Checks Succeed
./htest.pl -s ../seq/ssim -i
Simulating with ../seq/ssim
  All 756 ISA Checks Succeed

Part C

You will be working in directory sim/pipe in this part.

Your task in Part C is to modify ncopy.ys and pipe-full.hcl with the goal of making ncopy.ys run as fast as possible.

Coding rules

You are free to make any modifications you wish, with the following constraints:

Your ncopy.ys function must work for arbitrary array sizes.
Your ncopy.ys function must run correctly with YIS
The assembled version of your ncopy file must not be more than 1000 bytes long
1
2
# length check
$ ./check-len.pl < ncopy.yo
Your pipe-full.hcl implementation must pass the regression tests in ../y86-codeand ../ptest(without the -i flag that tests iaddq)

Other than that, you are free to implement the iaddq instruction if you think that will help.

You may make any semantics preserving transformations to the ncopy.ys function, such as reordering instructions, replacing groups of instructions with single instructions, deleting some instructions, and adding other instructions

(You may find it useful to read about loop unrolling in Section 5.8 )(Unfortunately, I haven’t progressed to Chapter 5)

Building and Running Your Solution

Each time you modify your ncopy.ys program, you can rebuild the driver programs by typing

1	$ make drivers

Each time you modify your pipe-full.hcl file, you can rebuild the simulator by typing

1	$ make psim VERSION=full

If you want to rebuild the simulator and the driver programs, type

1	$ make VERSION=full

Testing your driver files on the ISA simulator. Make sure that your ncopy.ys function works prop- erly with YIS:

1 2	$ make drivers $ ../misc/yis sdriver.yo

Testing your code on a range of block lengths with the ISA simulator

1	$ ./correctness.pl

The program will end with register %rax having the following value:

0xaaaa : All tests pass.

0xbbbb : Incorrect count

0xcccc : Function ncopy is more than 1000 bytes long.

0xdddd : Some of the source data was not copied to its destination.

0xeeee : Some word just before or just after the destination region was corrupted.

Testing your pipeline simulator on the benchmark programs.

1	$ (cd ../y86-code; make testpsim)

Testing your pipeline simulator with extensive regression tests

1	$ (cd ../ptest; make SIM=../pipe/psim TFLAGS=-i)

Testing your code on a range of block lengths with the pipeline simulator.

1	$ ./correctness.pl -p

My solution :

For ncopy.ys, I use conditional move to erase the small branch.

##################################################################
# You can modify this portion
	# Loop header
        irmovq $1,%r11
	xorq %rax,%rax		# count = 0;
	andq %rdx,%rdx		# len <= 0?
	jle Done		# if so, goto Done:

Loop:	mrmovq (%rdi), %r10	# read val from src...
	rmmovq %r10, (%rsi)	# ...and store it to dst
        xorq %r8,%r8
	andq %r10, %r10		# val <= 0?
        cmovg %r11,%r8
	addq %r8, %rax		# count++
        iaddq $-1,%rdx
        iaddq $8,%rdi
        iaddq $8,%rsi
	andq %rdx,%rdx		# len > 0?
	jg Loop			# if so, goto Loop:
##################################################################

For pipe-full.hcl, I implement the iaddq instruction to reduce the number of instructions in ncopy.ys

####################################################################
#    Control Signal Definitions.                                   #
####################################################################

################ Fetch Stage     ###################################

## What address should instruction be fetched at
word f_pc = [
	# Mispredicted branch.  Fetch at incremented PC
	M_icode == IJXX && !M_Cnd : M_valA;
	# Completion of RET instruction
	W_icode == IRET : W_valM;
	# Default: Use predicted value of PC
	1 : F_predPC;
];

## Determine icode of fetched instruction
word f_icode = [
	imem_error : INOP;
	1: imem_icode;
];

# Determine ifun
word f_ifun = [
	imem_error : FNONE;
	1: imem_ifun;
];

# Is instruction valid?
bool instr_valid = f_icode in 
	{ INOP, IHALT, IRRMOVQ, IIRMOVQ, IRMMOVQ, IMRMOVQ,
	  IOPQ, IJXX, ICALL, IRET, IPUSHQ, IPOPQ  , IIADDQ };

# Determine status code for fetched instruction
word f_stat = [
	imem_error: SADR;
	!instr_valid : SINS;
	f_icode == IHALT : SHLT;
	1 : SAOK;
];

# Does fetched instruction require a regid byte?
bool need_regids =
	f_icode in { IRRMOVQ, IOPQ, IPUSHQ, IPOPQ, 
		     IIRMOVQ, IRMMOVQ, IMRMOVQ  , IIADDQ };

# Does fetched instruction require a constant word?
bool need_valC =
	f_icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ, IJXX, ICALL  , IIADDQ };

# Predict next value of PC
word f_predPC = [
	f_icode in { IJXX, ICALL } : f_valC;
	1 : f_valP;
];

################ Decode Stage ######################################


## What register should be used as the A source?
word d_srcA = [
	D_icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ  } : D_rA;
	D_icode in { IPOPQ, IRET } : RRSP;
	1 : RNONE; # Don't need register
];

## What register should be used as the B source?
word d_srcB = [
	D_icode in { IOPQ, IRMMOVQ, IMRMOVQ , IIADDQ  } : D_rB;
	D_icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
	1 : RNONE;  # Don't need register
];

## What register should be used as the E destination?
word d_dstE = [
	D_icode in { IRRMOVQ, IIRMOVQ, IOPQ , IIADDQ } : D_rB;
	D_icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
	1 : RNONE;  # Don't write any register
];

## What register should be used as the M destination?
word d_dstM = [
	D_icode in { IMRMOVQ, IPOPQ } : D_rA;
	1 : RNONE;  # Don't write any register
];

## What should be the A value?
## Forward into decode stage for valA
word d_valA = [
	D_icode in { ICALL, IJXX } : D_valP; # Use incremented PC
	d_srcA == e_dstE : e_valE;    # Forward valE from execute
	d_srcA == M_dstM : m_valM;    # Forward valM from memory
	d_srcA == M_dstE : M_valE;    # Forward valE from memory
	d_srcA == W_dstM : W_valM;    # Forward valM from write back
	d_srcA == W_dstE : W_valE;    # Forward valE from write back
	1 : d_rvalA;  # Use value read from register file
];

word d_valB = [
	d_srcB == e_dstE : e_valE;    # Forward valE from execute
	d_srcB == M_dstM : m_valM;    # Forward valM from memory
	d_srcB == M_dstE : M_valE;    # Forward valE from memory
	d_srcB == W_dstM : W_valM;    # Forward valM from write back
	d_srcB == W_dstE : W_valE;    # Forward valE from write back
	1 : d_rvalB;  # Use value read from register file
];

################ Execute Stage #####################################

## Select input A to ALU
word aluA = [
	E_icode in { IRRMOVQ, IOPQ } : E_valA;
	E_icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ , IIADDQ } : E_valC;
	E_icode in { ICALL, IPUSHQ } : -8;
	E_icode in { IRET, IPOPQ } : 8;
	# Other instructions don't need ALU
];

## Select input B to ALU
word aluB = [
	E_icode in { IRMMOVQ, IMRMOVQ, IOPQ, ICALL, 
		     IPUSHQ, IRET, IPOPQ , IIADDQ } : E_valB;
	E_icode in { IRRMOVQ, IIRMOVQ } : 0;
	# Other instructions don't need ALU
];

## Set the ALU function
word alufun = [
	E_icode == IOPQ : E_ifun;
	1 : ALUADD;
];

## Should the condition codes be updated?
bool set_cc = E_icode in {IOPQ , IIADDQ }  &&
	# State changes only during normal operation
	!m_stat in { SADR, SINS, SHLT } && !W_stat in { SADR, SINS, SHLT };

## Generate valA in execute stage
word e_valA = E_valA;    # Pass valA through stage

## Set dstE to RNONE in event of not-taken conditional move
word e_dstE = [
	E_icode == IRRMOVQ && !e_Cnd : RNONE;
	1 : E_dstE;
];

################ Memory Stage ######################################

## Select memory address
word mem_addr = [
	M_icode in { IRMMOVQ, IPUSHQ, ICALL, IMRMOVQ } : M_valE;
	M_icode in { IPOPQ, IRET } : M_valA;
	# Other instructions don't need address
];

## Set read control signal
bool mem_read = M_icode in { IMRMOVQ, IPOPQ, IRET };

## Set write control signal
bool mem_write = M_icode in { IRMMOVQ, IPUSHQ, ICALL };

#/* $begin pipe-m_stat-hcl */
## Update the status
word m_stat = [
	dmem_error : SADR;
	1 : M_stat;
];
#/* $end pipe-m_stat-hcl */

## Set E port register ID
word w_dstE = W_dstE;

## Set E port value
word w_valE = W_valE;

## Set M port register ID
word w_dstM = W_dstM;

## Set M port value
word w_valM = W_valM;

## Update processor status
word Stat = [
	W_stat == SBUB : SAOK;
	1 : W_stat;
];

################ Pipeline Register Control #########################

# Should I stall or inject a bubble into Pipeline Register F?
# At most one of these can be true.
bool F_bubble = 0;
bool F_stall =
	# Conditions for a load/use hazard
	E_icode in { IMRMOVQ, IPOPQ } &&
	 E_dstM in { d_srcA, d_srcB } ||
	# Stalling at fetch while ret passes through pipeline
	IRET in { D_icode, E_icode, M_icode };

# Should I stall or inject a bubble into Pipeline Register D?
# At most one of these can be true.
bool D_stall = 
	# Conditions for a load/use hazard
	E_icode in { IMRMOVQ, IPOPQ } &&
	 E_dstM in { d_srcA, d_srcB };

bool D_bubble =
	# Mispredicted branch
	(E_icode == IJXX && !e_Cnd) ||
	# Stalling at fetch while ret passes through pipeline
	# but not condition for a load/use hazard
	!(E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB }) &&
	  IRET in { D_icode, E_icode, M_icode };

# Should I stall or inject a bubble into Pipeline Register E?
# At most one of these can be true.
bool E_stall = 0;
bool E_bubble =
	# Mispredicted branch
	(E_icode == IJXX && !e_Cnd) ||
	# Conditions for a load/use hazard
	E_icode in { IMRMOVQ, IPOPQ } &&
	 E_dstM in { d_srcA, d_srcB};

# Should I stall or inject a bubble into Pipeline Register M?
# At most one of these can be true.
bool M_stall = 0;
# Start injecting bubbles as soon as exception passes through memory stage
bool M_bubble = m_stat in { SADR, SINS, SHLT } || W_stat in { SADR, SINS, SHLT };

# Should I stall or inject a bubble into Pipeline Register W?
bool W_stall = W_stat in { SADR, SINS, SHLT };
bool W_bubble = 0;
#/* $end pipe-all-hcl */

Again, we only need to do little modification to the codes, all complex stuff such as data forwarding, bubble injection and PC predication for ret and call, which I don’t quite understand, are done by the teachers.

By the way, don’t forget that iaddq need to set the condition code.

# test ncopy.ys
$ ./correctness.pl
Simulating with instruction set simulator yis
	ncopy
0	OK
1	OK
2	OK
3	OK
4	OK
5	OK
6	OK
7	OK
8	OK
9	OK
10	OK
11	OK
12	OK
13	OK
14	OK
15	OK
16	OK
17	OK
18	OK
19	OK
20	OK
21	OK
22	OK
23	OK
24	OK
25	OK
26	OK
27	OK
28	OK
29	OK
30	OK
31	OK
32	OK
33	OK
34	OK
35	OK
36	OK
37	OK
38	OK
39	OK
40	OK
41	OK
42	OK
43	OK
44	OK
45	OK
46	OK
47	OK
48	OK
49	OK
50	OK
51	OK
52	OK
53	OK
54	OK
55	OK
56	OK
57	OK
58	OK
59	OK
60	OK
61	OK
62	OK
63	OK
64	OK
128	OK
192	OK
256	OK
68/68 pass correctness test

# test benchmark programs on modified psim
$ (cd ../y86-code; make testpsim)
grep "ISA Check" *.pipe
asum.pipe:ISA Check Succeeds
asumr.pipe:ISA Check Succeeds
cjr.pipe:ISA Check Succeeds
j-cc.pipe:ISA Check Succeeds
poptest.pipe:ISA Check Succeeds
prog1.pipe:ISA Check Succeeds
prog2.pipe:ISA Check Succeeds
prog3.pipe:ISA Check Succeeds
prog4.pipe:ISA Check Succeeds
prog5.pipe:ISA Check Succeeds
prog6.pipe:ISA Check Succeeds
prog7.pipe:ISA Check Succeeds
prog8.pipe:ISA Check Succeeds
pushquestion.pipe:ISA Check Succeeds
pushtest.pipe:ISA Check Succeeds
ret-hazard.pipe:ISA Check Succeeds

# test with extensive regression tests
$ (cd ../ptest; make SIM=../pipe/psim TFLAGS=-i)
./optest.pl -s ../pipe/psim -i
Simulating with ../pipe/psim
  All 58 ISA Checks Succeed
./jtest.pl -s ../pipe/psim -i
Simulating with ../pipe/psim
  All 96 ISA Checks Succeed
./ctest.pl -s ../pipe/psim -i
Simulating with ../pipe/psim
  All 22 ISA Checks Succeed
./htest.pl -s ../pipe/psim -i
Simulating with ../pipe/psim
  All 756 ISA Checks Succeed

# test correctness on modified pipeline
$ ./correctness.pl -p
Simulating with pipeline simulator psim
	ncopy
0	OK
1	OK
2	OK
3	OK
4	OK
5	OK
6	OK
7	OK
8	OK
9	OK
10	OK
11	OK
12	OK
13	OK
14	OK
15	OK
16	OK
17	OK
18	OK
19	OK
20	OK
21	OK
22	OK
23	OK
24	OK
25	OK
26	OK
27	OK
28	OK
29	OK
30	OK
31	OK
32	OK
33	OK
34	OK
35	OK
36	OK
37	OK
38	OK
39	OK
40	OK
41	OK
42	OK
43	OK
44	OK
45	OK
46	OK
47	OK
48	OK
49	OK
50	OK
51	OK
52	OK
53	OK
54	OK
55	OK
56	OK
57	OK
58	OK
59	OK
60	OK
61	OK
62	OK
63	OK
64	OK
128	OK
192	OK
256	OK
68/68 pass correctness test

Evaluation

$ ./benchmark.pl
	ncopy
0	14
1	30	30.00
2	42	21.00
3	54	18.00
4	66	16.50
5	78	15.60
6	90	15.00
7	102	14.57
8	114	14.25
9	126	14.00
10	138	13.80
11	150	13.64
12	162	13.50
13	174	13.38
14	186	13.29
15	198	13.20
16	210	13.12
17	222	13.06
18	234	13.00
19	246	12.95
20	258	12.90
21	270	12.86
22	282	12.82
23	294	12.78
24	306	12.75
25	318	12.72
26	330	12.69
27	342	12.67
28	354	12.64
29	366	12.62
30	378	12.60
31	390	12.58
32	402	12.56
33	414	12.55
34	426	12.53
35	438	12.51
36	450	12.50
37	462	12.49
38	474	12.47
39	486	12.46
40	498	12.45
41	510	12.44
42	522	12.43
43	534	12.42
44	546	12.41
45	558	12.40
46	570	12.39
47	582	12.38
48	594	12.38
49	606	12.37
50	618	12.36
51	630	12.35
52	642	12.35
53	654	12.34
54	666	12.33
55	678	12.33
56	690	12.32
57	702	12.32
58	714	12.31
59	726	12.31
60	738	12.30
61	750	12.30
62	762	12.29
63	774	12.29
64	786	12.28
Average CPE	13.33
Score	0.0/60.0

I know there are many tricks to accelerate the program, such as expanding the loop, using different branch predication strategies and implementing data forwarding in pipe-lf.hcl. But I failed to figure the implementation on my own. Thanks for dreamanddead.

However, it only gives me 1 second reduction…..

$ ./benchmark.pl
	ncopy
0	14
1	29	29.00
2	40	20.00
3	51	17.00
4	62	15.50
5	73	14.60
6	84	14.00
7	95	13.57
8	106	13.25
9	117	13.00
10	128	12.80
11	139	12.64
12	150	12.50
13	161	12.38
14	172	12.29
15	183	12.20
16	194	12.12
17	205	12.06
18	216	12.00
19	227	11.95
20	238	11.90
21	249	11.86
22	260	11.82
23	271	11.78
24	282	11.75
25	293	11.72
26	304	11.69
27	315	11.67
28	326	11.64
29	337	11.62
30	348	11.60
31	359	11.58
32	370	11.56
33	381	11.55
34	392	11.53
35	403	11.51
36	414	11.50
37	425	11.49
38	436	11.47
39	447	11.46
40	458	11.45
41	469	11.44
42	480	11.43
43	491	11.42
44	502	11.41
45	513	11.40
46	524	11.39
47	535	11.38
48	546	11.38
49	557	11.37
50	568	11.36
51	579	11.35
52	590	11.35
53	601	11.34
54	612	11.33
55	623	11.33
56	634	11.32
57	645	11.32
58	656	11.31
59	667	11.31
60	678	11.30
61	689	11.30
62	700	11.29
63	711	11.29
64	722	11.28
Average CPE	12.33
Score	0.0/60.0

I have to admit that I failed on Part C.

(In fact, the brenchmark.pl only use output of psim to judge the performance, so wrong implementation will be scored very high, for example, just delete the implementation of iaddq)

	ncopy
0	14
1	19	19.00
2	19	9.50
3	19	6.33
4	19	4.75
5	19	3.80
6	19	3.17
7	19	2.71
8	19	2.38
9	19	2.11
10	19	1.90
11	19	1.73
12	19	1.58
13	19	1.46
14	19	1.36
15	19	1.27
16	19	1.19
17	19	1.12
18	19	1.06
19	19	1.00
20	19	0.95
21	19	0.90
22	19	0.86
23	19	0.83
24	19	0.79
25	19	0.76
26	19	0.73
27	19	0.70
28	19	0.68
29	19	0.66
30	19	0.63
31	19	0.61
32	19	0.59
33	19	0.58
34	19	0.56
35	19	0.54
36	19	0.53
37	19	0.51
38	19	0.50
39	19	0.49
40	19	0.47
41	19	0.46
42	19	0.45
43	19	0.44
44	19	0.43
45	19	0.42
46	19	0.41
47	19	0.40
48	19	0.40
49	19	0.39
50	19	0.38
51	19	0.37
52	19	0.37
53	19	0.36
54	19	0.35
55	19	0.35
56	19	0.34
57	19	0.33
58	19	0.33
59	19	0.32
60	19	0.32
61	19	0.31
62	19	0.31
63	19	0.30
64	19	0.30
Average CPE	1.41
Score	60.0/60.0

Architecture lab

Introduction

PartA

Program 1

My solution :

Program 2

My solution:

Program 3

My solution :

Part B

Building and Testing Your Solution

Build

Test

Testing your solution on a simple Y86-64 program.

Debug

Retesting your solution using the benchmark programs

Performing regression tests.

My solution :

Simple y86 test

Benchmark test

Regression test

Test iaddq

Part C

Coding rules

Building and Running Your Solution

My solution :

Evaluation

Test `iaddq`