architecturelab

Architecture lab

Introduction

The lab is organized into three parts, each with its own handin

  • In Part A you will write some simple Y86-64 programs and become familiar with the Y86-64 tools.
  • In Part B, you will extend the SEQ simulator with a new instruction.
  • These two parts will prepare you for Part C, the heart of the lab, where you will optimize the Y86-64 benchmark program and the processor design.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ tar xvf archlab-handout.tar
$ cat README
This directory contains the files that you need for the CS:APP
architecture lab.

******
Files:
******

Makefile Use this to handin your solutions
README This file
archlab.{ps,pdf} Lab writeup
sim.tar Archive of the Y86-64 tools in tar format
simguide.{ps,pdf} CS:APP Guide to Simulators document
1
2
3
$ tar xvf sim.tar
$ cd sim
$ make clean; make # for GUI version
1
2
3
$ vim Makefile
# comment out TKLIBS and TKINC for non-GUI version
$ make clean; make

PartA

You will be working in directory sim/misc in this part.

Your task is to write and simulate the following three Y86-64 programs.

  • put your name and ID in a comment at the beginning of each program
  • You can test your programs by first assemblying them with the program YAS and then running them with the instruction set simulator YIS.
  • you should follow the x86-64 conventions for passing function arguments, using registers, and using the stack. This includes saving and restoring any callee-save registers that you use.
  • The C code of the program is in sim/misc/example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/* 
* Architecture Lab: Part A
*
* High level specs for the functions that the students will rewrite
* in Y86-64 assembly language
*/

/* $begin examples */
/* linked list element */
typedef struct ELE {
long val;
struct ELE *next;
} *list_ptr;

/* sum_list - Sum the elements of a linked list */
long sum_list(list_ptr ls)
{
long val = 0;
while (ls) {
val += ls->val;
ls = ls->next;
}
return val;
}

/* rsum_list - Recursive version of sum_list */
long rsum_list(list_ptr ls)
{
if (!ls)
return 0;
else {
long val = ls->val;
long rest = rsum_list(ls->next);
return val + rest;
}
}

/* copy_block - Copy src to dest and return xor checksum of src */
long copy_block(long *src, long *dest, long len)
{
long result = 0;
while (len > 0) {
long val = *src++;
*dest++ = val;
result ^= val;
len--;
}
return result;
}
/* $end examples */
Program 1

sum.ys: Iteratively sum linked list elements

Your program should consist of some code that sets up the stack structure, invokes a function, and then halts

Test your program using the following three-element list:

1
2
3
4
5
6
7
8
9
10
11
# Sample linked list 
.align 8
ele1:
.quad 0x00a
.quad ele2
ele2:
.quad 0x0b0
.quad ele3
ele3:
.quad 0xc00
.quad 0

My solution :

  1. use .pos 0xvalue to specify the address in code section

  2. I know we can use gcc to first assemble example.c into x86_64 assembly code and then modify it to y86 code, but just not recommended.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# code section start at address 0x0
.pos 0
code:
# stack set up
irmovq stack, %rsp

# entry of the program
main:
irmovq data,%rdi
call sum_list
halt

# sum_list(Node * ls)
sum_list:
# initialize return value to 0
irmovq $0 , %rax
# test if ls is NULL
irmovq $0 , %rbx
subq %rbx, %rdi
je return
# guarded do strategy
loop:
mrmovq (%rdi),%rsi
addq %rsi,%rax
mrmovq 8(%rdi),%rdi
subq %rbx , %rdi
jne loop
return:
ret

# Sample linked list
# data section start at address 0x300
.pos 0x300
.align 8
data:
ele1:
.quad 0x00a
.quad ele2
ele2:
.quad 0x0b0
.quad ele3
ele3:
.quad 0xc00
.quad 0

# stack grows from address 0x500 to 0x0 (should be limited at 0x300)
.pos 0x500
stack:
1
2
3
4
5
6
7
8
Stopped in 24 steps at PC = 0x1d.  Status 'HLT', CC Z=1 S=0 O=0
Changes to registers:
%rax: 0x0000000000000000 0x0000000000000cba
%rsp: 0x0000000000000000 0x0000000000000500
%rsi: 0x0000000000000000 0x0000000000000c00

Changes to memory:
0x04f8: 0x0000000000000000 0x000000000000001d

Program 2

rsum.ys: Recursively sum linked list elements

This code should be similar to the code in sum.ys, except that it should use a function rsum_list that recursively sums a list of numbers, as shown with the C function rsum_list

My solution:

when using callee saved register in recursive function:

  1. push it before use it
  2. pop it after use it
  3. if not used in some branch, not push and pop are needed at all.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# code section start at address 0x0
.pos 0
code:

# stack set up
irmovq stack, %rsp

# entry of the program
main:
irmovq data,%rdi
call rsum_list
halt

# rsum_list(Node * ls)
rsum_list:
irmovq $0 , %rax # initialize return value to 0
irmovq $0 , %rcx # get a 0 constant
subq %rcx , %rdi
je return
# else:
pushq %rbp
mrmovq (%rdi),%rsi
addq %rsi, %rax
rrmovq %rax, %rbp
mrmovq 8(%rdi),%rdi
call rsum_list
addq %rbp ,%rax
popq %rbp
return:
ret

# Sample linked list
# data section start at address 0x300
.pos 0x300
.align 8
data:
ele1:
.quad 0x00a
.quad ele2
ele2:
.quad 0x0b0
.quad ele3
ele3:
.quad 0xc00
.quad 0
# stack grows from address 0x500 to 0x0 (should be limited at 0x300)
.pos 0x500
stack:
1
2
3
4
5
6
7
8
9
10
11
12
13
Stopped in 48 steps at PC = 0x1d.  Status 'HLT', CC Z=0 S=0 O=0
Changes to registers:
%rax: 0x0000000000000000 0x0000000000000cba
%rsp: 0x0000000000000000 0x0000000000000500
%rsi: 0x0000000000000000 0x0000000000000c00

Changes to memory:
0x04c8: 0x0000000000000000 0x0000000000000060
0x04d0: 0x0000000000000000 0x00000000000000b0
0x04d8: 0x0000000000000000 0x0000000000000060
0x04e0: 0x0000000000000000 0x000000000000000a
0x04e8: 0x0000000000000000 0x0000000000000060
0x04f8: 0x0000000000000000 0x000000000000001d

Program 3

copy.ys: Copy a source block to a destination block

Write a program (copy.ys) that copies a block of words from one part of memory to another (non-overlapping area) area of memory, computing the checksum (Xor) of all the words copied.

Test your program using the following three-element source and destination blocks:

1
2
3
4
5
6
7

.align 8
# Source block src:
.quad 0x00a .quad 0x0b0 .quad 0xc00
# Destination block
dest:
.quad 0x111 .quad 0x222 .quad 0x333

My solution :

Note that in y86 address is assigned by lines

So quad 0x00a .quad 0x0b0 .quad 0xc00 and

.quad 0x00a .quad 0x0b0 .quad 0xc00

are different.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# code section start at address 0x0
.pos 0
code:

# stack set up
irmovq stack, %rsp

# entry of the program
main:
irmovq src,%rdi
irmovq dest,%rsi
irmovq $3, %rcx
call copy_block
halt

copy_block:
# initialize return value to 0
irmovq $0 , %rax
# set constants
irmovq $0, %rbp
irmovq $8, %rbx
# jump to middle strategy
jmp test
loop:
mrmovq (%rdi),%rdx
addq %rbx,%rdi
rmmovq %rdx,(%rsi)
addq %rbx , %rsi
xorq %rdx,%rax
irmovq $1,%rdx
subq %rdx,%rcx
test:
subq %rbp , %rcx
jg loop
ret

# data section start at address 0x300
.pos 0x300
.align 8
# Source block
src:
.quad 0x00a
.quad 0x0b0
.quad 0xc00
# Destination block
dest:
.quad 0x111
.quad 0x222
.quad 0x333

# stack grows from address 0x500 to 0x0 (should be limited at 0x300)
.pos 0x500
stack:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Stopped in 40 steps at PC = 0x31.  Status 'HLT', CC Z=1 S=0 O=0
Changes to registers:
%rax: 0x0000000000000000 0x0000000000000cba
%rdx: 0x0000000000000000 0x0000000000000001
%rbx: 0x0000000000000000 0x0000000000000008
%rsp: 0x0000000000000000 0x0000000000000500
%rsi: 0x0000000000000000 0x0000000000000330
%rdi: 0x0000000000000000 0x0000000000000318

Changes to memory:
0x0318: 0x0000000000000111 0x000000000000000a
0x0320: 0x0000000000000222 0x00000000000000b0
0x0328: 0x0000000000000333 0x0000000000000c00
0x04f8: 0x0000000000000000 0x0000000000000031

Part B

You will be working in directory sim/seq in this part.

Your task in Part B is to extend the SEQ processor to support the iaddq

To add this instructions, you will modify the file seq-full.hcl, which implements the version of SEQ described in the CS:APP3e textbook. In addition, it contains declarations of some constants that you will need for your solution.

Your HCL file must begin with a header comment containing the following information:

  • Your name and ID.
  • A description of the computations required for the iaddq instruction. Use the descriptions of irmovq and OPq in Figure 4.18 in the CS:APP3e text as a guide.

Building and Testing Your Solution

you will need to build a new instance of the SEQ simulator (ssim) based on this HCL file, and then test it

Build

1
$ make VERSION=full

(This builds a version of ssim that uses the control logic you specified in seq-full.hcl. To save typing, you can assign VERSION=full in the Makefile.)

Test

Testing your solution on a simple Y86-64 program.
1
$ ./ssim -t ../y86-code/asumi.yo
Debug
1
$ ./ssim -g ../y86-code/asumi.yo
Retesting your solution using the benchmark programs
1
$ (cd ../y86-code; make testssim)

Note that none of these programs test the added instructions.

You are simply making sure that your solution did not inject errors for the original instructions. See ../y86-code/README file for more details.

Performing regression tests.

test everything except iaddq and leave

1
$ (cd ../ptest; make SIM=../seq/ssim)

test iaddq

1
$ (cd ../ptest; make SIM=../seq/ssim TFLAGS=-i)

For more information on the SEQ simulator refer to the handout CS:APP3e Guide to Y86-64 Processor Simulators (simguide.pdf).

My solution :

The problem could have been much harder. Due to the bloody clear comments and concise codes given by the file itself, adding the iaddq instruction becomes a piece of cake. Just adding IIADDQ at serveral different places will get the job done.

The teachers at CMU really did a splendid job, I really appreciate their efforts and passion.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Stage               iaddq V,rB

Fetch icode:ifunc ← M1[PC]
rA:rB ← M1[PC+1]
valC ← M8[PC + 2]
valP = PC + 10

Decode valB ← R[rB]

Execute valE ← valB + valC

Memory

Write back R[rB] ← valE

PC update PC ← valP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
####################################################################
# Control Signal Definitions. #
####################################################################

################ Fetch Stage ###################################

# Determine instruction code
word icode = [
imem_error: INOP;
1: imem_icode; # Default: get from instruction memory
];

# Determine instruction function
word ifun = [
imem_error: FNONE;
1: imem_ifun; # Default: get from instruction memory
];

bool instr_valid = icode in
{ INOP, IHALT, IRRMOVQ, IIRMOVQ, IRMMOVQ, IMRMOVQ,
IOPQ, IJXX, ICALL, IRET, IPUSHQ, IPOPQ , IIADDQ };

# Does fetched instruction require a regid byte?
bool need_regids =
icode in { IRRMOVQ, IOPQ, IPUSHQ, IPOPQ,
IIRMOVQ, IRMMOVQ, IMRMOVQ , IIADDQ };

# Does fetched instruction require a constant word?
bool need_valC =
icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ, IJXX, ICALL , IIADDQ };

################ Decode Stage ###################################

## What register should be used as the A source?
word srcA = [
icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ } : rA;
icode in { IPOPQ, IRET } : RRSP;
1 : RNONE; # Don't need register
];

## What register should be used as the B source?
word srcB = [
icode in { IOPQ, IRMMOVQ, IMRMOVQ , IIADDQ } : rB;
icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
1 : RNONE; # Don't need register
];

## What register should be used as the E destination?
word dstE = [
icode in { IRRMOVQ } && Cnd : rB;
icode in { IIRMOVQ, IOPQ , IIADDQ } : rB;
icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
1 : RNONE; # Don't write any register
];

## What register should be used as the M destination?
word dstM = [
icode in { IMRMOVQ, IPOPQ } : rA;
1 : RNONE; # Don't write any register
];

################ Execute Stage ###################################

## Select input A to ALU
word aluA = [
icode in { IRRMOVQ, IOPQ } : valA;
icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ , IIADDQ } : valC;
icode in { ICALL, IPUSHQ } : -8;
icode in { IRET, IPOPQ } : 8;
# Other instructions don't need ALU
];

## Select input B to ALU
word aluB = [
icode in { IRMMOVQ, IMRMOVQ, IOPQ, ICALL,
IPUSHQ, IRET, IPOPQ , IIADDQ } : valB;
icode in { IRRMOVQ, IIRMOVQ } : 0;
# Other instructions don't need ALU
];

## Set the ALU function
word alufun = [
icode == IOPQ : ifun;
1 : ALUADD;
];

## Should the condition codes be updated?
bool set_cc = icode in { IOPQ , IIADDQ };

################ Memory Stage ###################################

## Set read control signal
bool mem_read = icode in { IMRMOVQ, IPOPQ, IRET };

## Set write control signal
bool mem_write = icode in { IRMMOVQ, IPUSHQ, ICALL };

## Select memory address
word mem_addr = [
icode in { IRMMOVQ, IPUSHQ, ICALL, IMRMOVQ } : valE;
icode in { IPOPQ, IRET } : valA;
# Other instructions don't need address
];

## Select memory input data
word mem_data = [
# Value from register
icode in { IRMMOVQ, IPUSHQ } : valA;
# Return PC
icode == ICALL : valP;
# Default: Don't write anything
];

## Determine instruction status
word Stat = [
imem_error || dmem_error : SADR;
!instr_valid: SINS;
icode == IHALT : SHLT;
1 : SAOK;
];

################ Program Counter Update ############################

## What address should instruction be fetched at

word new_pc = [
# Call. Use instruction constant
icode == ICALL : valC;
# Taken branch. Use instruction constant
icode == IJXX && Cnd : valC;
# Completion of RET instruction. Use value from stack
icode == IRET : valM;
# Default: Use incremented PC
1 : valP;
];
#/* $end seq-all-hcl */
Simple y86 test
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ make VERSION=full
34 instructions executed
Status = HLT
Condition Codes: Z=1 S=0 O=0
Changed Register State:
%rax: 0x0000000000000000 0x0000abcdabcdabcd
%rsp: 0x0000000000000000 0x0000000000000200
%rdi: 0x0000000000000000 0x0000000000000038
%r8: 0x0000000000000000 0x0000000000000008
%r9: 0x0000000000000000 0x0000000000000001
%r10: 0x0000000000000000 0x0000a000a000a000
Changed Memory State:
0x01f0: 0x0000000000000000 0x0000000000000055
0x01f8: 0x0000000000000000 0x0000000000000013
ISA Check Succeeds
Benchmark test
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ cd ../y86_code ; make testssim
grep "ISA Check" *.seq
asum.seq:ISA Check Succeeds
asumr.seq:ISA Check Succeeds
cjr.seq:ISA Check Succeeds
j-cc.seq:ISA Check Succeeds
poptest.seq:ISA Check Succeeds
prog1.seq:ISA Check Succeeds
prog2.seq:ISA Check Succeeds
prog3.seq:ISA Check Succeeds
prog4.seq:ISA Check Succeeds
prog5.seq:ISA Check Succeeds
prog6.seq:ISA Check Succeeds
prog7.seq:ISA Check Succeeds
prog8.seq:ISA Check Succeeds
pushquestion.seq:ISA Check Succeeds
pushtest.seq:ISA Check Succeeds
ret-hazard.seq:ISA Check Succeeds
Regression test
1
2
3
4
5
6
7
8
9
10
11
12
13
$ cd ../ptest; make SIM=../seq/ssim
./optest.pl -s ../seq/ssim
Simulating with ../seq/ssim
All 49 ISA Checks Succeed
./jtest.pl -s ../seq/ssim
Simulating with ../seq/ssim
All 64 ISA Checks Succeed
./ctest.pl -s ../seq/ssim
Simulating with ../seq/ssim
All 22 ISA Checks Succeed
./htest.pl -s ../seq/ssim
Simulating with ../seq/ssim
All 600 ISA Checks Succeed
Test iaddq
1
2
3
4
5
6
7
8
9
10
11
12
13
$ cd ../ptest; make SIM=../seq/ssim TFLAGS=-i
./optest.pl -s ../seq/ssim -i
Simulating with ../seq/ssim
All 58 ISA Checks Succeed
./jtest.pl -s ../seq/ssim -i
Simulating with ../seq/ssim
All 96 ISA Checks Succeed
./ctest.pl -s ../seq/ssim -i
Simulating with ../seq/ssim
All 22 ISA Checks Succeed
./htest.pl -s ../seq/ssim -i
Simulating with ../seq/ssim
All 756 ISA Checks Succeed

Part C

You will be working in directory sim/pipe in this part.

Your task in Part C is to modify ncopy.ys and pipe-full.hcl with the goal of making ncopy.ys run as fast as possible.

Coding rules

You are free to make any modifications you wish, with the following constraints:

  • Your ncopy.ys function must work for arbitrary array sizes.

  • Your ncopy.ys function must run correctly with YIS

  • The assembled version of your ncopy file must not be more than 1000 bytes long

    1
    2
    # length check
    $ ./check-len.pl < ncopy.yo
  • Your pipe-full.hcl implementation must pass the regression tests in ../y86-codeand ../ptest(without the -i flag that tests iaddq)

Other than that, you are free to implement the iaddq instruction if you think that will help.

You may make any semantics preserving transformations to the ncopy.ys function, such as reordering instructions, replacing groups of instructions with single instructions, deleting some instructions, and adding other instructions

(You may find it useful to read about loop unrolling in Section 5.8 )(Unfortunately, I haven’t progressed to Chapter 5)

Building and Running Your Solution

Each time you modify your ncopy.ys program, you can rebuild the driver programs by typing

1
$ make drivers

Each time you modify your pipe-full.hcl file, you can rebuild the simulator by typing

1
$ make psim VERSION=full

If you want to rebuild the simulator and the driver programs, type

1
$ make VERSION=full

Testing your driver files on the ISA simulator. Make sure that your ncopy.ys function works prop- erly with YIS:

1
2
$ make drivers
$ ../misc/yis sdriver.yo

Testing your code on a range of block lengths with the ISA simulator

1
$ ./correctness.pl

The program will end with register %rax having the following value:

0xaaaa : All tests pass.

0xbbbb : Incorrect count

0xcccc : Function ncopy is more than 1000 bytes long.

0xdddd : Some of the source data was not copied to its destination.

0xeeee : Some word just before or just after the destination region was corrupted.

Testing your pipeline simulator on the benchmark programs.

1
$ (cd ../y86-code; make testpsim)

Testing your pipeline simulator with extensive regression tests

1
$ (cd ../ptest; make SIM=../pipe/psim TFLAGS=-i)

Testing your code on a range of block lengths with the pipeline simulator.

1
$ ./correctness.pl -p

My solution :

For ncopy.ys, I use conditional move to erase the small branch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
##################################################################
# You can modify this portion
# Loop header
irmovq $1,%r11
xorq %rax,%rax # count = 0;
andq %rdx,%rdx # len <= 0?
jle Done # if so, goto Done:

Loop: mrmovq (%rdi), %r10 # read val from src...
rmmovq %r10, (%rsi) # ...and store it to dst
xorq %r8,%r8
andq %r10, %r10 # val <= 0?
cmovg %r11,%r8
addq %r8, %rax # count++
iaddq $-1,%rdx
iaddq $8,%rdi
iaddq $8,%rsi
andq %rdx,%rdx # len > 0?
jg Loop # if so, goto Loop:
##################################################################

For pipe-full.hcl, I implement the iaddq instruction to reduce the number of instructions in ncopy.ys

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
####################################################################
# Control Signal Definitions. #
####################################################################

################ Fetch Stage ###################################

## What address should instruction be fetched at
word f_pc = [
# Mispredicted branch. Fetch at incremented PC
M_icode == IJXX && !M_Cnd : M_valA;
# Completion of RET instruction
W_icode == IRET : W_valM;
# Default: Use predicted value of PC
1 : F_predPC;
];

## Determine icode of fetched instruction
word f_icode = [
imem_error : INOP;
1: imem_icode;
];

# Determine ifun
word f_ifun = [
imem_error : FNONE;
1: imem_ifun;
];

# Is instruction valid?
bool instr_valid = f_icode in
{ INOP, IHALT, IRRMOVQ, IIRMOVQ, IRMMOVQ, IMRMOVQ,
IOPQ, IJXX, ICALL, IRET, IPUSHQ, IPOPQ , IIADDQ };

# Determine status code for fetched instruction
word f_stat = [
imem_error: SADR;
!instr_valid : SINS;
f_icode == IHALT : SHLT;
1 : SAOK;
];

# Does fetched instruction require a regid byte?
bool need_regids =
f_icode in { IRRMOVQ, IOPQ, IPUSHQ, IPOPQ,
IIRMOVQ, IRMMOVQ, IMRMOVQ , IIADDQ };

# Does fetched instruction require a constant word?
bool need_valC =
f_icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ, IJXX, ICALL , IIADDQ };

# Predict next value of PC
word f_predPC = [
f_icode in { IJXX, ICALL } : f_valC;
1 : f_valP;
];

################ Decode Stage ######################################


## What register should be used as the A source?
word d_srcA = [
D_icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ } : D_rA;
D_icode in { IPOPQ, IRET } : RRSP;
1 : RNONE; # Don't need register
];

## What register should be used as the B source?
word d_srcB = [
D_icode in { IOPQ, IRMMOVQ, IMRMOVQ , IIADDQ } : D_rB;
D_icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
1 : RNONE; # Don't need register
];

## What register should be used as the E destination?
word d_dstE = [
D_icode in { IRRMOVQ, IIRMOVQ, IOPQ , IIADDQ } : D_rB;
D_icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP;
1 : RNONE; # Don't write any register
];

## What register should be used as the M destination?
word d_dstM = [
D_icode in { IMRMOVQ, IPOPQ } : D_rA;
1 : RNONE; # Don't write any register
];

## What should be the A value?
## Forward into decode stage for valA
word d_valA = [
D_icode in { ICALL, IJXX } : D_valP; # Use incremented PC
d_srcA == e_dstE : e_valE; # Forward valE from execute
d_srcA == M_dstM : m_valM; # Forward valM from memory
d_srcA == M_dstE : M_valE; # Forward valE from memory
d_srcA == W_dstM : W_valM; # Forward valM from write back
d_srcA == W_dstE : W_valE; # Forward valE from write back
1 : d_rvalA; # Use value read from register file
];

word d_valB = [
d_srcB == e_dstE : e_valE; # Forward valE from execute
d_srcB == M_dstM : m_valM; # Forward valM from memory
d_srcB == M_dstE : M_valE; # Forward valE from memory
d_srcB == W_dstM : W_valM; # Forward valM from write back
d_srcB == W_dstE : W_valE; # Forward valE from write back
1 : d_rvalB; # Use value read from register file
];

################ Execute Stage #####################################

## Select input A to ALU
word aluA = [
E_icode in { IRRMOVQ, IOPQ } : E_valA;
E_icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ , IIADDQ } : E_valC;
E_icode in { ICALL, IPUSHQ } : -8;
E_icode in { IRET, IPOPQ } : 8;
# Other instructions don't need ALU
];

## Select input B to ALU
word aluB = [
E_icode in { IRMMOVQ, IMRMOVQ, IOPQ, ICALL,
IPUSHQ, IRET, IPOPQ , IIADDQ } : E_valB;
E_icode in { IRRMOVQ, IIRMOVQ } : 0;
# Other instructions don't need ALU
];

## Set the ALU function
word alufun = [
E_icode == IOPQ : E_ifun;
1 : ALUADD;
];

## Should the condition codes be updated?
bool set_cc = E_icode in {IOPQ , IIADDQ } &&
# State changes only during normal operation
!m_stat in { SADR, SINS, SHLT } && !W_stat in { SADR, SINS, SHLT };

## Generate valA in execute stage
word e_valA = E_valA; # Pass valA through stage

## Set dstE to RNONE in event of not-taken conditional move
word e_dstE = [
E_icode == IRRMOVQ && !e_Cnd : RNONE;
1 : E_dstE;
];

################ Memory Stage ######################################

## Select memory address
word mem_addr = [
M_icode in { IRMMOVQ, IPUSHQ, ICALL, IMRMOVQ } : M_valE;
M_icode in { IPOPQ, IRET } : M_valA;
# Other instructions don't need address
];

## Set read control signal
bool mem_read = M_icode in { IMRMOVQ, IPOPQ, IRET };

## Set write control signal
bool mem_write = M_icode in { IRMMOVQ, IPUSHQ, ICALL };

#/* $begin pipe-m_stat-hcl */
## Update the status
word m_stat = [
dmem_error : SADR;
1 : M_stat;
];
#/* $end pipe-m_stat-hcl */

## Set E port register ID
word w_dstE = W_dstE;

## Set E port value
word w_valE = W_valE;

## Set M port register ID
word w_dstM = W_dstM;

## Set M port value
word w_valM = W_valM;

## Update processor status
word Stat = [
W_stat == SBUB : SAOK;
1 : W_stat;
];

################ Pipeline Register Control #########################

# Should I stall or inject a bubble into Pipeline Register F?
# At most one of these can be true.
bool F_bubble = 0;
bool F_stall =
# Conditions for a load/use hazard
E_icode in { IMRMOVQ, IPOPQ } &&
E_dstM in { d_srcA, d_srcB } ||
# Stalling at fetch while ret passes through pipeline
IRET in { D_icode, E_icode, M_icode };

# Should I stall or inject a bubble into Pipeline Register D?
# At most one of these can be true.
bool D_stall =
# Conditions for a load/use hazard
E_icode in { IMRMOVQ, IPOPQ } &&
E_dstM in { d_srcA, d_srcB };

bool D_bubble =
# Mispredicted branch
(E_icode == IJXX && !e_Cnd) ||
# Stalling at fetch while ret passes through pipeline
# but not condition for a load/use hazard
!(E_icode in { IMRMOVQ, IPOPQ } && E_dstM in { d_srcA, d_srcB }) &&
IRET in { D_icode, E_icode, M_icode };

# Should I stall or inject a bubble into Pipeline Register E?
# At most one of these can be true.
bool E_stall = 0;
bool E_bubble =
# Mispredicted branch
(E_icode == IJXX && !e_Cnd) ||
# Conditions for a load/use hazard
E_icode in { IMRMOVQ, IPOPQ } &&
E_dstM in { d_srcA, d_srcB};

# Should I stall or inject a bubble into Pipeline Register M?
# At most one of these can be true.
bool M_stall = 0;
# Start injecting bubbles as soon as exception passes through memory stage
bool M_bubble = m_stat in { SADR, SINS, SHLT } || W_stat in { SADR, SINS, SHLT };

# Should I stall or inject a bubble into Pipeline Register W?
bool W_stall = W_stat in { SADR, SINS, SHLT };
bool W_bubble = 0;
#/* $end pipe-all-hcl */

Again, we only need to do little modification to the codes, all complex stuff such as data forwarding, bubble injection and PC predication for ret and call, which I don’t quite understand, are done by the teachers.

By the way, don’t forget that iaddq need to set the condition code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# test ncopy.ys
$ ./correctness.pl
Simulating with instruction set simulator yis
ncopy
0 OK
1 OK
2 OK
3 OK
4 OK
5 OK
6 OK
7 OK
8 OK
9 OK
10 OK
11 OK
12 OK
13 OK
14 OK
15 OK
16 OK
17 OK
18 OK
19 OK
20 OK
21 OK
22 OK
23 OK
24 OK
25 OK
26 OK
27 OK
28 OK
29 OK
30 OK
31 OK
32 OK
33 OK
34 OK
35 OK
36 OK
37 OK
38 OK
39 OK
40 OK
41 OK
42 OK
43 OK
44 OK
45 OK
46 OK
47 OK
48 OK
49 OK
50 OK
51 OK
52 OK
53 OK
54 OK
55 OK
56 OK
57 OK
58 OK
59 OK
60 OK
61 OK
62 OK
63 OK
64 OK
128 OK
192 OK
256 OK
68/68 pass correctness test
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# test benchmark programs on modified psim
$ (cd ../y86-code; make testpsim)
grep "ISA Check" *.pipe
asum.pipe:ISA Check Succeeds
asumr.pipe:ISA Check Succeeds
cjr.pipe:ISA Check Succeeds
j-cc.pipe:ISA Check Succeeds
poptest.pipe:ISA Check Succeeds
prog1.pipe:ISA Check Succeeds
prog2.pipe:ISA Check Succeeds
prog3.pipe:ISA Check Succeeds
prog4.pipe:ISA Check Succeeds
prog5.pipe:ISA Check Succeeds
prog6.pipe:ISA Check Succeeds
prog7.pipe:ISA Check Succeeds
prog8.pipe:ISA Check Succeeds
pushquestion.pipe:ISA Check Succeeds
pushtest.pipe:ISA Check Succeeds
ret-hazard.pipe:ISA Check Succeeds
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# test with extensive regression tests
$ (cd ../ptest; make SIM=../pipe/psim TFLAGS=-i)
./optest.pl -s ../pipe/psim -i
Simulating with ../pipe/psim
All 58 ISA Checks Succeed
./jtest.pl -s ../pipe/psim -i
Simulating with ../pipe/psim
All 96 ISA Checks Succeed
./ctest.pl -s ../pipe/psim -i
Simulating with ../pipe/psim
All 22 ISA Checks Succeed
./htest.pl -s ../pipe/psim -i
Simulating with ../pipe/psim
All 756 ISA Checks Succeed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# test correctness on modified pipeline
$ ./correctness.pl -p
Simulating with pipeline simulator psim
ncopy
0 OK
1 OK
2 OK
3 OK
4 OK
5 OK
6 OK
7 OK
8 OK
9 OK
10 OK
11 OK
12 OK
13 OK
14 OK
15 OK
16 OK
17 OK
18 OK
19 OK
20 OK
21 OK
22 OK
23 OK
24 OK
25 OK
26 OK
27 OK
28 OK
29 OK
30 OK
31 OK
32 OK
33 OK
34 OK
35 OK
36 OK
37 OK
38 OK
39 OK
40 OK
41 OK
42 OK
43 OK
44 OK
45 OK
46 OK
47 OK
48 OK
49 OK
50 OK
51 OK
52 OK
53 OK
54 OK
55 OK
56 OK
57 OK
58 OK
59 OK
60 OK
61 OK
62 OK
63 OK
64 OK
128 OK
192 OK
256 OK
68/68 pass correctness test

Evaluation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
$ ./benchmark.pl
ncopy
0 14
1 30 30.00
2 42 21.00
3 54 18.00
4 66 16.50
5 78 15.60
6 90 15.00
7 102 14.57
8 114 14.25
9 126 14.00
10 138 13.80
11 150 13.64
12 162 13.50
13 174 13.38
14 186 13.29
15 198 13.20
16 210 13.12
17 222 13.06
18 234 13.00
19 246 12.95
20 258 12.90
21 270 12.86
22 282 12.82
23 294 12.78
24 306 12.75
25 318 12.72
26 330 12.69
27 342 12.67
28 354 12.64
29 366 12.62
30 378 12.60
31 390 12.58
32 402 12.56
33 414 12.55
34 426 12.53
35 438 12.51
36 450 12.50
37 462 12.49
38 474 12.47
39 486 12.46
40 498 12.45
41 510 12.44
42 522 12.43
43 534 12.42
44 546 12.41
45 558 12.40
46 570 12.39
47 582 12.38
48 594 12.38
49 606 12.37
50 618 12.36
51 630 12.35
52 642 12.35
53 654 12.34
54 666 12.33
55 678 12.33
56 690 12.32
57 702 12.32
58 714 12.31
59 726 12.31
60 738 12.30
61 750 12.30
62 762 12.29
63 774 12.29
64 786 12.28
Average CPE 13.33
Score 0.0/60.0

I know there are many tricks to accelerate the program, such as expanding the loop, using different branch predication strategies and implementing data forwarding in pipe-lf.hcl. But I failed to figure the implementation on my own. Thanks for dreamanddead.

However, it only gives me 1 second reduction…..

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
$ ./benchmark.pl
ncopy
0 14
1 29 29.00
2 40 20.00
3 51 17.00
4 62 15.50
5 73 14.60
6 84 14.00
7 95 13.57
8 106 13.25
9 117 13.00
10 128 12.80
11 139 12.64
12 150 12.50
13 161 12.38
14 172 12.29
15 183 12.20
16 194 12.12
17 205 12.06
18 216 12.00
19 227 11.95
20 238 11.90
21 249 11.86
22 260 11.82
23 271 11.78
24 282 11.75
25 293 11.72
26 304 11.69
27 315 11.67
28 326 11.64
29 337 11.62
30 348 11.60
31 359 11.58
32 370 11.56
33 381 11.55
34 392 11.53
35 403 11.51
36 414 11.50
37 425 11.49
38 436 11.47
39 447 11.46
40 458 11.45
41 469 11.44
42 480 11.43
43 491 11.42
44 502 11.41
45 513 11.40
46 524 11.39
47 535 11.38
48 546 11.38
49 557 11.37
50 568 11.36
51 579 11.35
52 590 11.35
53 601 11.34
54 612 11.33
55 623 11.33
56 634 11.32
57 645 11.32
58 656 11.31
59 667 11.31
60 678 11.30
61 689 11.30
62 700 11.29
63 711 11.29
64 722 11.28
Average CPE 12.33
Score 0.0/60.0

I have to admit that I failed on Part C.

(In fact, the brenchmark.pl only use output of psim to judge the performance, so wrong implementation will be scored very high, for example, just delete the implementation of iaddq)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
	ncopy
0 14
1 19 19.00
2 19 9.50
3 19 6.33
4 19 4.75
5 19 3.80
6 19 3.17
7 19 2.71
8 19 2.38
9 19 2.11
10 19 1.90
11 19 1.73
12 19 1.58
13 19 1.46
14 19 1.36
15 19 1.27
16 19 1.19
17 19 1.12
18 19 1.06
19 19 1.00
20 19 0.95
21 19 0.90
22 19 0.86
23 19 0.83
24 19 0.79
25 19 0.76
26 19 0.73
27 19 0.70
28 19 0.68
29 19 0.66
30 19 0.63
31 19 0.61
32 19 0.59
33 19 0.58
34 19 0.56
35 19 0.54
36 19 0.53
37 19 0.51
38 19 0.50
39 19 0.49
40 19 0.47
41 19 0.46
42 19 0.45
43 19 0.44
44 19 0.43
45 19 0.42
46 19 0.41
47 19 0.40
48 19 0.40
49 19 0.39
50 19 0.38
51 19 0.37
52 19 0.37
53 19 0.36
54 19 0.35
55 19 0.35
56 19 0.34
57 19 0.33
58 19 0.33
59 19 0.32
60 19 0.32
61 19 0.31
62 19 0.31
63 19 0.30
64 19 0.30
Average CPE 1.41
Score 60.0/60.0