perflab

Performance lab how to

warm up for cache lab

Read the writeup

This assignment deals with optimizing memory intensive code.

In this lab, we will consider two image processing operations: rotate, which rotates an image counter-clockwise by 90◦ , and smooth, which “smooths” or “blurs” an image.

rotate 90

smooth picture

The only file you will be modifying and handing in is kernels.c.

The driver.c program is a driver program that allows you to evaluate the performance of your solutions.

Use the command make driver to generate the driver code and run it with the command ./driver.

Performance measures

You will need to remake driver each time you change the code in kernels.c.

performance

Overall performance is calculated by
$$
R = \sqrt[5]{R_{64} \times R_{128} \times R_{256} \times R_{512} \times R_{1024} }
$$

Versioning

You will be writing many versions of the rotate and smooth routines.

To help you compare the performance of all the different versions you’ve written, we provide a way of “registering” functions.

For example, the file kernels.c that we have provided you contains the following functions

1
2
3
void register_rotate_functions() {
add_rotate_function(&rotate, rotate_descr);
}

This function contains one or more calls to add rotate function.

When adding new version:

1
2
3
4
5
6
void new_version(void){...}
char new_descr [] = "new version";
void register_rotate_functions() {
add_rotate_function(&rotate, rotate_descr);
add_rotate_function(&new_version , new_descr);
}

Coding Rules

  • You may not use any embedded assembly language statements.
  • You are allowed to define macros, additional global variables, and other procedures in these files.
  • You can make the assumption of N that N = k * 32

Getting start

This lab is a legacy lab, so we need to compile it as 32-bit program

1
2
$ sudo apt install gcc-multilib
# in order to support gcc -m32 option

The naive version’s performance:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
$ make driver
$ ./driver
Teamname: rubbishbin
Member 1: rubbish
Email 1: xxxxxx@xxx.xxx

Rotate: Version = naive_rotate: Naive baseline implementation:
Dim 64 128 256 512 1024 Mean
Your CPEs 0.7 0.8 1.9 3.0 3.8
Baseline CPEs 14.7 40.1 46.4 65.9 94.5
Speedup 21.7 52.3 24.0 21.9 25.0 27.2

Rotate: Version = rotate: Current working version:
Dim 64 128 256 512 1024 Mean
Your CPEs 0.7 0.8 1.9 3.3 3.9
Baseline CPEs 14.7 40.1 46.4 65.9 94.5
Speedup 20.8 48.6 24.1 20.0 24.5 26.0

Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 14.5 15.0 15.4 15.4 15.1
Baseline CPEs 695.0 698.0 702.0 717.0 722.0
Speedup 47.9 46.6 45.5 46.5 47.8 46.8

Smooth: Version = naive_smooth: Naive baseline implementation:
Dim 32 64 128 256 512 Mean
Your CPEs 14.5 15.0 14.9 14.4 14.9
Baseline CPEs 695.0 698.0 702.0 717.0 722.0
Speedup 47.8 46.6 47.1 49.8 48.3 47.9

Summary of Your Best Scores:
Rotate: 27.2 (naive_rotate: Naive baseline implementation)
Smooth: 47.9 (naive_smooth: Naive baseline implementation)

Baseline is configed in config.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/*********************************************************
* config.h - Configuration data for the driver.c program.
*********************************************************/
#ifndef _CONFIG_H_
#define _CONFIG_H_

/*
* CPEs for the baseline (naive) version of the rotate function that
* was handed out to the students. Rd is the measured CPE for a dxd
* image. Run the driver.c program on your system to get these
* numbers.
*/
#define R64 14.7
#define R128 40.1
#define R256 46.4
#define R512 65.9
#define R1024 94.5

/*
* CPEs for the baseline (naive) version of the smooth function that
* was handed out to the students. Sd is the measure CPE for a dxd
* image. Run the driver.c program on your system to get these
* numbers.
*/
#define S32 695
#define S64 698
#define S128 702
#define S256 717
#define S512 722


#endif /* _CONFIG_H_ */

My solution :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
/***************
* ROTATE KERNEL
***************/

/******************************************************
* Your different versions of the rotate kernel go here
******************************************************/

/*
* naive_rotate - The naive baseline version of rotate
*/
char naive_rotate_descr[] = "naive_rotate: Naive baseline implementation";
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;

for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++){
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
}

/*
* rotate - Your current working version of rotate
* IMPORTANT: This is the version you will be graded on
*/
char rotate_descr[] = "rotate: My implementation, Optimized";
void rotate(int dim, pixel *src, pixel *dst)
{
int i ;
int j ;
int bound = dim - 7; //unrolling by 8, since we can assume the dim is k*32 and make use of cache
//notice that unroll bound is N-step+1

int src_head ;
int dst_head ;

for ( i = 0 ; i < dim ; i++){

//compute start address for src and dst
src_head = RIDX(i, 0, dim);
dst_head = RIDX(dim - 1 , i, dim);

for (j = 0 ; j < bound ; j+=8){

int decreasement = j * dim;
int decreased_dst = dst_head - decreasement ;
int increased_src = src_head + j;

//all 8 statements can be executed independently(simultaneously) since there is no data dependency between them
dst[decreased_dst ] = src[increased_src];
dst[decreased_dst - dim] = src[increased_src + 1];
dst[decreased_dst - (dim << 1)] = src[increased_src + 2];
dst[decreased_dst - (dim << 1) - dim] = src[increased_src + 3];
dst[decreased_dst - (dim << 2)] = src[increased_src + 4];
dst[decreased_dst - (dim << 2) - dim] = src[increased_src + 5];
dst[decreased_dst - (dim << 2) - (dim << 1)] = src[increased_src + 6];
dst[decreased_dst - (dim << 3) + dim ] = src[increased_src + 7];
}

}

}
char copy_descr [] = "copyed from https://github.com/Exely/CSAPP-Labs/blob/master/notes/perflab.md";
void copyrotate(int dim, pixel *src, pixel *dst)
{
int i, j, a, b;
int sdim = dim - 1;
for (i = 0; i < dim; i += 8)
{
for (j = 0; j < dim; j += 8)
{
for (a = i; a < i + 8; a++)
{
for (b = j; b < j + 8; b++)
{
dst[RIDX(sdim - b, a, dim)] = src[RIDX(a, b, dim)];
}
}
}
}
}


/*********************************************************************
* register_rotate_functions - Register all of your different versions
* of the rotate kernel with the driver by calling the
* add_rotate_function() for each test function. When you run the
* driver program, it will test and report the performance of each
* registered test function.
*********************************************************************/

void register_rotate_functions()
{
add_rotate_function(&naive_rotate, naive_rotate_descr);
add_rotate_function(&rotate, rotate_descr);
add_rotate_function(&copyrotate, copy_descr);
/* ... Register additional test functions here */
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ ./driver
Rotate: Version = naive_rotate: Naive baseline implementation:
Dim 64 128 256 512 1024 Mean
Your CPEs 0.7 0.8 2.0 3.2 4.2
Baseline CPEs 0.6 0.8 1.9 3.0 4.3
Speedup 0.9 1.0 0.9 0.9 1.0 0.9

Rotate: Version = rotate: My implementation, Optimized:
Dim 64 128 256 512 1024 Mean
Your CPEs 0.8 1.0 2.2 3.3 4.4
Baseline CPEs 0.6 0.8 1.9 3.0 4.3
Speedup 0.8 0.8 0.9 0.9 1.0 0.9

Rotate: Version = copyed:
Dim 64 128 256 512 1024 Mean
Your CPEs 0.8 0.8 0.9 1.0 1.3
Baseline CPEs 0.6 0.8 1.9 3.0 4.3
Speedup 0.8 1.0 2.1 2.9 3.4 1.7

I don’t understand blocking well, my work seems to be a joke.

quote from Exely(author of the blocking code):

优化第一个函数用了循环分块的方法,这我不是很懂,书上好像也没找到相关内容,还是要多学习!

:crying_cat_face:


Figure out blocking in cache lab.