perflab

Performance lab how to

warm up for cache lab

Read the writeup

This assignment deals with optimizing memory intensive code.

In this lab, we will consider two image processing operations: rotate, which rotates an image counter-clockwise by 90◦ , and smooth, which “smooths” or “blurs” an image.

rotate 90

smooth picture

The only file you will be modifying and handing in is kernels.c.

The driver.c program is a driver program that allows you to evaluate the performance of your solutions.

Use the command make driver to generate the driver code and run it with the command ./driver.

Performance measures

You will need to remake driver each time you change the code in kernels.c.

performance

Overall performance is calculated by
$$
R = \sqrt[5]{R_{64} \times R_{128} \times R_{256} \times R_{512} \times R_{1024} }
$$

Versioning

You will be writing many versions of the rotate and smooth routines.

To help you compare the performance of all the different versions you’ve written, we provide a way of “registering” functions.

For example, the file kernels.c that we have provided you contains the following functions

1
2
3

void register_rotate_functions() {
add_rotate_function(&rotate, rotate_descr);
}

This function contains one or more calls to add rotate function.

When adding new version:

void new_version(void){...}
char new_descr [] = "new version";
void register_rotate_functions() {
add_rotate_function(&rotate, rotate_descr);
add_rotate_function(&new_version , new_descr);
}

Coding Rules

You may not use any embedded assembly language statements.
You are allowed to define macros, additional global variables, and other procedures in these files.
You can make the assumption of N that N = k * 32

Getting start

This lab is a legacy lab, so we need to compile it as 32-bit program

1 2	$ sudo apt install gcc-multilib # in order to support gcc -m32 option

The naive version’s performance:

$ make driver
$ ./driver
Teamname: rubbishbin
Member 1: rubbish
Email 1: xxxxxx@xxx.xxx

Rotate: Version = naive_rotate: Naive baseline implementation:
Dim		64	128	256	512	1024	Mean
Your CPEs	0.7	0.8	1.9	3.0	3.8
Baseline CPEs	14.7	40.1	46.4	65.9	94.5
Speedup		21.7	52.3	24.0	21.9	25.0	27.2

Rotate: Version = rotate: Current working version:
Dim		64	128	256	512	1024	Mean
Your CPEs	0.7	0.8	1.9	3.3	3.9
Baseline CPEs	14.7	40.1	46.4	65.9	94.5
Speedup		20.8	48.6	24.1	20.0	24.5	26.0

Smooth: Version = smooth: Current working version:
Dim		32	64	128	256	512	Mean
Your CPEs	14.5	15.0	15.4	15.4	15.1
Baseline CPEs	695.0	698.0	702.0	717.0	722.0
Speedup		47.9	46.6	45.5	46.5	47.8	46.8

Smooth: Version = naive_smooth: Naive baseline implementation:
Dim		32	64	128	256	512	Mean
Your CPEs	14.5	15.0	14.9	14.4	14.9
Baseline CPEs	695.0	698.0	702.0	717.0	722.0
Speedup		47.8	46.6	47.1	49.8	48.3	47.9

Summary of Your Best Scores:
  Rotate: 27.2 (naive_rotate: Naive baseline implementation)
  Smooth: 47.9 (naive_smooth: Naive baseline implementation)

Baseline is configed in config.h

/*********************************************************
 * config.h - Configuration data for the driver.c program.
 *********************************************************/
#ifndef _CONFIG_H_
#define _CONFIG_H_

/* 
 * CPEs for the baseline (naive) version of the rotate function that
 * was handed out to the students. Rd is the measured CPE for a dxd
 * image. Run the driver.c program on your system to get these
 * numbers.  
 */
#define R64    14.7
#define R128   40.1
#define R256   46.4
#define R512   65.9
#define R1024  94.5

/* 
 * CPEs for the baseline (naive) version of the smooth function that
 * was handed out to the students. Sd is the measure CPE for a dxd
 * image. Run the driver.c program on your system to get these
 * numbers.  
 */
#define S32   695
#define S64   698
#define S128  702
#define S256  717
#define S512  722


#endif /* _CONFIG_H_ */

My solution :

/***************
 * ROTATE KERNEL
 ***************/

/******************************************************
 * Your different versions of the rotate kernel go here
 ******************************************************/

/* 
 * naive_rotate - The naive baseline version of rotate 
 */
char naive_rotate_descr[] = "naive_rotate: Naive baseline implementation";
void naive_rotate(int dim, pixel *src, pixel *dst) 
{
    int i, j;
    
    for (i = 0; i < dim; i++)
	for (j = 0; j < dim; j++){
	    dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
        } 
}

/* 
 * rotate - Your current working version of rotate
 * IMPORTANT: This is the version you will be graded on
 */
char rotate_descr[] = "rotate: My implementation, Optimized";
void rotate(int dim, pixel *src, pixel *dst) 
{
    int i  ; 
    int j  ;
    int bound = dim - 7; //unrolling by 8, since we can assume the dim is k*32 and make use of cache
                         //notice that unroll bound is N-step+1

    int src_head ; 
    int dst_head ;

    for ( i = 0  ; i < dim ; i++){

            //compute start address for src and dst
            src_head = RIDX(i, 0, dim);
            dst_head = RIDX(dim - 1 , i, dim);

        for (j = 0 ; j < bound ; j+=8){

            int decreasement = j * dim;
            int decreased_dst = dst_head - decreasement ;
            int increased_src = src_head + j;

            //all 8 statements can be executed independently(simultaneously) since there is no data dependency between them
            dst[decreased_dst ] = src[increased_src];
            dst[decreased_dst - dim] = src[increased_src + 1];
            dst[decreased_dst  - (dim << 1)] = src[increased_src + 2];
            dst[decreased_dst  - (dim << 1) - dim] = src[increased_src + 3];
            dst[decreased_dst - (dim << 2)] = src[increased_src + 4];
            dst[decreased_dst - (dim << 2) - dim] = src[increased_src + 5];
            dst[decreased_dst - (dim << 2) - (dim << 1)] = src[increased_src + 6];
            dst[decreased_dst - (dim << 3) + dim ] = src[increased_src + 7];
        }

    }

}
char copy_descr [] = "copyed from https://github.com/Exely/CSAPP-Labs/blob/master/notes/perflab.md";
void copyrotate(int dim, pixel *src, pixel *dst)
{
    int i, j, a, b;
    int sdim = dim - 1;
    for (i = 0; i < dim; i += 8)
    {
        for (j = 0; j < dim; j += 8)
        {
            for (a = i; a < i + 8; a++)
            {
                for (b = j; b < j + 8; b++)
                {
                    dst[RIDX(sdim - b, a, dim)] = src[RIDX(a, b, dim)];
                }
            }
        }
    }
}


/*********************************************************************
 * register_rotate_functions - Register all of your different versions
 *     of the rotate kernel with the driver by calling the
 *     add_rotate_function() for each test function. When you run the
 *     driver program, it will test and report the performance of each
 *     registered test function.  
 *********************************************************************/

void register_rotate_functions() 
{
    add_rotate_function(&naive_rotate, naive_rotate_descr);   
    add_rotate_function(&rotate, rotate_descr);   
    add_rotate_function(&copyrotate, copy_descr);   
    /* ... Register additional test functions here */
}

$ ./driver
Rotate: Version = naive_rotate: Naive baseline implementation:
Dim		64	128	256	512	1024	Mean
Your CPEs	0.7	0.8	2.0	3.2	4.2
Baseline CPEs	0.6	0.8	1.9	3.0	4.3
Speedup		0.9	1.0	0.9	0.9	1.0	0.9

Rotate: Version = rotate: My implementation, Optimized:
Dim		64	128	256	512	1024	Mean
Your CPEs	0.8	1.0	2.2	3.3	4.4
Baseline CPEs	0.6	0.8	1.9	3.0	4.3
Speedup		0.8	0.8	0.9	0.9	1.0	0.9

Rotate: Version = copyed:
Dim		64	128	256	512	1024	Mean
Your CPEs	0.8	0.8	0.9	1.0	1.3
Baseline CPEs	0.6	0.8	1.9	3.0	4.3
Speedup		0.8	1.0	2.1	2.9	3.4	1.7

I don’t understand blocking well, my work seems to be a joke.

quote from Exely(author of the blocking code):

优化第一个函数用了循环分块的方法，这我不是很懂，书上好像也没找到相关内容，还是要多学习！

:crying_cat_face:

Figure out blocking in cache lab.