2021-08-05

csapp

Ch3PartIII

3.7 Procedures

Procedures are a key abstraction in software. They provide a way to package code that implements some functionality with a designated set of arguments and an optional return value.

Procedures come in many guises in different programming languages—functions, methods, subroutines, handlers, and so on—but they all share a general set of features

suppose procedure P calls procedure Q, and Q then executes and returns back to P. These actions involve one or more of the following mechanisms:

Passing control. The program counter must be set to the starting address of the code for Q upon entry and then set to the instruction in P following the call to Q upon return.
Passing data. P must be able to provide one or more parameters to Q, and Q must be able to return a value back to P.
Allocating and deallocating memory. Q may need to allocate space for local variables when it begins and then free that storage before it returns.

3.7.1 The Run-Time Stack

A key feature of the procedure-calling mechanism of C, and of most other languages, is that it can make use of the last-in, first-out memory management discipline provided by a stack data structure

As P calls Q, control and data information are added to the end of the stack. This information gets deallocated when P returns.

When an x86-64 procedure requires storage beyond what it can hold in registers, it allocates space on the stack. This region is referred to as the procedure’s stack frame

stack digram

The frame for the currently executing procedure is always at the top of the stack.
When procedure P calls procedure Q, it will push the return address onto the stack, indicating where within P the program should resume execution once Q returns. We consider the return address to be part of P’s stack frame, since it holds state relevant to P.
The stack frames for most procedures are of fixed size, allocated at the beginning of the procedure. Some procedures, however, require variable-size frames. This issue is discussed in Section 3.10.5.
many procedures have six or fewer arguments, and so all of their parameters can be passed in registers. Thus the “argument build area” can be omitted.

If more than 6 arguments are required, they can be stored by calling procedure within its stack frame prior to the call.
many functions do not even require a stack frame. This occurs when all of the local variables can be held in registers and the function does not call any other functions (sometimes referred to as a leaf procedure, in reference to the tree structure of procedure calls).

3.7.2 Control Transfer

Passing control from function P to function Q involves simply setting the program counter (PC) to the starting address of the code for Q

However, when it later comes time for Q to return, the processor must have some record of the code location where it should resume the execution of P.

call instruction pushes an address A onto the stack and sets the PC to the beginning of Q. The pushed address A is referred to as the return address and is computed as the address of the instruction immediately following the call instruction.

The counterpart instruction ret pops an address A off the stack and sets the PC to A

1
2
3

call Lable		# direct procedure call
call *Operand	# indirect procedure call
ret				# return from call

call and ret

#Beginning of function multstore
 0000000000400540 <multstore>: 
 400540: 53 		push %rbx 
 400541: 48 89 d3 	mov %rdx,%rbx 
...
#Return from function multstore
 40054d: c3 		retq 
...
#Call to multstore from main 
 400563: e8 d8 ff ff ff 	callq 400540 <multstore> 
 400568: 48 8b 54 24 08 	mov 0x8(%rsp),%rdx

The effect of the call is
- push the return address 0x400568 onto the stack
- jump to the first instruction in function multstore, at address 0x0400540
The effect of the ret is
- pops the value 0x400568 from the stack and jumps to this address, resuming the execution of main just after the call instruction

Practice Problem 3.32

The disassembled code for two functions first and last is shown below, along with the code for a call of first by function main:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Disassembly of last(long u, long v)
# u in %rdi, v in %rsi
 0000000000400540 <last>: 
 400540: 48 89 f8 		mov %rdi,%rax 		#L1: u 
 400543: 48 0f af c6 	imul %rsi,%rax 		#L2: u*v 
 400547: c3 			retq 				#L3: Return 
# Disassembly of last(long x)
# x in %rdi
  0000000000400548 <first>: 
 400548: 48 8d 77 01 	lea 0x1(%rdi),%rsi 	#F1: x+1 
 40054c: 48 83 ef 01 	sub $0x1,%rdi 		#F2: x-1 
 400550: e8 eb ff ff ff callq 400540 <last> #F3: Call last(x-1,x+1) 
 400555: f3 c3 			repz retq 			#F4: Return 
....
<main>:
 400560: e8 e3 ff ff ff callq 400548 <first>#M1: Call first(10)   
 400565: 48 89 c2 		mov %rax,%rdx 		#M2: Resume  
Each of these instructions is given a label. Starting with the calling of first(10) by main, fill in the following table to trace instruction execution through to the point where the program returns back to main.

Label PC Instruction %rdi %rsi %rax %rsp *%rsp Description

M1 0x400560 callq 10 X X 0x7fffffffe820 X Call first(10)

F1

F2

F3

L1

L2

L3

F4

M2

My solution : :white_check_mark:

Label	PC	Instruction	`%rdi`	`%rsi`	`%rax`	`%rsp`	`*%rsp`	Description
M1	0x400560	`callq`	10	X	X	0x7fffffffe820	X	Call `first(10)`
F1	0x400548	`lea`	10	X	X	0x7fffffffe818	0x400565	x + 1
F2	0x40054c	`sub`	10	11	X	0x7fffffffe818	0x400565	x - 1
F3	0x400550	`callq`	9	11	X	0x7fffffffe818	0x400565	Call `last(x-1,x+1)`
L1	0x400540	`mov`	9	11	X	0x7fffffffe810	0x400555	u
L2	0x400543	`imul`	9	11	9	0x7fffffffe810	0x400555	`u*v`
L3	0x400547	`retq`	9	11	99	0x7fffffffe810	0x400555	Return
F4	0x400555	`repz retq`	9	11	99	0x7fffffffe818	0x400565	Return
M2	0x400565	`mov`	9	11	99	0x7fffffffe820	X	Resume

3.7.3 Data Transfer

procedure calls may involve passing data as arguments, and returning from a procedure may also involve returning a value

With x86-64, most of these data passing to and from procedures take place via registers.
With x86-64, up to six integral (i.e., integer and pointer) arguments can be passed via registers.

They are used in a specified order.

When a function has more than six integral arguments, the other ones are passed on the stack.

The order of arguments are reversed, as shown in figure 3.25

int fun(int a1 , int a2 , int a3 , int a4 , int a5 , int a6 , int a7 , int a8 , int a9 );
//When calling fun
rdi = a1;
rsi = a2;
rdx = a3;
rcx = a4;
r8 = a5;
r9 = a6;
|-------stack-------|
|      ......       |
|_________a9________|
|_________a8________|
|_________a7________|
|_____resume addr___|

When passing parameters on the stack, all data sizes are rounded up to be multiples of eight.

Example

void proc(long a1, long *a1p,int a2, int *a2p,
	short a3, short *a3p,char a4, char *a4p)
{
*a1p += a1;
*a2p += a2;
*a3p += a3;
*a4p += a4;
}

/*
Arguments passed as follows:
a1 in %rdi (64 bits)
a1p in %rsi (64 bits)
a2 in %edx (32 bits)
a2p in %rcx (64 bits)
a3 in %r8w (16 bits)
a3p in %r9 (64 bits)
a4 at %rsp+8 ( 8 bits)
a4p at %rsp+16 (64 bits)
*/
proc:
	movq	16(%rsp), %rax
	addq	%rdi, (%rsi)
	addl	%edx, (%rcx)
	addw	%r8w, (%r9)
	movl	8(%rsp), %edx
	addb	%dl, (%rax)
	ret

stack frame

Practice Problem 3.33

A C function procprob has four arguments u, a, v, and b. Each is either a signed number or a pointer to a signed number, where the numbers have different sizes. The function has the following body
1
2
3
*u += a;
*v += b;
return sizeof(a) + sizeof(b);
It compiles to the following x86-64 code:
1
2
3
4
5
6
procprob:
	movslq %edi, %rdi
	addq %rdi, (%rdx)
	addb %sil, (%rcx)
	movl $6, %eax
	ret
Determine a valid ordering and types of the four parameters. There are two correct answers

My solution : :x:

procprob:
	movslq %edi, %rdi 
	addq %rdi, (%rdx)	# *p += ..	%rdi is a or b
	addb %sil, (%rcx)	# *p += ..	%rsi is a or b
	movl $6, %eax		# sizeof(a) + sizeof(b) = 6
	ret

addb %sil,(%rcx) just use %rsi‘s lower-order byte, so type of the second argument is unknown!

Solution on the book:

Let us first describe one answer and then explain the second possibility. If we assume the first addition (line 3) implements *u += a, while the second (line 4) implements *v += b, then we can see that a was passed as the first argument in %edi and converted from 4 bytes to 8 before adding it to the 8 bytes pointed to by %rdx. This implies that a must be of type int and u must be of type long *. We can also see that the low-order byte of argument b is added to the byte pointed to by %rcx. This implies that v must be of type char *, but the type of b is ambiguous—it could be 1, 2, 4, or 8 bytes long. This ambiguity is resolved by noting the return value of 6, computed as the sum of the sizes of a and b. Since we know a is 4 bytes long, we can deduce that b must be 2.

1	int procprobl(int a, short b, long u, char v)

Alternatively, we can see that the same assembly code would be valid if the two sums were computed in the assembly code in the opposite ordering as they are in the C code. This would result in interchanging arguments a and b and arguments u and v, yielding the following prototype:

1	int procprob(int b, short a, long v, char u);

3.7.4 Local Storage on the Stack

Typically, a procedure allocates space on the stack frame by decrementing the stack pointer. This results in the portion of the stack frame labeled “Local variables” in Figure 3.25.

example

Example2

As shown in Figure 3.30, arguments 7 and 8 are now at offsets 8 and 16 relative to the stack pointer, because the return address was pushed onto the stack.

3.7.5 Local Storage in Registers

Although only one procedure can be active at a given time, we must make sure that when one procedure (the caller) calls another (the callee), the callee does not overwrite some register value that the caller planned to use later.

By convention, registers %rbx, %rbp, and %r12–%r15 are classified as callee-saved registers. When procedure P calls procedure Q, Q must preserve the values of these registers, ensuring that they have the same values when Q returns to P as they did when Q was called.
- Procedure Q can preserve a register value by either not changing it at all or by pushing the original value on the stack, altering it, and then popping the old value from the stack before returning
- (The pushing of register values has the effect of creating the portion of the stack frame labeled “Saved registers” in Figure 3.25)(They are in Q’s frame, it is Q’s responsibility to save them!)
All other registers, except for the stack pointer %rsp, are classified as caller-saved registers. This means that they can be modified by any function
- Q can use these register freely
- It is incumbent upon P (the caller) to first save the data(move data from caller-saved to callee-saved ) before it makes the call.

As an example, consider the function P shown in Figure 3.34(a). It calls Q twice. During the first call, it must retain the value of x for use later. Similarly, during the second call, it must retain the value computed for Q(y).

Example

Note how they are popped in the reverse order from how they were pushed, to account for the last-in, first-out discipline of a stack.
what the purpose of subq $8,%rsp and addq $8, %rsp without using %rsp at all ? answer
- In order to align stack frame to $2^4=16$, 4 is gcc’s default value, this can be specified by -mpreferred-stack-boundary=n
- In this example, 8 bytes already for P’s return address, so P should allocate $16 -8 = 8$ bytes more.

Practice Problem 3.34

Consider a function P, which generates local values, named a0–a8. It then calls function Q using these generated values as arguments. Gcc produces the following code for the first part of P:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# long P(long x)
# x in %rdi
P:  
    pushq %r15
    pushq %r14
    pushq %r13
    pushq %r12
    pushq %rbp
    pushq %rbx
    subq $24, %rsp
    movq %rdi, %rbx
    leaq 1(%rdi), %r15 
    leaq 2(%rdi), %r14 
    leaq 3(%rdi), %r13 
    leaq 4(%rdi), %r12 
    leaq 5(%rdi), %rbp 
    leaq 6(%rdi), %rax 
    movq %rax, (%rsp) 
    leaq 7(%rdi), %rdx 
    movq %rdx, 8(%rsp) 
    movl $0, %eax 
    call Q 
    ...
A. Identify which local values get stored in callee-saved registers.

B. Identify which local values get stored on the stack.

C. Explain why the program could not store all of the local values in callee-saved registers.

My solution : :white_check_mark:

  	subq $24, %rsp			# 16 bytes for save local value, 8 bytes for alignment to 2^4
    movq %rdi, %rbx			# save first argument,x, in %rbx
    leaq 1(%rdi), %r15 		# save x+1 in %r15
    leaq 2(%rdi), %r14  	# save x+2 in %r14
    leaq 3(%rdi), %r13  	# save x+3 in %r13
    leaq 4(%rdi), %r12  	# save x+4 in %r12
    leaq 5(%rdi), %rbp  	# save x+5 in %rbp
    leaq 6(%rdi), %rax  	# save x+6 in stack, (%rsp)
    movq %rax, (%rsp)
    leaq 7(%rdi), %rdx		# save x+7 in stack, 8(%rsp) 
    movq %rdx, 8(%rsp)
    movl $0, %eax 
    call Q 
# only finite callee-saved registers can be used.

3.7.6 Recursive Procedures

The conventions we have described for using the registers and the stack allow x86-64 procedures to call themselves recursively

This method of implementing function calls and returns even works for more complex patterns, including mutual recursion (e.g., when procedure P calls Q, which in turn calls P).

recursion example

Practice Problem 3.35

For a C function having the general structure
1
2
3
4
5
6
7
long rfun(unsigned long x) {
if ( _____ )
return _______;
unsigned long nx = __________;
long rv = rfun(nx);
return __________;
}
gcc generates the following assembly code:
1
2
3
4
5
6
7
8
9
10
11
12
13
rfun:
	testq	%rdi, %rdi
	jne	.L8
	movl	$0, %eax
	ret
.L8:
	pushq	%rbx
	movq	%rdi, %rbx
	shrq	$2, %rdi
	call	rfun
	addq	%rbx, %rax
	popq	%rbx
	ret
A. What value does rfun store in the callee-saved register %rbx?

B. Fill in the missing expressions in the C code shown above.

My solution: :white_check_mark:

# long rfun(unsigned long x) 
rfun:
	testq	%rdi, %rdi	# if (x == 0) return 0;
	jne	.L8
	movl	$0, %eax
	ret
.L8:
	pushq	%rbx		
	movq	%rdi, %rbx	# save x in %rbx
	shrq	$2, %rdi	# x >>= 2;
	call	rfun		# rfun(x);
	addq	%rbx, %rax	# return rfun(x) + %rbx
	popq	%rbx
	ret

long rfun(unsigned long x) {
if ( x== 0 )
return x;
unsigned long nx = x >> 2;
long rv = rfun(nx);
return rv + x;
}

3.8 Array Allocation and Access

Arrays in C are one means of aggregating scalar data into larger data types. C uses a particularly simple implementation of arrays, and hence the translation into machine code is fairly straightforward

3.8.1 Basic Principles

For data type T and integer constant N

T A [N];

The declaration has two effects.

it allocates a contiguous region of $L \times N$ bytes in memory, where L is the size (in bytes) of data type T.
It introduces an identifier A that can be used as a pointer to the beginning of the array.

Array element i will be stored at address $A + L \times i$

The memory referencing instructions of x86-64 are designed to simplify array access.

For example, suppose E is an array of values of type int and we wish to evaluate E[i], where the address of E is stored in register %rdx and i is stored in register %rcx. Then the instruction

1	movl (%rdx,%rcx,4),%eax

will perform the address computation E + 4*i, read that memory location, and copy the result to register %eax. The allowed scaling factors of 1, 2, 4, and 8 cover the sizes of the common primitive data types.

Practice Problem 3.36

Consider the following declarations:
1
2
3
4
5
int P[5];
short Q[2];
int **R[9];
double *S[10];
short *T[2];
Fill in the following table describing the element size, the total size, and the address of element i for each of these arrays.

Array Element size Total size Start address Element i

P

Q

R

S

T

My solution : :white_check_mark:

Array	Element size	Total size	Start address	Element i
P	4	20	$x_p$	$x_p + 4 \times i$
Q	2	4	$x_q$	$x_q + 2 \times i$
R	8	72	$x_r$	$x_r + 8 \times i$
S	8	80	$x_s$	$x_s + 8 \times i$
T	8	16	$x_t$	$x_t + 8 \times i$

3.8.2 Pointer Arithmetic

C allows arithmetic on pointers, where the computed value is scaled according to the size of the data type referenced by the pointer.

That is, if p is a pointer to data of type T , and the value of p is $x_p$, then the expression p+i has value $x_p + L \times i$, where L is the size of data type T .

The array subscripting operation can be applied to both arrays and pointers. The array reference A[i] is identical to the expression *(A+i).

example

The final example shows that one can compute the difference of two pointers within the same data structure, with the result being data having type long and value equal to the difference of the two addresses divided by the size of the data type.

Note that &a will make a pointer with type a.

1
2
3

#define int Type
Type a ;
//&a have type of Type*

Example

    int array [10] ;
    printf("start address : \t%p\n" , array);
    printf("address of array[7] :\t%p\n" , &array[7]);
    printf("address of array+7 :\t%p\n" , array+7);
    printf("address of &array[0] :\t%p\n" , &array[0]);
    printf("address of &array[0] + 1 :\t%p\n" , &array[0] + 1);
    printf("address of &array[0] + 7*4 :\t%p\n" , &array[0] + 7 *4 );
    printf("sizeof(&array[0] + 1) == 8 ?  %u\t\n" , sizeof(&array[0] + 7*4) == 8 );
/* output
start address : 				0x7ffeefa45ae0
address of array[7] :			0x7ffeefa45afc
address of array+7 :			0x7ffeefa45afc
address of &array[0] :			0x7ffeefa45ae0
address of &array[0] + 1 :		0x7ffeefa45ae4
address of &array[0] + 7*4 :	0x7ffeefa45b50
sizeof(&array[0] + 1) == 8 ?  	1
*/

Practice Problem 3.37

Suppose $x_P$, the address of short integer array P, and long integer index i are stored in registers %rdx and %rcx, respectively. For each of the following expressions, give its type, a formula for its value, and an assembly-code implementation. The result should be stored in register %rax if it is a pointer and register element %ax if it has data type short.

Expression Type Value Assembly code

P[1]

P+3+i

P[i * 6 - 5]

P[2]

&P[i + 2]

My solution: :white_check_mark:

Expression	Type	Value	Assembly code
`P[1]`	`short`	$M[x_P + 2]$	`movw 2(%rdx),%ax`
`P+3+i`	`short*`	$x_P + (3+i) \times 2$	`leaq 6(%rdx,%rcx,2), %rax`
`P[i * 6 - 5]`	`short`	$M[ x_P + (i\times 6 - 5)\times 2]$
`P[2]`	`short`	$M[x_P + 2\times 2]$	`movw 4(%rdx),%ax`
`&P[i + 2]`	`short*`	$x_P + (i+2) \times 2$	`leaq 4(%rdx,%rcx,2),%rax`

# P[i*6 -5]
leaq	(%rcx,%rcx,2), %rcx
leaq	0(,%rcx,4), %rax
movzwl	-10(%rdx,%rax), %eax

3.8.3 Nested Arrays

The general principles of array allocation and referencing hold even when we create arrays of arrays

int A[5][3];
//equivalent to
typedef int row3_t[3];
row3_t A[5];
//but this type of alias is a bad idea
//the following one would be better
typedef struct array{
    int x [3];
}row3_t;

Data type row3_t is defined to be an array of three integers
Array A contains five such elements, each requiring 12 bytes to store the three integers. The total array size is then $4 \times 5 \times 3 = 60$ bytes.

The array elements are ordered in memory in row-major order, meaning all elements of row 0, which can be written A[0], followed by all elements of row 1 (A[1]), and so on

row major storage

This ordering is a consequence of our nested declaration. Viewing A as an array of five elements, each of which is an array of three int’s, we first have A[0], followed by A[1], and so on.

For an array declared as T D[R][C];
$$
&D[i][j] = D + L\times(C \times i + j)
$$

Programming languages or their standard libraries that support multi-dimensional arrays typically have a native row-major or column-major storage order for these arrays.

Row-major order is used in C/C++/Objective-C (for C-style arrays), PL/I,[4] Pascal,[5] Speakeasy,[*citation needed*] SAS,[6] and Rasdaman.[7]

Column-major order is used in Fortran, MATLAB,[8] GNU Octave, S-Plus,[9] R,[10] Julia,[11] and Scilab.[12]

So there is no way to declare a “column major” array in C

details

Practice Problem 3.38

Consider the following source code, where M and N are constants declared with
1
2
3
4
5
6
7
#define: ??

long P[M][N];
long Q[N][M];
long sum_element(long i, long j) {
return P[i][j] + Q[j][i];
}
In compiling this program, gcc generates the following assembly code:
1
2
3
4
5
6
7
8
9
10
11
12
13
#long sum_element(long i, long j)
#i in %rdi, j in %rsi
 sum_element: 
	leaq	0(,%rdi,8), %rdx
	subq	%rdi, %rdx
	addq	%rsi, %rdx
	leaq	(%rsi,%rsi,4), %rax
	addq	%rax, %rdi
	leaq	Q(%rip), %rax
	movq	(%rax,%rdi,8), %rax
	leaq	P(%rip), %rcx
	addq	(%rcx,%rdx,8), %rax
	ret
Use your reverse engineering skills to determine the values of M and N based on this assembly code.

My solution: :white_check_mark:

#long sum_element(long i, long j)
#i in %rdi, j in %rsi
 sum_element: 
	leaq	0(,%rdi,8), %rdx		# %rdx = 8*i
	subq	%rdi, %rdx				# %rdx = 7*i
	addq	%rsi, %rdx				# %rdx = 7*i + j
	leaq	(%rsi,%rsi,4), %rax		# %rax = 5*j
	addq	%rax, %rdi				# %rdi = i+5*j
	leaq	Q(%rip), %rax
	movq	(%rax,%rdi,8), %rax		# %rax = M[Q + 8*(5*j+i)]
	leaq	P(%rip), %rcx
	addq	(%rcx,%rdx,8), %rax		# %rax += M[P + 8 * (7*i+j)]
	ret

return P[i][j] + Q[j][i];
// <==>
//*(P+8*(N*i+j)) + *(Q+8*(M*j+i))
#define N 7
#define M 5

Test

#define N 7
#define M 5
long traversal1(){	//access with row major order
    long P[M][N];
    long sum = 0;
        for (int i = 0; i < M ; i++){
            for(int j = 0 ; j < N ; j++){
                sum += P[i][j];
            }
        }
    return sum;
}
long traversal2(){  //access with column major order
    long P[M][N];
    long sum = 0;
        for (int j = 0; j < N ; j++){
            for(int i = 0 ; i < M ; i++){
                sum += P[i][j];
            }
        }
    return sum;
}

traversal1:
	movl	$0, %r8d # int i = 0;
	movl	$0, %eax # long sum = 0;
.L7:
	cmpl	$4, %r8d # if (i > 4 ) goto .L12
	jg	.L12
	movl	$0, %ecx # int j = 0 ;
	cmpl	$6, %ecx # if (j > 6 ) goto .L13
	jg	.L13
	subq	$168, %rsp # first allocate 168 bytes, 8 bytes used to align stack
.L3:
	movslq	%ecx, %rdi 
	movslq	%r8d, %rsi
	leaq	0(,%rsi,8), %rdx # %rdx = 8*i
	subq	%rsi, %rdx       # %rdx = 7*i
	addq	%rdi, %rdx       # %rdx = 7*i + j
	addq	-120(%rsp,%rdx,8), %rax # sum += array[7*i + j] , allocate 120 bytes, array size = (5*7)*8 = 160 + 120 = 280, and the content is uninitialized.
	addl	$1, %ecx         # j++;
.L4:
	cmpl	$6, %ecx         # if(j <= 6) goto .L3 //loop continue
	jle	.L3
	addl	$1, %r8d         # i++;
	cmpl	$4, %r8d         # if (i > 4) goto .L1//break
	jg	.L14
	movl	$0, %ecx         # j = 0 ;
	jmp	.L4
.L14:
	addq	$168, %rsp       # deallocate stack space
	ret
.L13:
	addl	$1, %r8d        # j++;
	jmp	.L7
.L12:
	ret
traversal2:
	movl	$0, %r8d   # int j = 0
	movl	$0, %eax   # long sum = 0;
	jmp	.L21
.L28:
	addq	$168, %rsp # deallocate 168 bytes
	ret
.L27:
	addl	$1, %r8d  # j++
.L21:
	cmpl	$6, %r8d  # if (j > 6) goto .L26 //break
	jg	.L26
	movl	$0, %esi  # int i = 0;
	cmpl	$4, %esi  # if (i > 4) goto .L27
	jg	.L27
	subq	$168, %rsp # allocate 168 bytes for array
.L17:
	movslq	%r8d, %rdx # %rdx = j
	movslq	%esi, %rdi # %rdi = i
	leaq	0(,%rdi,8), %rcx # %rcx = 8*i
	subq	%rdi, %rcx # %rcx = 7*i
	addq	%rcx, %rdx # %rdx = j + 7*i
	addq	-120(%rsp,%rdx,8), %rax # allocate 120 bytes, sum += array[7*i + j] , gcc optimize the codes ... there's no difference between the two version....
	addl	$1, %esi 
.L18:
	cmpl	$4, %esi
	jle	.L17
	addl	$1, %r8d
	cmpl	$6, %r8d
	jg	.L28
	movl	$0, %esi
	jmp	.L18
.L26:
	ret

Even without optimization, gcc generate almost the same code for traversal1 and traversal2, the difference in time consuming do not show in machine level code. They have much to do with the cache mechanism.

3.8.4 Fixed-Size Arrays

The C compiler is able to make many optimizations for code operating on multidimensional arrays of fixed size.

optimization example

Practice Problem 3.40

The following C code sets the diagonal elements of one of our fixed-size arrays to val:
1
2
3
4
5
6
7
8
#define N 16
typedef int fix_matrix[N][N];
/* Set all diagonal elements to val */
void fix_set_diag(fix_matrix A, int val) {
long i;
for (i = 0; i < N; i++)
A[i][i] = val;
}
When compiled with optimization level -O1, gcc generates the following assembly code:
1
2
3
4
5
6
7
8
9
fix_set_diag:
	movq	%rdi, %rax
	addq	$1088, %rdi
.L2:
	movl	%esi, (%rax)
	addq	$68, %rax
	cmpq	%rdi, %rax
	jne	.L2
	rep ret
Create a C code program fix_set_diag_opt that uses optimizations similar to those in the assembly code, in the same style as the code in Figure 3.37(b). Use expressions involving the parameter N rather than integer constants, so that your code will work correctly if N is redefined.

My solution :

fix_set_diag:
	movq	%rdi, %rax		# %rax = A
	addq	$1088, %rdi		# A += 1088 (1088 = 16*16*4+16*4)
	# 16*16*4 is the size of the matrix, +(16*4) gives the next access address 
.L2:
	movl	%esi, (%rax)	# *A = val
	addq	$68, %rax		# A += 68 (68 = 16 * 4 + 4)
	# skip the last accessed space (4 bytes) , add a row (16*4)
	cmpq	%rdi, %rax		# if ( A != 1088) continue
	jne	.L2
	rep ret

#define N 16
typedef int fix_matrix[N][N];
/* Set all diagonal elements to val */
void fix_set_diag(fix_matrix A, int val) {
    int * start = &A[0][0];
	for ( int i = 0 ; i != (N*N + N) ; i += (N+1)){
        *(start+i) = val;
    }
}

$ diff my.s test.s
1c1
< 	.file	"my.c"
---
> 	.file	"test.c"

Solution on the book:

/* Set all diagonal elements to val */
void fix_set_diag_opt(fix_matrix A, int val) {
int *Abase = &A[0][0];
long i = 0;
long iend = N*(N+1);
do {
Abase[i] = val;
i += (N+1);
} while (i != iend);
}
//use do-while to indicate that the initial test can be omitted

3.8.5 Variable-Size Arrays

Historically, C only supported multidimensional arrays where the sizes (with the possible exception of the first dimension) could be determined at compile time.

Programmers requiring variable-size arrays had to allocate storage for these arrays using functions such as malloc or calloc, and they had to explicitly encode the mapping of multidimensional arrays into single-dimension ones via row-major indexing
ISO C99 introduced the capability of having array dimension expressions that are computed as the array is being allocated.
1
2
3
4
int n ;
scanf("%d" , &n);
int array [n];
//These codes will compile without error

int A[expr1][expr2].The dimensions of the array are determined by evaluating the expressions expr1 and expr2 at the time the declaration is encountered.

For example

//The parameter n must precede the parameter A[n][n], so that the function can compute the array dimensions as the parameter is encountered.
int var_ele(long n, int A[n][n], long i, long j) {
return A[i][j];
}


#int var_ele(long n, int A[n][n], long i, long j)
#n in %rdi, A in %rsi, i in %rdx, j in %rcx
var_ele:
 imulq %rdx, %rdi 			#Compute n*i
 leaq (%rsi,%rdi,4), %rax 	#Compute A + 4(n*i)
 movl (%rax,%rcx,4), %eax 	#Read from M[xA + 4(n*i) + 4j ]
 ret

The dynamic version must use a multiplication instruction to scale i by n, rather than a series of shifts and adds. In some processors, this multiplication can incur a significant performance penalty, but it is unavoidable in this case.

When variable-size arrays are referenced within a loop, the compiler can often optimize the index computations by exploiting the regularity of the access patterns.

Example

/* Compute i,k of variable matrix product */
 int var_prod_ele(long n, int A[n][n], int B[n][n], long i, long k) {
 long j;
 int result = 0;

 for (j = 0; j < n; j++)
 result += A[i][j] * B[j][k];

 return result;
}

# compiled with -O1
var_prod_ele:
	testq	%rdi, %rdi		# if(n<=0) return 0;
	jle	.L4
	salq	$2, %rcx		# %rcx = 4*i
	imulq	%rdi, %rcx		# %rcx = 4*i * n
	addq	%rcx, %rsi		# %rsi = A + 4*i * n
	leaq	0(,%rdi,4), %r9	# %r9 = 4*n
	leaq	(%rdx,%r8,4), %r8	# %r8 = B + 4*k 
	movl	$0, %edx		# %rdx = 0 //j = 0
	movl	$0, %eax		# %rax = 0
.L3:
	movl	(%rsi,%rdx,4), %ecx	# %rax = A[i * n + j]
	imull	(%r8), %ecx			# %rcx = B[k]*A[i*n + j]
	addl	%ecx, %eax			# res += %rcx
	addq	$1, %rdx			# j++
	addq	%r9, %r8			# %r8 = B + 4*k + 4*n
	cmpq	%rdx, %rdi			# if(n == j)  break
	jne	.L3						# else continue
	rep ret
.L4:
	movl	$0, %eax
	ret

int var_prod_ele(long n, int A[n][n], int B[n][n], long i, long k){
    if(n<=0) return 0;
    int j = 0;
    int res = 0;
    int * start_A = &A[i][j];
    int * start_B = &B[j][k];
    do{
        res += (*(start_A+j)) * (*start_B + n*j);
		 j++;
    }while(j != n);
    return res;
}

Whether gcc generates the pointer-based code of Figure 3.37(b) or the array-based code of Figure 3.38(b), these optimizations will significantly improve program performance.

3.9 Heterogeneous Data Structures

C provides two mechanisms for creating data types by combining objects of different types:

struct aggregate multiple objects into a single unit
union allow an object to be referenced using several different types.

3.9.1 Structures

The implementation of structures is similar to that of arrays in that all of the components of a structure are stored in a contiguous region of memory and a pointer to a structure is the address of its first byte

The compiler maintains information about each structure type indicating the byte offset of each field.

For example

struct rec {
int i;
int j;
int a[2];
int *p;
};

struct example

To access the fields of a structure, the compiler generates code that adds the appropriate offset to the address of the structure.

1 2	struct rec * r ; r->j = r->i;

1
2
3

# Registers: r in %rdi
movl (%rdi), %eax 	#Get r->i
movl %eax, 4(%rdi) 	#Store in r->j

1	int * ptr = &(r->a[1]);

1	leaq 8(%rdi,%rsi,4), %rax #Set %rax to &r->a[i],r in %rdi, i %rsi

As these examples show, the selection of the different fields of a structure is handled completely at compile time. The machine code contains no information about the field declarations or the names of the fields, only offset.

Practice Problem 3.41 :white_check_mark:

Consider the following structure declaration:
1
2
3
4
5
6
7
8
struct test {
	short *p;
	struct {
		short x;
		short y;
		} s;
	struct test *next;
};
This declaration illustrates that one structure can be embedded within another, just as arrays can be embedded within structures and arrays can be embedded within arrays.

The following procedure (with some expressions omitted) operates on this structure:
1
2
3
4
5
void st_init(struct test *st) {
st->s.y = ______;
st->p = ______;
st->next = _______;
}
A. What are the offsets (in bytes) of the following fields?

B. How many total bytes does the structure require?

C. The compiler generates the following assembly code for st_init:
1
2
3
4
5
6
7
st_init:
	movzwl	8(%rdi), %eax
	movw	%ax, 10(%rdi)
	leaq	10(%rdi), %rax
	movq	%rax, (%rdi)
	movq	%rdi, 16(%rdi)
	ret
On the basis of this information, fill in the missing expressions in the code for st_init.

//A
/*
p:0
s.x:8	
s.y:10
next:12
*/
//B
//It requires 20 bytes
//C
void st_init(struct test *st) {
st->s.y = st->s.x;
st->p = &(st->s.y);
st->next = st;
}

Perhaps the compiler pads some space to align with the max size in the struct, that is align to 8.

gcc gives that sizeof(test) == 24

Practice Problem 3.42

The following code shows the declaration of a structure of type ACE and the prototype for a function test:
1
2
3
4
5
struct ACE {
short v;
struct ACE *p;
};
short test(struct ACE *ptr);
gcc generates the following assembly code:
1
2
3
4
5
6
7
8
9
10
test:
	movl	$1, %eax
	jmp	.L2
.L3:
	imulw	(%rdi), %ax
	movq	8(%rdi), %rdi
.L2:
	testq	%rdi, %rdi
	jne	.L3
	rep ret
A. Use your reverse engineering skills to write C code for test.

B. Describe the data structure that this structure implements and the operation performed by test.

My solution: :white_check_mark:

short test(struct ACE *ptr){
	short res = 1;
    while(ptr != NULL){
        res *= ptr->v;
        ptr = ptr->p;
    }
    return res;
}
//the data structure is linked list
//test function traverse the linked list and accumulate the v part

3.9.2 Unions

Unions provide a way to circumvent the type system of C, allowing a single object to be referenced according to multiple types.

The syntax of a union declaration is identical to that for structures
Rather than having the different fields reference different blocks of memory, they all reference the same block.
the overall size of a union equals the maximum size of any of its fields.

Usage of unions

Unions can be useful in several contexts. However, they can also lead to nasty bugs, since they bypass the safety provided by the C type system

the use of two different fields in a data structure will be mutually exclusive.

Then, declaring these two fields as part of a union rather than a structure will reduce the total space allocated
- For example, suppose we want to implement a binary tree data structure where each leaf node has two double data values and each internal node has pointers to two children but no data.
- if we declare the node as
  1
  2
  3
  4
  5
  struct node_s {
  struct node_s *left;
  struct node_s *right;
  double data[2];
  };
  then every node requires 32 bytes, with half the bytes wasted for each type of node.
- if we declare the node as
  1
  2
  3
  4
  5
  6
  7
  union node_u {
  struct {
  union node_u *left;
  union node_u *right;
  } internal;
  double data[2];
  };
  then every node will require just 16 bytes
- However, there is no way to determine whether a given node is a leaf or an internal node.
- A common method is to introduce an enumerated type defining the different possible choices for the union, and then create a structure containing a tag field and the union:
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  typedef enum { N_LEAF, N_INTERNAL } nodetype_t;
  struct node_t {
  nodetype_t type;
  union {
  struct {
  struct node_t *left;
  struct node_t *right;
  } internal;
  double data[2];
  } info;
  };
  This structure requires a total of 24 bytes: 4 for type, and either 8 each for info.internal.left and info.internal.right or 16 for info.data(additional 4 bytes for padding)
- In this case, the savings gain of using a union is small relative to the awkwardness of the resulting code. For data structures with more fields, the savings can be more compelling.
Unions can also be used to access the bit patterns of different data types
- For example, suppose we use a function to convert a value d of type double to a value u of type unsigned long:
  1
  2
  3
  4
  5
  6
  7
  8
  unsigned long double2bits(double d) {
  union {
  double d;
  unsigned long u;
  } temp;
  temp.d = d;
  return temp.u;
  };
  In this code, we store the argument in the union using one data type and access it using another
When using unions to combine data types of different sizes, byte-ordering issues can become important.
- For example, suppose we write a procedure that will create an 8-byte double using the bit patterns given by two 4-byte unsigned values:
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  double uu2double(unsigned word0, unsigned word1)
  {
  union {
  double d;
  unsigned u[2];
  } temp;
  temp.u[0] = word0;
  temp.u[1] = word1;
  return temp.d;
  }
  On a little-endian machine, such as an x86-64 processor, argument word0 will become the low-order 4 bytes of d, while word1 will become the high-order 4 bytes

Practice Problem 3.43

Suppose you are given the job of checking that a C compiler generates the proper code for structure and union access. You write the following structure declaration:
1
2
3
4
5
6
7
8
9
10
11
typedef union {
struct {
long u;
short v;
char w;
} t1;
struct {
int a[2];
char *p;
} t2;
} u_type;
You write a series of functions of the form
1
2
3
void get(u_type *up, type *dest) {
*dest = expr;
}
with different access expressions expr and with destination data type type set according to type associated with expr. You then examine the code generated when compiling the functions to see if they match your expectations.

Suppose in these functions that up and dest are loaded into registers %rdi and %rsi, respectively. Fill in the following table with data type type and sequences of one to three instructions to compute the expression and store the result at dest.
1
2
3
4
5
6
7
8
9
10
11
12
expr 				type 				Code
up->t1.u 			long 				movq (%rdi), %rax
										movq %rax, (%rsi)
up->t1.v

&up->t1.w

up->t2.a

up->t2.a[up->t1.u]

*up->t2.p

My solution : :warning:

/*
t1:
0         8     12    16
u________ v__00 w_000
t2:
0         4         8         16
a[0] ____ a[1] ____ p ________
*/
void get1(u_type * up, long * dest) {
*dest = up->t1.u;
// movq (%rdi),%rax
// movq %rax, (%rsi)
}
void get2(u_type * up, short * dest) {
*dest = up->t1.v;
// movw 8(%rdi),%ax
// movw %ax, (%rsi)
}
void get3(u_type * up, char* * dest) {
*dest = &up->t1.w;
// leaq 12(%rdi),%rax
// movq %rax, (%rsi)
}
void get4(u_type * up , int * * dest ){
*dest = up->t2.a;
// movq %rdi , (%rsi)
}
void get5(u_type * up , int  * dest ){
*dest = up->t2.a[up->t1.u];
// movq (%rdi) , %rax
// movl (%rdi , %rax, 4) , %eax
// movl %eax, (%rdi)
}
void get6(u_type * up , char  * dest ){
*dest = *up->t2.p;
// leaq 8(%rdi), %rax
// movb (%rax), %al
// movb %al , (%rsi)
}

in get3 gcc generate leaq 10(%rdi),%rax rather then add 12 to %rdi, indicating my padding strategy might be wrong.

char ‘s address need to be multiple of 1, that is, any value.

/*
t1:
0         8   10    16
u________ v__ w_00000
t2:
0         4         8         16
a[0] ____ a[1] ____ p ________
*/

3.9.3 Data Alignment

Many computer systems place restrictions on the allowable addresses for the primitive data types, requiring that the address for some objects must be a multiple of some value K (typically 2, 4, or 8)

For example, suppose a processor always fetches 8 bytes from memory with an address that must be a multiple of 8. If we can guarantee that any double will be aligned to have its address be a multiple of 8, then the value can be read or written with a single memory operation. Otherwise, we may need to perform two memory accesses, since the object might be split across two 8-byte memory blocks.

The x86-64 hardware will work correctly regardless of the alignment of data
However, Intel recommends that data be aligned to improve memory system performance

Their alignment rule is based on the principle that any primitive object of K bytes must have an address that is a multiple of K.
1
2
3
4
5
K Types
1 char
2 short
4 int, float
8 long, double, char *
The compiler places directives in the assembly code indicating the desired alignment for global data.
1
.align 8
This ensures that the data following it will start with an address that is a multiple of 8

For code involving structures, the compiler may need to insert gaps in the field allocation to ensure that each structure element satisfies its alignment requirement.

For example
1
2
3
4
5
6
7
8
9
struct S1 {
int i;
char c;
int j;
};
/*
0 4 5 9
i____c_j____
*/
This won’t satisfy the align principle. Instead, the compiler inserts a 3-byte gap between fields c and j:
1
2
0 4 8 12
i____c_000j____
In addition, the compiler may need to add padding to the end of the structure so that each element in an array of structures will satisfy its alignment requirement.

For example
1
2
3
4
5
struct S2 {
int i;
int j;
char c;
};
If we pack this structure into 9 bytes, we can still satisfy the alignment requirements, but the next element in the array won’t stick to the requirement.
1
2
0 4 8 9
i____j____c_ i____....
So the compiler pads S2 to 12 bytes, adding 3 bytes 0 at the end.

Practice Problem 3.44

For each of the following structure declarations, determine the offset of each field, the total size of the structure, and its alignment requirement for x86-64:
1
2
3
4
5
A. struct P1 { short i; int c; int *j; short *d; };
B. struct P2 { int i[2]; char c[8]; short s[4]; long *j; };
C. struct P3 { long w[2]; int *c[2] };
D. struct P4 { char w[16]; char *c[2] };
E. struct P5 { struct P4 a[2]; struct P1 t };

My solution : :white_check_mark:

//A
0	  4		8		   16		  24
i__00 c____ j ________ d ________
//B
	 0     4		8				16				24		 32
i[0] ____ i[1] ____ c[0~7] ________ s[0~3] ________ j ________
//C
0				8			 16				24			32
w[0] ________ w[1] ________ c[0] ________ c[1] ________
//D
0						 16				24			32
w[0~15] ________________ c[0] ________ c[1] ________
//E
0	32	 64 88
a[0] a[1] t

Verification:

struct P1 { short i; int c; int *j; short *d; };
struct P2 { int i[2]; char c[8]; short s[4]; long *j; };
struct P3 { long w[2]; int *c[2]; };
struct P4 { char w[16]; char *c[2] ;};
struct P5 { struct P4 a[2]; struct P1 t; };

void access(struct P1 * ptr){
    ptr->i = 123;
    ptr->c = 456;
    ptr->j = 0x0;
    ptr->d = 0x0;
}

access:
	movw	$123, (%rdi)
	movl	$456, 4(%rdi)
	movq	$0, 8(%rdi)
	movq	$0, 16(%rdi)
	ret

Practice Problem 3.45

Answer the following for the structure declaration
1
2
3
4
5
6
7
8
9
10
struct {
int *a;
float b;
char c;
short d;
long e;
double f;
int g;
char *h;
} rec;
A. What are the byte offsets of all the fields in the structure?

B. What is the total size of the structure?

C. Rearrange the fields of the structure to minimize wasted space, and then show the byte offsets and total size for the rearranged structure.

My solution : :white_check_mark:

1 2	0 8 12 14 16 24 32 40 a ________ b ____ c _0 d __ e ________ f ________ g ____0000 h ________ 48

After rearrangement

struct {
    int * a;
    char * h;
    double f;
    long e ;
    float b ;
    int g;
    short d;
    char c;
}
0                                                        40   42
a ________ h ________ f ________ e ________ b ____ g ____ d __ c _0 0000 48

Can’t decrease the size any more.

Verification:

typedef struct {
    int * a;
    char * h;
    double f;
    long e ;
    float b ;
    int g;
    short d;
    char c;
} rec;
/*typedef struct {
int *a;
float b;
char c;
short d;
long e;
double f;
int g;
char *h;
} rec;
*/
void access ( rec * ptr ){
    ptr->a = 0x0;
    ptr->b = 1.1;
    ptr->c = 'A';
    ptr->d = 123;
    ptr->e = 456;
    ptr->f = 2.2;
    ptr->g = 789;
    ptr->h = 0x0;
}
#include <stdio.h>
int main(){
    rec r;
    printf("%lu\n" , sizeof(r));
}

1 2	$ gcc -Og test.c && ./a.out 48