DataWrangling

Data Wrangling

All kinds of transforming data from one representation into another one can be called data wrangling

Sed

sed is a stream-style editor, it is a programming language

1
2
3
4
# The most common purpose you will use sed for is to replace
$ echo "hello world" | sed -E 's/\s/|/'
hello|world
# -E means use extended regular expressions instead of old version

Normally, regular expression will match only once per line, if you want to make it operate everywhere it matches, use g

This is same in Vim !

1
2
3
4
$ echo "hello world" | sed -E 's/[eo]/V/'
hVllo world
$ echo "hello world" | sed -E 's/[eo]/V/g'
hVllV wVrld

Capture group

anything wrappered with () is a capture group, which can be referred in replacement later

1
2
3
4
$ echo "hello world, told by John" | sed -E 's/(.*), .* by (.*)/\2/'
John
# \1 refer to the first capture group
# \2 refer to the second one

Tip

If your regular expression becomes really complex you shouldn’t use it , maybe you should consider other tools

If you’re fetching HTML data, pup might be helpful. For JSON data, try jq

Sort

sort text in lexicographical order

1
2
3
4
5
6
7
8
9
10
11
12
$ cat text | sort
1
12
3
I
Today
Went
angeles
in
los
restaurant
to

Uniq

eliminate duplicate lines

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ cat text | uniq -c
3 Today
1 I
1 restaurant
1 Went
1 to
1 I
1 1
2 restaurant
1 3
1 in
2 12
1 los
1 angeles
1
2
3
4
5
6
7
8
9
10
11
12
# sort numerically (-n)
$ cat number| sort -n
7
15
42
45
54
65
436
534
3241
67453

Awk

awk is column focus

awk is a powerful programming language for text processing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ python -c "for i,v in enumerate([x for x in range(10,20)]):print(i,v)"
(0, 10)
(1, 11)
(2, 12)
(3, 13)
(4, 14)
(5, 15)
(6, 16)
(7, 17)
(8, 18)
(9, 19)
# print the second column
$ !! | awk '{print $2}'
python -c "for i,v in enumerate([x for x in range(10,20)]):print(i,v)" | awk '{print $2}'
10)
11)
12)
13)
14)
15)
16)
17)
18)
19)

Paste

like python join, paste multiple lines into single line separated by a deliminator

1
2
3
4
$ paste -s -d '|' text
Today|Today|Today|I|restaurant|Went|to|I|1|restaurant|restaurant|3|in|12|12|los|angeles
$ python -c "for i,v in enumerate([x for x in range(10,20)]):print(i,v)" | awk '{print $2}' | paste -sd,
10),11),12),13),14),15),16),17),18),19)

bc

you can just memorize it by “Berkeley Calculator”

You will always want to add -l flag to include the math library

1
2
3
4
5
6
7
8
9
10
11
12
13
$ bc
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
1 + 3
4
2 / 3
0
2^3
8
3-2-4
-3

It can be powerful when combined with paste or sed

1
2
3
# append text to line via sed
$ echo "11" | sed -E 's/$/+4/' | bc -l
15
1
2
3
4
$ paste -sd+ number 
3241+54+15+42+534+65+436+45+67453+7
$ paste -sd+ number | bc -l
71892

Gnuplot

Xargs

turn input into arguments

1
2
3
4
5
6
7
8
9
$ python -c "for i,v in enumerate([x for x in range(10,20)]):print(i , v)"  | awk '{print $1}' | sed -E 's/\(([0-9]),/\1/' | xargs
0 1 2 3 4 5 6 7 8 9

$ cat requirement.txt
requests==1.1
six==8.2.1
pandas==5.3
tensorflow==2.0.0
$ cat requirement.txt | sed -E 's/(.*)==.*/\1/' | xargs pip show

Tee

write the input to stdout and a file

1
2
3
4
$ echo "hi" | tee hilog
hi
$ cat hilog
hi
1
2
3
4
$ echo "3+4" | tee expressionlog |  bc -l
7
$ cat expressionlog
3+4

ffempeg