Data Wrangling
All kinds of transforming data from one representation into another one can be called data wrangling
Sed
sed
is a stream-style editor, it is a programming language
1 | # The most common purpose you will use sed for is to replace |
Normally, regular expression will match only once per line, if you want to make it operate everywhere it matches, use g
This is same in Vim !
1 | $ echo "hello world" | sed -E 's/[eo]/V/' |
Capture group
anything wrappered with ()
is a capture group, which can be referred in replacement later
1 | $ echo "hello world, told by John" | sed -E 's/(.*), .* by (.*)/\2/' |
Tip
If your regular expression becomes really complex you shouldn’t use it , maybe you should consider other tools
If you’re fetching HTML data,
pup
might be helpful. For JSON data, tryjq
Sort
sort text in lexicographical order
1 | $ cat text | sort |
Uniq
eliminate duplicate lines
1 | $ cat text | uniq -c |
1 | # sort numerically (-n) |
Awk
awk
is column focus
awk
is a powerful programming language for text processing
1 | $ python -c "for i,v in enumerate([x for x in range(10,20)]):print(i,v)" |
Paste
like python join
, paste multiple lines into single line separated by a deliminator
1 | $ paste -s -d '|' text |
bc
you can just memorize it by “Berkeley Calculator”
You will always want to add -l
flag to include the math library
1 | $ bc |
It can be powerful when combined with paste
or sed
1 | # append text to line via sed |
1 | $ paste -sd+ number |
Gnuplot
Xargs
turn input into arguments
1 | $ python -c "for i,v in enumerate([x for x in range(10,20)]):print(i , v)" | awk '{print $1}' | sed -E 's/\(([0-9]),/\1/' | xargs |
Tee
write the input to stdout and a file
1 | $ echo "hi" | tee hilog |
1 | $ echo "3+4" | tee expressionlog | bc -l |