SlideShare a Scribd company logo
MeasureCamp VI
*NIX for ETL
(Using Scripting to Automate Data Cleaning & Transformation)
Andrew Hood
@lynchpin
Is this a 1970s tribute presentation?
UNIX
BSD Mac OS X
Linux Android
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 2
Why is *NIX relevant to analytics?
• Most input data is crap. And increasingly large.
• Requires data processing:
– Sequences of transformations
– Reproducibility
– Flexibility
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 3
Download
file(s)
Remove dodgy
characters
Fix date
format
Load into
database
Do I have it/how do I get *NIX?
• You might have it already:
– Max OS X, Android, …
• Linux
– Stick on an old PC
• E.g. http://www.ubuntu.com/
– 12 months free on AWS
• http://aws.amazon.com/free/
• Windows
– MinGW/MSYS ports of all the common command line tools
• http://www.mingw.org/
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 4
3 Key UNIX Principles
1. Small is Beautiful
2. Everything is a Filter
3. Silence is Golden
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 5
Principle 1: Small is Beautiful
• A UNIX command does one simple thing efficiently
– ls: lists files in the current directory
– cat: concatenates files together
– sort: er, sorts stuff
– uniq: de-duplicates stuff
– grep: search/filter on steroids
– sed: search and replace on steroids
– awk: calculator on steroids
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 6
In contrast to…
• One tool that does lots of things badly.
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 7
Learning Curve Disclaimer
• Commands are named to minimise typing (not deliberately
maximise obfuscation!)
– ls = list
– mv = move
– cp = copy
– pwd = present working directory
• man [command] is your friend – the manual pages
– E.g. man ls
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 8
Principle 2: Everything is a Filter
• Pretty much every command can take an input and generate
an output based on that input.
– This is rather good for data processing!
• Key punctuation:
– | chains output of one command to input of the next
– > redirects output to a file (or device)
• Example:
– cat myfile.txt | sort | uniq > sorted-and-deduplicated.txt
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 9
Terry Pratchett Tribute Slide
• How to remember what pipes and redirection do
• Source: alt.fan.pratchett (yes, USENET)
• A better LOL:
– C|N>K
• Coffee (filtered) through nose (redirected) to keyboard
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 10
Key Principle 3: Silence is Golden
• Commands have 3 default streams:
– STDIN: text streaming into the command
– STDOUT: text streaming out of the command
– STDERR: any errors or warnings from running the command
• If a command completes successfully you get no messages
– No “Done.”, “15 files copied” or “Are you sure?” prompts!
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 11
Example 1: Quick Line/Field Count
• How big are these massive compressed Webtrends Logs for December 2014?
% zcat *2014-12*.log.gz | wc -l
1514139
• And how many fields are there in the file?
% zcat *.log.gz | head -n5 | awk '{ print NF+1 }'
20
19
19
19
19
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 12
Example 1: Quick Line/Field Count
• Why 20 fields in the first line and 19 in the rest?
% zcat *2014-12*.log.gz | head -n1
Fields: date time c-ip cs-username cs-host cs-method cs-uri-stem
cs-uri-query sc-status sc-bytes cs-version cs(User-Agent)
cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id
• Let’s get rid of that header line and pop into one file:
% zcat *2014-12*.log.gz | grep –v '^Fields' > 2014-12.log
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 13
Example 2: Quick & Dirty Sums
• How many unique IP addresses in that logfile?
Fields: date time c-ip cs-username cs-host cs-method cs-uri-stem cs-uri-query sc-status sc-
bytes cs-version cs(User-Agent) cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id
% cat 2014-12.log | cut -d" " -f3 | sort | uniq | wc -l
8695
• What are the top 5 IP addresses by hits?
% cat 2014-12.log | cut -d" " -f3 | sort | uniq -c | sort -nr | head -n5
1110 91.214.5.128
840 193.108.78.10
730 193.108.73.47
184 203.112.87.138
173 203.112.87.220
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 14
(A Note on Regular Expressions)
• Best learn to love these
– . is any character.
–  escapes a character that has a meaning i.e. . is a literal full stop
– * Matches 0 or more characters
– + Matches 1 or more characters
– ? Matches 0 or 1 characters
– ^ Starts With
– $ Ends With
– [ ] matches a list or group of charcaters (e.g. [A-Za-z] any letter)
– ( ) groups and captures characters
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 15
Example 3: Fixing The Quotes/Date
% cat test.csv
Joe Bloggs,56.78,05/12/2014
Bill Bloggs,123.99,12/06/2014
% sed 's/[[:alnum:] /.]*/"&"/g' test.csv
"Joe Bloggs","56.78","05/12/2014"
"Bill Bloggs","123.99","12/06/2014“
% sed 's/[[:alnum:] /.]*/"&"/g' test.csv | sed 's|(..)/(..)/(....)|3-2-1|'
"Joe Bloggs","56.78","2014-12-05"
"Bill Bloggs","123.99","2014-06-12"
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 16
Example 4: Clean & Database Load
cat input.csv 
| grep '^[0-9]' 
| sed 's/,,/,NULL,/g' 
| sed 's/[^[:print:]]//g' 
| psql -c "copy mytable from stdin with delimiter ','"
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 17
Example 5: Basic Scripting
#!/bin/sh
DATE=`date -v -1d +%Y-%m-%d`
SOURCE="input-$DATE.zip"
LOGFILE="$DATE.log"
sftp -b /dev/stdin username@host 2>&1 >>$LOGFILE <<EOF
cd incoming
get $SOURCE
EOF
if [ ! -f "$SOURCE" ]; then
echo "Source file $SOURCE not found" >> $LOGFILE
exit 99
fi
unzip $SOURCE
echo "Source file $SOURCE successfully downloaded and decompressed" >> $LOGFILE
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 18
Resources
• UNIX Fundamentals:
– http://freeengineer.org/learnUNIXin10minutes.html
– http://www.ee.surrey.ac.uk/Teaching/Unix/
• Cheat Sheet:
– http://www.pixelbeat.org/cmdline.html
• Basic Analytics Applications
– http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/
• Regular Expressions:
– http://www.lunametrics.com/regex-book/Regular-Expressions-Google-Analytics.pdf
– http://www.regexr.com/
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 19

More Related Content

Nix for etl using scripting to automate data cleaning & transformation

  • 1. MeasureCamp VI *NIX for ETL (Using Scripting to Automate Data Cleaning & Transformation) Andrew Hood @lynchpin
  • 2. Is this a 1970s tribute presentation? UNIX BSD Mac OS X Linux Android 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 2
  • 3. Why is *NIX relevant to analytics? • Most input data is crap. And increasingly large. • Requires data processing: – Sequences of transformations – Reproducibility – Flexibility 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 3 Download file(s) Remove dodgy characters Fix date format Load into database
  • 4. Do I have it/how do I get *NIX? • You might have it already: – Max OS X, Android, … • Linux – Stick on an old PC • E.g. http://www.ubuntu.com/ – 12 months free on AWS • http://aws.amazon.com/free/ • Windows – MinGW/MSYS ports of all the common command line tools • http://www.mingw.org/ 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 4
  • 5. 3 Key UNIX Principles 1. Small is Beautiful 2. Everything is a Filter 3. Silence is Golden 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 5
  • 6. Principle 1: Small is Beautiful • A UNIX command does one simple thing efficiently – ls: lists files in the current directory – cat: concatenates files together – sort: er, sorts stuff – uniq: de-duplicates stuff – grep: search/filter on steroids – sed: search and replace on steroids – awk: calculator on steroids 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 6
  • 7. In contrast to… • One tool that does lots of things badly. 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 7
  • 8. Learning Curve Disclaimer • Commands are named to minimise typing (not deliberately maximise obfuscation!) – ls = list – mv = move – cp = copy – pwd = present working directory • man [command] is your friend – the manual pages – E.g. man ls 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 8
  • 9. Principle 2: Everything is a Filter • Pretty much every command can take an input and generate an output based on that input. – This is rather good for data processing! • Key punctuation: – | chains output of one command to input of the next – > redirects output to a file (or device) • Example: – cat myfile.txt | sort | uniq > sorted-and-deduplicated.txt 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 9
  • 10. Terry Pratchett Tribute Slide • How to remember what pipes and redirection do • Source: alt.fan.pratchett (yes, USENET) • A better LOL: – C|N>K • Coffee (filtered) through nose (redirected) to keyboard 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 10
  • 11. Key Principle 3: Silence is Golden • Commands have 3 default streams: – STDIN: text streaming into the command – STDOUT: text streaming out of the command – STDERR: any errors or warnings from running the command • If a command completes successfully you get no messages – No “Done.”, “15 files copied” or “Are you sure?” prompts! 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 11
  • 12. Example 1: Quick Line/Field Count • How big are these massive compressed Webtrends Logs for December 2014? % zcat *2014-12*.log.gz | wc -l 1514139 • And how many fields are there in the file? % zcat *.log.gz | head -n5 | awk '{ print NF+1 }' 20 19 19 19 19 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 12
  • 13. Example 1: Quick Line/Field Count • Why 20 fields in the first line and 19 in the rest? % zcat *2014-12*.log.gz | head -n1 Fields: date time c-ip cs-username cs-host cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-version cs(User-Agent) cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id • Let’s get rid of that header line and pop into one file: % zcat *2014-12*.log.gz | grep –v '^Fields' > 2014-12.log 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 13
  • 14. Example 2: Quick & Dirty Sums • How many unique IP addresses in that logfile? Fields: date time c-ip cs-username cs-host cs-method cs-uri-stem cs-uri-query sc-status sc- bytes cs-version cs(User-Agent) cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id % cat 2014-12.log | cut -d" " -f3 | sort | uniq | wc -l 8695 • What are the top 5 IP addresses by hits? % cat 2014-12.log | cut -d" " -f3 | sort | uniq -c | sort -nr | head -n5 1110 91.214.5.128 840 193.108.78.10 730 193.108.73.47 184 203.112.87.138 173 203.112.87.220 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 14
  • 15. (A Note on Regular Expressions) • Best learn to love these – . is any character. – escapes a character that has a meaning i.e. . is a literal full stop – * Matches 0 or more characters – + Matches 1 or more characters – ? Matches 0 or 1 characters – ^ Starts With – $ Ends With – [ ] matches a list or group of charcaters (e.g. [A-Za-z] any letter) – ( ) groups and captures characters 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 15
  • 16. Example 3: Fixing The Quotes/Date % cat test.csv Joe Bloggs,56.78,05/12/2014 Bill Bloggs,123.99,12/06/2014 % sed 's/[[:alnum:] /.]*/"&"/g' test.csv "Joe Bloggs","56.78","05/12/2014" "Bill Bloggs","123.99","12/06/2014“ % sed 's/[[:alnum:] /.]*/"&"/g' test.csv | sed 's|(..)/(..)/(....)|3-2-1|' "Joe Bloggs","56.78","2014-12-05" "Bill Bloggs","123.99","2014-06-12" 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 16
  • 17. Example 4: Clean & Database Load cat input.csv | grep '^[0-9]' | sed 's/,,/,NULL,/g' | sed 's/[^[:print:]]//g' | psql -c "copy mytable from stdin with delimiter ','" 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 17
  • 18. Example 5: Basic Scripting #!/bin/sh DATE=`date -v -1d +%Y-%m-%d` SOURCE="input-$DATE.zip" LOGFILE="$DATE.log" sftp -b /dev/stdin username@host 2>&1 >>$LOGFILE <<EOF cd incoming get $SOURCE EOF if [ ! -f "$SOURCE" ]; then echo "Source file $SOURCE not found" >> $LOGFILE exit 99 fi unzip $SOURCE echo "Source file $SOURCE successfully downloaded and decompressed" >> $LOGFILE 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 18
  • 19. Resources • UNIX Fundamentals: – http://freeengineer.org/learnUNIXin10minutes.html – http://www.ee.surrey.ac.uk/Teaching/Unix/ • Cheat Sheet: – http://www.pixelbeat.org/cmdline.html • Basic Analytics Applications – http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/ • Regular Expressions: – http://www.lunametrics.com/regex-book/Regular-Expressions-Google-Analytics.pdf – http://www.regexr.com/ 16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 19