Nix for etl using scripting to automate data cleaning & transformation
- 2. Is this a 1970s tribute presentation?
UNIX
BSD Mac OS X
Linux Android
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 2
- 3. Why is *NIX relevant to analytics?
• Most input data is crap. And increasingly large.
• Requires data processing:
– Sequences of transformations
– Reproducibility
– Flexibility
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 3
Download
file(s)
Remove dodgy
characters
Fix date
format
Load into
database
- 4. Do I have it/how do I get *NIX?
• You might have it already:
– Max OS X, Android, …
• Linux
– Stick on an old PC
• E.g. http://www.ubuntu.com/
– 12 months free on AWS
• http://aws.amazon.com/free/
• Windows
– MinGW/MSYS ports of all the common command line tools
• http://www.mingw.org/
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 4
- 5. 3 Key UNIX Principles
1. Small is Beautiful
2. Everything is a Filter
3. Silence is Golden
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 5
- 6. Principle 1: Small is Beautiful
• A UNIX command does one simple thing efficiently
– ls: lists files in the current directory
– cat: concatenates files together
– sort: er, sorts stuff
– uniq: de-duplicates stuff
– grep: search/filter on steroids
– sed: search and replace on steroids
– awk: calculator on steroids
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 6
- 7. In contrast to…
• One tool that does lots of things badly.
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 7
- 8. Learning Curve Disclaimer
• Commands are named to minimise typing (not deliberately
maximise obfuscation!)
– ls = list
– mv = move
– cp = copy
– pwd = present working directory
• man [command] is your friend – the manual pages
– E.g. man ls
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 8
- 9. Principle 2: Everything is a Filter
• Pretty much every command can take an input and generate
an output based on that input.
– This is rather good for data processing!
• Key punctuation:
– | chains output of one command to input of the next
– > redirects output to a file (or device)
• Example:
– cat myfile.txt | sort | uniq > sorted-and-deduplicated.txt
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 9
- 10. Terry Pratchett Tribute Slide
• How to remember what pipes and redirection do
• Source: alt.fan.pratchett (yes, USENET)
• A better LOL:
– C|N>K
• Coffee (filtered) through nose (redirected) to keyboard
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 10
- 11. Key Principle 3: Silence is Golden
• Commands have 3 default streams:
– STDIN: text streaming into the command
– STDOUT: text streaming out of the command
– STDERR: any errors or warnings from running the command
• If a command completes successfully you get no messages
– No “Done.”, “15 files copied” or “Are you sure?” prompts!
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 11
- 12. Example 1: Quick Line/Field Count
• How big are these massive compressed Webtrends Logs for December 2014?
% zcat *2014-12*.log.gz | wc -l
1514139
• And how many fields are there in the file?
% zcat *.log.gz | head -n5 | awk '{ print NF+1 }'
20
19
19
19
19
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 12
- 13. Example 1: Quick Line/Field Count
• Why 20 fields in the first line and 19 in the rest?
% zcat *2014-12*.log.gz | head -n1
Fields: date time c-ip cs-username cs-host cs-method cs-uri-stem
cs-uri-query sc-status sc-bytes cs-version cs(User-Agent)
cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id
• Let’s get rid of that header line and pop into one file:
% zcat *2014-12*.log.gz | grep –v '^Fields' > 2014-12.log
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 13
- 14. Example 2: Quick & Dirty Sums
• How many unique IP addresses in that logfile?
Fields: date time c-ip cs-username cs-host cs-method cs-uri-stem cs-uri-query sc-status sc-
bytes cs-version cs(User-Agent) cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id
% cat 2014-12.log | cut -d" " -f3 | sort | uniq | wc -l
8695
• What are the top 5 IP addresses by hits?
% cat 2014-12.log | cut -d" " -f3 | sort | uniq -c | sort -nr | head -n5
1110 91.214.5.128
840 193.108.78.10
730 193.108.73.47
184 203.112.87.138
173 203.112.87.220
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 14
- 15. (A Note on Regular Expressions)
• Best learn to love these
– . is any character.
– escapes a character that has a meaning i.e. . is a literal full stop
– * Matches 0 or more characters
– + Matches 1 or more characters
– ? Matches 0 or 1 characters
– ^ Starts With
– $ Ends With
– [ ] matches a list or group of charcaters (e.g. [A-Za-z] any letter)
– ( ) groups and captures characters
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 15
- 16. Example 3: Fixing The Quotes/Date
% cat test.csv
Joe Bloggs,56.78,05/12/2014
Bill Bloggs,123.99,12/06/2014
% sed 's/[[:alnum:] /.]*/"&"/g' test.csv
"Joe Bloggs","56.78","05/12/2014"
"Bill Bloggs","123.99","12/06/2014“
% sed 's/[[:alnum:] /.]*/"&"/g' test.csv | sed 's|(..)/(..)/(....)|3-2-1|'
"Joe Bloggs","56.78","2014-12-05"
"Bill Bloggs","123.99","2014-06-12"
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 16
- 17. Example 4: Clean & Database Load
cat input.csv
| grep '^[0-9]'
| sed 's/,,/,NULL,/g'
| sed 's/[^[:print:]]//g'
| psql -c "copy mytable from stdin with delimiter ','"
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 17
- 18. Example 5: Basic Scripting
#!/bin/sh
DATE=`date -v -1d +%Y-%m-%d`
SOURCE="input-$DATE.zip"
LOGFILE="$DATE.log"
sftp -b /dev/stdin username@host 2>&1 >>$LOGFILE <<EOF
cd incoming
get $SOURCE
EOF
if [ ! -f "$SOURCE" ]; then
echo "Source file $SOURCE not found" >> $LOGFILE
exit 99
fi
unzip $SOURCE
echo "Source file $SOURCE successfully downloaded and decompressed" >> $LOGFILE
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 18
- 19. Resources
• UNIX Fundamentals:
– http://freeengineer.org/learnUNIXin10minutes.html
– http://www.ee.surrey.ac.uk/Teaching/Unix/
• Cheat Sheet:
– http://www.pixelbeat.org/cmdline.html
• Basic Analytics Applications
– http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/
• Regular Expressions:
– http://www.lunametrics.com/regex-book/Regular-Expressions-Google-Analytics.pdf
– http://www.regexr.com/
16 March 2015 © Lynchpin Analytics Limited, All Rights Reserved 19