Enterprise Data Workflows with Cascading and Windows Azure HDInsight

Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid
“Enterprise Data Workﬂows with
Cascading and Windows Azure
HDInsight”
1

Cascading and Windows Azure HDInsight
First, some highly recommended reading…
Using MapReduce with HDInsight
http://www.windowsazure.com/en-us/manage/
services/hdinsight/using-mapreduce-with-
hdinsight/
Drive Smarter Decisions with Hadoop and
Windows Azure HDInsight
http://www.slideshare.net/Hadoop_Summit/
drive-smarter-decisions-with-hadoop-and-
windows-azure-hdinsight
2

Cascading and Windows Azure HDInsight
Second, a highly recommended talk…
Hadoop Summit 2013 San Jose
Wed, June 26th, 2:05-2:45 pm
hadoopsummit.org/san-jose/schedule/
“Building Tools for the Hadoop Developer”
In this session we’ll ﬁrst discuss our experience extending Hadoop development to
new platforms & languages and then discuss our experiments and experiences
building supporting developer tools and plugins for those platforms. First, we’ll take a
hands on approach to showing our experiments and successes extending Hadoop to
languages such as JavaScript and .NET with LINQ. Second, we’ll walk through some
of the developer & developer ops tools and plugins we’ve experimented with in an
effort to simplify life for the Hadoop developer across both on premises and cloud-
based projects.
Matt Winkler, Microsoft
blogs.msdn.com/mwinkle
@mwinkle
3

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Cascading: background
4

Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.
5

Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
6

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – deﬁnitions
• a pattern language for Enterprise Data Workﬂows
• simple to build, easy to test, robust in production
• design principles ⟹ ensure best practices at scale
7

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – usage
• Java API, DSLs in Scala, Clojure,
Jython, JRuby, Groovy,ANSI SQL
• ASL 2 license, GitHub src,
http://conjars.org
• 5+ yrs production use,
multiple Enterprise verticals
8

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – integrations
• partners: Microsoft Azure, Hortonworks,
Amazon AWS, MapR, EMC, SpringSource,
Cloudera
• taps: Memcached, Cassandra, MongoDB,
HBase, JDBC, Parquet, etc.
• serialization: Avro, Thrift, Kryo,
JSON, etc.
• topologies: Apache Hadoop,
tuple spaces, local mode
9

Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
10

Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
workﬂow abstraction addresses:
• stafﬁng bottleneck;
• system integration;
• operational complexity;
• test-driven development
11

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Cascading: sample code
12

void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");
void reduce (String word, Iterator group):
int count = 0;
for each pc in group:
count += Int(pc);
emit(word, String(count));
The Ubiquitous Word Count
Definition:
count how often each word appears
in a collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates:
• requires a minimal amount of code
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction
• is not many steps away from useful search indexing
• serves as a “HelloWorld” for Hadoop apps
Any distributed computing framework which can runWord
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
count how often each word appears
in a collection of text documents
13

Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
1 map
1 reduce
18 lines code gist.github.com/3900702
word count – conceptual ﬂow diagram
cascading.org/category/impatient
14

word count – Cascading app in Java
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
15

mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count']
[{1}:'token']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
wc[{1}:'token']
[{1}:'token']
[{1}:'token']
[{1}:'token']
word count – generated ﬂow diagram
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
16

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))
(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
word count – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
17

github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
word count – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
18

import com.twitter.scalding._

class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
word count – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
19

github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual ﬂow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• signiﬁcant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
20

• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual ﬂow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API,Algebird, etc.
• signiﬁcant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
Cascalog and Scalding DSLs
leverage the functional aspects
of MapReduce, helping limit
complexity in process
21

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
The Workflow Abstraction
22

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Enterprise Data Workﬂows
Let’s consider a “strawman” architecture
for an example app… at the front end
LOB use cases drive demand for apps
23

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Same example… in the back ofﬁce
Organizations have substantial investments
in people, infrastructure, process
24

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Same example… the heavy lifting!
“Main Street” ﬁrms are migrating
workﬂows to Hadoop, for cost
savings and scale-out
25

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading workﬂows – taps
• taps integrate other data frameworks, as tuple streams
• these are “plumbing” endpoints in the pattern language
• sources (inputs), sinks (outputs), traps (exceptions)
• text delimited, JDBC, Memcached,
HBase, Cassandra, MongoDB, etc.
• data serialization: Avro, Thrift,
Kryo, JSON, etc.
• extend a new kind of tap in just
a few lines of Java
schema and provenance get
derived from analysis of the taps
26

Cascading workﬂows – taps
wcFlow.complete();
source and sink taps
for TSV data in HDFS
27

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading workflows – topologies
• topologies execute workflows on clusters
• flow planner is like a compiler for queries
- Hadoop (MapReduce jobs)
- local mode (dev/test or special config)
- in-memory data grids (real-time)
• flow planner can be extended
to support other topologies
blend flows in different topologies
into the same app – for example,
batch (Hadoop) + transactions (IMDG)
28

Cascading workﬂows – topologies
wcFlow.complete();
ﬂow planner for
Apache Hadoop
topology
29

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading workflows – test-driven development
• assert patterns (regex) on the tuple streams
• adjust assert levels, like log4j levels
• trap edge cases as “data exceptions”
• TDD at scale:
1.start from raw inputs in the flow graph
2.define stream assertions for each stage
of transforms
3.verify exceptions, code to remove them
4.when impl is complete, app has full
test coverage
redirect traps in production
to Ops, QA, Support,Audit, etc.
30

Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly
based on domain-speciﬁc languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
31

Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time
32

Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Data is represented as flows of tuples. Operations within
the flows bring functional programming aspects into Java
In formal terms, this provides a pattern language
33

Pattern Language
structured method for solving large, complex design
problems, where the syntax of the language ensures
the use of best practices – i.e., conveying expertise
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
34

Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
35

references…
by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/
“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”
36

Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
37

references…
by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on what apps do:
the process of structuring data
38

Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workﬂows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
39

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Lingual: ANSI SQL in Cascading
40

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading workflows – ANSI SQL
• collab with Optiq – industry-proven code base
• ANSI SQL parser/optimizer atop Cascading
flow planner
• JDBC driver to integrate into existing
tools and app servers
• relational catalog over a collection
of unstructured data
• SQL shell prompt to run queries
• enable analysts without retraining
on Hadoop, etc.
• transparency for Support, Ops,
Finance, et al.
a language for queries – not a database,
but ANSI SQL as a DSL for workflows
41

Lingual – CSV data in local ﬁle system
cascading.org/lingual
42

Lingual – shell prompt, catalog
43

Lingual – queries
44

abstraction RDBMS JVM Cluster
parser ANSI SQL
compliant parser
ANSI SQL
compliant parser
optimizer logical plan,
optimized based on stats
logical plan,
optimized based on stats
planner physical plan API “plumbing”
machine
data
query history,
table stats
app history,
tuple stats
topology b-trees, etc. heterogenous, distributed:
Hadoop, in-memory, etc.
visualization ERD ﬂow diagram
schema table schema tuple schema
catalog relational catalog tap usage DB
provenance (manual audit) data set
producers/consumers
abstraction layers in queries…
45

Lingual – articles
Open Source 'Lingual' Helps SQL Devs Unlock Hadoop
Thor Olavsrud, 2013-02-22
cio.com/article/729283/Open_Source_Lingual_Helps_SQL_Devs_Unlock_Hadoop
Hadoop AppsWithout MapReduce Mindsets
Adrian Bridgwater, 2013-02-28
drdobbs.com/open-source/hadoop-apps-without-mapreduce-mindsets/240149708
Concurrent gives old SQL users new Hadoop tricks
Jack Clark, 2013-02-20
theregister.co.uk/2013/02/20/hadoop_sql_translator_lingual_launches/
Concurrent Raises $4 Million,Will Expand Big Data App Framework
James Johnson, 2013-03-20
inquisitr.com/581755/concurrent-raises-4-million-will-expand-big-data-app-framework
Concurrent Releases Lingual, a SQL DSL for Hadoop
Boris Lublinsky, 2013-02-28
infoq.com/news/2013/02/Lingual
46

Lingual – JDBC driver
public void run() throws ClassNotFoundException, SQLException {
Class.forName( "cascading.lingual.jdbc.Driver" );
Connection connection =
DriverManager.getConnection(
"jdbc:lingual:local;schemas=src/main/resources/data/example" );
Statement statement = connection.createStatement();

ResultSet resultSet = statement.executeQuery(
"select *n"
+ "from "EXAMPLE"."SALES_FACT_1997" as sn"
+ "join "EXAMPLE"."EMPLOYEE" as en"
+ "on e."EMPID" = s."CUST_ID"" );

while( resultSet.next() ) {
int n = resultSet.getMetaData().getColumnCount();
StringBuilder builder = new StringBuilder();

for( int i = 1; i <= n; i++ ) {
builder.append( ( i > 1 ? "; " : "" )
+ resultSet.getMetaData().getColumnLabel( i )
+ "="
+ resultSet.getObject( i ) );
}
System.out.println( builder );
}

resultSet.close();
statement.close();
connection.close();
}
47

Lingual – JDBC result set
$ gradle clean jar
$ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar

CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill
CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian
Caveat: if you absolutely positively must have sub-second
SQL query response for Pb-scale data on a 1000+ node
cluster… Good luck with that! (call the MPP vendors)
This ANSI SQL library is primarily intended for batch
workﬂows – high throughput, not low-latency –
for many under-represented use cases in Enterprise IT.
In other words, SQL as a DSL.
48

# load the JDBC package
library(RJDBC)

# set up the driver
drv <- JDBC("cascading.lingual.jdbc.Driver",
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")

# set up a database connection to a local repository
connection <- dbConnect(drv,
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")

# query the repository: in this case the MySQL sample database (CSV files)
df <- dbGetQuery(connection,
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
head(df)

# use R functions to summarize and visualize part of the data
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
summary(df$hire_age)
library(ggplot2)
m <- ggplot(df, aes(x=hire_age))
m <- m + ggtitle("Age at hire, people named Gina")
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
Lingual – connecting Hadoop and R
49

> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
Lingual – connecting Hadoop and R
50

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Pattern: PMML in Cascading
51

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Pattern – model scoring
• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML
• great open source tools – R, Weka,
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries –
Matrix API, etc.
• leverage PMML as another kind
of DSL
cascading.org/pattern
52

## train a RandomForest model

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)

## test the model on the holdout test set

print(fit$importance)
print(fit)

predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)

## export predicted labels to TSV

write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)

## export RF model to PMML

saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
Pattern – create a model in R
53

<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.dmg.org/PMML-4_0
http://www.dmg.org/v4-0/pmml-4-0.xsd">
<Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.2.30"/>
  <Timestamp>2012-10-22 19:39:28</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
  <DataField name="label" optype="categorical" dataType="string">
   <Value value="0"/>
   <Value value="1"/>
  </DataField>
  <DataField name="var0" optype="continuous" dataType="double"/>
</DataDictionary>
<MiningModel modelName="randomForest_Model" functionName="classification">
  <MiningSchema>
   <MiningField name="label" usageType="predicted"/>
   <MiningField name="var0" usageType="active"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="majorityVote">
   <Segment id="1">
    <True/>
    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
     <MiningSchema>
      <MiningField name="label" usageType="predicted"/>
     </MiningSchema>
...
Pattern – capture model parameters as PMML
54

public static void main( String[] args ) throws RuntimeException {
String inputPath = args[ 0 ];
String classifyPath = args[ 1 ];
// set up the config properties
Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
// handle command line options
OptionParser optParser = new OptionParser();
optParser.accepts( "pmml" ).withRequiredArg();
OptionSet options = optParser.parse( args );

FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

if( options.hasArgument( "pmml" ) ) {
String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlPath ) )
.retainOnlyActiveIncomingFields()
.setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model
flowDef.addAssemblyPlanner( pmmlPlanner );
}

Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
Pattern – score a model, within an app
55

Customer
Orders
Classify
Scored
Orders
GroupBy
token
Count
PMML
Model
M R
Failure
Traps
Assert
Confusion
Matrix
Pattern – score a model, using pre-deﬁned Cascading app
56

## run an RF classifier at scale

hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml

## run an RF classifier at scale, assert regression test, measure confusion matrix

hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml --assert --measure out/measure

## run a predictive model at scale, measure RMSE

hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap
--pmml data/iris.lm_p.xml --rmse out/measure
Pattern – score a model, using pre-deﬁned Cascading app
57

• established XML standard for predictive model markup
• organized by Data Mining Group (DMG), since 1997
http://dmg.org/
• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
• PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple ﬂows
“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations.With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”
PMML – standard
wikipedia.org/wiki/Predictive_Model_Markup_Language
58

• Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classiﬁers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• SupportVector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element
PMML – model coverage
ibm.com/developerworks/industry/library/ind-PMML2/
59

roadmap – existing algorithms for scoring
•

Random Forest
• Decision Trees
• Linear Regression
• GLM
• Logistic Regression
• K-Means Clustering
• Hierarchical Clustering
• SupportVector Machines
• Multinomial
also, model chaining and general support for ensembles
61

roadmap – top priorities for creating models at scale
•

Random Forest
• Logistic Regression
• K-Means Clustering
a wealth of recent research indicates many opportunities
to parallelize popular algorithms for training models at scale
on Apache Hadoop…
62

roadmap – next priorities for scoring
•

Time Series (ARIMA forecast)
• Association Rules (basket analysis)
• Naïve Bayes
• Neural Networks
algorithms extended based on customer use cases –
contact groups.google.com/forum/?fromgroups#!forum/pattern-user
63

experiments – comparing models
• much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale
• run multiple variants, then measure relative “lift”
• Concurrent runtime – tag and track models
the following example compares two models trained
with different machine learning algorithms
this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment
64

## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220

f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
print(fit)
saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))
experiments – Random Forest model
OOB estimate of error rate: 14%
Confusion matrix:
0 1 class.error
0 69 16 0.1882353
1 12 103 0.1043478
65

## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r

f <- as.formula("as.factor(label) ~ var0 + var2")
fit <- glm(f, family=binomial, data=data)
print(summary(fit))
saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))
experiments – Logistic Regression model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8524 0.3803 4.871 1.11e-06 ***
var0 -1.3755 0.4355 -3.159 0.00159 **
var2 -3.7742 0.5794 -6.514 7.30e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NB: this model has “var1” intentionally omitted
66

experiments – comparing results
•

use a confusion matrix to compare results for the classiﬁers
• Logistic Regression has a lower “false negative” rate (5% vs. 11%)
however it has a much higher “false positive” rate (52% vs. 14%)
• assign a cost model to select a winner –
for example, in an ecommerce anti-fraud classiﬁer:
FN ∼ chargeback risk
FP ∼ customer support costs
67

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Design Patterns for Workflows,
Across Departments
68

Anatomy of an Enterprise app
Deﬁnition a typical Enterprise workﬂow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
69

ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
70

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
71

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
72

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
73

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
74

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Cascading allows multiple departments to combine their workﬂow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
75

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );

SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );

flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
76

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();

flowDef.addAssemblyPlanner( pmmlPlanner );
77

cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
visual collaboration for the business logic is a great
way to improve how teams work together
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
78

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
multiple departments, working in their respective
frameworks, integrate results into a combined app,
which runs at scale on a cluster… business process
combined in a common space (DAG) for ﬂow
planners, compiler, optimization, troubleshooting,
exception handling, notiﬁcations, security audit,
performance monitoring, etc.
cascading.org
79

Enterprise DataWorkﬂows
with Cascading
O’Reilly, 2013
amazon.com/dp/1449358721
references…
newsletter updates:
liber118.com/pxn/
80

blog, developer community, code/wiki/gists, maven repo,
commercial products, etc.:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com
drill-down…
81

Enterprise Data Workflows with Cascading and Windows Azure HDInsight

More Related Content

Enterprise Data Workflows with Cascading and Windows Azure HDInsight