SlideShare a Scribd company logo
PyCascading for Intuitive
Flow Processing With
Hadoop
Gabor Szabo
Senior Data Scientist
Twitter, Inc.
Outline
• Basic concepts in the Hadoop ecosystem, with an example
• Hadoop
• Cascading
• PyCascading
• Essential PyCascading operations
• PyCascading by example: discovering main interests among friends
• Miscellaneous remarks, caveats
2
Hadoop
Architecture
• The Hadoop file system (HDFS)
• Large, distributed file system
• Thousands of nodes, PBs of data
• The storage layer for Apache Hive, HBase, ...
• MapReduce
• Idea: ship the code to the data, not other way around
• Do aggregations locally
• Iterate on the results
• Map phase: process the input records, emit a key & a value
• Reduce phase: collect records with the same key from Map, emit a new (aggregate) record
• Fault tolerance
• Both storage and compute are fault tolerant (redundancy, replication, restart)
3
Hadoop
In practice
• Language
• Java
• Need to think in MapReduce
• Hard to translate the problem to MR
• Hard to maintain and make changes in the topology
• Best used for
• Archiving (HDFS)
• Batch processing (MR)
4

Recommended for you

Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained

The document provides an overview of MapReduce and how it addresses the problem of processing large datasets in a distributed computing environment. It explains how MapReduce inspired by functional programming works by splitting data, mapping functions to pieces in parallel, and then reducing the results. Examples are given of word count and sorting word counts to find the most frequent word. Finally, it discusses how Hadoop popularized MapReduce by providing an open-source implementation and ecosystem.

apache hadoopmapreduce
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows

Map Reduce is a parallel and distributed approach developed by Google for processing large data sets. It has two key components - the Map function which processes input data into key-value pairs, and the Reduce function which aggregates the intermediate output of the Map into a final result. Input data is split across multiple machines which apply the Map function in parallel, and the Reduce function is applied to aggregate the outputs.

map reducehadoop on windowsparallel programming
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

sparkhadoop
Cascading	
The Cascading way: flow processing
• Cascading is built on top of Hadoop
• Introduces semi-structured flow processing of tuples with typed fields
• Analogy: data is flowing in pipes
• Input comes from source taps
• Output goes to sink taps
• Data is reshaped in the pipes by different operations
• Builds a DAG from the job, and optimizes the topology to minimize the number of
MapReduce phases
• The pipes analogy is more intuitive to use than raw MapReduce
5
Flow processing in (Py)Cascading
6
Source: cascading.org
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Sink tap

Recommended for you

Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce

This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens. The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.

elastic map reducemap reduceapache hadoop
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)

Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.

reducetaskhadoopnode manager
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite

Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.

dataworks summit 2017hadoop summitdws17
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Filter
Sink tap
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Filter
Sink tap
“Map”
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Filter
Sink tap
“Map” Join
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Filter
Group & aggregate
Sink tap
“Map” Join

Recommended for you

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17

Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.

sparkcluster computingdata analytics
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130

This slide introduces Hadoop Spark. Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming. Not all technical details are included.

sparkhadoop
Map Reduce
Map ReduceMap Reduce
Map Reduce

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.

hadoopmapreduce
PyCascading
Design
• Built on top of Cascading
• Uses the Jython 2.5 interpreter
• Everything in Python
• Building the pipelines
• User-defined functions that operate on data
• Completely hides Java if the user wants it to
• However due to the strong ties with Java, it’s worth knowing the Cascading classes
7
Example: as always, WordCount
Writing MapReduce jobs by hand is hard
• WordCount: split the input file into words, and count how many times each word occurs
8
Example: as always, WordCount
Writing MapReduce jobs by hand is hard
• WordCount: split the input file into words, and count how many times each word occurs
8
M
Example: as always, WordCount
Writing MapReduce jobs by hand is hard
• WordCount: split the input file into words, and count how many times each word occurs
8
MR

Recommended for you

Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase

Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. It is particularly useful when data needs to be processed in real-time. Carol McDonald, HBase Hadoop Instructor at MapR, will cover: + What is Spark Streaming and what is it used for? + How does Spark Streaming work? + Example code to read, process, and write the processed data

Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret

Apache Spark is a fast and general engine for large-scale data processing. It was originally developed in 2009 and is now supported by Databricks. Spark provides APIs in Java, Scala, Python and can run on Hadoop, Mesos, standalone or in the cloud. It provides high-level APIs like Spark SQL, MLlib, GraphX and Spark Streaming for structured data processing, machine learning, graph analytics and stream processing.

mllibdata preprocessingdata mining
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark

Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.

apachebig_dataspark
Example: as always, WordCount
Writing MapReduce jobs by hand is hard
• WordCount: split the input file into words, and count how many times each word occurs
8
MRSupportcodeSupportcode
Cascading WordCount
• Still in Java, but algorithm design is easier
• Need to write separate classes for each user-defined operation
9
Cascading WordCount
• Still in Java, but algorithm design is easier
• Need to write separate classes for each user-defined operation
9
M
Cascading WordCount
• Still in Java, but algorithm design is easier
• Need to write separate classes for each user-defined operation
9
MG

Recommended for you

Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem

This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.

Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides

This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies

kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderaba www.kellytechno.com

hadoop training in hyderabadhadoop institutes in hyderabadhadoop training centers in hyderabad
Cascading WordCount
• Still in Java, but algorithm design is easier
• Need to write separate classes for each user-defined operation
9
MGSupportcodeSupportcode
word_count.py
PyCascading minimizes programmer effort
10
word_count.py
PyCascading minimizes programmer effort
10
Map
word_count.py
PyCascading minimizes programmer effort
10
GMap

Recommended for you

Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial

This document describes how to set up a single-node Hadoop installation to perform MapReduce operations. It discusses supported platforms, required software including Java and SSH, and preparing the Hadoop cluster in either local, pseudo-distributed, or fully-distributed mode. The main components of the MapReduce execution pipeline are explained, including the driver, mapper, reducer, and input/output formats. Finally, a simple word count example MapReduce job is described to demonstrate how it works.

apache_hadoop mapreduce big_data cloud_computing
Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...

AngularJS, together with Node.js, is an extremely powerful combination for building single page applications. Unfortunately, its development requires writing HTML and Javascript, which is tedious and error prone. By using vibe.d, HTML is no longer necessary, and the developers can use the full power of a static-typed language for the development of the backend. Substituting Javascript with Typescript in addition to a little bit of CTFE D magic then removes the need for redundant data type declarations, and makes everything statically typed. At the end of the talk, the attendee will have witnessed the creation of a statically typed, asynchronous single page application that required little extra typing than its dynamically typed equivalent. Additionally, the attendees will be motivated to explore the presented combination of frameworks as a viable desktop application UI framework.

software developmentdconf
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM

ADAM is an open source, high performance, distributed platform for genomic analysis built on Apache Spark. It defines a Scala API and data schema using Avro and Parquet to store data in a columnar format, addressing the I/O bottleneck in genomics pipelines. ADAM implements common genomics algorithms as data or graph parallel computations and minimizes data movement by sending code to the data using Spark. It is designed to scale to processing whole human genomes across distributed file systems and cloud infrastructure.

adamga4ghbdgenomics
word_count.py
PyCascading minimizes programmer effort
10
GSupportcodeMap
PyCascading workflow
The basics of writing a Cascading flow in Python
• There is one main script that must contain a main() function
• We build the pipeline in main()
• Pipes are joined together with the pipe operator |
• Pipe ends may be assigned to variables and reused (split)
• All the user-defined operations are Python functions
• Globally or locally-scoped
• Then submit the pipeline to be run to PyCascading
• The main Python script will be executed on each of the workers when they spin up to
import global declarations
• This is the reason we have to have main(), so that it won’t be executed again
11
PyCascading by example
Walk through the operations using an example
• Data
• A friendship network in long format
• List of interests per user, ordered by decreasing importance
• Question
• For every user, find which main interest among the friends occurs the most
• Workflow
• Take the most important interest per user, and join it to the friendship table
• For each user, count how many times each interest appears, and select the one with the
maximum count
12
Friendship network User interests
The full source
13

Recommended for you

Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark

Since its debut in 2010, Apache Spark has become one of the most popular Big Data technologies in the Apache open source ecosystem. In addition to enabling processing of large data sets through its distributed computing architecture, Spark provides out-of-the-box support for machine learning, streaming and graph processing in a single framework. Spark has been supported by companies like Microsoft, Google, Amazon and IBM and in financial services, companies like Blackrock (http://bit.ly/1Q1DVJH ) and Bloomberg (http://bit.ly/29LXbPv ) have started to integrate Apache Spark into their tool chain and the interest is growing. Unlike other big-data technologies which require intensive programming using Java etc., Spark enables data scientists to work with a big-data technology using higher level languages like Python and R making it accessible to conduct experiments and for rapid prototyping. In this talk, we will introduce Apache Spark and discuss the key features that differentiate Apache Spark from other technologies. We will provide examples on how Apache Spark can help scale analytics and discuss how the machine learning API could be used to solve large-scale machine learning problems using Spark’s distributed computing framework. We will also illustrate enterprise use cases for scaling analytics with Apache Spark.

hyperloglogquantuniversityanalytics
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2

In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop. Finally we introduce spark using a docker image and we show how to use anonymous function in spark. The topics of the next slides will be - Spark Shell (Scala, Python) - Shark Shell - Data Frames - Spark Streaming - Code Examples: Data Processing and Machine Learning

anonymous functionsscatter gatherjobs
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes

Hadoop Streaming allows any executable or script to be used as a MapReduce job. It works by launching the executable or script as a separate process and communicating with it via stdin and stdout. The executable or script receives key-value pairs in a predefined format and outputs new key-value pairs that are collected. Hadoop Streaming uses PipeMapper and PipeReducer to adapt the external processes to the MapReduce framework. It provides a simple way to run MapReduce jobs without writing Java code.

mapreducehadoop
Defining the inputs
14
Defining the inputs
14
Need to use Java
types since this is a
Cascading call
Shaping the fields: “mapping”
15
Shaping the fields: “mapping”
15
Replace the interest field
with the result yielded by
take_first, and call it
interest

Recommended for you

Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project

This document discusses experiences using Hadoop and HBase in the Perf-Log project. It provides an overview of the Perf-Log data format and architecture, describes how Hadoop and HBase were configured, and gives examples of using MapReduce jobs and HBase APIs like Put and Scan to analyze log data. Key aspects covered include matching Hadoop and HBase versions, running MapReduce jobs, using column families in HBase, and filtering Scan results.

hbaseperformancehadoop
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python

The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.

pythonmapreducepig
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse

This document presents m2r2, a framework for materializing and reusing results in high-level dataflow systems for big data. The framework operates at the logical plan level to be language-independent. It includes components for matching plans, rewriting queries to reuse past results, optimizing plans, caching results, and garbage collection. An evaluation using the TPC-H benchmark on Pig Latin showed the framework reduced query execution time by 65% on average by reusing past query results. Future work includes integrating it with more systems and minimizing materialization costs.

apache pigmapreducebigdata
Shaping the fields: “mapping”
15
Decorators annotate
user-defined functions
Replace the interest field
with the result yielded by
take_first, and call it
interest
Shaping the fields: “mapping”
15
Decorators annotate
user-defined functions
tuple is a Cascading
record type
Replace the interest field
with the result yielded by
take_first, and call it
interest
Shaping the fields: “mapping”
15
Decorators annotate
user-defined functions
tuple is a Cascading
record type
We can return any
number of new
records with yield
Replace the interest field
with the result yielded by
take_first, and call it
interest
Checkpointing
16

Recommended for you

Data Science
Data ScienceData Science
Data Science

The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses: - Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark. - Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs. - Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access. - Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.

Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight

SF Bay Area Azure Developers meetup at Microsoft, SF on 2013-06-11 http://www.meetup.com/bayazure/events/120889902/

hadoopcascadingcascalog
Unit-5 [Pig] working and architecture.pptx
Unit-5 [Pig] working and architecture.pptxUnit-5 [Pig] working and architecture.pptx
Unit-5 [Pig] working and architecture.pptx

Features and Working with PIG

pigarchitectureworking
Checkpointing
16
Take the data EITHER from the cache
(ID: “users_first_interests”), OR
generate it if it’s not cached yet
Grouping & aggregating
17
Grouping & aggregating
17
Group by user, and call
the two result fields
user and friend
Grouping & aggregating
17
Define a UDF that takes the the
grouping fields, a tuple iterator, and
optional arguments
Group by user, and call
the two result fields
user and friend

Recommended for you

Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG

This document provides an overview of Apache Cassandra, a distributed database designed for managing large amounts of structured data across commodity servers. It discusses Cassandra's data model, which is based on Dynamo and Bigtable, as well as its client API and operational benefits like easy scaling and high availability. The document uses a Twitter-like application called StatusApp to illustrate Cassandra's data model and provide examples of common operations.

distributedjavaaustin
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O

This document summarizes a presentation by Erin LeDell from H2O.ai about machine learning using the H2O software. H2O is an open-source machine learning platform that provides APIs for R, Python, Scala and other languages. It allows distributed machine learning on large datasets across clusters. The presentation covers H2O's architecture, algorithms like random forests and deep learning, and how to use H2O within R including loading data, training models, and running grid searches. It also discusses H2O on Spark via Sparkling Water and real-world use cases with customers.

workshopmachine learningbig data
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma

"Apache Spark is today’s fastest growing Big Data analysis platform. Spark workloads typically maintain a persistent data set in memory, which is accessed multiple times over the network. Consequently, networking IO performance is a critical component in Spark systems. RDMA’s performance characteristics, such as high bandwidth, low latency, and low CPU overhead, offer a good opportunity for accelerating Spark by improving its data transfer facilities." "In this talk, we present a Java-based, RDMA network layer for Apache Spark. The implementation optimized both the RPC and the Shuffle mechanisms for RDMA. Initial benchmarking shows up to 25% improvement for Spark Applications." Watch the video presentation: http://wp.me/p3RLHQ-gzN Learn more: http://mellanox.com Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

openfabrics workshopinfinibandinsidehpc
Grouping & aggregating
17
Define a UDF that takes the the
grouping fields, a tuple iterator, and
optional arguments
Use the .get getter
with the field name
Group by user, and call
the two result fields
user and friend
Grouping & aggregating
17
Define a UDF that takes the the
grouping fields, a tuple iterator, and
optional arguments
Use the .get getter
with the field name
Yield any number of
results
Group by user, and call
the two result fields
user and friend
Joins & field algebra
18
Joins & field algebra
18

Recommended for you

2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug

This document provides an introduction to Hadoop, including: - An overview of big data and the challenges it poses for data storage and processing. - How Hadoop addresses these challenges through its distributed, scalable architecture based on MapReduce and HDFS. - Descriptions of key Hadoop components like MapReduce, HDFS, Hive, and Sqoop. - Examples of how to perform common data processing tasks like word counting and friend recommendations using MapReduce. - Some best practices, limitations, and other tools in the Hadoop ecosystem.

Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data

A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.

columnar storageparquetdremel
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale

Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.

as-a-serviceconsulpostgresql
Joins & field algebra
18
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Joins & field algebra
18
No field name
overlap is
allowed
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd

Recommended for you

Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014

Johnny Miller – Cassandra + Spark = Awesome This talk will discuss how Cassandra and Spark can work together to deliver real-time analytics. This is a technical discussion that will introduce the attendees to the basic principals on Cassandra and Spark, why they work well together and examples usecases.

johnny millerdatastaxnosql matters barcelona 2014
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala

Impala is an open source SQL query engine for Apache Hadoop that allows real-time queries on large datasets stored in HDFS and other data stores. It uses a distributed architecture where an Impala daemon runs on each node and coordinates query planning and execution across nodes. Impala allows SQL queries to be run directly against files stored in HDFS and other formats like Avro and Parquet. It aims to provide high performance for both analytical and transactional workloads through its C++ implementation and avoidance of MapReduce.

groverimpalamark
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab

PDF and Keynote version of the presentation available here: https://github.com/h2oai/h2o-meetups/tree/master/2017_04_04_HarvardMed_Scalable_Ensembles

machine learningensemble learning
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Use built-in
aggregators
where possible
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Use built-in
aggregators
where possible
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Use built-in
aggregators
where possible

Recommended for you

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...

This document outlines the steps to build your own natural language processing (NLP) system, beginning with creating a streaming consumer, launching a message queue service, creating a data pre-processing service, serving an ML model, and publishing predictions to a messaging app. It discusses separating components for modularity and ease of testing/extensibility. The presenter recommends tools like Anaconda, Docker, Redis, Fast.ai and SpaCy and walks through setting up the environment and each step in a Jupyter notebook. The goal is to experiment with building your own end-to-end NLP system in a modular, reusable way.

Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh

In the same way that we need to make assertions about how code functions, we need to make assertions about data, and unit testing is a promising framework. In this talk, we'll explore what is unique about unit testing data, and see how Two Sigma's open source library Marbles addresses these unique challenges in several real-world scenarios.

The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

TileDB is an open-source storage manager for multi-dimensional sparse and dense array data. It has a novel architecture that addresses some of the pain points in storing array data on “big-data” and “cloud” storage architectures. This talk will highlight TileDB’s design and its ability to integrate with analysis environments relevant to the PyData community such as Python, R, Julia, etc.

Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Save this stream!
Use built-in
aggregators
where possible
Split & aggregate more
19
Split & aggregate more
19
Split the stream to
group by user, and
find the interest
that appears most
by count
Split & aggregate more
19
Split the stream to
group by user, and
find the interest
that appears most
by count
Once the data flow is
built, submit and run it!

Recommended for you

Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...

In this talk I will discuss exponential family embeddings, which are methods that extend the idea behind word embeddings to other data types. I will describe how we used dynamic embeddings to understand how data science skill-sets have transformed over the last 3 years using our large corpus of jobs. The key takeaway is that these models can enrich analysis of specialized datasets.

Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer

How many newspapers should be distributed to each store for sale every day? The data science group at The New York Times addresses this optimization problem using custom time series modeling and analytical solutions, while also incorporating qualitative business concerns. I'll describe our modeling and data engineering approaches, written in Python and hosted on Google Cloud Platform.

Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma

This document provides an introduction to graph theory concepts and working with graph data in Python. It begins with basic graph definitions and real-world graph examples. Various graph concepts are then demonstrated visually, such as vertices, edges, paths, cycles, and graph properties. Finally, it discusses working with graph data structures and algorithms in the NetworkX library in Python, including graph generation, analysis, and visualization. The overall goal is to introduce readers to graph theory and spark their interest in further exploration.

Running the script
Local or remote runs
• Cascading flows can run locally or on HDFS
• Local run for testing
• local_run.sh recommendation.py
• Remote run in production
• remote_deploy.sh -s server.com 
recommendation.py
• The example had 5 MR stages
• Although the problem was simple, doing it by hand would
have been inconvenient
20
Friendship network User interests
friends_interests_counts
recommendations
Some remarks
• Benefits
• Can use any Java class
• Can be mixed with Java code
• Can use Python libraries
• Caveats
• Only pure Python code can be used, no compiled C (numpy, scipy)
• But with streaming it’s possible to execute a CPython interpreter
• Some idiosyncrasies because of Jython’s representation of basic types
• Strings are OK, but Python integers are represented as java.math.BigInteger, so
before yielding explicit conversion is needed (joins!)
21
Contact
• Javadoc: http://cascading.org
• Other Cascading-based wrappers
• Scalding (Scala), Cascalog (Clojure), Cascading-JRuby (Ruby)
22
http://github.org/twitter/pycascading
http://pycascading.org
@gaborjszabo
gabor@twitter.com
Implementation details
Challenges due to an interpreted language
• We need to make code available on all workers
• Java bytecode is easy, same .jar everywhere
• Although Jython represents Python functions as classes, they cannot be serialized
• We need to start an interpreter on every worker
• The Python source of the UDFs is retrieved and shipped in the .jar
• Different Hadoop distributions explode the .jar differently, need to use the Hadoop distributed
cache
23

Recommended for you

Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...

To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.

RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro

Those of us who use TensorFlow often focus on building the model that's most predictive, not the one that's most deployable. So how to put that hard work to work? In this talk, we'll walk through a strategy for taking your machine learning models from Jupyter Notebook into production and beyond.

Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...

In September 2017, dockless bikeshare joined the transportation options in the District of Columbia. In March 2018, scooter share followed. During the pilot of these technologies, Python has helped District Department of Transportation answer some critical questions. This talk will discuss how Python was used to answer research questions and how it supported the evaluation of this demonstration.

More Related Content

What's hot

Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
DataWorks Summit
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
Gwen (Chen) Shapira
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
Dmytro Sandu
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
Muhammad Shahid
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
Emilio Coppa
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
DataWorks Summit
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 

What's hot (19)

Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 

Similar to PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)

Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...
Robert Schadek
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
fnothaft
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
QuantUniversity
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
Hanborq Inc.
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
Vasia Kalavri
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
Unit-5 [Pig] working and architecture.pptx
Unit-5 [Pig] working and architecture.pptxUnit-5 [Pig] working and architecture.pptx
Unit-5 [Pig] working and architecture.pptx
tripathineeharika
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
Stu Hood
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
Sri Ambati
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
inside-BigData.com
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
Wojciech Langiewicz
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
Sean Chittenden
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
NoSQLmatters
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Sri Ambati
 

Similar to PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo) (20)

Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
 
Data Science
Data ScienceData Science
Data Science
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Unit-5 [Pig] working and architecture.pptx
Unit-5 [Pig] working and architecture.pptxUnit-5 [Pig] working and architecture.pptx
Unit-5 [Pig] working and architecture.pptx
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
PyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
PyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
PyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
PyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
PyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 

Recently uploaded (20)

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 

PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)

  • 1. PyCascading for Intuitive Flow Processing With Hadoop Gabor Szabo Senior Data Scientist Twitter, Inc.
  • 2. Outline • Basic concepts in the Hadoop ecosystem, with an example • Hadoop • Cascading • PyCascading • Essential PyCascading operations • PyCascading by example: discovering main interests among friends • Miscellaneous remarks, caveats 2
  • 3. Hadoop Architecture • The Hadoop file system (HDFS) • Large, distributed file system • Thousands of nodes, PBs of data • The storage layer for Apache Hive, HBase, ... • MapReduce • Idea: ship the code to the data, not other way around • Do aggregations locally • Iterate on the results • Map phase: process the input records, emit a key & a value • Reduce phase: collect records with the same key from Map, emit a new (aggregate) record • Fault tolerance • Both storage and compute are fault tolerant (redundancy, replication, restart) 3
  • 4. Hadoop In practice • Language • Java • Need to think in MapReduce • Hard to translate the problem to MR • Hard to maintain and make changes in the topology • Best used for • Archiving (HDFS) • Batch processing (MR) 4
  • 5. Cascading The Cascading way: flow processing • Cascading is built on top of Hadoop • Introduces semi-structured flow processing of tuples with typed fields • Analogy: data is flowing in pipes • Input comes from source taps • Output goes to sink taps • Data is reshaped in the pipes by different operations • Builds a DAG from the job, and optimizes the topology to minimize the number of MapReduce phases • The pipes analogy is more intuitive to use than raw MapReduce 5
  • 6. Flow processing in (Py)Cascading 6 Source: cascading.org
  • 7. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap
  • 8. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap Sink tap
  • 9. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap Filter Sink tap
  • 10. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap Filter Sink tap “Map”
  • 11. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap Filter Sink tap “Map” Join
  • 12. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap Filter Group & aggregate Sink tap “Map” Join
  • 13. PyCascading Design • Built on top of Cascading • Uses the Jython 2.5 interpreter • Everything in Python • Building the pipelines • User-defined functions that operate on data • Completely hides Java if the user wants it to • However due to the strong ties with Java, it’s worth knowing the Cascading classes 7
  • 14. Example: as always, WordCount Writing MapReduce jobs by hand is hard • WordCount: split the input file into words, and count how many times each word occurs 8
  • 15. Example: as always, WordCount Writing MapReduce jobs by hand is hard • WordCount: split the input file into words, and count how many times each word occurs 8 M
  • 16. Example: as always, WordCount Writing MapReduce jobs by hand is hard • WordCount: split the input file into words, and count how many times each word occurs 8 MR
  • 17. Example: as always, WordCount Writing MapReduce jobs by hand is hard • WordCount: split the input file into words, and count how many times each word occurs 8 MRSupportcodeSupportcode
  • 18. Cascading WordCount • Still in Java, but algorithm design is easier • Need to write separate classes for each user-defined operation 9
  • 19. Cascading WordCount • Still in Java, but algorithm design is easier • Need to write separate classes for each user-defined operation 9 M
  • 20. Cascading WordCount • Still in Java, but algorithm design is easier • Need to write separate classes for each user-defined operation 9 MG
  • 21. Cascading WordCount • Still in Java, but algorithm design is easier • Need to write separate classes for each user-defined operation 9 MGSupportcodeSupportcode
  • 26. PyCascading workflow The basics of writing a Cascading flow in Python • There is one main script that must contain a main() function • We build the pipeline in main() • Pipes are joined together with the pipe operator | • Pipe ends may be assigned to variables and reused (split) • All the user-defined operations are Python functions • Globally or locally-scoped • Then submit the pipeline to be run to PyCascading • The main Python script will be executed on each of the workers when they spin up to import global declarations • This is the reason we have to have main(), so that it won’t be executed again 11
  • 27. PyCascading by example Walk through the operations using an example • Data • A friendship network in long format • List of interests per user, ordered by decreasing importance • Question • For every user, find which main interest among the friends occurs the most • Workflow • Take the most important interest per user, and join it to the friendship table • For each user, count how many times each interest appears, and select the one with the maximum count 12 Friendship network User interests
  • 30. Defining the inputs 14 Need to use Java types since this is a Cascading call
  • 31. Shaping the fields: “mapping” 15
  • 32. Shaping the fields: “mapping” 15 Replace the interest field with the result yielded by take_first, and call it interest
  • 33. Shaping the fields: “mapping” 15 Decorators annotate user-defined functions Replace the interest field with the result yielded by take_first, and call it interest
  • 34. Shaping the fields: “mapping” 15 Decorators annotate user-defined functions tuple is a Cascading record type Replace the interest field with the result yielded by take_first, and call it interest
  • 35. Shaping the fields: “mapping” 15 Decorators annotate user-defined functions tuple is a Cascading record type We can return any number of new records with yield Replace the interest field with the result yielded by take_first, and call it interest
  • 37. Checkpointing 16 Take the data EITHER from the cache (ID: “users_first_interests”), OR generate it if it’s not cached yet
  • 39. Grouping & aggregating 17 Group by user, and call the two result fields user and friend
  • 40. Grouping & aggregating 17 Define a UDF that takes the the grouping fields, a tuple iterator, and optional arguments Group by user, and call the two result fields user and friend
  • 41. Grouping & aggregating 17 Define a UDF that takes the the grouping fields, a tuple iterator, and optional arguments Use the .get getter with the field name Group by user, and call the two result fields user and friend
  • 42. Grouping & aggregating 17 Define a UDF that takes the the grouping fields, a tuple iterator, and optional arguments Use the .get getter with the field name Yield any number of results Group by user, and call the two result fields user and friend
  • 43. Joins & field algebra 18
  • 44. Joins & field algebra 18
  • 45. Joins & field algebra 18 Join on the friend field from the 1st stream, and on the user field from the 2nd
  • 46. Joins & field algebra 18 No field name overlap is allowed Join on the friend field from the 1st stream, and on the user field from the 2nd
  • 47. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd
  • 48. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd
  • 49. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd
  • 50. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd Use built-in aggregators where possible
  • 51. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd Use built-in aggregators where possible
  • 52. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd Use built-in aggregators where possible
  • 53. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd Save this stream! Use built-in aggregators where possible
  • 54. Split & aggregate more 19
  • 55. Split & aggregate more 19 Split the stream to group by user, and find the interest that appears most by count
  • 56. Split & aggregate more 19 Split the stream to group by user, and find the interest that appears most by count Once the data flow is built, submit and run it!
  • 57. Running the script Local or remote runs • Cascading flows can run locally or on HDFS • Local run for testing • local_run.sh recommendation.py • Remote run in production • remote_deploy.sh -s server.com recommendation.py • The example had 5 MR stages • Although the problem was simple, doing it by hand would have been inconvenient 20 Friendship network User interests friends_interests_counts recommendations
  • 58. Some remarks • Benefits • Can use any Java class • Can be mixed with Java code • Can use Python libraries • Caveats • Only pure Python code can be used, no compiled C (numpy, scipy) • But with streaming it’s possible to execute a CPython interpreter • Some idiosyncrasies because of Jython’s representation of basic types • Strings are OK, but Python integers are represented as java.math.BigInteger, so before yielding explicit conversion is needed (joins!) 21
  • 59. Contact • Javadoc: http://cascading.org • Other Cascading-based wrappers • Scalding (Scala), Cascalog (Clojure), Cascading-JRuby (Ruby) 22 http://github.org/twitter/pycascading http://pycascading.org @gaborjszabo gabor@twitter.com
  • 60. Implementation details Challenges due to an interpreted language • We need to make code available on all workers • Java bytecode is easy, same .jar everywhere • Although Jython represents Python functions as classes, they cannot be serialized • We need to start an interpreter on every worker • The Python source of the UDFs is retrieved and shipped in the .jar • Different Hadoop distributions explode the .jar differently, need to use the Hadoop distributed cache 23