SlideShare a Scribd company logo
Agile Data Science
January 2014
Agile Analytics Applications with Hadoop
2
About Me…Bearding.
• Bearding is my #1 natural talent.
• I’m going to beat this guy.
• Seriously.
• Salty Sea Beard
• Fortified with Pacific Ocean Minerals
2
3
Agile Data Science: The Book
A philosophy.
Not the only way,
but it’s a really good way!
Code: ‘AUTHD’ – 50% off
3
4
We Go Fast, But Don’t Worry!
• Download the slides - click the links - read examples!
• If it’s not on the blog (Hortonworks, Data Syndrome), it’s in
the book!
• Order now: http://shop.oreilly.com/product/0636920025054.do
4

Recommended for you

Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain

Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.

programmingnetworkssocial network
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...

In this talk we present a new paradigm of computation where the intelligence is computed inside the database. Standard software systems must get the data from the database to execute a routine. If the size of the data is big, there are inefficiencies due to the data movement. Store procedures tried to solve this issue in the past, allowing for computing simple functions inside the database. However, only simple routines can be executed. To showcase the capabilities of our new system, we created a lung cancer detection algorithm using Microsoft’s Cognitive Toolkit, also known as CNTK. We used transfer learning between ImageNet dataset, which contains natural images, and a lung cancer dataset, which contains scans of horizontal sections of the lung for healthy and sick patients. Specifically, a pretrained Convolutional Neural Network on ImageNet is used on the lung cancer dataset to generate features. Once the features are computed, a boosted tree is applied to predict whether the patient has cancer or not. All this process is computed inside the database, so the data movement is minimized. We are even able to execute the algorithm using the GPU of the virtual machine that hosts the database. Using a GPU, we can compute the featurization in less than 1h, in contrast to using a CPU, that would take up to 32h. Finally, we set up an API to connect the solution to a web app, where a doctor can analyze the images and get a prediction of a patient.

aideep learningsql
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop

Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup Basic knowledge of R/python and general ML concepts Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop Level: 200 Time: 2 Hours Agenda: - Introduction to ML, H2O and Sparkling Water - Refresher of data manipulation in R & Python - Supervised learning ---- Understanding liner regression model with an example ---- Understanding binomial classification with an example ---- Understanding multinomial classification with an example - Unsupervised learning ---- Understanding k-means clustering with an example - Using machine learning models in production - Sparkling Water Introduction & Demo

h2omachine learningr
5
Agile Application
Development: Check
• LAMP stack mature
• Post-Rails frameworks to choose from
• Enable rapid feedback and agility
+ NoSQL
5
6
Data Warehousing
6
7
Scientific Computing / HPC
Tubes and Mercury (Old School) Cores and Spindles (New School)
UNIVAC and Deep Blue both fill a warehouse. We’re back!
7
‘Smart Kid’ Only: MPI, Globus, etc. Until Hadoop
8
Data Science?
Application
Development
Data Warehousing
Scientific Computing / HPC
8

Recommended for you

Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes

This document discusses best practices for big data analytics. It emphasizes the importance of data curation to ensure semantic consistency and quality across diverse data sources. It warns against simply accumulating large amounts of ungoverned data ("data swamps") without relevant analytics or business applications. Instead, it advocates taking a full stack approach by building incremental decision models and data products to demonstrate value from the beginning. The document also stresses the need for data management layers, appropriate computing frameworks, and real-time and batch analytics capabilities to enable flexible exploration and insights.

big dataanalytics
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01

This document discusses various heuristics and principles for architecture design. It provides guidelines for creating simplified, evolvable systems using small modular components. Some key points discussed include using open architectures, building in options, and designing structures that are resilient to stress. The document also advocates for pattern-oriented, minimalist designs and evolutionary systems that can adapt over time without disrupting existing information. Overall, the document presents best practices for handling complexity, enabling flexibility, and ensuring architectures can withstand failures.

architecturesystems design
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction

Data Science with Windows Azure is an introduction to HDInsight and Hadoop offerings from Microsoft Machine Learning and Big Data Cloud based platform. This was presented at Microsoft Data Science Group – Tampa Analytics Professionals.

hdinsight azuremachine learningbig data
9
Data Center as Computer
“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient
manner.” Click here for a paper on operating a ‘data center as computer.’
9
Warehouse Scale Computers and Applications
10
Hadoop to the Rescue!
• Easy to use (Pig, Hive, Cascading)
• CHEAP: 1% the cost of SAN/NAS
• A department can afford its own Hadoop cluster!
• Dump all your data in one place: Hadoop DFS
• Silos come CRASHING DOWN!
• JOIN like crazy!
• ETL like whoa!
• An army of mappers and reducers at your command
• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!
10
11
NOW
WHAT?
11
12
Analytics Apps: It takes a Team
• Broad skill-set
• Nobody has them all
• Inherently collaborative
12

Recommended for you

Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up

A keynote presentation for Big Data Spain 2015 in Madrid, 2015-10-15 http://www.bigdataspain.org/program/

data sciencemachine learningspark
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)

Slideset of the training we gave at the Spark Summit East. Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/ Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ

data scienceapache sparkanalytics
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...

The document discusses using a graph database to store and query graph data stored in a Hadoop data lake more efficiently. It describes the limitations of the typical approach of using Spark/GraphFrames on HDFS for graph queries. A graph database allows for faster ad hoc graph queries by leveraging graph traversals. The document proposes using a multi-model database that combines a document store, graph database, and key-value store with a common query language. It suggests this approach could run on a DC/OS cluster for easy deployment and management of resources. Examples show importing data into ArangoDB and running graph queries.

big databig data spain
13
Data Science Team
• 3-4 team members with broad, diverse skill-sets that overlap
• Transactional overhead dominates at 5+ people
• Expert researchers: lend 25-50% of their time to teams
• Creative workers. Like a studio, not an assembly line
• Total freedom... with goals and deliverables.
• Work environment matters most
13
14
How To Get Insight Into Product
• Back-end has gotten THICKER
• Generating $$$ insight can take 10-100x app dev
• Timeline disjoint: analytics vs agile app-dev/design
• How do you ship insights efficiently?
• Can you collaborate on research vs developer timeline?
14
15
The Wrong Way - Part One
“We made a great design.
Your job is to predict the future for it.”
15
16
The Wrong Way - Part Two
“What is taking you so long
to reliably predict the future?”
16

Recommended for you

Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences

The document outlines the workflow of a data scientist, from planning experiments and collecting data, to analyzing, visualizing, and publishing results. It emphasizes that data science involves formalizing hypotheses based on observations and testing them using collected data. A suite of open-source tools is presented to help data scientists in managing data and supporting open, reproducible life science research. The goal is to enable integration and sharing of experimental data and results.

life scienceopen dataisatools
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data Exploration

This document appears to be a slide deck presentation on empowering others through data exploration. The presentation discusses removing barriers to data, making feedback fast, and removing yourself from blocking others. It emphasizes visualizing data pipelines and augmenting data warehouses with data lakes to handle varying data volumes, varieties, and velocities. The goal is to turn data into insights that create business value.

big datadata sciencedata lake
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data science

Interested in Data science but trying to get a handle on all the terms getting you confused? Not sure where to start? This presentation breaks down the concepts and the terminology

data sciencemachine learningintroduction
17
The Wrong Way - Part Three
“The users don’t understand
what 86% true means.”
17
18
The Wrong Way - Part Four
GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!!
18
19
The Wrong Way - Conclusion
Inevitable Conclusion
Plane Mountain
19
20
Reminds me of... the waterfall
model
:( 20

Recommended for you

Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System Accuracy

This document discusses the potential for self-evolving machine learning models to provide dynamic system accuracy. It notes drawbacks of static models and outlines characteristics of self-evolving models, including their ability to work at scale, sense their environment, judge relevance, discover connections, retain and build upon previous learning, and learn from experience. The document argues that self-evolving models powered by Hadoop streaming and machine learning could fulfill expectations of low latency, high throughput, real-time performance, scalability, accuracy and dynamic context while overcoming human limitations of focus, variants, emotion and reliability.

hadoop summitapache hadoop
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.

What was a data product before the world changed and got so complex. Why distributed computing/data science is the solution. What problems does that add? How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework

machine learningdataspark notebook
Demo Eclipse Science
Demo Eclipse ScienceDemo Eclipse Science
Demo Eclipse Science

Eclipse science group presentation given at Eclipse Converge and Devoxx 2017 in California. These slides give an overview of projects in the Eclipse Science working group in 2017.

javascienceopen source
21
Chief Problem
You can’t design insight in analytics applications.
You discover it.
You discover by exploring.
21
22
-> Strategy
So make an app for exploring your data.
Which becomes a palette for what you ship.
Iterate and publish intermediate results.
22
23
Data Design
• Not the 1st query that = insight, it’s the 15th, or 150th
• Capturing “Ah ha!” moments
• Slow to do those in batch…
• Faster, better context in an interactive web application.
• Pre-designed charts wind up terrible. So bad.
• Easy to invest man-years in wrong statistical models
• Semantics of presenting predictions are complex
• Opportunity lies at intersection of data & design
23
24
How Do We Get Back to Agile?
24

Recommended for you

Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals

With the surge in Big Data, organizations have began to implement Big Data related technologies as a part of their system. This has lead to a huge need to update existing skillsets with Hadoop. Java professionals are one such people who have to update themselves with Hadoop skills.

hadoopjavabig data
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark

Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.

big datapythondata
Enabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopEnabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPop

Graphs are everywhere, but in a modern data stack, they are not the only tool in the toolbox. With Apache TinkerPop, adding graph capability on top of your existing data platform is not as daunting as it sounds. We will do a deep dive on writing Traversal Strategies to optimize performance of the underlying graph database. We will investigate how various TinkerPop systems offer unique possibilities in a multimodel approach to graph processing. We will discuss how using Gremlin frees you from vendor lock-in and enables you to swap out your graph database as your requirements evolve. Presented at Graph Day Texas, January 14, 2017. http://graphday.com/graph-day-at-data-day-texas/#plurad

graph databasenosqlopen source
25
Statement of Principles
(Then Tricks With Code)
25
26
Setup An Environment Where:
• Insights repeatedly produced
• Iterative work shared with entire team
• Interactive from day Zero
• Data model is consistent end-to-end
• Minimal impedance between layers
• Scope and depth of insights grow
• Insights form the palette for what you ship
• Until the application pays for itself and more
26
27
Snowballing Audience
27
28
Value Document > Relation
Most data is dirty. Most data is semi-structured or unstructured. Rejoice!
28

Recommended for you

SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python

See 2020 update: https://derwen.ai/s/h88s SF Python Meetup, 2017-02-08 https://www.meetup.com/sfpython/events/237153246/ PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.

pythonsummarizationgraph algorithms
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQL

Big SQL, Impala, and Hive were benchmarked on their ability to execute 99 queries from the TPC-DS benchmark at various scale factors. Big SQL was able to express all queries without rewriting, complete the full workload at 10TB and 30TB, and achieved the highest throughput. Impala and Hive required rewriting some queries and could only complete 70-73% of the workload at 10TB. The results indicate that query support, scale, and throughput are important factors to consider for SQL-on-Hadoop implementations.

bigsql "big sql" impala hive benchmark tpcds
tarea 7 gabriel
tarea 7 gabrieltarea 7 gabriel
tarea 7 gabriel

El documento describe los componentes básicos de los generadores eólicos. Explica que la energía eólica proviene de la energía solar y el calentamiento diferencial del aire por el sol. También menciona que existen diferentes tipos de aerogeneradores según su potencia y número de palas. Luego enumera los principales componentes como el rotor, las palas, el eje de baja velocidad, la caja multiplicadora, el sistema de orientación y el sistema de soporte.

29
Value Document > Relation
Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
29
30
Relational Data = Legacy Format
• Why JOIN? Storage is fundamentally cheap!
• Duplicate that JOIN data in one big record type!
• ETL once to document format on import, NOT every job
• Not zero JOINs, but far fewer JOINs
• Semi-structured documents preserve data’s actual structur
• Column compressed document formats beat JOINs!
30
31
Value Imperative > Declarative
• We don’t know what we want to SELECT.
• Data is dirty - check each step, clean iteratively.
• 85% of data scientist’s time spent munging. ETL.
• Imperative is optimized for our process.
• Process = iterative, snowballing insight
• Efficiency matters, self optimize
31
32
Value Dataflow > SELECT
32

Recommended for you

Bitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshopBitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshop

Slide from my Particle Photon workshop @bitraffineriet, Oslo, 8. Feb 2017

particleiotphoton
JSON-LD Update
JSON-LD UpdateJSON-LD Update
JSON-LD Update

1) JSON-LD has seen widespread adoption with over 2 million HTML pages including it and it being a required format for Linked Data platforms. 2) A primary goal of JSON-LD was to allow JSON developers to use it similarly to JSON while also providing mechanisms to reshape JSON documents into a deterministic structure for processing. 3) JSON-LD 1.1 includes additional features like using objects to index into collections, scoped contexts, and framing capabilities.

linked datajson-ldrdf
Mapa mental de un lider tahi
Mapa mental de un lider  tahiMapa mental de un lider  tahi
Mapa mental de un lider tahi

El documento describe las características clave de un líder efectivo. Un líder debe tener la capacidad de comunicarse claramente, poseer inteligencia emocional para manejar los sentimientos propios y de otros, y establecer metas y objetivos congruentes con las capacidades del grupo. Además, un líder planea estratégicamente, aprovecha sus fortalezas y trabaja para mejorar sus debilidades, y ayuda a su gente a crecer delegando responsabilidades.

emily ramirez
33
Ex. Dataflow: ETL +
Email Sent Count
(I can’t read this either. Get a big version here.)
33
34
Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING
• Pig is optimized for refining data, as opposed to consuming it
• Pig is imperative, iterative
• Pig is dataflows, and SQLish (but not SQL)
• Code modularization/re-use: Pig Macros
• ILLUSTRATE speeds dev time (even UDFs)
• Easy UDFs in Java, JRuby, Jython, Javascript
• Pig Streaming = use any tool, period.
• Easily prepare our data as it will appear in our app.
• If you prefer Hive, use Hive.
Actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive.
See: HCatalog for Pig/Hive integration.
34
35
Localhost vs Petabyte Scale:
Same Tools
• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3
• Everything we serve in our app is re-creatable via Hadoop.
35
36
Data-Value Pyramid
Climb it. Do not skip steps. See here.
36

Recommended for you

ConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving Future

Consumers are interested in autonomous cars but still fear letting go of the wheel completely. While traffic is a major issue for city satisfaction, autonomous vehicles may help by freeing up drivers and improving the commute experience. Those most interested in autonomous cars tend to be professionals with children who already use cars to commute. Allowing cars to be shared more easily through technologies like digital keys could change whether people own cars or use them as a service. A variety of companies from traditional automakers to technology firms and public transport providers are seen as potential future providers of autonomous mobility options.

autonomous drivingautonomous vehiclesdriverless car
Zipcar
ZipcarZipcar
Zipcar

Zipcar is a car sharing service that allows users to rent vehicles by the hour or day. Members pay an annual fee of $70 plus hourly rates of $8.50 per hour or daily rates of $59. Zipcar has over 1 million members across 500 cities in 9 countries, with a fleet of 10,000 vehicles. The document outlines Zipcar's approach, history, competitors, and future outlook which includes increasing their fleet size and adding more hybrid and electric vehicles.

Feb 13 17 word of the day (1)
Feb 13 17 word of the day (1)Feb 13 17 word of the day (1)
Feb 13 17 word of the day (1)

This document provides word of the day definitions for the words "clever", "dainty", "pounce", and "generous" across four sections. Each section defines the word, provides part of speech, examples, and discussion questions related to demonstrating or applying that word. The overall document aims to build vocabulary and comprehension through engaging examples and questions about the different words.

37
0/1) Display Atomic Records
On The Web
37
38
0.0) Document - Serialize Events
• Protobuf
• Thrift
• JSON
• Avro - I use Avro because the schema is onboard.
38
39
0.1) Documents Via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray);
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray,
name:chararray);
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray,
name:chararray), enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;
store emails into '/enron/emails.avro' using AvroStorage(
Example here.
39
40
0.2) Serialize Events From
Streamsclass GmailSlurper(object):
...
def init_imap(self, username, password):
self.username = username
self.password = password
try:
imap.shutdown()
except:
pass
self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
self.imap.login(username, password)
self.imap.is_readonly = True
...
def write(self, record):
self.avro_writer.append(record)
...
def slurp(self):
if(self.imap and self.imap_folder):
for email_id in self.id_list:
(status, email_hash, charset) = self.fetch_email(email_id)
if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
print email_id, charset, email_hash['thread_id']
self.write(email_hash)
Scrape your own gmail in Python and Ruby.
40

Recommended for you

Your moment is Waiting
Your moment is WaitingYour moment is Waiting
Your moment is Waiting

It's all about introduction to a blog which speaks about Destinations, Arts, Culture, People, Cuisines...Everything you would want to know about Kerala Discover Life. Feel Divinity. Find Yourself...........Experience God's Own Country

marg hotelstravelkerala tourism
Mapa mental
Mapa mentalMapa mental
Mapa mental

La carta proporciona información sobre Michell Figueroa, un estudiante de la Universidad Fermín Toro en Barquisimeto. Figueroa está inscrito en la Facultad de Ciencias Jurídicas y Políticas, Escuela de Derecho, sección Saia D. La carta incluye su nombre completo y número de identificación.

Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview Presentation

An overview of Teraproc cluster-as-a-service offerings for high-performance distributed analytics. This overview presentation includes a step-by-step demonstration of the process of deploying a ready-to-run R Studio cluster environment on Amazon Web Services. More information available at http://teraproc.com

hpcrr-studio
41
0.3) ETL Logs
log_data = LOAD 'access_log'
USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
AS (remoteAddr,
remoteLogname,
user,
time,
method,
uri,
proto,
bytes);
41
42
1) Plumb Atomic Events->Browser
(Example stack that enables high productivity)
42
43
1.1) Cat Avro Serialized Events
me$ cat_avro ~/Data/enron.avro
{
u'bccs': [],
u'body': u'scamming people, blah blah',
u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z',
u'from': {u'address': u'bob.dobbs@enron.com', u'name': None},
u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>',
u'subject': u'Re: Enron trade for frop futures',
u'tos': [
{u'address': u'connie@enron.com', u'name': None}
]
}
Get cat_avro in python, ruby
43
44
1.2) Load Events in Pig
me$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
grunt> describe enron_emails
emails: {
message_id: chararray,
datetime: chararray,
from:tuple(address:chararray,name:chararray)
subject: chararray,
body: chararray,
tos: {to: (address: chararray,name: chararray)},
ccs: {cc: (address: chararray,name: chararray)},
bccs: {bcc: (address: chararray,name: chararray)}
}
 
44

Recommended for you

Creating HTML Pages
Creating HTML PagesCreating HTML Pages
Creating HTML Pages

The document discusses creating an HTML page from a template. It breaks the template down into sections like header, main content, and footer. It then provides the HTML code to recreate each section, with explanations. For example, it shows how to code the header section with elements for quick links, logo, search bar, and navigation. It also demonstrates how to code the main content with different article sections. The document is intended to teach how to reconstruct a web page design in HTML.

computer sciencecodingweb development
Top Insights from SaaStr by Leading Enterprise Software Experts
Top Insights from SaaStr by Leading Enterprise Software ExpertsTop Insights from SaaStr by Leading Enterprise Software Experts
Top Insights from SaaStr by Leading Enterprise Software Experts

Market Research SHARE I had the pleasure of attending the SaaStr Annual 2016 Conference in San Francisco earlier this month and wanted to share some of the insights I gathered from that event with you here. The findings below are arranged by functional area with attribution. I tried to compress the content as much as possible, but there was A TON of great information at the conference so would highly recommend spending the time to read through.

softwaremarketingproduct
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...

On Swedish-German translation culture during the eighteenth century

history of translationswedish pomeraniaeighteenth century
45
1.3) ILLUSTRATE Events in Pig
grunt> illustrate enron_emails
 ---------------------------------------------------------------------------
| emails |
| message_id:chararray |
| datetime:chararray |
| from:tuple(address:chararray,name:chararray) |
| subject:chararray |
| body:chararray |
tos:bag{to:tuple(address:chararray,name:chararray)} |
| ccs:bag{cc:tuple(address:chararray,name:chararray)} |
| bccs:bag{bcc:tuple(address:chararray,name:chararray)} |
---------------------------------------------------------------------------
| |
| <1731.10095812390082.JavaMail.evans@thyme> |
| 2001-01-09T06:38:00.000Z |
| (bob.dobbs@enron.com, J.R. Bob Dobbs) |
| Re: Enron trade for frop futures |
| scamming people, blah blah |
| {(connie@enron.com,)} |
| {} |
| {} |
Upgrade to Pig 0.10+
45
46
1.4) Publish Events to a ‘Database’
pig -l /tmp -x local -v -w -param avros=enron.avro 
-param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
/* By default, lets have 5 reducers */
set default_parallel 5
avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();
Full instructions here.
Which does this:
From Avro to MongoDB in one command:
46
47
1.5) Check Events in ‘Database’
$ mongo enron
MongoDB shell version: 2.0.2
connecting to: enron
show collections
Emails
system.indexes
>db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"})
{
"_id" : ObjectId("502b4ae703643a6a49c8d180"),
"message_id" : "<1731.10095812390082.JavaMail.evans@thyme>",
"date" : "2001-01-09T06:38:00.000Z",
"from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" },
"subject" : Re: Enron trade for frop futures,
"body" : "Scamming more people...",
"tos" : [ { "address" : "connie@enron", "name" : null } ],
"ccs" : [ ],
"bccs" : [ ]
}
47
48
1.6) Publish Events on the Web
require 'rubygems'
require 'sinatra'
require 'mongo'
require 'json'
connection = Mongo::Connection.new
database = connection['agile_data']
collection = database['emails']
get '/email/:message_id' do |message_id|
data = collection.find_one({:message_id => message_id})
JSON.generate(data)
end
48

Recommended for you

CSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, LinzCSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, Linz

This document summarizes Rachel Andrew's presentation on CSS Grid Layout. Some key points: - CSS Grid Layout provides a new two-dimensional layout system for CSS that solves many of the problems of previous methods like floats and flexbox. - Grid uses line-based placement, with grid lines that can be explicit or implicit, to position items on the page. Properties like grid-column and grid-row position items within the grid. - The grid template establishes the structure of rows and columns. Items can span multiple tracks. Fraction units like fr distribute space proportionally. - Common layouts like Holy Grail are easily achieved with Grid. The structure can also adapt at breakpoints by redefining

csscss grid layoutcss3
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TB

Detailed report of IBM's 30TB Hadoop-DS report showing that IBM InfoSphere BigInsights (SQL-on-Hadoop) is able to execute all 99 TPC-DS queries at scale over native Hadoop data formats. Written by Simon Harris, Abhayan Sundararajan, John Poelman and Matthew Emmerton.

Motivación laboral
Motivación laboralMotivación laboral
Motivación laboral

La motivación laboral se refiere a la capacidad de las empresas para mantener el estímulo positivo de sus empleados y su desempeño en el trabajo. Existen cuatro tipos de motivación: extrínseca, intrínseca, transitiva y trascendente. La motivación es importante para las empresas porque mejora la productividad individual y grupal de los empleados. Algunos factores que motivan incluyen tener responsabilidades, autonomía y objetivos claros, mientras que problemas interpersonales, falta de confianza y exceso de control desmotivan.

49
1.6) Publish events on the web
49
50
One-Liner to Transition Stack
50
51
What’s the Point?
• A designer can work against real data.
• An application developer can work against real data.
• A product manager can think in terms of real data.
• Entire team is grounded in reality!
• You’ll see how ugly your data really is.
• You’ll see how much work you have yet to do.
• Ship early and often!
• Feels agile, don’t it? Keep it up!
51
52
1.7) Wrap Events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body>
Complete example here with code here.
52

Recommended for you

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014

This document discusses setting up an environment for agile data science and analytics applications. It recommends: - Publishing atomic records like emails or logs to a "database" like MongoDB in order to make the data accessible to designers, developers and product managers. - Wrapping the records with tools like Pig, Avro and Bootstrap to enable viewing, sorting and linking the records in a browser. - Taking an iterative approach of refining the data model and publishing insights to gradually build up an application that discovers insights from exploring the data, rather than designing insights upfront. - Emphasizing simplicity, self-service tools, and minimizing impedance between layers to facilitate rapid iteration and collaboration across roles.

technologydata sciencebig data
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications

This document provides an overview of steps to build an agile analytics application, beginning with raw event data and ending with a web application to explore and visualize that data. The steps include: 1) Serializing raw event data (emails, logs, etc.) into a document format like Avro or JSON 2) Loading the serialized data into Pig for exploration and transformation 3) Publishing the data to a "database" like MongoDB 4) Building a web interface with tools like Sinatra, Bootstrap, and JavaScript to display and link individual records The overall approach emphasizes rapid iteration, with the goal of creating an application that allows continuous discovery of insights from the source data.

apache hadoophadoop summitbig data
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...

cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.

distributed computingbig datascientific workflows
53
1.7) Wrap Events with Bootstrap
53
54
Refine. Add Links
Between Documents.
Not the Mona Lisa, but coming along... See: here
54
56
1.8) List Links to Sorted Events
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
"_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
"message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
"from" : [
...
pig -l /tmp -x local -v -w
emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};
store emails_per_user into '$mongourl' using MongoStorage();
Use Pig, serve/cache a bag/array of email documents:
Use your ‘database’, if it can sort.
56
57
1.8) List Links
to Sorted Documents
57

Recommended for you

Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017

"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision

redshiftbig dataaws user group
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data

This document discusses how Cloudera Enterprise Data Hub (EDH) can be used for advanced analytics. EDH allows users to perform diverse concurrent analytics on large datasets without moving the data. It includes tools for machine learning, graph analytics, search, and statistical analysis. EDH protects data through security features and system change tracking. The document argues that EDH is the only platform that can support all these analytics capabilities in a single, integrated system. It provides several examples of how advanced analytics on EDH have helped organizations like the government address important problems.

clouderafederalhadoop in government
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks. Session presented at Big Data Spain 2014 Conference 18th Nov 2014 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014

big data spainbig datadata science
58
1.9) Make It Searchable
If you have list, search is easy with
ElasticSearch and Wonderdog...
/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch-
0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'
Test it with curl:
ElasticSearch has no security features. Take note. Isolate.
58
59
2) Create Simple Charts
59
60
2) Create Simple Tables and
Charts
60
61
2) Create Simple Charts
• Start with an HTML table on general principle.
• Then use nvd3.js - reusable charts for d3.js
• Aggregate by properties & displaying is first step in entity resolution
• Start extracting entities. Ex: people, places, topics, time series
• Group documents by entities, rank and count.
• Publish top N, time series, etc.
• Fill a page with charts.
• Add a chart to your event page.
61

Recommended for you

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA

Tomáš Červenka will discuss Hive, an open-source data warehousing system built on Hadoop that provides SQL-like queries over large datasets. He will explain what Hive is useful for (big data analytics and processing), and not useful for (real-time queries and algorithms difficult to parallelize). He will demonstrate how to get started with Hive using Amazon EMR and provide a sample query, and discuss how VisualDNA uses Hive for analytics, reporting pipelines, and machine learning inference. Tips provided include using fast instance types, compression, and partitioning data.

hivehadooplondon
Large scale computing
Large scale computing Large scale computing
Large scale computing

LinkedIn is a large professional social network with 50 million users from around the world. It faces big data challenges at scale, such as caching a user's third degree network of up to 20 million connections and performing searches across 50 million user profiles. LinkedIn uses Hadoop and other scalable architectures like distributed search engines and custom graph engines to solve these problems. Hadoop provides a scalable framework to process massive amounts of user data across thousands of nodes through its MapReduce programming model and HDFS distributed file system.

Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World

War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.

edwedw14
62
2.1) Top N (of Anything) in Pig
pig -l /tmp -x local -v -w
top_things = foreach (group things by key) {
sorted = order things by arbitrary_rank desc;
top_10_things = limit sorted 10;
generate group as key, top_10_things as top_10_things;
};
store top_n into '$mongourl' using MongoStorage();
Remember, this is the same structure the browser gets as json.
This would make a good Pig Macro.
62
63
2.2) Time Series (of Anything) in
Pig
pig -l /tmp -x local -v -w
/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
COUNT_STAR(things) as total;
/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();
Yet another good Pig Macro.
63
64
Data Processing in Our Stack
A new feature in our application might begin at any layer…
GREAT!
Any team member can add new features, no problemo!
I’m creative!
I know Pig!
I’m creative too!
I <3 Javascript!
omghi2u!
where r my legs?
send halp
64
65
Data Processing in Our Stack
... but we shift the data-processing towards batch, as we are able.
Ex: Overall total emails calculated in each layer
See real example here.
65

Recommended for you

Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life

ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!

ormmicroormsql
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...

This document summarizes a presentation about query-time nonparametric regression and time routed aliases in Solr. It discusses how nonparametric multiplicative regression was used to continuously predict user interests for an online career coaching system based on click-through data. It also describes how time routed aliases in Solr provide a built-in way to implement time-partitioned indexing of timestamped data across multiple collections while automatically adding and removing collections over time.

solr developeractivate18
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation

This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.

big datahadoop
66
3) Exploring with Reports
66
67
3) Exploring with Reports
67
68
3.0) From Charts to Reports
• Extract entities from properties we aggregated by in charts (Step 2)
• Each entity gets its own type of web page
• Each unique entity gets its own web page
• Link to entities as they appear in atomic event documents (Step 1)
• Link most related entities together, same and between types.
• More visualizations!
• Parametize results via forms.
68
69
3.1) Looks Like This:
69

Recommended for you

Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?

The four Linked Data publishing principles established in 2006 seem to be quite clear and well understood by people inside and outside the core Linked Data and Semantic Web community. However, not only when discussing with outsiders about the goodness of Linked Data but also when reviewing papers for the COLD workshop series, I find myself, in many occasions, going back again to the principles in order to see whether some approach for Web data publication and consumption is actually Linked Data or not. In this talk we will review some of the current approaches that we have for publishing data on the Web, and we will reflect on why it is sometimes so difficult to get into an agreement on what we understand by Linked Data. Furthermore, we will take the opportunity to describe yet another approach that we have been working on recently at the Center for Open Middleware, a joint technology center between Banco Santander and Universidad Politécnica de Madrid, in order to facilitate Linked Data consumption.

cold2015center for open middlewarelinked data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

The document discusses how traditional analytics processes involve siloed data and platforms, long timelines for data discovery, and difficulties accessing and sharing data. It proposes that an Enterprise Data Hub (EDH) using Cloudera can help address these issues by providing unified storage for all types of data, shorter analytics lifecycles, and the ability to do more with data by using 100x more data and more types of data. The EDH allows organizations to use all of their data and gain insights sooner.

5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer

5 Things that Make Hadoop a Game Changer Webinar by Elliott Cordo, Caserta Concepts There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop. To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029 For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/

big datahadoopbig data analytics
70
3.2) Cultivate Common Keyspaces
70
71
3.3) Get People Clicking. Learn.
• Explore this web of generated pages, charts and links!
• Everyone on the team gets to know your data.
• Keep trying out different charts, metrics, entities, links.
• See whats interesting.
• Figure out what data needs cleaning and clean it.
• Start thinking about predictions & recommendations.
‘People’ could be just your team, if data is sensitive.
71
72
4) Predictions and
Recommendations
72
73
4.0) Preparation
• We’ve already extracted entities, their properties and relationships
• Our charts show where our signal is rich
• We’ve cleaned our data to make it presentable
• The entire team has an intuitive understanding of the data
• They got that understanding by exploring the data
• We are all on the same page!
73

Recommended for you

Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...

The document discusses building a lightweight discovery interface for Chinese patents using Solr/Lucene. It describes parsing various patent file formats using Tika and building custom parsers. It also emphasizes the importance of making the search solution accessible by allowing users to export data and share results.

lucenetikagpsn
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018

This presentation was given in one of the DSATL Mettups in March 2018 in partnership with Southern Data Science Conference 2018 (www.southerndatascience.com)

data sciencebig datarecommender system
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb

This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China. About the Event: The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world. The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.

chinaapacheopen source
74
4.2) Think in Different
Perspectives
• Networks
• Time Series / Distributions
• Natural Language Processing
• Conditional Probabilities / Bayesian Inference
• Check out Chapter 2 of the book
74
75
4.3) Networks
75
76
4.3.1) Weighted Email
Networks in Pig
76
77
4.3.2) Networks Viz with Gephi
77

Recommended for you

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark

This document provides an introduction and agenda for a presentation on Spark. It discusses how Spark is a fast engine for large-scale data processing and how it improves on MapReduce. Spark stores data in memory across clusters to allow for faster iterative computations versus writing to disk with MapReduce. The presentation will demonstrate Spark concepts through word count and log analysis examples and provide an overview of Spark's Resilient Distributed Datasets (RDDs) and directed acyclic graph (DAG) execution model.

DataHub
DataHubDataHub
DataHub

The DataHub Project: Collaborative Data Science and Dataset Version Management; talk given at CIDR 2015

Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma

原版一模一样【微信:741003700 】【阳光海岸大学毕业证成绩单】【微信:741003700 】学位证,留信学历认证(真实可查,永久存档)原件一模一样纸张工艺/offer、在读证明、外壳等材料/诚信可靠,可直接看成品样本,帮您解决无法毕业带来的各种难题!外壳,原版制作,诚信可靠,可直接看成品样本。行业标杆!精益求精,诚心合作,真诚制作!多年品质 ,按需精细制作,24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题,包您满意。 本公司拥有海外各大学样板无数,能完美还原。 1:1完美还原海外各大学毕业材料上的工艺:水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】 一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等! 二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证(教育部存档!教育部留服网站永久可查) 四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况: ◇在校期间,因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力,希望尽快拿到; ◇不清楚认证流程以及材料该如何准备; ◇回国时间很长,忘记办理; ◇回国马上就要找工作,办给用人单位看; ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金 留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内,将在公安局网内查询个人身份证信息后,同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料,供国家高端企业选择人才 办理阳光海岸大学毕业证【微信:741003700 】外观非常简单,由纸质材料制成,上面印有校徽、校名、毕业生姓名、专业等信息。 办理阳光海岸大学毕业证【微信:741003700 】格式相对统一,各专业都有相应的模板。通常包括以下部分: 校徽:象征着学校的荣誉和传承。 校名:学校英文全称 授予学位:本部分将注明获得的具体学位名称。 毕业生姓名:这是最重要的信息之一,标志着该证书是由特定人员获得的。 颁发日期:这是毕业正式生效的时间,也代表着毕业生学业的结束。 其他信息:根据不同的专业和学位,可能会有一些特定的信息或章节。 办理阳光海岸大学毕业证【微信:741003700 】价值很高,需要妥善保管。一般来说,应放置在安全、干燥、防潮的地方,避免长时间暴露在阳光下。如需使用,最好使用复印件而不是原件,以免丢失。 综上所述,办理阳光海岸大学毕业证【微信:741003700 】是证明身份和学历的高价值文件。外观简单庄重,格式统一,包括重要的个人信息和发布日期。对持有人来说,妥善保管是非常重要的。

阳光海岸大学毕业证
78
4.3.3) Gephi = Easy
78
79
4.3.4) Social Network Analysis
79
80
4.4) Time Series & Distributions
80
81
4.4.1) Smooth Sparse Data
See here. 81

Recommended for you

[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers

간단해 보이지만 실제로는 복잡한 몇 가지 Amazon DynamoDB 디자인 퍼즐을 함께 해결하며 DynamoDB가 대규모로 작동하는 방식에 대해 자세히 알아봅니다. DynamoDB의 작동 방식을 이해함으로써 더 효과적이고 확장 가능한 솔루션을 찾는 방법을 알아보세요.

awsdatabasedynamodb
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe

Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe

Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...

Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados. QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo. Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.

time-seriesquestdbdatabases
82
4.4.2) Regress to Find Trends
JRuby Linear Regression UDF Pig to use the UDF
Trend Line in your Application
82
83
4.5.1) Natural Language
Processing
Example with code here and macro here.
83
84
4.5.2) NLP: Extract Topics!
84
85
4.5.3) NLP for All: Extract Topics!
• TF-IDF in Pig - 2 lines of code with Pig Macros:
• http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-
topic-summarization-2-lines-of-pig/
• LDA with Pig and the Lucene Tokenizer:
• http://thedatachef.blogspot.be/2012/03/topic-discovery-
with-apache-pig-and.html
85

Recommended for you

Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products

Analytics use cases for telco

LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx

LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time. It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.

MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT

### Data Description and Analysis Summary for Presentation #### 1. **Importing Libraries** Libraries used: - `pandas`, `numpy`: Data manipulation - `matplotlib`, `seaborn`: Data visualization - `scikit-learn`: Machine learning utilities - `statsmodels`, `pmdarima`: Statistical modeling - `keras`: Deep learning models #### 2. **Loading and Exploring the Dataset** **Dataset Overview:** - **Source:** CSV file (`mumbai-monthly-rains.csv`) - **Columns:** - `Year`: The year of the recorded data. - `Jan` to `Dec`: Monthly rainfall data. - `Total`: Total annual rainfall. **Initial Data Checks:** - Displayed first few rows. - Summary statistics (mean, standard deviation, min, max). - Checked for missing values. - Verified data types. **Visualizations:** - **Annual Rainfall Time Series:** Trends in annual rainfall over the years. - **Monthly Rainfall Over Years:** Patterns and variations in monthly rainfall. - **Yearly Total Rainfall Distribution:** Distribution and frequency of annual rainfall. - **Box Plots for Monthly Data:** Spread and outliers in monthly rainfall. - **Correlation Matrix of Monthly Rainfall:** Relationships between different months' rainfall. #### 3. **Data Transformation** **Steps:** - Ensured 'Year' column is of integer type. - Created a datetime index. - Converted monthly data to a time series format. - Created lag features to capture past values. - Generated rolling statistics (mean, standard deviation) for different window sizes. - Added seasonal indicators (dummy variables for months). - Dropped rows with NaN values. **Result:** - Transformed dataset with additional features ready for time series analysis. #### 4. **Data Splitting** **Procedure:** - Split the data into features (`X`) and target (`y`). - Further split into training (80%) and testing (20%) sets without shuffling to preserve time series order. **Result:** - Training set: `(X_train, y_train)` - Testing set: `(X_test, y_test)` #### 5. **Automated Hyperparameter Tuning** **Tool Used:** `pmdarima` - Automatically selected the best parameters for the SARIMA model. - Evaluated using metrics such as AIC and BIC. **Output:** - Best SARIMA model parameters and statistical summary. #### 6. **SARIMA Model** **Steps:** - Fit the SARIMA model using the training data. - Evaluated on both training and testing sets using MAE and RMSE. **Output:** - **Train MAE:** Indicates accuracy on training data. - **Test MAE:** Indicates accuracy on unseen data. - **Train RMSE:** Measures average error magnitude on training data. - **Test RMSE:** Measures average error magnitude on testing data. #### 7. **LSTM Model** **Preparation:** - Reshaped data for LSTM input. - Converted data to `float32`. **Model Building and Training:** - Built an LSTM model with one LSTM layer and one Dense layer. - Trained the model on the training data. **Evaluation:** - Evaluated on both training and testing sets using MAE and RMSE. **Output:** - **Train MAE:** Accuracy on training data. - **T

mumbai rainfalls
86
4.6) Probability & Bayesian
Inference
86
87
4.6.1) Gmail Suggested Recipients
87
88
4.6.1) Reproducing it with Pig
88
89
4.6.2) Step 1: COUNT (From -> To)
89

Recommended for you

Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe

Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe

[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...

Aurora PostgreSQL에서 가장 일반적인 performance use case 들에 대해 Aurora PostreSQL의 모니터링 Tool들을 통해 어떤게 문제를 식별하고 분석하는지 그리고 이 문제를 해결해나가는 절차와 방법에 대한 Deep Dive입니다.

awsdatabaseaurora
90
4.6.2) Step 2: COUNT
(From, To, Cc)/Total
P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone
90
91
4.6.3) Wait - Stop Here! It Works!
They match…
91
92
4.4) Add Predictions to Reports
92
93
5) Enable New Actions
93

Recommended for you

Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)

Sin Involves More Than You Might Think (We'll Explain)

Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe

Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe

Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeMahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

94
Why Doesn’t Kate Reply
to My Emails?
• What time is best to catch her?
• Are they too long?
• Are they meant to be replied to (original content)?
• Are they nice? (sentiment analysis)
• Do I reply to her emails (reciprocity)?
• Do I cc the wrong people (my mom)?
94
97
Thank You!
•Questions & Answers
97
• Follow: @rjurney
• Read the Blog: datasyndrome.com

More Related Content

What's hot

Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
Russell Jurney
 
Social Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainSocial Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem Domain
Russell Jurney
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
Russell Jurney
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
Russell Jurney
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
Avkash Chauhan
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
Krishna Sankar
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
Adnan Masood
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
Paco Nathan
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
Krishna Sankar
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
Eamonn Maguire
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Greg Goltsov
 
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data science
Susan Ibach
 
Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System Accuracy
DataWorks Summit
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella
 
Demo Eclipse Science
Demo Eclipse ScienceDemo Eclipse Science
Demo Eclipse Science
Matthew Gerring
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
Edureka!
 

What's hot (19)

Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Social Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem DomainSocial Network Analysis in Your Problem Domain
Social Network Analysis in Your Problem Domain
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
Networks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domainNetworks All Around Us: Extracting networks from your problem domain
Networks All Around Us: Extracting networks from your problem domain
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Data science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief IntroductionData science with Windows Azure - A Brief Introduction
Data science with Windows Azure - A Brief Introduction
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
 
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data science
 
Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System Accuracy
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Demo Eclipse Science
Demo Eclipse ScienceDemo Eclipse Science
Demo Eclipse Science
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 

Viewers also liked

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
Enabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopEnabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPop
Jason Plurad
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQL
Simon Harris
 
tarea 7 gabriel
tarea 7 gabrieltarea 7 gabriel
tarea 7 gabriel
Gabriel Ramírez
 
Bitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshopBitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshop
Jens Brynildsen
 
JSON-LD Update
JSON-LD UpdateJSON-LD Update
JSON-LD Update
Gregg Kellogg
 
Mapa mental de un lider tahi
Mapa mental de un lider  tahiMapa mental de un lider  tahi
Mapa mental de un lider tahi
Tahi04
 
ConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving Future
Ericsson
 
Zipcar
ZipcarZipcar
Zipcar
Alex Li
 
Feb 13 17 word of the day (1)
Feb 13 17 word of the day (1)Feb 13 17 word of the day (1)
Feb 13 17 word of the day (1)
Gerald Hernandez , Jr.
 
Your moment is Waiting
Your moment is WaitingYour moment is Waiting
Your moment is Waiting
rittujacob
 
Mapa mental
Mapa mentalMapa mental
Mapa mental
Michell Figueroa
 
Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview Presentation
Gord Sissons
 
Creating HTML Pages
Creating HTML PagesCreating HTML Pages
Creating HTML Pages
Mike Crabb
 
Top Insights from SaaStr by Leading Enterprise Software Experts
Top Insights from SaaStr by Leading Enterprise Software ExpertsTop Insights from SaaStr by Leading Enterprise Software Experts
Top Insights from SaaStr by Leading Enterprise Software Experts
OpenView
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
Andreas Önnerfors
 
CSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, LinzCSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, Linz
Rachel Andrew
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TB
Gord Sissons
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboral
alexander_hv
 

Viewers also liked (20)

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Enabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopEnabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPop
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQL
 
tarea 7 gabriel
tarea 7 gabrieltarea 7 gabriel
tarea 7 gabriel
 
Bitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshopBitraf - Particle Photon IoT workshop
Bitraf - Particle Photon IoT workshop
 
JSON-LD Update
JSON-LD UpdateJSON-LD Update
JSON-LD Update
 
Mapa mental de un lider tahi
Mapa mental de un lider  tahiMapa mental de un lider  tahi
Mapa mental de un lider tahi
 
ConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving Future
 
Zipcar
ZipcarZipcar
Zipcar
 
Feb 13 17 word of the day (1)
Feb 13 17 word of the day (1)Feb 13 17 word of the day (1)
Feb 13 17 word of the day (1)
 
Your moment is Waiting
Your moment is WaitingYour moment is Waiting
Your moment is Waiting
 
Mapa mental
Mapa mentalMapa mental
Mapa mental
 
Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview Presentation
 
Creating HTML Pages
Creating HTML PagesCreating HTML Pages
Creating HTML Pages
 
Top Insights from SaaStr by Leading Enterprise Software Experts
Top Insights from SaaStr by Leading Enterprise Software ExpertsTop Insights from SaaStr by Leading Enterprise Software Experts
Top Insights from SaaStr by Leading Enterprise Software Experts
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
 
CSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, LinzCSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, Linz
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TB
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboral
 

Similar to Agile Data Science: Building Hadoop Analytics Applications

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
DataWorks Summit
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
AWS Chicago
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
Bhupesh Bansal
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
OpenSource Connections
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
Davide Mauri
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
Oscar Corcho
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera, Inc.
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
OpenSource Connections
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
CareerBuilder.com
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Giivee The
 
DataHub
DataHubDataHub

Similar to Agile Data Science: Building Hadoop Analytics Applications (20)

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
DataHub
DataHubDataHub
DataHub
 

Recently uploaded

Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
cwavvyy
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
luqmansyauqi2
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
kumkum tuteja$A17
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
depikasharma
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
sapna sharmap11
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
khansayyad1256
 
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeMahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
aashuverma204
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
yogita singh$A17
 
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
taqyea
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
SanelaNikodinoska1
 
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeMalviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
taqyea
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
simmi singh$A17
 

Recently uploaded (20)

Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
 
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeMahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
 
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeMalviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 

Agile Data Science: Building Hadoop Analytics Applications

  • 1. Agile Data Science January 2014 Agile Analytics Applications with Hadoop
  • 2. 2 About Me…Bearding. • Bearding is my #1 natural talent. • I’m going to beat this guy. • Seriously. • Salty Sea Beard • Fortified with Pacific Ocean Minerals 2
  • 3. 3 Agile Data Science: The Book A philosophy. Not the only way, but it’s a really good way! Code: ‘AUTHD’ – 50% off 3
  • 4. 4 We Go Fast, But Don’t Worry! • Download the slides - click the links - read examples! • If it’s not on the blog (Hortonworks, Data Syndrome), it’s in the book! • Order now: http://shop.oreilly.com/product/0636920025054.do 4
  • 5. 5 Agile Application Development: Check • LAMP stack mature • Post-Rails frameworks to choose from • Enable rapid feedback and agility + NoSQL 5
  • 7. 7 Scientific Computing / HPC Tubes and Mercury (Old School) Cores and Spindles (New School) UNIVAC and Deep Blue both fill a warehouse. We’re back! 7 ‘Smart Kid’ Only: MPI, Globus, etc. Until Hadoop
  • 9. 9 Data Center as Computer “A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here for a paper on operating a ‘data center as computer.’ 9 Warehouse Scale Computers and Applications
  • 10. 10 Hadoop to the Rescue! • Easy to use (Pig, Hive, Cascading) • CHEAP: 1% the cost of SAN/NAS • A department can afford its own Hadoop cluster! • Dump all your data in one place: Hadoop DFS • Silos come CRASHING DOWN! • JOIN like crazy! • ETL like whoa! • An army of mappers and reducers at your command • OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME! 10
  • 12. 12 Analytics Apps: It takes a Team • Broad skill-set • Nobody has them all • Inherently collaborative 12
  • 13. 13 Data Science Team • 3-4 team members with broad, diverse skill-sets that overlap • Transactional overhead dominates at 5+ people • Expert researchers: lend 25-50% of their time to teams • Creative workers. Like a studio, not an assembly line • Total freedom... with goals and deliverables. • Work environment matters most 13
  • 14. 14 How To Get Insight Into Product • Back-end has gotten THICKER • Generating $$$ insight can take 10-100x app dev • Timeline disjoint: analytics vs agile app-dev/design • How do you ship insights efficiently? • Can you collaborate on research vs developer timeline? 14
  • 15. 15 The Wrong Way - Part One “We made a great design. Your job is to predict the future for it.” 15
  • 16. 16 The Wrong Way - Part Two “What is taking you so long to reliably predict the future?” 16
  • 17. 17 The Wrong Way - Part Three “The users don’t understand what 86% true means.” 17
  • 18. 18 The Wrong Way - Part Four GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!! 18
  • 19. 19 The Wrong Way - Conclusion Inevitable Conclusion Plane Mountain 19
  • 20. 20 Reminds me of... the waterfall model :( 20
  • 21. 21 Chief Problem You can’t design insight in analytics applications. You discover it. You discover by exploring. 21
  • 22. 22 -> Strategy So make an app for exploring your data. Which becomes a palette for what you ship. Iterate and publish intermediate results. 22
  • 23. 23 Data Design • Not the 1st query that = insight, it’s the 15th, or 150th • Capturing “Ah ha!” moments • Slow to do those in batch… • Faster, better context in an interactive web application. • Pre-designed charts wind up terrible. So bad. • Easy to invest man-years in wrong statistical models • Semantics of presenting predictions are complex • Opportunity lies at intersection of data & design 23
  • 24. 24 How Do We Get Back to Agile? 24
  • 25. 25 Statement of Principles (Then Tricks With Code) 25
  • 26. 26 Setup An Environment Where: • Insights repeatedly produced • Iterative work shared with entire team • Interactive from day Zero • Data model is consistent end-to-end • Minimal impedance between layers • Scope and depth of insights grow • Insights form the palette for what you ship • Until the application pays for itself and more 26
  • 28. 28 Value Document > Relation Most data is dirty. Most data is semi-structured or unstructured. Rejoice! 28
  • 29. 29 Value Document > Relation Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction. 29
  • 30. 30 Relational Data = Legacy Format • Why JOIN? Storage is fundamentally cheap! • Duplicate that JOIN data in one big record type! • ETL once to document format on import, NOT every job • Not zero JOINs, but far fewer JOINs • Semi-structured documents preserve data’s actual structur • Column compressed document formats beat JOINs! 30
  • 31. 31 Value Imperative > Declarative • We don’t know what we want to SELECT. • Data is dirty - check each step, clean iteratively. • 85% of data scientist’s time spent munging. ETL. • Imperative is optimized for our process. • Process = iterative, snowballing insight • Efficiency matters, self optimize 31
  • 32. 32 Value Dataflow > SELECT 32
  • 33. 33 Ex. Dataflow: ETL + Email Sent Count (I can’t read this either. Get a big version here.) 33
  • 34. 34 Value Pig > Hive (for app-dev) • Pigs eat ANYTHING • Pig is optimized for refining data, as opposed to consuming it • Pig is imperative, iterative • Pig is dataflows, and SQLish (but not SQL) • Code modularization/re-use: Pig Macros • ILLUSTRATE speeds dev time (even UDFs) • Easy UDFs in Java, JRuby, Jython, Javascript • Pig Streaming = use any tool, period. • Easily prepare our data as it will appear in our app. • If you prefer Hive, use Hive. Actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive. See: HCatalog for Pig/Hive integration. 34
  • 35. 35 Localhost vs Petabyte Scale: Same Tools • Simplicity essential to scalability: highest level tools we can • Prepare a good sample - tricky with joins, easy with documents • Local mode: pig -l /tmp -x local -v -w • Frequent use of ILLUSTRATE • 1st: Iterate, debug & publish locally • 2nd: Run on cluster, publish to team/customer • Consider skipping Object-Relational-Mapping (ORM) • We do not trust ‘databases,’ only HDFS @ n=3 • Everything we serve in our app is re-creatable via Hadoop. 35
  • 36. 36 Data-Value Pyramid Climb it. Do not skip steps. See here. 36
  • 37. 37 0/1) Display Atomic Records On The Web 37
  • 38. 38 0.0) Document - Serialize Events • Protobuf • Thrift • JSON • Avro - I use Avro because the schema is onboard. 38
  • 39. 39 0.1) Documents Via Relation ETL enron_messages = load '/enron/enron_messages.tsv' as ( message_id:chararray, sql_date:chararray, from_address:chararray, from_name:chararray, subject:chararray, body:chararray); enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray); split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc'; headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10; with_headers = join headers by group, enron_messages by message_id parallel 10; emails = foreach with_headers generate enron_messages::message_id as message_id, CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date, TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray), enron_messages::subject as subject, enron_messages::body as body, headers::tos.(address, name) as tos, headers::ccs.(address, name) as ccs, headers::bccs.(address, name) as bccs; store emails into '/enron/emails.avro' using AvroStorage( Example here. 39
  • 40. 40 0.2) Serialize Events From Streamsclass GmailSlurper(object): ... def init_imap(self, username, password): self.username = username self.password = password try: imap.shutdown() except: pass self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993) self.imap.login(username, password) self.imap.is_readonly = True ... def write(self, record): self.avro_writer.append(record) ... def slurp(self): if(self.imap and self.imap_folder): for email_id in self.id_list: (status, email_hash, charset) = self.fetch_email(email_id) if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash): print email_id, charset, email_hash['thread_id'] self.write(email_hash) Scrape your own gmail in Python and Ruby. 40
  • 41. 41 0.3) ETL Logs log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes); 41
  • 42. 42 1) Plumb Atomic Events->Browser (Example stack that enables high productivity) 42
  • 43. 43 1.1) Cat Avro Serialized Events me$ cat_avro ~/Data/enron.avro { u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'connie@enron.com', u'name': None} ] } Get cat_avro in python, ruby 43
  • 44. 44 1.2) Load Events in Pig me$ pig -l /tmp -x local -v -w grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage(); grunt> describe enron_emails emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)} }   44
  • 45. 45 1.3) ILLUSTRATE Events in Pig grunt> illustrate enron_emails  --------------------------------------------------------------------------- | emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | tos:bag{to:tuple(address:chararray,name:chararray)} | | ccs:bag{cc:tuple(address:chararray,name:chararray)} | | bccs:bag{bcc:tuple(address:chararray,name:chararray)} | --------------------------------------------------------------------------- | | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | (bob.dobbs@enron.com, J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {(connie@enron.com,)} | | {} | | {} | Upgrade to Pig 0.10+ 45
  • 46. 46 1.4) Publish Events to a ‘Database’ pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig /* MongoDB libraries and configuration */ register /me/mongo-hadoop/mongo-2.7.3.jar register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar /* Set speculative execution off to avoid chance of duplicate records in Mongo */ set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */ /* By default, lets have 5 reducers */ set default_parallel 5 avros = load '$avros' using AvroStorage(); store avros into '$mongourl' using MongoStorage(); Full instructions here. Which does this: From Avro to MongoDB in one command: 46
  • 47. 47 1.5) Check Events in ‘Database’ $ mongo enron MongoDB shell version: 2.0.2 connecting to: enron show collections Emails system.indexes >db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}) { "_id" : ObjectId("502b4ae703643a6a49c8d180"), "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>", "date" : "2001-01-09T06:38:00.000Z", "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" }, "subject" : Re: Enron trade for frop futures, "body" : "Scamming more people...", "tos" : [ { "address" : "connie@enron", "name" : null } ], "ccs" : [ ], "bccs" : [ ] } 47
  • 48. 48 1.6) Publish Events on the Web require 'rubygems' require 'sinatra' require 'mongo' require 'json' connection = Mongo::Connection.new database = connection['agile_data'] collection = database['emails'] get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data) end 48
  • 49. 49 1.6) Publish events on the web 49
  • 51. 51 What’s the Point? • A designer can work against real data. • An application developer can work against real data. • A product manager can think in terms of real data. • Entire team is grounded in reality! • You’ll see how ugly your data really is. • You’ll see how much work you have yet to do. • Ship early and often! • Feels agile, don’t it? Keep it up! 51
  • 52. 52 1.7) Wrap Events with Bootstrap <link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"> </head> <body> <div class="container" style="margin-top: 100px;"> <table class="table table-striped table-bordered table-condensed"> <thead> {% for key in data['keys'] %} <th>{{ key }}</th> {% endfor %} </thead> <tbody> <tr> {% for value in data['values'] %} <td>{{ value }}</td> {% endfor %} </tr> </tbody> </table> </div> </body> Complete example here with code here. 52
  • 53. 53 1.7) Wrap Events with Bootstrap 53
  • 54. 54 Refine. Add Links Between Documents. Not the Mona Lisa, but coming along... See: here 54
  • 55. 56 1.8) List Links to Sorted Events mongo enron > db.emails.ensureIndex({message_id: 1}) > db.emails.find().sort({date:0}).limit(10).pretty() { { "_id" : ObjectId("4f7a5da2414e4dd0645d1176"), "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>", "from" : [ ... pig -l /tmp -x local -v -w emails_per_user = foreach (group emails by from.address) { sorted = order emails by date; last_1000 = limit sorted 1000; generate group as from_address, emails as emails; }; store emails_per_user into '$mongourl' using MongoStorage(); Use Pig, serve/cache a bag/array of email documents: Use your ‘database’, if it can sort. 56
  • 56. 57 1.8) List Links to Sorted Documents 57
  • 57. 58 1.9) Make It Searchable If you have list, search is easy with ElasticSearch and Wonderdog... /* Load ElasticSearch integration */ register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar'; register '/me/elasticsearch-0.18.6/lib/*'; define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage(); emails = load '/me/tmp/emails' using AvroStorage(); store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch- 0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins'); curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1' Test it with curl: ElasticSearch has no security features. Take note. Isolate. 58
  • 58. 59 2) Create Simple Charts 59
  • 59. 60 2) Create Simple Tables and Charts 60
  • 60. 61 2) Create Simple Charts • Start with an HTML table on general principle. • Then use nvd3.js - reusable charts for d3.js • Aggregate by properties & displaying is first step in entity resolution • Start extracting entities. Ex: people, places, topics, time series • Group documents by entities, rank and count. • Publish top N, time series, etc. • Fill a page with charts. • Add a chart to your event page. 61
  • 61. 62 2.1) Top N (of Anything) in Pig pig -l /tmp -x local -v -w top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc; top_10_things = limit sorted 10; generate group as key, top_10_things as top_10_things; }; store top_n into '$mongourl' using MongoStorage(); Remember, this is the same structure the browser gets as json. This would make a good Pig Macro. 62
  • 62. 63 2.2) Time Series (of Anything) in Pig pig -l /tmp -x local -v -w /* Group by our key and date rounded to the month, get a total */ things_by_month = foreach (group things by (key, ISOToMonth(datetime)) generate flatten(group) as (key, month), COUNT_STAR(things) as total; /* Sort our totals per key by month to get a time series */ things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); Yet another good Pig Macro. 63
  • 63. 64 Data Processing in Our Stack A new feature in our application might begin at any layer… GREAT! Any team member can add new features, no problemo! I’m creative! I know Pig! I’m creative too! I <3 Javascript! omghi2u! where r my legs? send halp 64
  • 64. 65 Data Processing in Our Stack ... but we shift the data-processing towards batch, as we are able. Ex: Overall total emails calculated in each layer See real example here. 65
  • 65. 66 3) Exploring with Reports 66
  • 66. 67 3) Exploring with Reports 67
  • 67. 68 3.0) From Charts to Reports • Extract entities from properties we aggregated by in charts (Step 2) • Each entity gets its own type of web page • Each unique entity gets its own web page • Link to entities as they appear in atomic event documents (Step 1) • Link most related entities together, same and between types. • More visualizations! • Parametize results via forms. 68
  • 68. 69 3.1) Looks Like This: 69
  • 69. 70 3.2) Cultivate Common Keyspaces 70
  • 70. 71 3.3) Get People Clicking. Learn. • Explore this web of generated pages, charts and links! • Everyone on the team gets to know your data. • Keep trying out different charts, metrics, entities, links. • See whats interesting. • Figure out what data needs cleaning and clean it. • Start thinking about predictions & recommendations. ‘People’ could be just your team, if data is sensitive. 71
  • 72. 73 4.0) Preparation • We’ve already extracted entities, their properties and relationships • Our charts show where our signal is rich • We’ve cleaned our data to make it presentable • The entire team has an intuitive understanding of the data • They got that understanding by exploring the data • We are all on the same page! 73
  • 73. 74 4.2) Think in Different Perspectives • Networks • Time Series / Distributions • Natural Language Processing • Conditional Probabilities / Bayesian Inference • Check out Chapter 2 of the book 74
  • 76. 77 4.3.2) Networks Viz with Gephi 77
  • 77. 78 4.3.3) Gephi = Easy 78
  • 79. 80 4.4) Time Series & Distributions 80
  • 80. 81 4.4.1) Smooth Sparse Data See here. 81
  • 81. 82 4.4.2) Regress to Find Trends JRuby Linear Regression UDF Pig to use the UDF Trend Line in your Application 82
  • 82. 83 4.5.1) Natural Language Processing Example with code here and macro here. 83
  • 84. 85 4.5.3) NLP for All: Extract Topics! • TF-IDF in Pig - 2 lines of code with Pig Macros: • http://hortonworks.com/blog/pig-macro-for-tf-idf-makes- topic-summarization-2-lines-of-pig/ • LDA with Pig and the Lucene Tokenizer: • http://thedatachef.blogspot.be/2012/03/topic-discovery- with-apache-pig-and.html 85
  • 85. 86 4.6) Probability & Bayesian Inference 86
  • 86. 87 4.6.1) Gmail Suggested Recipients 87
  • 88. 89 4.6.2) Step 1: COUNT (From -> To) 89
  • 89. 90 4.6.2) Step 2: COUNT (From, To, Cc)/Total P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone 90
  • 90. 91 4.6.3) Wait - Stop Here! It Works! They match… 91
  • 91. 92 4.4) Add Predictions to Reports 92
  • 92. 93 5) Enable New Actions 93
  • 93. 94 Why Doesn’t Kate Reply to My Emails? • What time is best to catch her? • Are they too long? • Are they meant to be replied to (original content)? • Are they nice? (sentiment analysis) • Do I reply to her emails (reciprocity)? • Do I cc the wrong people (my mom)? 94
  • 94. 97 Thank You! •Questions & Answers 97 • Follow: @rjurney • Read the Blog: datasyndrome.com