APRIL 30th
Nicola Cardace
• Auto-Scaling Using Amazon EC2 and Scalr
• Nginx and Memcached on EC2, a 400% boost!
• NASDAQ exchange re-play on AWS
• Persistent Django on Amazon EC2 and EBS
• Taking Massive Distributed Computing to the
Common Man - Hadoop on Amazon EC2/S3
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09

Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop

Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.

Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks

This document provides tips and best practices for optimizing Apache Spark performance and resource allocation. It discusses: - The components of Spark including executors, drivers, and tasks - Configuring Spark on YARN and dynamic resource allocation - Optimizing memory usage, avoiding data skew, and reducing serialization costs - Best practices for Spark Streaming around microbatching, fault tolerance, and performance - Recommendations for running Spark on cloud object stores like S3

big datasparkhadoop
Auto-Scaling Using
Amazon EC2 and Scalr
Scalr, a redundant, self-curing, self-scaling hosting
solution built on EC2
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Scalr sourcecode:

Nextag talk
Nextag talkNextag talk
Nextag talk

Hive provides an SQL-like interface to query data stored in Hadoop's HDFS distributed file system and processed using MapReduce. It allows users without MapReduce programming experience to write queries that Hive then compiles into a series of MapReduce jobs. The document discusses Hive's components, data model, query planning and optimization techniques, and performance compared to other frameworks like Pig.

Scalr overview
• By using Scalr, you can create a server farm that uses prebuilt AMIs
for load balancing, web servers, and databases. You also can
customize a generic AMI, which you can use to host your actual
• Scalr monitors the health of the entire server farm, ensuring that
instances stay running and that load averages stay below a
configurable threshold. If an instance crashes, another one of the
proper type will be launched and added to the load balancer.
Scalr (2)
• Scalr is an open source, fully redundant, self-curing, and
self-scaling hosting environment that uses Amazon EC2.
• Scalr allows network administrators to create virtual
server farms, using prebuilt components. Scalr uses four
Amazon Machine Instances (AMIs) for load balancing,
databases, application server, and a generic base
• Administrators can preconfigure one machine and, when
the load warrants, bring online additional machines with
the same image, to handle the increased requests.
Nginx and Memcached on EC2
400% boost!
Nginx and Memcached on EC2
400% boost!
(with a five minutes config tweak!)

Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It is written in Java and uses a pluggable backend. Presto is fast due to code generation and runtime compilation techniques. It provides a library and framework for building distributed services and fast Java collections. Plugins allow Presto to connect to different data sources like Hive, Cassandra, MongoDB and more.

Originally developed by Igor Sysoev for (second largest
Russian web-site), it is a high-performance HTTP server / reverse
proxy known for its stability, performance, and ease of use. The great
track record, a lot of great modules, and an active development
community have rightfully earned it a steady uptick of users
memcached is a high-performance, distributed memory object
caching system, generic in nature, but intended for use in
speeding up dynamic web applications by alleviating database
“Memcached, the darling of every web-developer, is
capable of turning almost any application into a speed-
demon. Benchmarking one of my own Rails applications
resulted in ~850 req/s on commodity, non-optimized
hardware - more than enough in the case of this
application. However, what if we took Mongrel out of the
equation? Nginx, by default, comes prepackaged with the
Memcached module, which allows us to bypass the
Mongrel (from rubyforge) servers and talk to Memcached
directly. Same hardware, and a quick test later: ~3,550
req/s, or almost a 400% improvement!”
AWS (Hadoop) Meetup 30.04.09
Nginx and Memcached on EC2
400% boost!

NASDAQ exchange
re-play on AWS
your homework 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Persistent Django on Amazon
EC2 and EBS

AWS (Hadoop) Meetup 30.04.09
Thomas Brox Røst,
Visiting researcher, Decision Systems Group, Harvard
Persistent Django
on Amazon EC2 and EBS - The easy way
Now that Amazon’s Elastic Block Store (EBS) is publicly available,
running a complete Django installation on Amazon Web Services
(AWS) is easier than ever.
EBS provides persistent storage, which means that the Django database
is kept safe even after the Django EC2 instances terminate
To setup Django with persistent PostgreSQL database on AWS:
Set up an AWS account
Download and install the Elasticfox Firefox extension
Add your AWS credentials to Firefox
Create a new EC2 security group
By default, EC2 instances are an introverted lot: They prefer keeping to themselves and don’t expose any
of their ports to the outside world. We will be running a web application on port 8000 so therefore port
8000 has to be opened. (Normally we would be opening port 80, but since I will only be using the Django
development web server then port 8000 is preferable). SSH access is also essential, so port 22 should be
opened as well. To make this happen we must create a new security group where these ports are opened.

Set up a key pair
Launch an EC2 Instance
Connect with your new instance (ssh using putty)
- Install subversion
- Install, initialize and launch PostgreSQL
- Modify PostgreSQL config to avoid username/password problems
- Restart PostgreSQL to enable new security policy
- Set up a database for Django
- Install Django (checkout from SVN)
- Install psycopg2 (for database access from Python)
Set up a Django project
Test the installation
Launch the dev server
Create a Django app
Create and mount an EBS Instance
Mount the filesystem
Move the database to persistent storage (with server stopped)
Elastic MapReduce
Amazon Elastic MapReduce

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Data and Computing Trends:
Source: Facebook
• Explosion of Data
– Web Logs, Ad-Server logs, Sensor Networks, Seismic Data, DNA
sequences (?)
– User generated content/Web 2.0
– Data as BI => Data as product (Search, Ads, Digg, Quantcast, …)
• Declining Revenue/GB
– Milk @ $3/gallon => $15M / GB
– Ads @ 20c / 10^6 impressions => $1/GB
– Google Analytics, Facebook Lexicon == Free!
• Hardware Trends
– Commodity Rocks: $4K 1U box = 8 cores + 16GB mem + 4x1TB
– CPU: SMP  NUMA, Storage: $ Shared-Nothing << $ Shared,
Networking: Ethernet

• Parallel Computing platform
– Distributed FileSystem (HDFS)
– Parallel Processing model (Map/Reduce)
– Express Computation in any language
– Job execution for Map/Reduce jobs
• Open-Source
– Most popular Apache project!
– Highly Extensible Java Stack (@ expense of Efficiency)
– Develop/Test on EC2!
• Ride the commodity curve:
– Cheap (but reliable) shared nothing storage
– Data Local computing (don’t need high speed networks)
– Highly Scalable (@expense of Efficiency)
Map/Reduce DataFLow
Hadoop Running MapReduce

In Pictures (Source: Facebook)
Looks like this ..
1 Gigabit 4-8 Gigabit
• Large installed base of SQL users 
– ie. map-reduce is for ultra-geeks
– much much easier to write sql query
• Analytics SQL queries translate really well
to map-reduce
• Files as insufficient data management
– Tables, Schemas, Partitions, Indices
AWS (Hadoop) Meetup 30.04.09

Hive Query Language
• Basic SQL
– From clause subquery
– ANSI JOIN (equi-join only)
– Multi-table Insert
– Multi group-by
– Sampling
– Objects traversal
• Extensibility
– Pluggable Map-reduce scripts using
Data Warehousing at Facebook
(Scribe is a server for aggregating log data streamed in real time from a large
number of servers. It is designed to be scalable, extensible without client-side
modification, and robust to failure of the network or any specific machine)
Web Servers Scribe Servers
Hive on
Hadoop Cluster
Oracle RAC Federated MySQL
Hadoop Usage @ Facebook
• Data warehouse running Hive
• 600 machines, 4800 cores
• 3200 jobs per day
• 50+ engineers have used Hadoop
• Data statistics:
– Total Data: ~2.5PB
– Net Data added/day: ~15TB
• 6TB of uncompressed source logs
• 4TB of uncompressed dimension data reloaded daily
– Compression Factor ~5x (gzip, more with bzip)
• Usage statistics:
– 3200 jobs/day with 800K tasks(map-reduce tasks)/day
– 55TB of compressed data scanned daily
– 15TB of compressed output data written to hdfs
– 80 MM compute minutes/day
Hadoop Job types @ Facebook
• Production jobs: load data, compute
statistics, detect spam, etc
• Long experiments: machine learning, etc
• Small ad-hoc queries: Hive jobs, sampling
• GOAL: Provide fast response times for
small jobs and guaranteed service levels
for production jobs

Usage patterns in Yahoo
– Put large data source (eg. Log files) onto the Hadoop File System
– Perform aggregations, transformations, normalizations on the data
– Load into RDBMS / data mart
• Reporting and Analytics
– Run canned and ad-hoc queries over large data
– Run analytics and data mining operations on large data
– Produce reports for end-user consumption or loading into data mart
Usage patterns in Yahoo
• Data Processing Pipelines
– Multi-step pipelines for data processing
– Coordination, scheduling, data collection and publishing of feeds
– SLA carrying, regularly scheduled jobs
• Machine Learning & Graph Algorithms
– Traverse large graphs and data sets, building models and classifiers
– Implement machine learning algorithms over massive data sets
• General Back end processing
– Implement significant portions of back-end, batch oriented processing on the grid
– General computation framework
– Simplify back-end architecture
What is Hadoop Pig
Pig is a platform for analyzing large data sets that consists of a
high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs.
AWS (Hadoop) Meetup 30.04.09

AWS (Hadoop) Meetup 30.04.09
Thanks to the kind sponsorship
Thank you !

AWS (Hadoop) Meetup 30.04.09

  • 2. Topics • Auto-Scaling Using Amazon EC2 and Scalr • Nginx and Memcached on EC2, a 400% boost! • NASDAQ exchange re-play on AWS • Persistent Django on Amazon EC2 and EBS • Taking Massive Distributed Computing to the Common Man - Hadoop on Amazon EC2/S3
  • 5. Auto-Scaling Using Amazon EC2 and Scalr Scalr, a redundant, self-curing, self-scaling hosting solution built on EC2
  • 9. Scalr overview • By using Scalr, you can create a server farm that uses prebuilt AMIs for load balancing, web servers, and databases. You also can customize a generic AMI, which you can use to host your actual application. • Scalr monitors the health of the entire server farm, ensuring that instances stay running and that load averages stay below a configurable threshold. If an instance crashes, another one of the proper type will be launched and added to the load balancer.
  • 10. Scalr (2) • Scalr is an open source, fully redundant, self-curing, and self-scaling hosting environment that uses Amazon EC2. • Scalr allows network administrators to create virtual server farms, using prebuilt components. Scalr uses four Amazon Machine Instances (AMIs) for load balancing, databases, application server, and a generic base image. • Administrators can preconfigure one machine and, when the load warrants, bring online additional machines with the same image, to handle the increased requests.
  • 11. Nginx and Memcached on EC2 400% boost!
  • 12. Nginx and Memcached on EC2 400% boost! (with a five minutes config tweak!)
  • 13. Originally developed by Igor Sysoev for (second largest Russian web-site), it is a high-performance HTTP server / reverse proxy known for its stability, performance, and ease of use. The great track record, a lot of great modules, and an active development community have rightfully earned it a steady uptick of users
  • 14. memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. “Memcached, the darling of every web-developer, is capable of turning almost any application into a speed- demon. Benchmarking one of my own Rails applications resulted in ~850 req/s on commodity, non-optimized hardware - more than enough in the case of this application. However, what if we took Mongrel out of the equation? Nginx, by default, comes prepackaged with the Memcached module, which allows us to bypass the Mongrel (from rubyforge) servers and talk to Memcached directly. Same hardware, and a quick test later: ~3,550 req/s, or almost a 400% improvement!”
  • 16. Nginx and Memcached on EC2 400% boost! ***
  • 17. NASDAQ exchange re-play on AWS your homework 
  • 20. Persistent Django on Amazon EC2 and EBS
  • 22. Credit: Thomas Brox Røst, Visiting researcher, Decision Systems Group, Harvard Persistent Django on Amazon EC2 and EBS - The easy way
  • 23. Now that Amazon’s Elastic Block Store (EBS) is publicly available, running a complete Django installation on Amazon Web Services (AWS) is easier than ever. --- EBS provides persistent storage, which means that the Django database is kept safe even after the Django EC2 instances terminate
  • 24. To setup Django with persistent PostgreSQL database on AWS: Set up an AWS account Download and install the Elasticfox Firefox extension Add your AWS credentials to Firefox Create a new EC2 security group By default, EC2 instances are an introverted lot: They prefer keeping to themselves and don’t expose any of their ports to the outside world. We will be running a web application on port 8000 so therefore port 8000 has to be opened. (Normally we would be opening port 80, but since I will only be using the Django development web server then port 8000 is preferable). SSH access is also essential, so port 22 should be opened as well. To make this happen we must create a new security group where these ports are opened.
  • 25. Set up a key pair Launch an EC2 Instance Connect with your new instance (ssh using putty) - Install subversion - Install, initialize and launch PostgreSQL - Modify PostgreSQL config to avoid username/password problems - Restart PostgreSQL to enable new security policy - Set up a database for Django - Install Django (checkout from SVN) - Install psycopg2 (for database access from Python) Set up a Django project Test the installation Launch the dev server Create a Django app Create and mount an EBS Instance Mount the filesystem Move the database to persistent storage (with server stopped)
  • 26. ***
  • 39. Data and Computing Trends: Source: Facebook • Explosion of Data – Web Logs, Ad-Server logs, Sensor Networks, Seismic Data, DNA sequences (?) – User generated content/Web 2.0 – Data as BI => Data as product (Search, Ads, Digg, Quantcast, …) • Declining Revenue/GB – Milk @ $3/gallon => $15M / GB – Ads @ 20c / 10^6 impressions => $1/GB – Google Analytics, Facebook Lexicon == Free! • Hardware Trends – Commodity Rocks: $4K 1U box = 8 cores + 16GB mem + 4x1TB – CPU: SMP  NUMA, Storage: $ Shared-Nothing << $ Shared, Networking: Ethernet
  • 41. Hadoop • Parallel Computing platform – Distributed FileSystem (HDFS) – Parallel Processing model (Map/Reduce) – Express Computation in any language – Job execution for Map/Reduce jobs (scheduling+localization+retries/speculation) • Open-Source – Most popular Apache project! – Highly Extensible Java Stack (@ expense of Efficiency) – Develop/Test on EC2! • Ride the commodity curve: – Cheap (but reliable) shared nothing storage – Data Local computing (don’t need high speed networks) – Highly Scalable (@expense of Efficiency)
  • 45. In Pictures (Source: Facebook)
  • 46. Looks like this .. Disks Node Disks Node Disks Node Disks Node Disks Node Disks Node 1 Gigabit 4-8 Gigabit Node = DataNode + Map-Reduce
  • 47. Why HIVE? • Large installed base of SQL users  – ie. map-reduce is for ultra-geeks – much much easier to write sql query • Analytics SQL queries translate really well to map-reduce • Files as insufficient data management abstraction – Tables, Schemas, Partitions, Indices
  • 49. Hive Query Language • Basic SQL – From clause subquery – ANSI JOIN (equi-join only) – Multi-table Insert – Multi group-by – Sampling – Objects traversal • Extensibility – Pluggable Map-reduce scripts using TRANSFORM
  • 50. Data Warehousing at Facebook (Scribe is a server for aggregating log data streamed in real time from a large number of servers. It is designed to be scalable, extensible without client-side modification, and robust to failure of the network or any specific machine) Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
  • 51. Hadoop Usage @ Facebook • Data warehouse running Hive • 600 machines, 4800 cores • 3200 jobs per day • 50+ engineers have used Hadoop • Data statistics: – Total Data: ~2.5PB – Net Data added/day: ~15TB • 6TB of uncompressed source logs • 4TB of uncompressed dimension data reloaded daily – Compression Factor ~5x (gzip, more with bzip) • Usage statistics: – 3200 jobs/day with 800K tasks(map-reduce tasks)/day – 55TB of compressed data scanned daily – 15TB of compressed output data written to hdfs – 80 MM compute minutes/day
  • 52. Hadoop Job types @ Facebook • Production jobs: load data, compute statistics, detect spam, etc • Long experiments: machine learning, etc • Small ad-hoc queries: Hive jobs, sampling • GOAL: Provide fast response times for small jobs and guaranteed service levels for production jobs
  • 53. Usage patterns in Yahoo • ETL – Put large data source (eg. Log files) onto the Hadoop File System – Perform aggregations, transformations, normalizations on the data – Load into RDBMS / data mart • Reporting and Analytics – Run canned and ad-hoc queries over large data – Run analytics and data mining operations on large data – Produce reports for end-user consumption or loading into data mart
  • 54. Usage patterns in Yahoo • Data Processing Pipelines – Multi-step pipelines for data processing – Coordination, scheduling, data collection and publishing of feeds – SLA carrying, regularly scheduled jobs • Machine Learning & Graph Algorithms – Traverse large graphs and data sets, building models and classifiers – Implement machine learning algorithms over massive data sets • General Back end processing – Implement significant portions of back-end, batch oriented processing on the grid – General computation framework – Simplify back-end architecture
  • 55. What is Hadoop Pig Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
  • 58. Thanks to the kind sponsorship to the AWS LONDON USER GROUP from

