This document provides a summary of key concepts in Hadoop including MapReduce, HDFS, and the Hadoop ecosystem. It begins with an introduction to MapReduce processing using the map and reduce functions. It then discusses HDFS storage and failures that can occur. The document concludes with a brief overview of additional tools in the Hadoop ecosystem such as Pig, Hive, HBase, Sqoop and Flume.
This document provides an overview and instructions for using Hadoop including:
- Hadoop uses HDFS for distributed storage and divides files into 64MB chunks across data servers.
- The master node tracks the namespace and metadata while slave nodes store data blocks.
- Commands like start-all.sh and stop-all.sh are used to start and stop Hadoop across nodes.
- The hadoop dfs command is used to interact with files in HDFS using options like -ls, -put, -get. Configuration files allow customizing Hadoop.
Hiera is a tool for managing configuration data in Puppet. It allows storing data in hierarchical YAML files and retrieving values based on node facts. This avoids duplicating data across manifests and modules. The document discusses using Hiera to look up data based on hostname, environment, and custom facts. It also covers debugging Hiera lookups and using encrypted YAML to securely store sensitive data.
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
This document discusses using PySpark with Cassandra for analytics. It provides background on Cassandra, Spark, and PySpark. Key features of PySpark Cassandra include scanning Cassandra tables into RDDs, writing RDDs to Cassandra, and joining RDDs with Cassandra tables. Examples demonstrate using operators like scan, project, filter, join, and save to perform tasks like processing time series data, media metadata processing, and earthquake monitoring. The document discusses getting started, compatibility, and provides code samples for common operations.
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...Puppet
This document discusses using role-based naming with Puppet to scale node classification. It recommends generating DNS records from inventory data and using Puppet facts and Hiera to assign node roles and classes based on the DNS names. This allows nodes to be treated as disposable "cattle" while maintaining descriptive, human-friendly names. The document provides examples of role name formats and implementing the approach with Facter, Hiera, monitoring configurations, and a CLI tool.
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
This document introduces Apache Drill, an open source interactive analysis engine for big data. It was inspired by Google's Dremel and supports standard SQL queries over various data sources like Hadoop and NoSQL databases. Drill provides low-latency interactive queries at scale through its distributed, schema-optional architecture and support for nested data formats. The talk outlines Drill's capabilities and status as a community-driven project under active development.
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
As part of the Tungsten project, Spark has started an ongoing effort to dramatically improve performance to bring the execution closer to bare metal. In this talk, we’ll go over the progress that has been made so far and the areas we’re looking to invest in next. This talk will discuss the architectural changes that are being made as well as some discussion into how Spark users can expect their application to benefit from this effort. The focus of the talk will be on Spark SQL but the improvements are general and applicable to multiple Spark technologies.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
The document provides information about Hive and Pig, two frameworks for analyzing large datasets using Hadoop. It compares Hive and Pig, noting that Hive uses a SQL-like language called HiveQL to manipulate data, while Pig uses Pig Latin scripts and operates on data flows. The document also includes code examples demonstrating how to use basic operations in Hive and Pig like loading data, performing word counts, joins, and outer joins on sample datasets.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
This document discusses Apache Pig and its role in data science. It begins with an introduction to Pig, describing it as a high-level scripting language for operating on large datasets in Hadoop. It transforms data operations into MapReduce/Tez jobs and optimizes the number of jobs required. The document then covers using Pig for understanding data through statistics and sampling, machine learning by sampling large datasets and applying models with UDFs, and natural language processing on large unstructured data.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Spider's HA structure includes data nodes, spider nodes, and monitoring nodes. Data nodes store data, spider nodes provide load balancing and failover, and monitoring nodes monitor data nodes. To add a new data node without stopping service: 1) Create a new table on the node, 2) Alter tables on monitoring nodes to include new node, 3) Alter clustered table connection to include new node, 4) Copy data to new node. This maintains redundancy when a node fails without service interruption.
This document provides an introduction to Apache Hive, including:
- Hive allows for data warehousing and analysis of large datasets stored in Hadoop through use of the HiveQL query language, which is automatically translated to MapReduce jobs.
- Key advantages of Hive include its higher-level query language that simplifies working with large data and lower learning curve compared to Pig or MapReduce. However, updating data can be complicated due to HDFS and Hive has high query latency.
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
This document discusses using IPython Notebook as a unified data science interface for Hadoop. It proposes that a unified environment needs: 1) mixed local and distributed processing via Apache Spark, 2) access to languages like Python via PySpark, 3) seamless SQL integration via SparkSQL, and 4) visualization and reporting via IPython Notebook. The document demonstrates this environment by exploring open payments data between doctors/hospitals and manufacturers.
The document discusses Spark and common problems that can occur when using Spark. It notes that failures, wrong results, poor performance, poor scalability, and application, data/storage, Spark, and resource problems can all occur. It asks how application developers currently detect and fix these issues using logs, but notes logs are spread out, incomplete and difficult to understand. It proposes a better approach would be to visualize all relevant data in one place, optimize by analyzing the data to provide diagnoses and fixes, and help strategize to prevent problems and meet goals. It lists some existing tools for Hadoop and Spark that provide some level of visualization, optimization and strategization capabilities.
This document provides an overview of streaming architectures and compares different streaming engines. It describes a common streaming architecture that ingests data from sources like logs and sockets into Kafka for storage and organization. It then discusses several streaming engines like Apache Beam, Apache Flink, and Akka Streams that can process the data from Kafka. Beam provides the most sophisticated streaming semantics to handle scenarios like late data arrival, while Flink offers low latency processing at large scales and can run Beam data flows. Akka Streams is best suited for complex event processing on individual events.
This document discusses MySQL and Hadoop. It provides an overview of Hadoop, Cloudera Distribution of Hadoop (CDH), MapReduce, Hive, Impala, and how MySQL can interact with Hadoop using Sqoop. Key use cases for Hadoop include recommendation engines, log processing, and machine learning. The document also compares MySQL and Hadoop in terms of data capacity, query languages, and support.
The document discusses SQL versus NoSQL databases. It provides background on SQL databases and their advantages, then explains why some large tech companies have adopted NoSQL databases instead. Specifically, it describes how companies like Amazon, Facebook, and Google have such massive amounts of data that traditional SQL databases cannot adequately handle the scale, performance, and flexibility needs. It then summarizes some popular NoSQL databases like Cassandra, Hadoop, MongoDB that were developed to solve the challenges of scaling to big data workloads.
This document provides a whirlwind tour of Hadoop, describing its core components and how they work together. It discusses HDFS for distributed storage, MapReduce for distributed processing, and Hive for SQL-like queries on Hadoop datasets.
This document provides an overview of Apache Hadoop, a distributed processing framework for large datasets. It describes how Hadoop uses the Hadoop Distributed File System (HDFS) to provide a unified view of large amounts of data across clusters of computers. It also explains how the MapReduce programming model allows distributed computations to be run efficiently across large datasets in parallel. Key aspects of Hadoop's architecture like scalability, fault tolerance and the MapReduce programming model are discussed at a high level.
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
1) The document provides an overview of tools for distributed computing including MapReduce, Hadoop, Hive, and Elastic MapReduce.
2) It discusses getting started with Elastic MapReduce using Python with mrjob or the AWS command line and challenges with getting started with Hive.
3) Potential pitfalls with EMR are also outlined such as JVM memory issues and problems with multiple small output files.
Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware.
Topics Covered
* The Hadoop architecture.
* Thinking in MapReduce.
* Run some sample MapReduce Jobs (using Hadoop Streaming).
* Introduce PigLatin, a easy to use data processing language.
Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
Apache Cassandra is rock-solid and widely deployed for OLTP and real-time applications, but it is typically not thought of as an OLAP database for analytical queries. This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance. We will then introduce a new open-source project that combines the above performance improvements with the ease of use of Apache Cassandra, and compare it to implementations based on Hadoop and Parquet.
First, the existing Cassandra Spark connector allows one to easily load data from Cassandra to Spark. We'll cover how to accelerate queries through different caching options in Spark, and the tradeoffs and limitations around performance, memory, and updating data in real time. We then dive into the use of columnar storage layout and efficient coding techniques that dramatically speed up I/O for OLAP use cases. Cassandra features like triggers and custom secondary indexes allow for easy data ingestion into columnar format. Next, we explore how to integrate this new storage with Spark SQL and its pluggable data storage API. Future developments will enable extreme analytical database performance, including smart caching of column projections, a columnar version of Spark's Catalyst execution planner, and how vectorization makes for fast cache- and GPU-friendly calculations - see Spark's Project Tungsten.
FiloDB is a new open-source database using the above techniques to combine very fast Spark SQL analytical queries with the ease of use of Cassandra. We will briefly cover interesting use cases, such as:
* Easy exactly-once ingestion from Kafka for streaming and IoT applications
* Incremental computed columns and geospatial annotations. We'll discuss how FiloDB improves aggregations needed for choropleth maps over standard PostGIS solutions.
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
From oracle to hadoop with Sqoop and other toolsGuy Harrison
This document discusses tools for transferring data between relational databases and Hadoop, focusing on Apache Sqoop. It describes how Sqoop was optimized for Oracle imports and exports, reducing database load by up to 99% and improving performance by 5-20x. It also outlines the goals of Sqoop 2 to improve usability, security, and extensibility through a REST API and by separating responsibilities.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
This document provides an overview of using Cassandra in web applications. It discusses why developers may consider using a NoSQL solution like Cassandra over traditional SQL databases. It then covers topics like Cassandra's architecture, data modeling, configuration options, APIs, development tools, and examples of companies using Cassandra in production systems. Key points emphasized are that Cassandra offers high performance but requires rewriting code and developing new processes and tools to support its flexible schema and data model.
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
- Profiling Hadoop jobs at Twitter revealed that compression/decompression of intermediate data and deserialization of complex object keys were very expensive. Optimizing these led to performance improvements of 1.5x or more.
- Using columnar file formats like Apache Parquet allows reading only needed columns, avoiding deserialization of unused data. This led to gains of up to 3x.
- Scala macros were developed to generate optimized implementations of Hadoop's RawComparator for common data types, avoiding deserialization for sorting.
The document discusses functional programming (FP) languages and provides recommendations for learning FP. It begins by introducing the author and his background in FP. It then discusses mature FP options like Haskell, Clojure, Erlang and compares their type systems. The document provides examples of FP concepts in Haskell like algebraic data types and mapping over trees. It discusses benefits of Haskell, OCaml, Lisp dialects and Erlang and recommends them for different purposes. Finally, it shares FP learning resources and plugs the author's FP journal.
The document discusses functional programming languages and provides recommendations for learning FP. It recommends Haskell for learning pure FP due to its strong type system and focus on purity. It suggests Scheme for learning core programming concepts through SICP. For practical applications, it recommends Clojure, F# and OCaml depending on requirements around platforms and static typing. Erlang is recommended for building highly concurrent systems. Resources are provided for further learning Haskell, Clojure and Erlang.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...shravanthium111
This document summarizes a student presentation on analyzing the frequency of tweets using MapReduce. It discusses big data, Hadoop frameworks, HDFS, and how MapReduce works. It then describes the student's proposed approach of using Python to extract tweets from Twitter and implement MapReduce to count the frequency of dates in the tweets and output the results.
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson
The document discusses common patterns and approaches for scaling web architectures. It covers topics like load balancing, caching, database scaling through replication and sharding, high availability, and storing large files across multiple servers and data centers. The overall goal is to discuss how to architect systems that can scale horizontally to handle increasing traffic and data sizes.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It consists of HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines and provides high throughput access to application data. MapReduce allows processing of large datasets in parallel by splitting the work into independent tasks called maps and reduces. Companies use Hadoop for applications like log analysis, data warehousing, machine learning, and scientific computing on large datasets.
The document discusses big data challenges and solutions. It describes how specialized systems like Hadoop are more efficient than relational databases for large-scale data. It provides examples of open source projects that can be used for tasks like storage, search, streaming data, and batch processing. The document also summarizes the design of the Voldemort distributed key-value store and how it was inspired by Dynamo and Memcached.
This document provides an overview and introduction to non-relational (NoSQL) databases. It discusses some of the limitations of relational databases and why NoSQL databases were developed as an alternative. It describes different types of NoSQL databases, including key-value, document, columnar, and graph databases. Specific NoSQL database examples like HBase, Cassandra, Riak, MongoDB, and Neo4j are also mentioned.
Apache Spark is an open source framework for fast, in-memory data processing. It supports Scala, Java, Python and integrates with other technologies like SQL, streaming, and machine learning. Spark runs in a clustered environment on top of distributed file systems and can integrate with schedulers like YARN and Mesos. It can efficiently read from and write to a variety of data sources.
The document provides an overview of topics covered in a FileMaker training including: database layouts vs tables, search filters, creating and populating databases and fields, using serial numbers, validation, calculations, and an introduction to creating relationships between tables using portals. It also includes notes on taking breaks and where to find additional help resources.
This document proposes a new web architecture for the organization that includes moving to a screened-net architecture with bastion hosts protected by a Cisco Pix firewall. It also proposes migrating non-NOMADS projects from the existing data1.gfdl.noaa.gov web server to an architecture with separate intranet and public web servers located in the perimeter network behind the firewall.
Shibboleth is an open-source identity management architecture that allows for single sign-on across multiple organizations. It works by creating trust relationships between organizations, so that when a user from one organization tries to access a resource from another, they can log in once through their original organization and then be authenticated to access the other organization's resources without needing to log in again. Some benefits of Shibboleth include increased user mobility, scalability, security, and privacy by limiting the sharing of personal information between organizations. However, it also has challenges such as the difficulty of setting up agreements between large organizations and ensuring security and privacy policies are followed.
Condor is a distributed computing software that takes advantage of idle computing resources by running computationally intensive tasks on desktop computers and clusters when they would otherwise be sitting idle. It was developed at the University of Wisconsin-Madison to maximize the utilization of expensive high performance computing machines. Condor allows users to submit jobs to a scheduler that will run the tasks on available desktops or clusters, moving the jobs around as needed to make use of idle resources.
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...Toru Tamaki
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models" arXiv2023
https://arxiv.org/abs/2307.12980
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc
Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk.
What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year?
Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year.
This webinar will review:
- Key changes to privacy regulations in 2024
- Key themes in privacy and data governance in 2024
- How to maximize your privacy program in the second half of 2024
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxSynapseIndia
Your comprehensive guide to RPA in healthcare for 2024. Explore the benefits, use cases, and emerging trends of robotic process automation. Understand the challenges and prepare for the future of healthcare automation
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
Hadoop for sysadmins
1. A whirlwind tour of
hadoop
By Eric Marshall
For LOPSA-NJ
an all too brief introduction to the world of big data
2. Eric Marshall
I work for Airisdata; we’re hiring!
Smallest computer I lost sleep over:
Sinclair-Timex Z81 – 1KB of memory
Largest computer I lost sleep over: SGI
Altix 4700 – 1 TB of memory
3. Vocabulary disclaimer
Just like your favorite swear word, which can act like
many parts of speech and refer to many a thing;
hadoop vocabulary has the same problem
Casually, people refer to hadoop as storage,
processing, a programming model(s), clustered
machines. The same problem exists for other terms in
the lexicon, so ask me when I make less sense than
usual.
4. My plan of attack
An intro: the good, the bad and the ugly at 50,000 ft.
2¢ tour of hadoop’s processing - map reduce
2¢ tour of hadoop’s storage – hdfs
A blitz tour of the rest of the hadoop ecosystem
5. Why did this happen?
Old school –> scale up == larger costlier
monolithic system (or a small cluster there of) i.e.
vertical scaling
Different approach –
all road lead to scale out
Assume failures
Smart software, cheap hardware
Don’t move data; bring processing
to data
6. The Good
Simple development (when
compared to Message
Passing Interface
programming )
Scale – no shared state,
programmer don’t need to
know the topology, easy to
add hardware
Automatic parallelization and
distribution of tasks
Fault tolerance
Works with commodity
hardware
Open source!
7. The Bad
Not a silver bullet :(
MapReduce is batch data processing
the time scale is minutes to hours
MapReduce is overly simplify/abstracted –
you are stuck with the M/R model and
it is hard to work smarter
MapReduce is low level
compared to high-level languages like SQL
Not all work decomposes well into parallelized M/R
Open source :)
10. Map()
Imagine a number of servers with lists of first names –
What is the most popular name?
Box 1-isabella William ava mia Emma Alexander
Box 2-Noah NOAH Isabella Isabella emma Emma
Box 3-emma Emma Liam liam mason Isabella
Map() would apply a function to each element independent
of order.
For example, capitalize each word
(MapReduce is covered in greater detail in Chapter 2 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
11. Map()
So we would have:
Box 1-Isabella William Ava Mia Emma Alexander
Box 2-Noah Noah Isabella Isabella Emma Emma
Box 3-Emma Emma Liam Liam Mason Isabella
Map() could be apply function to make pairs
For example, Isabella becomes (Isabella, 1)
12. Map()
So we would have:
Box 1-(Isabella,1) (William,1) (Ava,1) (Mia,1) (Emma,1)
(Alexander,1)
Box 2-(Noah,1) (Noah,1) (Isabella,1) (Isabella,1)
(Emma,1) (Emma,1)
Box 3-(Emma,1) (Emma,1) (Liam,1) (Liam,1) (Mason,1)
(Isabella,1)
Now we are almost ready for the reduce, but first the sort
and shuffle
13. Shuffle/Sort
So we would have:
Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1)
(Emma,1) (Emma,1) (Emma,1)
Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)
Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1)
(Noah,1) (William,1)
Now for the reduce, our function would sum all the of the 1s,
and return name and count
14. Reduce
So we would have:
Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1) (Emma,1)
(Emma,1) (Emma,1)
Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)
Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1)
(Noah,1) (William,1)
Now for the reduce, our function would sum all the of the 1s,
and return name and count
Box 1-(Alexander,1) (Ava,1) (Emma,5)
Box 2-(Isabella,4)
Box 3-(Liam,2) (Mason,1) (Mia,1) (Noah,2) (William,1)
(https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for similar coded in java )
15. (This architecture is covered in greater detail in Chapter 4 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
17. Map/reduce failures
Check the job if:
The job throws an uncaught exception.
The job exits with a nonzero exit code.
The job fails to report progress to the
tasktracker for a configurable amount of
time. (i.e. hung, stuck, slow)
Check the node if:
the same node keeps killing jobs…check
the node
Check the Job tracker/RM if:
jobs are lost or stuck and then they all fail
18. Instant MR test
Um, is the system working?
yarn jar /usr/hdp/2.3.2.0-2950/hadoop-
mapreduce/hadoop-mapreduce-examples.jar pi 10 100
(your jar most likely will be somewhere else)
19. (HDFS is covered in greater detail in Chapter 3 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
20. There is a HDFS CLI
You already know some of the commands:
hdfs dfs –ls /
hdfs dfs –du /
hdfs dfs –rm /
hdfs dfs –cat /
There are other modes than dfs: dfsadmin, namenode,
datanode, fsck, zkfc, balancer, etc.
21. HDFS failures
Jobs fail: due to missing
blocks
Jobs fail: due to moving data
due to down datanodes or
huge ingest
Without NN HA – single point
of failure for everything
Regular file system mayhem
that you already know and
love
plus the usual perms issues
22. HDFS failures
Jobs fail: due to missing
blocks
Jobs fail: due to moving data
due to down datanodes or
huge ingest
Without NN HA – single point
of failure for everything
Regular file system mayhem
that you already know and
love
plus the usual perms issues
23. The rest of the garden
Distributed Filesystems
- Apache HDFS
outliers:
- Tachyon
- Apache GridGain
- Ignite
- XtreemFS
- Ceph Filesystem
- Red Hat GlusterFS
- Quantcast File System QFS
- Lustre
Security
outliers:
- Apache Sentry
- Apache Knox Gateway
- Apache Ranger
Distributed Programming
- Apache MapReduce also MRv2/YARN
- Apache Pig
outliers:
- JAQL
- Apache Spark
- Apache Flink (formerly Stratosphere)
- Netflix PigPen
- AMPLab SIMR
- Facebook Corona
- Apache Twill
- Damballa Parkour
- Apache Hama
- Datasalt Pangool
- Apache Tez
- Apache Llama
- Apache DataFu
- Pydoop
- Kangaroo
- TinkerPop
- Pachyderm MapReduce
NewSQL Databases
outliers:
- TokuDB
- HandlerSocket
- Akiban Server
- Drizzle
- Haeinsa
- SenseiDB
- Sky
- BayesDB
- InfluxDB
NoSQL Databases
:Columnal Data Model
- Apache HBase
outliers:
- Apache Accumulo
- Hypertable
- HP Vertica
:Key Value Data Model
- Apache Cassandra
- Riak
- Redis
- Linkedin Volemort
:Document Data Model
outliers:
- MongoDB
- RethinkDB
- ArangoDB
- CouchDB
:Stream Data Model
outliers:
- EventStore
:Key-Value Data Model
outliers:
- Redis DataBase
- Linkedin Voldemort
- RocksDB
- OpenTSDB
:Graph Data Model
outliers:
- Neo4j
- ArangoDB
- TitanDB
- OrientDB
- Intel GraphBuilder
- Giraph
- Pegasus
- Apache Spark
Scheduling
- Apache Oozie
outliers:
- Linkedin Azkaban
- Spotify Luigi
- Apache Falco
24. 10 in 10 minutes!
Easier Programming: Pig, Spark
SQL-like tools: Hive, Impala, Hbase
Data pipefitting: Sqoop, Flume, Kafka
Book keeping: Oozie, Zookeeper
26. Pig
What is it: a high level programming language for data
manipulation that abstracts M/R from Yahoo
Why: a few lines of code to munge data
Example:
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS
count, group AS word;
(Pig is covered in greater detail in Alan Gate’s Programming Pig by O’Reilly
And in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
27. Spark
What is it: computing framework from ampLab, UC Berkeley
Why: high level abstractions and better use of memory
Neat trick: in-memory RDDs
Example:
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
Or, in python:
>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)
(Spark is covered in greater detail by Matei Zaharia et al. in Learning Spark by O’Reilly
Also of note is Advanced Analytics with Spark – it shows Spark’s capabilities well
but moves way too quick to be truly useful. It is covered in Chapter 19 of
Tom White’s Hadoop – The Definitive Guide by O’Reilly – lastest ed. Only)
29. Hive/HQL
What is it: a data infrastructure and query language from
Facebook
Why: batched SQL queries against HDFS
Neat trick: stores metadata so you don’t have to
Example:
hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’
OVERWRITE INTO TABLE BXDataSet;
hive> select yearofpublication, count(booktitle) from bxdataset group by
yearofpublication;
(Hive is covered in greater detail by Jason Ruthergenlen et al. in Programming HIve by O’Reilly.
Instant Apache Hive Essentials How-To by Darren Lee by Packt was useful to me as tutorial.
It is also covered in Chapter 17 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
30. Impala
What is it: SQL query engine from Cloudera
Why: fast adhoc queries on subsets of data stored in hadoop
Example:
[impala-host:21000] > select count(*) from customer_address;
(nada, let me know if you hit pay dirt)
31. HBase
What is it: a non-relational database from Powerset
Why: fast access to large sparse data sets
Example:
hbase(main):001:0> create 'test', 'cf'
0 row(s) in 0.4170 seconds
Hbase::Table – test
hbase(main):003:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0850 seconds
hbase(main):006:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a,
timestamp=1421762485768, value=value1
(HBase is covered in Chapter 20 of Tom White’s Hadoop – The Definitive Guide by O’Reilly
And in covered in greater detail in Lars George’s HBase – The Definitive Guide by O’Reilly)
33. Sqoop
What is it: glue tool for moving data between relational
databases and hadoop
Why: make the cumbersome easier
Example:
sqoop list-databases --connect jdbc:mysql://mysql/employees –username joe --
password myPassword
(HBase is covered in greater detail in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’Reilly
There is also a cookbook that covered a few worthy gotchas: Apache Sqoop Cookbook Kathleen Ting by O’Reilly)
34. Flume
What is it: a service for collecting and aggregating logs
Why: because log ingestion is tougher than it seems
Example:
# Define a memory channel on agent called memory-channel.
agent.channels.memory-channel.type = memory
# Define a source on agent and connect to channel memory-channel.
agent.sources.tail-source.type = exec
agent.sources.tail-source.command = tail -F /var/log/system.log
agent.sources.tail-source.channels = memory-channel
# Define a sink that outputs to logger.
agent.sinks.log-sink.channel = memory-channel
agent.sinks.log-sink.type = logger
# Define a sink that outputs to hdfs.
agent.sinks.hdfs-sink.channel = memory-channel
agent.sinks.hdfs-sink.type = hdfs
agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:54310/tmp/system.log/
agent.sinks.hdfs-sink.hdfs.fileType = DataStream
# Finally, activate.
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
(I haven’t read much on Flume; if you find something clever let me know!)
35. Kafka
What is it: message broker from LinkedIn
Why: fast handling of data feeds
Neat trick: no need to worry about missing data or double
processing data
Example:
> bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic test
This is a message
This is another message
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-
beginning
This is a message
This is another message
(I disliked the one book I read but I found the online docs very readable! http://kafka.apache.org/
Also check out the design docs http://kafka.apache.org/documentation.html#design )
37. Oozie
What is it: workflow scheduler from Yahoo Banglalore
Why: because cron isn’t perfect
Example:
oozie job -oozie http://localhost:8080/oozie -config examples/apps/map-
reduce/job.properties -run
(Oozie is covered in greater detail in Islam & Srinivasan’s Apache Oozie: The Workflow Scheduler by O’Reilly)
38. Zookeeper
What is it: a coordination service from Yahoo
Why: sync info for distributed systems (similar idea behind
DNS or LDAP)
Example:
[zkshell: 14] set /zk_test junk
cZxid = 5
ctime = Fri Jun 05 13:57:06 PDT 2009
mZxid = 6
mtime = Fri Jun 05 14:01:52 PDT 2009
pZxid = 5
[zkshell: 15] get /zk_test
junk
cZxid = 5
ctime = Fri Jun 05 13:57:06 PDT 2009
mZxid = 6
mtime = Fri Jun 05 14:01:52 PDT 2009
pZxid = 5
(Zookeeper is covered in greater detail in Zookeeper: Distributed Process Cooridination by O’Reilly
And in Chapter 21 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
44. And now a bit of common sense for
sys-admin-ing Hadoop clusters
45. Avoid
The usual -
Don’t let hdfs fill up
Don’t use all the memory
Don’t use up all the cpus
Don’t drop the network
<insert fav disaster>
Resource Exhaustion by users
Hardware Failure (drives are the king of this domain)
46. Um, backups?
Usual suspects plus
Namenode’s meta data!! (fsimage)
Hdfs? Well, it would nice but unlikely (if so distcp)
Snapshots
48. Monitoring
The usual suspects plus…
JMX support
Jvm via jstat, jmap etc.
hdfs
Mapred
conf/hadoop-metrics.properties
http://namenode:50070/
http://namenode:50070/jmx
49. User management
Hdfs quotas
Access controls
Internal and
external
MR schedulers
Fifo, Fair, Capacity
Kerberos can be used as well
50. Configuration
/etc/hadoop/conf
Lots of knobs!
!Ojo! –
Lots of overrides
Get the basic system solid before security and performance
Watch the units – some are in megabytes but some are in
bytes!
Have canary jobs
Ensure same configs are everywhere (including uniform
dns/host)
52. Fin
Thanks for listening
Slides:
http://www.slideshare.net/ericwilliammarshall/hadoop-
for-sysadmins
Any questions?
53. What’s in a name?
Doug Cutting seems to have been inspired by his
family. Lucene is his wife’s middle name, and her
maternal grandmother’s first name. His son, as a
toddler, used Nutch as the all-purpose word for meal
and later named a yellow stuffed elephant Hadoop.
Doug said he “was looking for a name that wasn’t
already a web domain and wasn’t trademarked, so I
tried various words that were in my life but not used by
anybody else. Kids are pretty good at making up
words.”
54. What to do?
Combinations of the usual stuff:
Numerical Summarizations
Filtering
Altering Data Organization
Joining Data
I/O