Follow on from last months AWS User Group Intro to AWS talk. This talk by Nicola Cardace is looking at some of the offerings in the AWS eco-system.
1) The document discusses memory management in Spark applications and summarizes different approaches tried by developers to address out of memory errors in Spark executors. 2) It analyzes the root causes of memory issues like executor overheads and data sizes, and evaluates fixes like increasing memory overhead, reducing cores, frequent garbage collection. 3) The document dives into Spark and JVM level configuration options for memory like storage pool sizes, caching formats, and garbage collection settings to improve reliability, efficiency and performance of Spark jobs.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
This document provides tips and best practices for optimizing Apache Spark performance and resource allocation. It discusses: - The components of Spark including executors, drivers, and tasks - Configuring Spark on YARN and dynamic resource allocation - Optimizing memory usage, avoiding data skew, and reducing serialization costs - Best practices for Spark Streaming around microbatching, fault tolerance, and performance - Recommendations for running Spark on cloud object stores like S3
This document provides a retrospective on data infrastructure at Facebook from 2007-2011 written by the ex-Facebook data infrastructure lead. It summarizes the goals of building a universal data logging and computing platform, the state and growth of the Hadoop cluster from 10TB to 50PB, and key components like Hive, Scribe, and reporting tools that helped various teams access and analyze data. It also discusses challenges around query performance, unnecessary duplication, and a lack of APIs that were missed opportunities. The overall message is that building useful services around the software was more important than the software itself.
This document discusses stream computing from an engineer's perspective. It begins by contrasting batch and stream processing, noting that stream processing handles data one record at a time with an emphasis on latency over throughput. The document then explores how to achieve scalability, performance, durability and availability in stream processing systems. It notes the tradeoffs between these goals and discusses challenges like handling failures. Specific open-source stream processing systems like Storm, Flink and Apex are then analyzed in terms of how they work, strengths, weaknesses and failure handling. The document concludes by discussing using distributed databases for state management in stream processing applications.
Hive provides an SQL-like interface to query data stored in Hadoop's HDFS distributed file system and processed using MapReduce. It allows users without MapReduce programming experience to write queries that Hive then compiles into a series of MapReduce jobs. The document discusses Hive's components, data model, query planning and optimization techniques, and performance compared to other frameworks like Pig.
Hue is a web interface tool for exploring, analyzing, and visualizing data with Apache Hadoop. It allows users to prepare and browse data, compose queries using various editors and APIs, and productionize workflows. Key features include querying data, building search dashboards, and scheduling workflows. Hue aims to improve the SQL and search experience, enhance metadata search capabilities, and adopt a single page layout user interface.
This document describes how Apache Spark and Apache Lucene can be used together for near-real-time predictive model building. It discusses representing streaming device data in Lucene documents that are indexed for fast search and retrieval. A framework called Trapezium is used to build batch, streaming, and API services on top of Spark and Lucene. It shows how to index large datasets in Lucene efficiently using Spark and analyze retrieved devices to generate statistical and predictive models.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It is written in Java and uses a pluggable backend. Presto is fast due to code generation and runtime compilation techniques. It provides a library and framework for building distributed services and fast Java collections. Plugins allow Presto to connect to different data sources like Hive, Cassandra, MongoDB and more.
a comprehensive good introduction to the the Big data world in AWS cloud, hadoop, Streaming, batch, Kinesis, DynamoDB, Hbase, EMR, Athena, Hive, Spark, Piq, Impala, Oozie, Data pipeline, Security , Cost, Best practices
Zeppelin is an open-source web-based notebook that enables data ingestion, exploration, visualization, and collaboration on Apache Spark. It has built-in support for languages like SQL, Python, Scala and R. Zeppelin notebooks can be stored in S3 for persistence and sharing. Apache Livy is a REST API that allows managing Spark jobs and provides a way to securely run and share notebooks across multiple users.
This document provides an overview of using Apache Spark with object stores like Amazon S3, Azure Blob Storage, and Google Cloud Storage. It discusses the key challenges of classpath configuration, credentials, code examples, and ensuring data consistency and durability. Specific tips are provided for configuring and working with S3 and Azure Blob Storage. The document emphasizes that object stores can be treated like any other URL, but some configuration is needed and performance/commitment challenges exist.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web. For more information please follow: https://github.com/tribbloid/spookystuff A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
This document provides an overview of Apache Sqoop, a tool for transferring bulk data between Apache Hadoop and structured data stores like relational databases. It describes how Sqoop can import data from external sources into HDFS or related systems, and export data from Hadoop to external systems. The document also demonstrates how to use basic Sqoop commands to list databases and tables, import and export data between MySQL and HDFS, and perform updates during export.
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\ subscribe to you youtube channel to see the video of this lecture: https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
This document summarizes a presentation about Presto, an open source distributed SQL query engine. It discusses Presto's distributed and plug-in architecture, query planning process, and cluster configuration options. For architecture, it explains that Presto uses coordinators, workers, and connectors to distribute queries across data sources. For query planning, it shows how SQL queries are converted into logical and physical query plans with stages, tasks, and splits. For configuration, it reviews single-server, multi-worker, and multi-coordinator cluster topologies. It also provides an overview of Presto's recent updates.
Typesafe has launched Spark support for Mesosphere's Data Center Operating System (DCOS). Typesafe engineers are contributing to the Mesos support for Spark and Typesafe will provide commercial support for Spark development and production deployment on Mesos. Mesos' flexibility allows many frameworks like Spark to run on top of it. This document discusses Spark on Mesos in coarse-grained and fine-grained modes and some features coming soon like dynamic allocation and constraints.
O documento discute a eficiência energética e redução de perdas na gestão da água. Apresenta as principais etapas do ciclo da água e aponta que grande quantidade de água captada não chega aos usuários, o que representa ineficiência. Também destaca algumas áreas chave de intervenção como a otimização dos processos comerciais e a gestão eficiente de receitas e ativos.
The document summarizes Tribal Moose's public relations strategies and results for 2009 and outlines plans for 2010. In 2009, Tribal Moose saw success growing their social media presence on Twitter and Facebook and receiving positive blogger reviews. Their PR work also improved search engine rankings. However, some opportunities were missed. For 2010, goals include increasing sales, building the brand and reputation, and strengthening key relationships through PR efforts like events and trade shows. Success requires timely client responses and clear communication.
Cahier spécial "Compétitivité internationale des PME" pour Planète PME 15 juin 2010 en présence du Président de la République, M. Nicolas Sarkozy
Hi, I’m Nick Inglis and I’m the SharePoint Program Manager at AIIM International. AIIM is the community that provides education, research, and best practices to help organizations find, control, and optimize their information… and I am the SharePoint guy at AIIM. You can learn more about us at http://www.AIIM.org. Today we’re going to be talking about how to Collaborate and Adopt SharePoint successfully.
Paul Hopton, @relayr_cloud - 'The WunderBar - Bootstrapping the Internet of Things How to move beyond corporate hype, and make the Internet of Things happen (almost) now.
Learn about the only solution to instantly provision a full-featured ETL environment running on AWS for less than your Sunday newspaper!
Brightpearl is a cloud-based business management platform that provides e-commerce, inventory, order, customer, and shipping functionality to over 1,300 customers. It is built on Amazon Web Services (AWS) using various programming languages and services. Some challenges of building and scaling such a platform on AWS include designing for redundancy, performance, concurrency, cost efficiency, and failure tolerance.
These are the slides from my presentation at CLOUDCOMP 2009 on AppScale, an open source platform for running Google App Engine apps on. See our project home page at http://appscale.cs.ucsb.edu or our code page at http://code.google.com/p/appscale
You sit on a big pile of data and want to know how to leverage it in your company? Interested in use-cases, examples and practical demos about the full Hadoop stack? Looking for big-data inspiration? In this talk we will cover: - Use-cases how implementing a Hadoop stack in TheNewMotion drastically helped us, software engineers, with our everyday challenges. And how Hadoop enables our management team, marketing and operations to become more data-driven. - Practical introduction into our data warehouse, analytical and visualization stack: Apache Pig, Impala, Hue, Apache Spark, IPython notebook and Angular with D3.js. - Easy deployment of the Hadoop stack to the cloud. - Hermes - our homegrown command-line tool which helps us automate data-related tasks. - Examples of exciting machine learning challenges that we are currently tackling - Hadoop with Azure and Microsoft stack.
Two popular tools for doing Machine Learning on top of JVM ecosystem is H2O and SparkML. This presentation compares these two tools as Machine Learning libraries (Didn't consider Spark's Data Munjing perspective). This work was done during June of 2018.
Slide presentation from Webinar on February 17, 2016. People in analytical roles are demanding more and more compute and storage to get their jobs done. Instead of building out infrastructure for a few employees or a department, systems engineers and IT managers can find value in creating a compute stack in the cloud to meet the fluctuating demand of their clients. In this 45-minute webinar, you’ll learn: - How to identify the right analytical workloads - How to create a scalable compute environment using the cloud for analysts in under 10 minutes - How to best manage costs associated with the cloud compute stack - How to create dedicated client stacks with their own scratch space as well as general access to reference data Health systems departments, research & development departments, and business analyst groups all face silos of these challenging, compute-intensive use cases. By learning how to quickly build this flexible workflow that can be scaled up and down (or off) instantly, you can support business objectives while efficiently managing costs.
Introduction to Apache Spark, architecture, resilient distributed datasets, working, use-cases and comparision with Hadoop.
Introduction to Apache Spark, understanding of the architecture, resilient distributed datasets and working.
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
This document provides an overview of migrating applications and workloads to AWS. It discusses key considerations for different migration approaches including "forklift", "embrace", and "optimize". It also covers important AWS services and best practices for architecture design, high availability, disaster recovery, security, storage, databases, auto-scaling, and cost optimization. Real-world customer examples of migration lessons and benefits are also presented.
The document discusses strategies for scaling LAMP applications on cloud computing platforms like AWS. It recommends: 1) Moving static files to scalable services like S3 and using a CDN to distribute load. 2) Using dedicated caching systems like Memcache instead of local caches and storing sessions in Memcache or DynamoDB for scalability. 3) Scaling databases horizontally using master-slave replication or sharding across multiple availability zones for high availability and read scaling. 4) Leveraging auto-scaling and load balancing on AWS with tools like Elastic Load Balancers, CloudWatch, and scaling alarms to dynamically scale application instances based on metrics.
This is a presentation on apache hadoop technology. This presentation may be helpful for the beginners to know about the terminologies of hadoop. This presentation contains some pictures which describes about the working function of this technology. I hope it will be helpful for the beginners. Thank you.
This presentation is about apache hadoop technology. It may be helpful for the beginners to know some terminologies of hadoop.
This presentation is about apache hadoop technology. This may be helpful for the beginners. The beginners will know about some terminologies of hadoop technology. There is also some diagrams which will show the working of this technology. Thank you.
This document discusses logging scenarios using DynamoDB and Elastic MapReduce. It covers collecting log data in real-time using tools like Fluentd and storing it in DynamoDB. It then describes using EMR to perform ETL processes on the data, extracting from DynamoDB, transforming the data across EC2 instances, and loading to S3 or DynamoDB. Finally, it discusses analyzing the data using Redshift for queries or CloudSearch for search capabilities.
The document discusses scaling a web application called Wanelo that is built on PostgreSQL. It describes 12 steps for incrementally scaling the application as traffic increases. The first steps involve adding more caching, optimizing SQL queries, and upgrading hardware. Further steps include replicating reads to additional PostgreSQL servers, using alternative data stores like Redis where appropriate, moving write-heavy tables out of PostgreSQL, and tuning PostgreSQL and the underlying filesystem. The goal is to scale the application while maintaining PostgreSQL as the primary database.
This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and BI integration help meet requirements for timely processing and quick responses.
This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and integration with BI tools help meet requirements for timely processing and quick