In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop.
Finally we introduce spark using a docker image and we show how to use anonymous function in spark.
The topics of the next slides will be
- Spark Shell (Scala, Python)
- Shark Shell
- Data Frames
- Spark Streaming
- Code Examples: Data Processing and Machine Learning
The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
Transformations and actions a visual guide training
The document summarizes key Spark API operations including transformations like map, filter, flatMap, groupBy, and actions like collect, count, and reduce. It provides visual diagrams and examples to illustrate how each operation works, the inputs and outputs, and whether the operation is narrow or wide.
This document provides an overview of big data ecosystems, including common log formats, compression techniques, data collection methods, distributed storage options like HDFS and S3, distributed processing frameworks like Hadoop MapReduce and Storm, workflow managers, real-time storage options, and other related topics. It describes technologies like Kafka, HBase, Cassandra, Pig, Hive, Oozie, and Azkaban; compares advantages and disadvantages of HDFS, S3, HBase and other storage systems; and provides references for further information.
This document discusses scaling extract, transform, load (ETL) processes with Apache Hadoop. It describes how data volumes and varieties have increased, challenging traditional ETL approaches. Hadoop offers a flexible way to store and process structured and unstructured data at scale. The document outlines best practices for extracting data from databases and files, transforming data using tools like MapReduce, Pig and Hive, and loading data into data warehouses or keeping it in Hadoop. It also discusses workflow management with tools like Oozie. The document cautions against several potential mistakes in ETL design and implementation with Hadoop.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.
This document discusses Sqoop, a tool for importing structured data from databases into Hadoop. Sqoop automates the process of importing data from relational databases into HDFS and generating code to work with the imported data. It supports importing from popular databases into various data formats like text files or SequenceFiles. Sqoop also integrates with Hive, allowing SQL-like queries over imported data. The tool aims to simplify common ETL tasks and make database data accessible for MapReduce jobs and analytics on Hadoop.
This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
In these slides we introduce Column-Oriented Stores. We deeply analyze Google BigTable. We discuss about features, data model, architecture, components and its implementation. In the second part we discuss all the major open source implementation for column-oriented databases.
This document discusses how to setup HBase with Docker in three configurations: single-node standalone, pseudo-distributed single-machine, and fully-distributed cluster. It describes features of HBase like consistent reads/writes, automatic sharding and failover. It provides instructions for installing HBase in a single node using Docker, including building an image and running it with ports exposed. It also covers running HBase in pseudo-distributed mode with the processes running as separate containers and interacting with the HBase shell.
In this lecture we analyze graph oriented databases. In particular, we consider TtitanDB as graph database. We analyze how to query using gremlin and how to create edges and vertex.
Finally, we presents how to use rexster to visualize the storeg graph/
Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
This document discusses the Spark analytics platform and its advantages over existing Hadoop-based platforms. Spark provides a unified data processing engine for batch, interactive, and streaming workloads. It includes components like Spark Core for distributed computing, Spark Streaming for real-time data processing, Shark for SQL and analytics, GraphX for graph processing, and MLlib for machine learning. Spark aims to be up to 100x faster than Hadoop for interactive queries by keeping data in-memory using its Resilient Distributed Datasets (RDDs). It also leverages other projects like Mesos for resource management and Tachyon for a fault-tolerant shared storage layer.
This document provides an introduction to HBase, including:
- An overview of BigTable, which HBase is modeled after
- Descriptions of the key features of HBase like being distributed, column-oriented, and versioned
- Examples of using the HBase shell to create tables, insert and retrieve data
- An explanation of the Java APIs for administering HBase, inserting/updating/retrieving data using Puts, Gets, and Scans
- Suggestions for setting up HBase with Docker for coding examples
The document discusses cloning Twitter using HBase. It describes some key features of Twitter like allowing users to post status updates, follow other users, mention users, and re-tweet posts. It then provides an overview of HBase including its features like consistency, automatic sharding and failover. It discusses how to install HBase in single node, pseudo-distributed and fully distributed modes using Docker. It also demonstrates some common HBase shell commands like creating and listing tables, putting and getting data. Finally, it discusses how to model the user, tweet, follower and following relationships in HBase.
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me to download the slides
This document discusses cloning Twitter using Redis by storing user, follower, and post data in Redis keys and data structures. It provides examples of how to store:
1) User profiles as Hashes with fields like username and ID.
2) Follower and following relationships as Sorted Sets with user IDs and timestamps.
3) User posts and timelines as Lists by pushing new post IDs.
It explains that while Redis lacks tables, its keys and data structures like Hashes, Sets and Lists allow building the same data model without secondary indexes. The document also notes that the system can scale horizontally by sharding the data across multiple Redis servers.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
DynamoDB is a key-value database that achieves high availability and scalability through several techniques:
1. It uses consistent hashing to partition and replicate data across multiple storage nodes, allowing incremental scalability.
2. It employs vector clocks to maintain consistency among replicas during writes, decoupling version size from update rates.
3. For handling temporary failures, it uses sloppy quorum and hinted handoff to provide high availability and durability guarantees when some replicas are unavailable.
- Data modeling for NoSQL databases is different than relational databases and requires designing the data model around access patterns rather than object structure. Key differences include not having joins so data needs to be duplicated and modeling the data in a way that works for querying, indexing, and retrieval speed.
- The data model should focus on making the most of features like atomic updates, inner indexes, and unique identifiers. It's also important to consider how data will be added, modified, and retrieved factoring in object complexity, marshalling/unmarshalling costs, and index maintenance.
- The _id field can be tailored to the access patterns, such as using dates for time-series data to keep recent
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
The document discusses using Oracle Solaris technologies for an Apache Hadoop cluster. It describes how Oracle Solaris Zones and ZFS provide benefits like fast provisioning of cluster nodes, high network throughput, large data capacity, and optimized I/O performance for Hadoop deployments. Various Oracle Solaris tools are also highlighted that can help monitor resource usage and troubleshoot performance issues for Hadoop workloads.
This document compares MapReduce and Spark frameworks. It discusses their histories and basic functionalities. MapReduce uses input, map, shuffle, and reduce stages, while Spark uses RDDs (Resilient Distributed Datasets) and transformations and actions. Spark is easier to program than MapReduce due to its interactive mode, but MapReduce has more supporting tools. Performance benchmarks show Spark is faster than MapReduce for sorting. The hardware and developer costs of Spark are also lower than MapReduce.
This document is a presentation on Apache Spark that compares its performance to MapReduce. It discusses how Spark is faster than MapReduce, provides code examples of performing word counts in both Spark and MapReduce, and explains features that make Spark suitable for big data analytics such as simplifying data analysis, providing built-in machine learning and graph libraries, and speaking multiple languages. It also lists many large companies that use Spark for applications like recommendations, business intelligence, and fraud detection.
The document discusses realtime analytics using Hadoop and HBase. It begins by introducing the speaker and their experience. It then discusses moving from batch processing with Hadoop to more realtime needs, and how systems like HBase can help bridge that gap. Several designs are presented for using HBase and Hadoop together to enable both realtime and batch analytics on large datasets.
This document provides an overview of the Apache Spark framework. It discusses how Spark allows distributed processing of large datasets across computer clusters using simple programming models. It also describes how Spark can scale from single servers to thousands of machines. Spark is designed to provide high availability by detecting and handling failures at the application layer. The document also summarizes Resilient Distributed Datasets (RDDs), which are Spark's fundamental data abstraction, and transformations and actions that can be performed on RDDs.
The document discusses cloud computing systems and MapReduce. It provides background on MapReduce, describing how it works and how it was inspired by functional programming concepts like map and reduce. It also discusses some limitations of MapReduce, noting that it is not designed for general-purpose parallel processing and can be inefficient for certain types of workloads. Alternative approaches like MRlite and DCell are proposed to provide more flexible and efficient distributed processing frameworks.
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses:
- Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark.
- Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs.
- Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access.
- Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.
Spark improves on Hadoop MapReduce by keeping data in-memory between jobs. It reads data into resilient distributed datasets (RDDs) that can be transformed and cached in memory across nodes for faster iterative jobs. RDDs are immutable, partitioned collections distributed across a Spark cluster. Transformations define operations on RDDs, while actions trigger computation by passing data to the driver program.
Spark is a fast and general cluster computing system that improves on MapReduce by keeping data in-memory between jobs. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark core provides in-memory computing capabilities and a programming model that allows users to write programs as transformations on distributed datasets.
This document provides an overview of big data processing techniques including batch processing using MapReduce and Hive, iterative batch processing using Spark, stream processing using Apache Storm, and OLAP over big data using Dremel and Druid. It discusses techniques such as MapReduce, Hive, Spark RDDs, and Storm tuples for processing large datasets and compares small versus big data approaches. Example usages and technologies for different processing types are also outlined.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and additional tools like Hive, Pig, HBase, Zookeeper, Flume, Sqoop and Oozie that make up its ecosystem. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
Transformations and actions a visual guide trainingSpark Summit
The document summarizes key Spark API operations including transformations like map, filter, flatMap, groupBy, and actions like collect, count, and reduce. It provides visual diagrams and examples to illustrate how each operation works, the inputs and outputs, and whether the operation is narrow or wide.
This document provides an overview of big data ecosystems, including common log formats, compression techniques, data collection methods, distributed storage options like HDFS and S3, distributed processing frameworks like Hadoop MapReduce and Storm, workflow managers, real-time storage options, and other related topics. It describes technologies like Kafka, HBase, Cassandra, Pig, Hive, Oozie, and Azkaban; compares advantages and disadvantages of HDFS, S3, HBase and other storage systems; and provides references for further information.
This document discusses scaling extract, transform, load (ETL) processes with Apache Hadoop. It describes how data volumes and varieties have increased, challenging traditional ETL approaches. Hadoop offers a flexible way to store and process structured and unstructured data at scale. The document outlines best practices for extracting data from databases and files, transforming data using tools like MapReduce, Pig and Hive, and loading data into data warehouses or keeping it in Hadoop. It also discusses workflow management with tools like Oozie. The document cautions against several potential mistakes in ETL design and implementation with Hadoop.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.
This document discusses Sqoop, a tool for importing structured data from databases into Hadoop. Sqoop automates the process of importing data from relational databases into HDFS and generating code to work with the imported data. It supports importing from popular databases into various data formats like text files or SequenceFiles. Sqoop also integrates with Hive, allowing SQL-like queries over imported data. The tool aims to simplify common ETL tasks and make database data accessible for MapReduce jobs and analytics on Hadoop.
This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
In these slides we introduce Column-Oriented Stores. We deeply analyze Google BigTable. We discuss about features, data model, architecture, components and its implementation. In the second part we discuss all the major open source implementation for column-oriented databases.
This document discusses how to setup HBase with Docker in three configurations: single-node standalone, pseudo-distributed single-machine, and fully-distributed cluster. It describes features of HBase like consistent reads/writes, automatic sharding and failover. It provides instructions for installing HBase in a single node using Docker, including building an image and running it with ports exposed. It also covers running HBase in pseudo-distributed mode with the processes running as separate containers and interacting with the HBase shell.
In this lecture we analyze graph oriented databases. In particular, we consider TtitanDB as graph database. We analyze how to query using gremlin and how to create edges and vertex.
Finally, we presents how to use rexster to visualize the storeg graph/
Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
This document discusses the Spark analytics platform and its advantages over existing Hadoop-based platforms. Spark provides a unified data processing engine for batch, interactive, and streaming workloads. It includes components like Spark Core for distributed computing, Spark Streaming for real-time data processing, Shark for SQL and analytics, GraphX for graph processing, and MLlib for machine learning. Spark aims to be up to 100x faster than Hadoop for interactive queries by keeping data in-memory using its Resilient Distributed Datasets (RDDs). It also leverages other projects like Mesos for resource management and Tachyon for a fault-tolerant shared storage layer.
This document provides an introduction to HBase, including:
- An overview of BigTable, which HBase is modeled after
- Descriptions of the key features of HBase like being distributed, column-oriented, and versioned
- Examples of using the HBase shell to create tables, insert and retrieve data
- An explanation of the Java APIs for administering HBase, inserting/updating/retrieving data using Puts, Gets, and Scans
- Suggestions for setting up HBase with Docker for coding examples
The document discusses cloning Twitter using HBase. It describes some key features of Twitter like allowing users to post status updates, follow other users, mention users, and re-tweet posts. It then provides an overview of HBase including its features like consistency, automatic sharding and failover. It discusses how to install HBase in single node, pseudo-distributed and fully distributed modes using Docker. It also demonstrates some common HBase shell commands like creating and listing tables, putting and getting data. Finally, it discusses how to model the user, tweet, follower and following relationships in HBase.
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me to download the slides
This document discusses cloning Twitter using Redis by storing user, follower, and post data in Redis keys and data structures. It provides examples of how to store:
1) User profiles as Hashes with fields like username and ID.
2) Follower and following relationships as Sorted Sets with user IDs and timestamps.
3) User posts and timelines as Lists by pushing new post IDs.
It explains that while Redis lacks tables, its keys and data structures like Hashes, Sets and Lists allow building the same data model without secondary indexes. The document also notes that the system can scale horizontally by sharding the data across multiple Redis servers.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
DynamoDB is a key-value database that achieves high availability and scalability through several techniques:
1. It uses consistent hashing to partition and replicate data across multiple storage nodes, allowing incremental scalability.
2. It employs vector clocks to maintain consistency among replicas during writes, decoupling version size from update rates.
3. For handling temporary failures, it uses sloppy quorum and hinted handoff to provide high availability and durability guarantees when some replicas are unavailable.
- Data modeling for NoSQL databases is different than relational databases and requires designing the data model around access patterns rather than object structure. Key differences include not having joins so data needs to be duplicated and modeling the data in a way that works for querying, indexing, and retrieval speed.
- The data model should focus on making the most of features like atomic updates, inner indexes, and unique identifiers. It's also important to consider how data will be added, modified, and retrieved factoring in object complexity, marshalling/unmarshalling costs, and index maintenance.
- The _id field can be tailored to the access patterns, such as using dates for time-series data to keep recent
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use CaseOrgad Kimchi
The document discusses using Oracle Solaris technologies for an Apache Hadoop cluster. It describes how Oracle Solaris Zones and ZFS provide benefits like fast provisioning of cluster nodes, high network throughput, large data capacity, and optimized I/O performance for Hadoop deployments. Various Oracle Solaris tools are also highlighted that can help monitor resource usage and troubleshoot performance issues for Hadoop workloads.
This document compares MapReduce and Spark frameworks. It discusses their histories and basic functionalities. MapReduce uses input, map, shuffle, and reduce stages, while Spark uses RDDs (Resilient Distributed Datasets) and transformations and actions. Spark is easier to program than MapReduce due to its interactive mode, but MapReduce has more supporting tools. Performance benchmarks show Spark is faster than MapReduce for sorting. The hardware and developer costs of Spark are also lower than MapReduce.
This document is a presentation on Apache Spark that compares its performance to MapReduce. It discusses how Spark is faster than MapReduce, provides code examples of performing word counts in both Spark and MapReduce, and explains features that make Spark suitable for big data analytics such as simplifying data analysis, providing built-in machine learning and graph libraries, and speaking multiple languages. It also lists many large companies that use Spark for applications like recommendations, business intelligence, and fraud detection.
Realtime Analytics with Hadoop and HBaselarsgeorge
The document discusses realtime analytics using Hadoop and HBase. It begins by introducing the speaker and their experience. It then discusses moving from batch processing with Hadoop to more realtime needs, and how systems like HBase can help bridge that gap. Several designs are presented for using HBase and Hadoop together to enable both realtime and batch analytics on large datasets.
This document provides an overview of the Apache Spark framework. It discusses how Spark allows distributed processing of large datasets across computer clusters using simple programming models. It also describes how Spark can scale from single servers to thousands of machines. Spark is designed to provide high availability by detecting and handling failures at the application layer. The document also summarizes Resilient Distributed Datasets (RDDs), which are Spark's fundamental data abstraction, and transformations and actions that can be performed on RDDs.
The document discusses cloud computing systems and MapReduce. It provides background on MapReduce, describing how it works and how it was inspired by functional programming concepts like map and reduce. It also discusses some limitations of MapReduce, noting that it is not designed for general-purpose parallel processing and can be inefficient for certain types of workloads. Alternative approaches like MRlite and DCell are proposed to provide more flexible and efficient distributed processing frameworks.
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses:
- Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark.
- Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs.
- Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access.
- Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.
Spark improves on Hadoop MapReduce by keeping data in-memory between jobs. It reads data into resilient distributed datasets (RDDs) that can be transformed and cached in memory across nodes for faster iterative jobs. RDDs are immutable, partitioned collections distributed across a Spark cluster. Transformations define operations on RDDs, while actions trigger computation by passing data to the driver program.
Spark is a fast and general cluster computing system that improves on MapReduce by keeping data in-memory between jobs. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark core provides in-memory computing capabilities and a programming model that allows users to write programs as transformations on distributed datasets.
This document provides an overview of big data processing techniques including batch processing using MapReduce and Hive, iterative batch processing using Spark, stream processing using Apache Storm, and OLAP over big data using Dremel and Druid. It discusses techniques such as MapReduce, Hive, Spark RDDs, and Storm tuples for processing large datasets and compares small versus big data approaches. Example usages and technologies for different processing types are also outlined.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements Google's MapReduce programming model and the Hadoop Distributed File System (HDFS) for reliable data storage. Key components include a JobTracker that coordinates jobs, TaskTrackers that run tasks on worker nodes, and a NameNode that manages the HDFS namespace and DataNodes that store application data. The framework provides fault tolerance, parallelization, and scalability.
This document discusses large scale computing with MapReduce. It provides background on the growth of digital data, noting that by 2020 there will be over 5,200 GB of data for every person on Earth. It introduces MapReduce as a programming model for processing large datasets in a distributed manner, describing the key aspects of Map and Reduce functions. Examples of MapReduce jobs are also provided, such as counting URL access frequencies and generating a reverse web link graph.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a simple programming model called MapReduce that automatically parallelizes and distributes work across nodes. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and MapReduce execution engine for processing. HDFS stores data as blocks replicated across nodes for fault tolerance. MapReduce jobs are split into map and reduce tasks that process key-value pairs in parallel. Hadoop is well-suited for large-scale data analytics as it scales to petabytes of data and thousands of machines with commodity hardware.
This document provides an overview and introduction to Spark, including:
- Spark is a general purpose computational framework that provides more flexibility than MapReduce while retaining properties like scalability and fault tolerance.
- Spark concepts include resilient distributed datasets (RDDs), transformations that create new RDDs lazily, and actions that run computations and return values to materialize RDDs.
- Spark can run on standalone clusters or as part of Cloudera's Enterprise Data Hub, and examples of its use include machine learning, streaming, and SQL queries.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
The document discusses Spark, an open-source cluster computing framework. It describes Spark's Resilient Distributed Dataset (RDD) as an immutable and partitioned collection that can automatically recover from node failures. RDDs can be created from data sources like files or existing collections. Transformations create new RDDs from existing ones lazily, while actions return values to the driver program. Spark supports operations like WordCount through transformations like flatMap and reduceByKey. It uses stages and shuffling to distribute operations across a cluster in a fault-tolerant manner. Spark Streaming processes live data streams by dividing them into batches treated as RDDs. Spark SQL allows querying data through SQL on DataFrames.
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
In this talk we’ll explore Apache Spark — the most popular cluster computing framework right now. We’ll look at the improvements that Spark brought over Hadoop MapReduce and what makes Spark so fast; explore Spark programming model and RDDs; and look at some sample use cases for Spark and big data in general.
This talk will be interesting for people who have little or no experience with Spark and would like to learn more about it. It will also be interesting to a general engineering audience as we’ll go over the Spark programming model and some engineering tricks that make Spark fast.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
This document provides an overview of the Hadoop framework. It introduces MapReduce and Hadoop, describes the Hadoop application architecture including HDFS, MapReduce programming model, and developing a typical Hadoop application. It discusses setting up a Hadoop environment and provides a sample word count demo to practice Hadoop.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download the slides
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me to download the slides
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...Fabio Fumarola
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Docker allows building and running applications inside lightweight containers. Some key benefits of Docker include:
- Portability - Dockerized applications are completely portable and can run on any infrastructure from development machines to production servers.
- Consistency - Docker ensures that application dependencies and environments are always the same, regardless of where the application is run.
- Efficiency - Docker containers are lightweight since they don't need virtualization layers like VMs. This allows for higher density and more efficient use of resources.
This document provides information about Linux containers and Docker. It discusses:
1) The evolution of IT from client-server models to thin apps running on any infrastructure and the challenges of ensuring consistent service interactions and deployments across environments.
2) Virtual machines and their benefits of full isolation but large disk usage, and Vagrant which allows packaging and provisioning of VMs via files.
3) Docker and how it uses Linux containers powered by namespaces and cgroups to deploy applications in lightweight portable containers that are more efficient than VMs. Examples of using Docker are provided.
This document lists and describes several large network dataset collections for research purposes. It includes social networks, communication networks, citation networks, collaboration networks, web graphs, product networks, road networks, and more. Sources provided include the Stanford Large Network Dataset Collection, a Twitter dataset, leaked Facebook pages, UCIrvine Datasets, and additional results. The datasets cover a wide range of network types and can be used to study interactions in online social networks, information cascades, and networked communities.
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce Fabio Fumarola
The document presents MrAdam, a parallel algorithm for approximate frequent itemset mining using MapReduce. MrAdam avoids expensive communication and synchronization costs by mining approximate frequent itemsets from big data with statistical error guarantees. It combines a statistical approach based on the Chernoff bound with MapReduce-based local model discovery and global combination through an SE-tree and structural interpolation. Experiments show MrAdam is 2 to 100 times faster than previous frequent itemset mining algorithms using MapReduce.
NoSQL databases are currently used in several applications scenarios in contrast to Relations Databases. Several type of Databases there exist. In this presentation we compare Key Value, Column Oriented, Document Oriented and Graph Databases. Using a simple case study there are evaluated pros and cons of the NoSQL databases taken into account.
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
### Data Description and Analysis Summary for Presentation
#### 1. **Importing Libraries**
Libraries used:
- `pandas`, `numpy`: Data manipulation
- `matplotlib`, `seaborn`: Data visualization
- `scikit-learn`: Machine learning utilities
- `statsmodels`, `pmdarima`: Statistical modeling
- `keras`: Deep learning models
#### 2. **Loading and Exploring the Dataset**
**Dataset Overview:**
- **Source:** CSV file (`mumbai-monthly-rains.csv`)
- **Columns:**
- `Year`: The year of the recorded data.
- `Jan` to `Dec`: Monthly rainfall data.
- `Total`: Total annual rainfall.
**Initial Data Checks:**
- Displayed first few rows.
- Summary statistics (mean, standard deviation, min, max).
- Checked for missing values.
- Verified data types.
**Visualizations:**
- **Annual Rainfall Time Series:** Trends in annual rainfall over the years.
- **Monthly Rainfall Over Years:** Patterns and variations in monthly rainfall.
- **Yearly Total Rainfall Distribution:** Distribution and frequency of annual rainfall.
- **Box Plots for Monthly Data:** Spread and outliers in monthly rainfall.
- **Correlation Matrix of Monthly Rainfall:** Relationships between different months' rainfall.
#### 3. **Data Transformation**
**Steps:**
- Ensured 'Year' column is of integer type.
- Created a datetime index.
- Converted monthly data to a time series format.
- Created lag features to capture past values.
- Generated rolling statistics (mean, standard deviation) for different window sizes.
- Added seasonal indicators (dummy variables for months).
- Dropped rows with NaN values.
**Result:**
- Transformed dataset with additional features ready for time series analysis.
#### 4. **Data Splitting**
**Procedure:**
- Split the data into features (`X`) and target (`y`).
- Further split into training (80%) and testing (20%) sets without shuffling to preserve time series order.
**Result:**
- Training set: `(X_train, y_train)`
- Testing set: `(X_test, y_test)`
#### 5. **Automated Hyperparameter Tuning**
**Tool Used:** `pmdarima`
- Automatically selected the best parameters for the SARIMA model.
- Evaluated using metrics such as AIC and BIC.
**Output:**
- Best SARIMA model parameters and statistical summary.
#### 6. **SARIMA Model**
**Steps:**
- Fit the SARIMA model using the training data.
- Evaluated on both training and testing sets using MAE and RMSE.
**Output:**
- **Train MAE:** Indicates accuracy on training data.
- **Test MAE:** Indicates accuracy on unseen data.
- **Train RMSE:** Measures average error magnitude on training data.
- **Test RMSE:** Measures average error magnitude on testing data.
#### 7. **LSTM Model**
**Preparation:**
- Reshaped data for LSTM input.
- Converted data to `float32`.
**Model Building and Training:**
- Built an LSTM model with one LSTM layer and one Dense layer.
- Trained the model on the training data.
**Evaluation:**
- Evaluated on both training and testing sets using MAE and RMSE.
**Output:**
- **Train MAE:** Accuracy on training data.
- **T
2. Contents
• Aggregate and Cluster
• Scatter Gather and MapReduce
• MapReduce
• Why Spark?
• Spark:
– Example, task and stages
– Docker Example
– Scala and Anonymous Functions
• Next Topics in 2/2
2
3. Aggregates and Clusters
• Aggregate-oriented databases change the rules for
data storage (CAP)
• But running on a cluster changes also computation
models
• When you store data on a cluster you can process
data in parallel
3
4. Database vs Client Processing
• With a centralized database data can be processed
on the database server or on the client machine
• Running on the client:
– Pros: flexibility and programming languages
– Cons: data transfer from the server to the client
• Running on the server:
– Pros: Data locality
– Cons: programming languages and debugging
4
5. Cluster and Computation
• We can spread the computation across the cluster
• However, we have to reduce the amount of data
transferred over the network
• We need have computation locality
• That is process the data in the same node where is
stored
5
6. Scatter-Gather
6
Use a Scatter-Gather that broadcasts a message to multiple recipients and
re-aggregates the responses back into a single message.
2003
7. Map-Reduce
• It is a way to organize processing by taking
advantage of clusters
• It gained prominence with Google’s MapReduce
framework (Dean and Ghemawat 2004)
• It was then implemented in Hadoop Framework
7
http://research.google.com/archive/mapreduce.html
https://hadoop.apache.org/
8. Programming Model: MapReduce
• We have a huge text document
• Count the number of times each distinct word
appears in the file
• Sample applications
– Analyze web server logs to find popular URLs
– Term statistics for search
8
9. Word Count
• Assumption: the input file is too large for memory,
but all <word, count> pairs fit in memory
• We can compute the pairs by
– wc file.txt | sort | uniq -c
• Encyclopedia Britannica, 11th Edition, Volume 4, Part
3 (http://www.gutenberg.org/files/19699/19699.zip)
9
10. Word Count Steps
wc file.txt | sort | uniq –c
•Map
• Scan input file record-at-a-time
• Extract keys from each record
•Group by key
• Sort and shuffle
•Reduce
• Aggregate, summarize, filter or transform
• Write the results
10
16. Example: Language Model
• Count the number of times each 5-word sequence
occurs in a large corpus of documents
• Map
– Extract <5-word sequence, count> pairs from each
document
• Reduce
– Combine the counts
16
18. Physical Execution: Concerns
• Mapper intermediate results are send to a single
reduces:
– This is the only steps that require a communication over
the network
• Thus the Partition and Shuffle phase are critical
18
19. Partition and Shuffle
Reducing the cost of these steps, dramatically reduces
the cost in time of the computation:
•The Partitioner determines which partition a given
(key, value) pair will go to.
•The default partitioner computes a hash value for the
key and assigns the partition based on this result.
•The Shuffler moves map outputs to the reducers.
19
20. MapReduce: Features
• Partioning the input data
• Scheduling the program’s execution across a set of
machines
• Performing the group by key step
• Handling node failures
• Managing required inter-machine communication
20
22. Hadoop
• An Open-Source software for distributed storage of large
dataset on commodity hardware
• Provides a programming model/framework for processing
large dataset in parallel
22
Map
Map
Map
Reduce
Reduce
Input Output
24. Distributed File System
• Data is kept in “chunks” spread across machines
• Each chink is replicated on different machines
(Persistence and Availability)
24
25. Distributed File System
• Chunk Servers
– File is split into contiguous chunks (16-64 MB)
– Each chunk is replicated 3 times
– Try to keep replicas on different machines
• Master Node
– Name Node in Hadoop’s HDFS
– Stores metadata about where files are stored
– Might be replicated
25
29. Apache Spark
• A big data analytics cluster-computing framework
written in Scala.
• Open Sourced originally in AMPLab at UC Berkley
• Provides in-memory analytics based on RDD
• Highly compatible with Hadoop Storage API
– Can run on top of an Hadoop cluster
• Developer can write programs using multiple
programming languages
29
32. Spark Data Flow
32
iter. 1iter. 1 iter. 2iter. 2 . . .
Input
Not tied to 2 stage Map
Reduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly
Logistic regression in Hadoop and Spark
HDFS
read
34. Spark Programming Model
34
User
(Developer)
Writes
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
Driver Program
RDD
(Resilient
Distributed
Dataset)
RDD
(Resilient
Distributed
Dataset)
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using rich set
of operators.
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using rich set
of operators.
35. RDD
• Programming Interface: Programmer can perform 3 types of
operations
35
Transformations
•Create a new dataset from
and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
Transformations
•Create a new dataset from
and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
Actions
•Returns to the driver
program a value or exports
data to a storage system
after performing a
computation.
•Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Actions
•Returns to the driver
program a value or exports
data to a storage system
after performing a
computation.
•Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Persistence
•For caching datasets in-
memory for future
operations.
•Option to store on disk or
RAM or mixed (Storage
Level).
•Example:
• Persist()
• Cache()
Persistence
•For caching datasets in-
memory for future
operations.
•Option to store on disk or
RAM or mixed (Storage
Level).
•Example:
• Persist()
• Cache()
36. How Spark works
• RDD: Parallel collection with partitions
• User application create RDDs, transform them, and
run actions.
• This results in a DAG (Directed Acyclic Graph) of
operators.
• DAG is compiled into stages
• Each stage is executed as a series of Task (one Task
for each Partition).
36
42. Execution Plan
Stages are sequences of RDDs, that don’t have a Shuffle in
between
42
textFile map map
reduceByKey
collect
Stage 1 Stage 2
43. Execution Plan
43
textFile map map
reduceByK
ey
collect
Stage
1
Stage
2
Stage
1
Stage
2
1. Read HDFS split
2. Apply both the maps
3. Start Partial reduce
4. Write shuffle data
1. Read shuffle data
2. Final reduce
3. Send result to driver
program
44. Stage Execution
• Create a task for each Partition in the new RDD
• Serialize the Task
• Schedule and ship Tasks to Slaves
And all this happens internally (you need to do anything)
44
Task 1
Task 2
Task 2
Task 2
46. Summary of Components
• Task: The fundamental unit of execution in Spark
• Stage: Set of Tasks that run parallel
• DAG: Logical Graph of RDD operations
• RDD: Parallel dataset with partitions
46
47. Start the docker container
From
•https://github.com/sequenceiq/docker-spark
docker pull sequenceiq/spark:1.3.0
docker run -i -t -h sandbox sequenceiq/spark:1.3.0-ubuntu
/etc/ bootstrap.sh bash
•Run the spark shell using yarn or local
spark-shell --master yarn-client --driver-memory 1g --executor-memory
1g --executor-cores 2
47
48. Separate Container Master/Worker
$ docker pull snufkin/spark-master
$ docker pull snufkin/spark-worker
•These images are based on snufkin/spark-base
$ docker run … master
$ docker run … worker
48
49. Running the example and Shell
• To Run an example
$ run-example SparkPi 10
• We can start a spark shell via
–spark-shell -- master local n
• The -- master specifies the master URL for a
distributed cluster
• Example applications are also provided in Python
–spark-submit example/src/main/python/pi.py 10
49
51. Scala vs Java vs Python
• Spark was originally written in Scala, which allows
concise function syntax and interactive use
• Java API added for standalone applications
• Python API added more recently along with an
interactive shell.
51
52. Why Scala?
• High-level language for the JVM
– Object oriented + functional programming
• Statistically typed
– Type Inference
• Interoperates with Java
– Can use any Java Class
– Can be called from Java code
52
54. Laziness
• For variables we can define lazy val, that are evaluated when
called
lazy val x = 10 * 10 * 10 * 10 //long computation
• For methods we can define call by value and call by name for
the parameters
def square(x: Double) // call by value
def square(x: => Double) // call by name
• It changes the order the parameter are evaluated
54
55. Anonymous functions
55
scala> val square = (x: Int) => x * x
square: Int => Int = <function1>
We define an anonymous function from Int to Int
The square is a val square of type Function1, which is equivalent to
scala> def square(x: Int) = x * x
square: (x: Int)Int
56. Anonymous Functions
(x: Int) => x * x
This is a syntactic sugar for
new Function1[Int ,Int] {
def apply(x: Int): Int = x * x
}
56
57. Currying
Converting a function with multiple arguments into a function
with a single argument that returns another function.
def gen(f: Int => Int)(x: Int) = f(x)
def identity(x: Int) = gen(i => i)(x)
def square(x: Int) = gen(i => i * i)(x)
def cube(x: Int) = gen(i => i * i * i)(x)
57
58. Anonymous Functions
//Explicit type declaration
val call1 = doWithOneAndTwo((x: Int, y: Int) => x + y)
//The compiler expects 2 ints so x and y types are inferred
val call2 = doWithOneAndTwo((x, y) => x + y)
//Even more concise syntax
val call3 = doWithOneAndTwo(_ + _)
58
60. High Order Functions
Methods that take as parameter functions
val list = (1 to 4).toList
list.foreach( x => println(x))
list.foreach(println)
list.map(x => x + 2)
list.map(_ + 2)
list.filter(x => x % 2 == 1)
list.filter(_ % 2 == 1)
list.reduce((x,y) => x + y)
list.reduce(_ + _)
60
61. Function Methods on Collections
61
http://www.scala-lang.org/api/2.11.6/index.html#scala.collection.Seq
63. Next Topics
• Spark Shell
– Scala
– Python
• Shark Shell
• Data Frames
• Spark Streaming
• Code Examples: Processing and Machine Learning
63
Editor's Notes
Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform in-memory parallel computations on large clusters. And that too in a highly fault tolerant manner.
This is the main concept around which the whole Spark framework revolves around.
Currently 2 types of RDDs:
Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU.
Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbase etc). These are created using SparkContext’s textFile method. Default number of slices in this case is 1 slice per file block.
Transformations: Like map – takes an RDD as an input, passes & process each element to a function, and return a new transformed RDD as an output.
By default, each transformed RDD is recomputed each time you run an action on it. Unless you specify the RDD to be cached in memory. Spark will try to keep the elements around the cluster for faster access.
RDD can be persisted on discs as well.
Caching is the Key tool for iterative algorithms.
Using persist, one can specify the Storage Level for persisting an RDD. Cache is just a short hand for default storage level. Which is MEMORY_ONLY.
MEMORY_ONLY
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they&apos;re needed. This is the default level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don&apos;t fit on disk, and read them from there when they&apos;re needed.
MEMORY_ONLY_SER
Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER
Similar to MEMORY_ONLY_SER, but spill partitions that don&apos;t fit in memory to disk instead of recomputing them on the fly each time they&apos;re needed.
DISK_ONLY
Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2 etc
Same as the levels above, but replicate each partition on two cluster nodes.
Which Storage level is best:
Few things to consider:
Try to keep in-memory as much as possible
Try not to spill to disc unless your computed datasets are memory expensive
Use replication only if you want fault tolerance