(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
The document discusses Spark exceptions and errors related to shuffling data between nodes. It notes that tasks can fail due to out of memory errors or files being closed prematurely. It also provides explanations of Spark's shuffle operations and how data is written and merged across nodes during shuffles.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
Here are the steps to complete the assignment:
1. Create RDDs to filter each file for lines containing "Spark":
val readme = sc.textFile("README.md").filter(_.contains("Spark"))
val changes = sc.textFile("CHANGES.txt").filter(_.contains("Spark"))
2. Perform WordCount on each:
val readmeCounts = readme.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
val changesCounts = changes.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
3. Join the two RDDs:
val joined = readmeCounts.join(changes
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
This document summarizes Spark and Spark Streaming internals. It discusses the Resilient Distributed Dataset (RDD) model in Spark, which allows for fault tolerance through lineage-based recomputation. It provides an example of log mining using RDD transformations and actions. It then discusses Spark Streaming, which provides a simple API for stream processing by treating streams as series of small batch jobs on RDDs. Key concepts discussed include Discretized Stream (DStream), transformations, and output operations. An example Twitter hashtag extraction job is outlined.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
Apache Spark is a fast, general engine for large-scale data processing. It supports batch, interactive, and stream processing using a unified API. Spark uses resilient distributed datasets (RDDs), which are immutable distributed collections of objects that can be operated on in parallel. RDDs support transformations like map, filter, and reduce and actions that return final results to the driver program. Spark provides high-level APIs in Scala, Java, Python, and R and an optimized engine that supports general computation graphs for data analysis.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
The document discusses Spark job failures and Spark/YARN architecture. It describes a Spark job failure due to a task failing 4 times with a NumberFormatException when parsing a string. It then explains that Spark jobs are divided into stages made up of tasks, and the entire job fails if a stage fails. The document also provides an overview of the Spark and YARN architectures, showing how Spark jobs are submitted to and run via the YARN resource manager.
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2shXBpj
This CloudxLab Apache Spark - Loading & Saving data tutorial helps you to understand Loading & Saving data in Apache Spark in detail. Below are the topics covered in this tutorial:
1) Common Data Sources
2) Common Supported File Formats
3) Handling Text Files using Scala
4) Loading CSV
5) SequenceFiles
6) Object Files
7) Hadoop Input and Output Format - Old and New API
8) Protocol Buffers
9) File Compression
10) Handling LZO
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Spark is a distributed data processing framework that uses RDDs (Resilient Distributed Datasets) to represent data distributed across a cluster. RDDs support transformations like map, filter, and actions like reduce to operate on the distributed data in a parallel and fault-tolerant manner. Key concepts include lazy evaluation of transformations, caching of RDDs, and use of broadcast variables and accumulators for sharing data across nodes.
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
This document provides an introduction to Apache Spark and collaborative filtering. It discusses big data and the limitations of MapReduce, then introduces Apache Spark including Resilient Distributed Datasets (RDDs), transformations, actions, and DataFrames. It also covers Spark Machine Learning (ML) libraries and algorithms such as classification, regression, clustering, and collaborative filtering.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
This document provides an overview of Spark SQL and its architecture. Spark SQL allows users to run SQL queries over SchemaRDDs, which are RDDs with a schema and column names. It introduces a SQL-like query abstraction over RDDs and allows querying data in a declarative manner. The Spark SQL component consists of Catalyst, a logical query optimizer, and execution engines for different data sources. It can integrate with data sources like Parquet, JSON, and Cassandra.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
This document provides an overview of installing and deploying Apache Spark, including:
1. Spark can be installed via prebuilt packages or by building from source.
2. Spark runs in local, standalone, YARN, or Mesos cluster modes and the SparkContext is used to connect to the cluster.
3. Jobs are deployed to the cluster using the spark-submit script which handles building jars and dependencies.
we will see internal architecture of spark cluster i.e what is driver, worker,executer and cluster manager, how spark program will be run on cluster and what are jobs,stages and task.
This document discusses the internals of Apache Spark, including its architecture, execution workflow, and key concepts like tasks, stages and jobs. It begins with an overview of the Spark cluster architecture consisting of driver programs, executors, worker nodes and a cluster manager. It then defines tasks as individual units of execution, stages as collections of tasks, and jobs as actions submitted to process RDDs. The document also explains how the DAG scheduler creates a DAG of stages to evaluate the final result and split the graph across workers.
We will see internal architecture of spark cluster i.e what is driver, worker, executor and cluster manager, how spark program will be run on cluster and what are jobs,stages and task.
Introduction to Machine Learning in Spark. Presented at Bangalore Apache Spark Meetup by Shashank L and Shashidhar E S on 17/10/2015.
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/225649429/
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Spark adds some abstractions and generalizations and performance optimizations to achieve much better efficiency especially in iterative workloads. Yet, spark does not concern itself with being a data file system while Hadoop has what is called HDFS.
Spark can leverage existing distributed files systems (like HDFS), a distributed data base (like HBase), traditional databases through its JDBC or ODBC adaptors, and flat files in local file systems or on a file store like S3 in Amazon cloud.
Hadoop MapReduce framework is similar to Spark in that it uses master slave-like paradigm. It has one Master node (which consists of a job tracker, name node, and RAM) and Worker Nodes (each worker node consists of a task tracker, data node, and a RAM). The task tracker in a worker node is analogues to an executor in Spark environment.
This document provides steps to install and run Apache Spark. It discusses:
1. Installing Scala, Spark, and configuring environment variables for Hadoop.
2. Running Spark programs using RDDs and transformations in standalone, YARN, and Hadoop modes.
3. Using SparkSQL and SparkR to read CSV files from HDFS and perform operations on DataFrames.
Powerful big data processing and storage combined, this presentation walks thru the basics of integrating Apache Spark and Apache Cassandra. Presented by Alex Thompson at the Sydney Cassandra Meetup.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
The benefits of running Spark on your own DockerItai Yaffe
Shir Bromberg (Big Data team leader) @ Yotpo:
Nowadays, many of an organization’s main applications rely on Spark pipelines. As these applications become more significant to businesses, so does the need to quickly deploy, test and monitor them.
The standard way of running spark jobs is to deploy it on a dedicated managed cluster. However, this solution is relatively expensive with potentially high setup time. Therefore, we developed a way to run Spark on any container orchestration platform. This allows us to run Spark in a simple, custom and testable way.
In this talk, we will present our open-source dockers for running Spark on Nomad servers. We will cover:
* The issues we had running spark on managed clusters and the solution we developed.
* How to build a spark docker.
* And finally, what you may achieve by using Spark on Nomad.
Workshop - How to Build Recommendation Engine using Spark 1.6 and HDP
Hands-on - Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.
b) Follow along - Build a Recommendation Engine - This will show how to build a predictive analytics (MLlib) recommendation engine with scoring This will give a better understanding of architecture and coding in Spark for ML.
This document provides an overview of Spark driven big data analytics. It begins by defining big data and its characteristics. It then discusses the challenges of traditional analytics on big data and how Apache Spark addresses these challenges. Spark improves on MapReduce by allowing distributed datasets to be kept in memory across clusters. This enables faster iterative and interactive processing. The document outlines Spark's architecture including its core components like RDDs, transformations, actions and DAG execution model. It provides examples of writing Spark applications in Java and Java 8 to perform common analytics tasks like word count.
The DAGScheduler is responsible for computing the DAG of stages for a Spark job and submitting them to the TaskScheduler. The TaskScheduler then submits individual tasks from each stage for execution and works with the DAGScheduler to handle failures through task and stage retries. Together, the DAGScheduler and TaskScheduler coordinate the execution of jobs by breaking them into independent stages of parallel tasks across executor nodes.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Anyscale
Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data. Project Hydrogen is a major Apache Spark initiative to bring the best AI and big data solutions together. It introduced barrier execution mode to Spark 2.4.0 release to help distributed model training, and it explores optimized data exchange to accelerate distributed model inference.
In this talk, we will explain why barrier execution mode is needed, how it works, and how to use it to integrate distributed DL training on Spark. We will demonstrate HorovodRunner, the first Spark+AI integration powered by Project Hydrogen. It is based on the Horovod framework developed by Uber and Databricks Runtime 5.0 for Machine Learning.
We will also share our experience and performance tips on how to combine Pandas UDF from Spark and AI frameworks to scale complex model inference workload.
This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.
Understanding computer vision with Deep LearningCloudxLab
Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal.
Topics covered:
1. Overview of Machine Learning
2. Basics of Deep Learning
3. What is computer vision and its use-cases?
4. Various algorithms used in Computer Vision (mostly CNN)
5. Live hands-on demo of either Auto Cameraman or Face recognition system
6. What next?
This document provides an agenda for an introduction to deep learning presentation. It begins with an introduction to basic AI, machine learning, and deep learning terms. It then briefly discusses use cases of deep learning. The document outlines how to approach a deep learning problem, including which tools and algorithms to use. It concludes with a question and answer section.
This document discusses recurrent neural networks (RNNs) and their applications. It begins by explaining that RNNs can process input sequences of arbitrary lengths, unlike other neural networks. It then provides examples of RNN applications, such as predicting time series data, autonomous driving, natural language processing, and music generation. The document goes on to describe the fundamental concepts of RNNs, including recurrent neurons, memory cells, and different types of RNN architectures for processing input/output sequences. It concludes by demonstrating how to implement basic RNNs using TensorFlow's static_rnn function.
Natural Language Processing (NLP) is a field of artificial intelligence that deals with interactions between computers and human languages. NLP aims to program computers to process and analyze large amounts of natural language data. Some common NLP tasks include speech recognition, text classification, machine translation, question answering, and more. Popular NLP tools include Stanford CoreNLP, NLTK, OpenNLP, and TextBlob. Vectorization is commonly used to represent text in a way that can be used for machine learning algorithms like calculating text similarity. Tf-idf is a common technique used to weigh words based on their frequency and importance.
- Naive Bayes is a classification technique based on Bayes' theorem that uses "naive" independence assumptions. It is easy to build and can perform well even with large datasets.
- It works by calculating the posterior probability for each class given predictor values using the Bayes theorem and independence assumptions between predictors. The class with the highest posterior probability is predicted.
- It is commonly used for text classification, spam filtering, and sentiment analysis due to its fast performance and high success rates compared to other algorithms.
An autoencoder is an artificial neural network that is trained to copy its input to its output. It consists of an encoder that compresses the input into a lower-dimensional latent-space encoding, and a decoder that reconstructs the output from this encoding. Autoencoders are useful for dimensionality reduction, feature learning, and generative modeling. When constrained by limiting the latent space or adding noise, autoencoders are forced to learn efficient representations of the input data. For example, a linear autoencoder trained with mean squared error performs principal component analysis.
The document discusses challenges in training deep neural networks and solutions to those challenges. Training deep neural networks with many layers and parameters can be slow and prone to overfitting. A key challenge is the vanishing gradient problem, where the gradients shrink exponentially small as they propagate through many layers, making earlier layers very slow to train. Solutions include using initialization techniques like He initialization and activation functions like ReLU and leaky ReLU that do not saturate, preventing gradients from vanishing. Later improvements include the ELU activation function.
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
The document provides information about key-value RDD transformations and actions in Spark. It defines transformations like keys(), values(), groupByKey(), combineByKey(), sortByKey(), subtractByKey(), join(), leftOuterJoin(), rightOuterJoin(), and cogroup(). It also defines actions like countByKey() and lookup() that can be performed on pair RDDs. Examples are given showing how to use these transformations and actions to manipulate key-value RDDs.
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
This document provides instructions for getting started with TensorFlow using a free CloudxLab. It outlines the following steps:
1. Open CloudxLab and enroll if not already enrolled. Otherwise go to "My Lab".
2. In "My Lab", open Jupyter and run commands to clone an ML repository containing TensorFlow examples.
3. Go to the deep learning folder in Jupyter and open the TensorFlow notebook to get started with examples.
Introduction to Deep Learning | CloudxLabCloudxLab
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/goQxnL )
This CloudxLab Deep Learning tutorial helps you to understand Deep Learning in detail. Below are the topics covered in this tutorial:
1) What is Deep Learning
2) Deep Learning Applications
3) Artificial Neural Network
4) Deep Learning Neural Networks
5) Deep Learning Frameworks
6) AI vs Machine Learning
In this tutorial, we will learn the the following topics -
+ The Curse of Dimensionality
+ Main Approaches for Dimensionality Reduction
+ PCA - Principal Component Analysis
+ Kernel PCA
+ LLE
+ Other Dimensionality Reduction Techniques
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
In this tutorial, we will learn the the following topics -
+ Training and Visualizing a Decision Tree
+ Making Predictions
+ Estimating Class Probabilities
+ The CART Training Algorithm
+ Computational Complexity
+ Gini Impurity or Entropy?
+ Regularization Hyperparameters
+ Regression
+ Instability
In this tutorial, we will learn the the following topics -
+ Linear SVM Classification
+ Soft Margin Classification
+ Nonlinear SVM Classification
+ Polynomial Kernel
+ Adding Similarity Features
+ Gaussian RBF Kernel
+ Computational Complexity
+ SVM Regression
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2wLh5aF
This CloudxLab Introduction to Linux helps you to understand Linux in detail. Below are the topics covered in this tutorial:
1) Linux Overview
2) Linux Components - The Programs, The Kernel, The Shell
3) Overview of Linux File System
4) Connect to Linux Console
5) Linux - Quick Start Commands
6) Overview of Linux File System
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Pig is an engine for executing data flows in parallel on Hadoop. It uses a language called Pig Latin to analyze large datasets. Pig provides relational operators like FOREACH, GROUP, and FILTER to process data in parallel. A hands-on example demonstrates loading dividend data, grouping it by stock symbol, calculating the average dividend for each symbol, and storing the results.
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2JjXp2u
This CloudxLab Oozie tutorial helps you to understand Oozie in detail. Below are the topics covered in this tutorial:
1) Introduction to Oozie
2) Oozie - Workflow & Coordinator Jobs
3) Oozie - Workflow jobs - DAG (Directed Acyclic Graph)
4) Oozie Use cases
5) Oozie Workflow - XML
6) Oozie Hands-on on the command line and Hue
7) Oozie WorkFlow for Hive
8) Execute shell script using Oozie Workflow
9) Run and debug the Spark task on Oozie
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
Implementations of Fused Deposition Modeling in real worldEmerging Tech
The presentation showcases the diverse real-world applications of Fused Deposition Modeling (FDM) across multiple industries:
1. **Manufacturing**: FDM is utilized in manufacturing for rapid prototyping, creating custom tools and fixtures, and producing functional end-use parts. Companies leverage its cost-effectiveness and flexibility to streamline production processes.
2. **Medical**: In the medical field, FDM is used to create patient-specific anatomical models, surgical guides, and prosthetics. Its ability to produce precise and biocompatible parts supports advancements in personalized healthcare solutions.
3. **Education**: FDM plays a crucial role in education by enabling students to learn about design and engineering through hands-on 3D printing projects. It promotes innovation and practical skill development in STEM disciplines.
4. **Science**: Researchers use FDM to prototype equipment for scientific experiments, build custom laboratory tools, and create models for visualization and testing purposes. It facilitates rapid iteration and customization in scientific endeavors.
5. **Automotive**: Automotive manufacturers employ FDM for prototyping vehicle components, tooling for assembly lines, and customized parts. It speeds up the design validation process and enhances efficiency in automotive engineering.
6. **Consumer Electronics**: FDM is utilized in consumer electronics for designing and prototyping product enclosures, casings, and internal components. It enables rapid iteration and customization to meet evolving consumer demands.
7. **Robotics**: Robotics engineers leverage FDM to prototype robot parts, create lightweight and durable components, and customize robot designs for specific applications. It supports innovation and optimization in robotic systems.
8. **Aerospace**: In aerospace, FDM is used to manufacture lightweight parts, complex geometries, and prototypes of aircraft components. It contributes to cost reduction, faster production cycles, and weight savings in aerospace engineering.
9. **Architecture**: Architects utilize FDM for creating detailed architectural models, prototypes of building components, and intricate designs. It aids in visualizing concepts, testing structural integrity, and communicating design ideas effectively.
Each industry example demonstrates how FDM enhances innovation, accelerates product development, and addresses specific challenges through advanced manufacturing capabilities.
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxSynapseIndia
Your comprehensive guide to RPA in healthcare for 2024. Explore the benefits, use cases, and emerging trends of robotic process automation. Understand the challenges and prepare for the future of healthcare automation
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfNeo4j
Presented at Gartner Data & Analytics, London Maty 2024. BT Group has used the Neo4j Graph Database to enable impressive digital transformation programs over the last 6 years. By re-imagining their operational support systems to adopt self-serve and data lead principles they have substantially reduced the number of applications and complexity of their operations. The result has been a substantial reduction in risk and costs while improving time to value, innovation, and process automation. Join this session to hear their story, the lessons they learned along the way and how their future innovation plans include the exploration of uses of EKG + Generative AI.
Best Practices for Effectively Running dbt in Airflow.pdfTatiana Al-Chueyr
As a popular open-source library for analytics engineering, dbt is often used in combination with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.
This webinar will cover a step-by-step guide to Cosmos, an open source package from Astronomer that helps you easily run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:
- Standard ways of running dbt (and when to utilize other methods)
- How Cosmos can be used to run and visualize your dbt projects in Airflow
- Common challenges and how to address them, including performance, dependency conflicts, and more
- How running dbt projects in Airflow helps with cost optimization
Webinar given on 9 July 2024
3. Basics of RDD
Machine or Node 1 Machine or Node 2 Machine or Node 3
Spark Runtime Architecture
4. Basics of RDD
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
Spark Runtime Architecture
5. Basics of RDD
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Spark Runtime Architecture
6. Basics of RDD
● Process where main() method runs
● When you launch a Spark shell, you’ve created a driver program
● Once the driver terminates, the application is finished.
The Driver
7. Basics of RDD
● Process where main() method runs
● When you launch a Spark shell, you’ve created a driver program
● Once the driver terminates, the application is finished.
● While running it performs following:
○ Converting a user program into tasks
■ Convert a user program into tasks - units of execution.
■ Converts DAG (logical graph) into a physical execution plan
The Driver
8. Basics of RDD
● Process where main() method runs
● When you launch a Spark shell, you’ve created a driver program
● Once the driver terminates, the application is finished.
● While running it performs following:
○ Converting a user program into tasks
■ Convert a user program into tasks - units of execution.
■ Converts DAG (logical graph) into a physical execution plan
○ Scheduling tasks on executors
The Driver
10. Basics of RDD
● Coordinate the scheduling of individual tasks on executors
Driver: Scheduling tasks on executors
11. Basics of RDD
● Coordinate the scheduling of individual tasks on executors
● Schedule tasks based on data placement
Driver: Scheduling tasks on executors
12. Basics of RDD
● Coordinate the scheduling of individual tasks on executors
● Schedule tasks based on data placement
● Tracks cached data and uses it to schedule future tasks
Driver: Scheduling tasks on executors
13. Basics of RDD
● Coordinate the scheduling of individual tasks on executors
● Schedule tasks based on data placement
● Tracks cached data and uses it to schedule future tasks
● Runs Spark web interface at port 4040.
Driver: Scheduling tasks on executors
14. Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2
15. Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
16. Basics of RDD
● It is a pluggable component in Spark.
● This allows Spark to run on YARN, Mesos & builtin Standalone
Cluster Manager
17. Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Spark Context
18. Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Spark Context
Executor Executor Executor
19. Basics of RDD
● Worker processes that run tasks of a job
Executors
20. Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
Executors
21. Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
● Launched once at the beginning of a Spark application
Executors
22. Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
● Launched once at the beginning of a Spark application
● Run for the entire lifetime of an application,
Executors
23. Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
● Launched once at the beginning of a Spark application
● Run for the entire lifetime of an application,
● Provide in-memory storage for cached RDDs via Block Manager
Executors
24. Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Spark Context
Executor Executor Executor
Task Task Task Task Task
25. Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Spark Context
Executor Executor Executor
Task Task Task Task Task
Maintains RDD &
executes workloads
26. Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Spark Context
Executor Executor Executor
Task Task Task Task Task
Maintains RDD &
executes workloads
Converts users program into
tasks & Launches Spark
Applications.
28. Basics of RDD
● The user submits an application using spark-submit.
Launching a Program
Spark-Submit
29. Basics of RDD
● The user submits an application using spark-submit.
● spark-submit launches the driver program
Launching a Program
Spark-Submit Driver
30. Basics of RDD
● The user submits an application using spark-submit.
● spark-submit launches the driver program
● driver invokes the main() and creates spark context
● The driver program contacts the cluster manager for resources
Launching a Program
Spark-Submit Driver
Cluster
Manager
31. Basics of RDD
Launching a Program
Spark-Submit Driver
Cluster
Manager
Starts
Executors
● The user submits an application using spark-submit.
● spark-submit launches the driver program
● driver invokes the main() and creates spark context
● The driver program contacts the cluster manager for resources
● The cluster manager launches executors
● The driver process runs through the user application.
● the driver sends work to executors in the form of tasks.
● Tasks are run on executor processes to compute and save results.
32. Basics of RDD
Launching a Program
Spark-Submit Driver
Cluster
Manager
Starts
Executors
● The user submits an application using spark-submit.
● spark-submit launches the driver program
● driver invokes the main() and creates spark context
● The driver program contacts the cluster manager for resources
● The cluster manager launches executors
● The driver process runs through the user application.
● the driver sends work to executors in the form of tasks.
● Tasks are run on executor processes to compute and save results.
● Terminate the executors and release resources if driver’s main() exit or sc.stop()
Exit
sc.stop()
33. Running On A Cluster
Getting Started - Two Modes
1. Local Mode
2. Cluster Mode
34. Running On A Cluster
Getting Started - Two Modes
1. Local Mode
2. Cluster Mode
Spark-shell --master ….
35. Running On A Cluster
Local Mode or Spark in-process
1. Default Mode
2. Does not require any resource manager
a. Simply download and run.
3. Good for utilizing multiple cores for processing
4. Partitions are generally equal to number of CPUs.
5. Used generally for testing
36. Running On A Cluster
We can run spark-shell, spark-submit with
○ spark-shell
○ spark-shell --master local
○ spark-shell --master local[n]
○ spark-shell --master local[*]
Getting Started - Local Mode
37. Running On A Cluster
Local Mode - Check!
scala> sc.isLocal
res0: Boolean = true
38. Running On A Cluster
Local Mode - Check!
scala> sc.isLocal
res0: Boolean = true
scala> sc.master
res0: String = local[*]
40. Running On A Cluster
Cluster Modes
Different kind of Resource Managers
a. Standalone
b. YARN
c. Mesos
d. EC2
41. Running On A Cluster
Cluster Mode - Standalone
Uses inbuilt manager resource manager
How to setup?
a. Install spark on all nodes.
b. Inform all nodes about each other
c. Launch spark on all nodes.
d. The spark nodes will discover each other
42. Running On A Cluster
Installing Standalone Cluster
1. Copy a compiled version of Spark to the same location on all your machines—for
example, /home/yourname/spark.
2. Set up password-less SSH access from your master machine to the others.
3. Edit the conf/slaves file on your master and fill in the workers’ hostnames.
4. run sbin/start-all.sh on your master
5. Check http://masternode:8080
6. To stop the cluster, run bin/stop-all.sh on your master node.
43. Running On A Cluster
To run spark inside Hadoop's YARN.
Tasks are run inside the yarn's containers
How to use?
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
spark-shell --master yarn
Cluster Mode - YARN
44. Running On A Cluster
Launching a program on yarn - Hands On
1. export YARN_CONF_DIR=/etc/hadoop/conf/
2. export HADOOP_CONF_DIR=/etc/hadoop/conf/
3. spark-submit --master yarn --class org.apache.spark.examples.SparkPi
/usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
45. Running On A Cluster
Launching a program on yarn - Hands On
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
git clone https://github.com/cloudxlab/bigdata
cd bigdata
cd spark/
cd projects/
cd apache-log-parsing_sbt/
sbt clean
sbt package
spark-submit --master yarn
target/scala-2.10/apache-log-parsing_2.10-0.0.1.jar 10 10
/data/spark/project/access/access.log.45.gz
46. Running On A Cluster
Hands On
Launching a program on yarn
47. Running On A Cluster
Cluster Mode - MESOS
1. Mesos Is a general-purpose cluster manager
2. it runs both analytics workloads and long-running services (DBs)
3. To use Spark on Mesos, pass a mesos:// URI to spark-submit:
spark-submit --master mesos://masternode:5050 yourapp
4. You can use ZooKeeper to elect master in mesos in case of multi-master
5. Use a mesos://zk:// URI pointing to a list of ZooKeeper nodes.
6. Ex:, if you have 3 nodes (n1, n2, n3) having ZK on port 2181, use URI:
mesos://zk://n1:2181/mesos,n2:2181/mesos,n2:2181/mesos
48. Running On A Cluster
Cluster Mode - Amazon EC2
● Spark comes with a built-in script to launch clusters on Amazon EC2.
● First create an Amazon Web Services (AWS) account
● Obtain an access key ID and secret access key.
● export these as environment variables:
○ export AWS_ACCESS_KEY_ID="..."
○ export AWS_SECRET_ACCESS_KEY="..."
● Create an EC2 SSH key pair and download its private key file (helps in SSH)
● Launch command of the spark-ec2 script:
○ cd /path/to/spark/ec2
○ ./spark-ec2 -k mykeypair -i mykeypair.pem launch mycluster
49. Running On A Cluster
Deployment Modes
● Based on where does driver run.
● Two ways:
○ Client
○ Cluster
50. Running On A Cluster
Deployment Modes
● Based on where does driver run.
● Two ways:
○ Client - launch the driver program locally. Default
○ Cluster
51. Running On A Cluster
Deployment Modes
● Based on where does driver run.
● Two ways:
○ Client - launch the driver program locally. Default
○ Cluster - on one of the worker machines inside the
cluster
52. Running On A Cluster
Architecture Yarn Client Mode
1. Driver Application is runs outside yarn
a. On machine where it is launched
2. If Driver Application shuts down the process is killed
3. Does not have resilience but is quicker to run.
53. Running On A Cluster
1. Driver Application runs inside yarn in application master
2. If launcher shuts down the process continues like a batch process
a. in background
3. Preferred way to run the long running processes
Architecture Yarn Cluster Mode
54. Running On A Cluster
Architecture Yarn cluster Mode - Example
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
spark-submit --master yarn --deploy-mode cluster --class
org.apache.spark.examples.SparkPi
/usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
To check the status, use:
○ http://e.cloudxlab.com:4040/
○ http://a.cloudxlab.com:8088/cluster
55. Running On A Cluster
Architecture Yarn cluster Mode - Demo
Hands On
56. Running On A Cluster
Which Cluster Manager to Use?
1. Start with a local mode if this is a new deployment.
2. To use richer resource scheduling capabilities (e.g., queues), use YARN and Mesos
3. When sharing amongst many users is primary criteria, use Mesos
4. In all cases, it is best to run Spark on the same nodes as HDFS for fast access to
storage.
a. You can either install Mesos or Standalone cluster on Datanodes
b. Or Hadoop distributions already install YARN and HDFS together
57. Running On A Cluster
Packaging Your Code and Dependencies
1. Bundle all the libraries that your program depends upon
2. No need to bundle the spark libraries (org.apache.spark) and language libraries
(java…)
3. Python users can:
a. Either install on all nodes using pip or easy_install
b. Or use --py-files argument (take files to every node's cwd) of spark-submit
4. Java & Scala
a. Submit libraries using --jars
b. But there are many libraries, use build tool such as sbt or maven
58. Running On A Cluster
Common flags for spark-submit
Flag Explanation
master Indicates the cluster manager to connect to. The options for
this flag are described earlier.
59. Running On A Cluster
Common flags for spark-submit
Flag Explanation
deploy-mode Whether to launch the driver program locally (“client”) or
on one of the worker machines inside the cluster (“cluster”).
In client mode spark-submit will run your driver on the same
machine where spark-submit is itself being invoked. In
cluster mode, the driver will be shipped to execute on a
worker node in the cluster. The default is client mode.
60. Running On A Cluster
Common flags for spark-submit
Flag Explanation
class The “main” class of your application if you’re running a Java
or Scala program.
61. Running On A Cluster
Common flags for spark-submit
Flag Explanation
name A human-readable name for your application. This will be
displayed in Spark’s web UI.
62. Running On A Cluster
Common flags for spark-submit
Flag Explanation
jars A list of JAR files to upload and place on the classpath of
your application. If your application depends on a small
number of third-party JARs, you can add them here.
63. Running On A Cluster
Common flags for spark-submit
Flag Explanation
files A list of files to be placed in the working directory of
your application. This can be used for data files that you
want to distribute to each node.
64. Running On A Cluster
Common flags for spark-submit
Flag Explanation
py-files A list of files to be added to the PYTHONPATH of
your application. This can contain .py, .egg, or .zip files.
65. Running On A Cluster
Common flags for spark-submit
Flag Explanation
executor-memory The amount of memory to use for executors, in bytes.
Suffixes can be used to specify larger quantities such as
“512m” (512 megabytes) or “15g” (15 gigabytes).
66. Running On A Cluster
Common flags for spark-submit
Flag Explanation
driver-memory The amount of memory to use for the driver process,
in bytes. Suffixes can be used to specify larger quantities
such as “512m” (512 megabytes) or “15g” (15
gigabytes).