Demystifying Apache Spark

•Download as PPTX, PDF•

0 likes•161 views

https://aka.ms/spark-architecture https://aka.ms/distributed-programming https://twitter.com/AdiPolak ------------------------------------------------------------------------------------------------- Spark is quickly gaining steam in the Big Data analytics world. However, it also has excellent features for Stream processing and machine learning development. Big Data systems are a new reality. Whether we're creating microservices-based solutions or machine learning-based products, we work with data, most often, big data. Apache Spark is your friend if you want to position yourself on the road for success as a Big Data Developer, Data Engineer, or Data Science with hands-on production experience. Come to this session to learn Apache Spark basics and how YOU can get starting with it.

Recommended for you

New Theme Directory

Millions of websites depend on the WordPress.org infrastructure every day. Parts of it is almost a decade old, ancient in software terms, and most of it is not even based on WordPress itself. At the time of WordCamp San Antonio, we will hopefully have finished a project that drastically modernizes the theme repository and moves to over to a WordPress installation, complete with a redesigned front-end and a more robust backend. This session will go over the changes involved, and will look into how WordPress.org works and what the underlying structure of it looks like.

•by Konstantin Obenland

wordpresstheme-directorythemes

Ten Reasons Developers Hate Your API

APIs are everywhere today and can be a great building block of modern applications. But all too often APIs are not truly great. Rather than love your API, developers curse it. How can you avoid that fate? In this session we'll look at the most common mistakes API providers make and you can avoid making them too. Do you offer a bad developer experience (DX)? Poor, inconsistent API design? Unreliable services? This talk is a deep dive on not just what to avoid but what to do instead. And you'll leave knowing how to get developers to love your API, not hate it.

•by John Musser

glueconapidevelopers

Container Days NYC Keynote

Why Docker? Is it cool? Is it the newest thing? Does it solve _my_ problem? In reality, as DevOps thought leaders and professionals the question is really, "How can the cost of a Docker adoption -- in terms of risk and opportunity cost -- benefit my company?"

•by Boyd Hemphill

dockercontaner days

Learning to code - types
https://aka.ms/p@adipolak

Learning to code –
reading keyboard
input
https://aka.ms/p@adipolak

Learning to code – reading from
file
https://aka.ms/p@adipolak

Recommended for you

Aye, Aye, API - What makes Technical Communicators uneasy about API document...

This document discusses the challenges of API documentation and differences between technical authors and API writers. It outlines that API writers require both strong technical skills like coding abilities as well as writing skills, which can be difficult to find. It also notes differences in tools used for general documentation versus API documentation. Finally, it provides recommendations like improving skills through training, having realistic expectations of skills, improving standards and tools, and making documentation a shared team responsibility.

•by Ellis Pratt

apitechcommtechnical communication

Data to Go: Mobile API Design

This document outlines three principles for designing mobile APIs: 1. Reduce round trips to the server by bundling data into single requests to minimize network overhead which conserves mobile resources. 2. Control verbosity by purging unnecessary data, using compression, and allowing clients to specify desired response fields to reduce payload size. 3. Restrict access by identifying request sources, denying unauthorized requests, and protecting sensitive data with a mobile-friendly security model. Examples are provided to illustrate each principle in action.

•by Chuck Greb

api developmentandroidmobile computing

Contributing to pandas (Korean)

The document discusses contributing to the pandas open source project. It provides information on how to become a contributor, the number of current contributors and issues, and steps for setting up a development environment, making changes, writing tests, and submitting pull requests.

•by Younggun Kim

Learning to code – reading from
database
https://aka.ms/p
Connecting and
accessing DB
Reading from
The DB
@adipolak

My App
Database
Manager
What if my data is so big that it won't
fit on one machine?
@adipolak

Recommended for you

Evgeniy Burak (HYS Enterprise): “Spring Data REST or intellectual job VS manual”

Spring Data REST allows creating RESTful APIs for data access without writing any endpoints or services. It generates REST endpoints for CRUD operations on data sources based on Spring Data repositories. With Spring Data REST, microservices require fewer code, services, and endpoints while gaining more possibilities through HATEOAS links between resources. Documentation on Spring Data REST can be found on the Spring website and reference guide.

•by HYS Enterprise

hys enterprisejava

RxJS: A Beginner & Expert's Perspective - ng-conf 2017

The document discusses RxJS, a library for reactive programming using observables. It begins with an introduction from a beginner and expert perspective on RxJS. It then covers topics like creating observables, best practices for importing RxJS, choosing operators, avoiding subscriptions, wrapping APIs, and the benefits of "same-shapedness". Code examples are provided for creating observables, getting input changes as an observable, using operators like switchMap, and merging multiple observable data sources.

•by Tracy Lee

reactive programmingngconfladyleet

API Documentation Meetup 2016, London

In general, this presentation is about API Documentation plus a quick introduction to RAML and it's new version 1.0.

•by Christian Heidenreich

ramlmeetupapi

My
App
What if my data is so big that I can’t process it
one machine?
Return
final aka.ms/spark@adipolak

My App
Distributed Processing!
aka.ms/spark@adipolak

Task:
Sort 50 Kilo of
Candy
In 15 min
aka.ms/spark
@adipolak

Recommended for you

Espresso introduction

Espresso Logic builds and runs RESTful servers for SQL databases, with advanced support for row/column level security, and business logic by JavaScript events and Reactive Programming.

•by Val Huber

restfulsqlmobile application development

Just enough DevOps for Data Scientists (Part II)

Imagine we have Ada, our data science intern. Let's run through a very simple wordcount spark job, and find a handful of potential failure points. Dozens of failures can and should happen when running spark jobs on commodity hardware. Given the basic foundation for infrastructure-level expectations, this talk gives Ada tools to ensure her job isn’t caught dead. Once the simple example job runs reliably, with the potential to scale, our data scientist can apply the same toolset to focus on some more interesting algorithms. Turn SNAFUs into successes by anticipating and handling Infra failures gracefully. Note: this talk is a spark-focused extension of Part I, "Just Enough DevOps For Data Scientists" from Scale by The Bay 2018 https://www.youtube.com/watch?v=RqpnBl5NgW0&t=19s

•by Databricks

wibddevopsapache spark

Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...

Anya Bida is a senior member of technical staff working on Spark tuning at Salesforce. She has a PhD from Mayo Clinic and BS from Johns Hopkins. The document discusses DevOps concepts for data scientists like handling infrastructure failures when running Spark jobs. It provides an overview of Spark operations like map, reduceByKey and saveAsTextFile. It also discusses best practices for avoiding common Spark and HDFS failures through techniques like high availability, sufficient disk space, optimizing partitions, and persisting or checkpointing data.

•by Anya Bida

sparksredevops

Recommended for you

A short introduction to Spark and its benefits

The document summarizes an agenda for a Big Data & Data Science event. The agenda includes: 1. A presentation on the Apache Spark ecosystem by an expert. 2. A talk on running SQL on large Hadoop clusters using SparkSQL and BigSQL. 3. A student presentation on a social data machine learning project. 4. A presentation on the Data Science Experience tool. 5. A talk on using data to detect anomalies. The event will conclude with a question and answer session.

•by Johan Picard

data scienceibmmeetup

Introduction to Apache Spark 2.0

Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.

•by Knoldus Inc.

apache spark 2sparkapache

An Insider’s Guide to Maximizing Spark SQL Performance

This document provides an overview of optimizing Spark SQL performance. It begins with introducing the speaker and their background with Spark. It then discusses reading query plans, interpreting them to understand optimizations, and tuning plans by pushing down filters, avoiding implicit casts, and other techniques. It emphasizes tracking query execution through the Spark UI to analyze jobs, stages and tasks for bottlenecks. The document aims to help understand how to maximize Spark SQL performance.

•by Takuya UESHIN

sparksparksqlhcj2019

Spark - getting STARTED:
aka.ms/spark-
Programming language support:
•Java
•Scala
•Python
•.NET
•R
@adipolak

SPARK BASICS
•Batch
•Stream
•Machine Learning
•Graph
Distributed
DATA
Workloads
aka.ms/spark-@adipolak

SPARK BASICS
•READ DATA/FILES
•DATA OPERATIONS
•SAVE DATA
DATA
PIPELI
NE
aka.ms/spark-@adipolak

Recommended for you

H2O PySparkling Water

Michal Malohlava talks about the PySparkling Water package for Spark and Python users. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

•by Sri Ambati

sparkapache sparksparkling water

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Big Data is all about being to access and process data in various formats, and from various sources. Apache Bahir provides extensions to distributed analytic platforms providing them access to different data sources. In this talk we will introduce you to Apache Bahir and its various connectors that are available for Apache Spark and Apache Flink. We will also go over the details of how to build, test and deploy an Spark Application using the MQTT data source for the new Apache Spark 2.0 Structure Streaming functionality.

•by Luciano Resende

#asf #apachebigdata #spark #bahir #ml #stc #mqtt

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...

This document discusses AppsFlyer's experience running Spark on Mesos in production for retention data processing and analytics. Key points include: - AppsFlyer processes over 30 million installs and 5 billion sessions daily for retention reporting across 18 dimensions using Spark, Mesos, and S3. - Challenges included timeouts and errors when using Spark's S3 connectors due to the eventual consistency of S3, which was addressed by using more robust connectors and configuration options. - A coarse-grained Mesos scheduling approach was found to be more stable than fine-grained, though it has limitations like static core allocation that future Mesos improvements may address. - Tuning jobs for coarse-

•by Spark Summit

apache spark

Spark - read DATA
aka.ms/spark-@adipolak

Spark - DATA OPERATIONS
aka.ms/spark-@adipolak

Spark Operation
• Transformations (define a new DataFrame):
map, filter, sample, groupByKey, reduceByKey,
sortByKey, flatMap, join, mapValues
• Actions: collect, reduce, count, save, lookupKey, take
aka.ms/spark-@adipolak

Recommended for you

Databricks with R: Deep Dive

In this presentation we'll explain how to use the R programming language with Spark using a Databricks notebook and the SparkR package. We'll discuss how to push data wrangling to the Spark nodes for massive scale and how to bring it back to a single node so we can use open source packages on the data. We'll demonstrate converting SQL tables into R distributed data frames and how to convert R data frames to SQL tables. We'll also have a look at how to train predictive models using data distributed over the Spark nodes. Bring your popcorn. This is a fun and interesting presentation. Speaker: Bryan Cafferky

•by Databricks

5 Reasons why Spark is in demand!

Spark is in high demand for several reasons: it offers low-latency processing by keeping data in memory, supports streaming analytics, machine learning algorithms, and graph processing. It also introduces DataFrames for easier data analysis and integrates well with Hadoop for processing large datasets. Spark can sort 100TB of data 3 times faster than MapReduce using fewer resources, making it a popular big data processing engine.

•by Edureka!

apache sparkapache spark and scalaspark and scala

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289

•by Paco Nathan

machine learningbig datadata science

Narrow (no partitioning):
Map
•FlatMap
•MapPartitions
•Filter
•Sample
•Union
Wide (partitioning):
•Intersection
•Distinct
•ReduceByKey
•GroupByKey
•Join
•Cartesian
•Repartition
•cogroupaka.ms/spark-@adipolak

Spark - SAVE DATA
aka.ms/spark-@adipolak

SPARK CONNECTORS – load
AND sink DATA
Load –
read/bring to
machines
Sink – save
aka.ms/spark-@adipolak

Recommended for you

Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...

The document discusses creating a fast data pipeline using Apache Spark's Structured Streaming and Spark Streaming. It presents a sensor anomaly detection pipeline that uses Structured Streaming for data exploration, preparation, and anomaly detection, and Spark Streaming for online model creation and training. It compares the execution and abstraction models of Structured Streaming and Spark Streaming, and demonstrates how to build the sensor anomaly detection pipeline using Kafka sources and sinks with SQL operations, event time windows, and watermarks.

•by Lightbend

real-time streamingapache sparkspark streaming

ASPgems - kappa architecture

Kappa Architecture is an alternative to Lambda Architecture that simplifies real-time data processing. It uses a distributed log like Kafka to store all input data immutably to allow reprocessing from the beginning if the processing code changes. This avoids having to maintain separate batch and real-time processing systems. The ASPgems team has implemented Kappa Architecture for several clients using Kafka, Spark Streaming, and Cassandra to provide real-time analytics and metrics in sectors like telecommunications, IoT, insurance, and energy.

•by Juantomás García Molina

kappa architecturebig datascala

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...

This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial: 1) Big Data Introduction 2) Batch vs Real Time Analytics 3) Why Apache Spark? 4) What is Apache Spark? 5) Using Spark with Hadoop 6) Apache Spark Features 7) Apache Spark Ecosystem 8) Demo: Earthquake Detection Using Apache Spark

•by Edureka!

big data with spark and scalaspark mllibgraphx

Spark Cluster
Driver Node
Worker Node Worker Node
Worker Node
aka.ms/spark-@adipolak

Spark Cluster
Driver
Driver Node
Worker Node Worker Node
Worker Node
aka.ms/spark-@adipolak

Spark Cluster
Driver
Driver Node
Worker Node Worker Node Worker Node
Executor Executor Executor Executor
aka.ms/spark-@adipolak

Spark Cluster
Driver
Driver Node
Worker Node Worker Node Worker Node
Executor Executor Executor Executor
Cache Cache Cache Cache
aka.ms/spark-@adipolak

Recommended for you

Spark Will Replace Hadoop ! Know Why

In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.

•by Edureka!

apache sparkspark & scalaapache spark and scala

MongoDB and Azure Databricks

This talk will provide a brief update on Microsoft’s recent history in Open Source with specific emphasis on Azure Databricks, a fast, easy and collaborative Apache Spark-based analytics service. Attendees will learn how to integrate MongoDB Atlas with Azure Databricks using the MongoDB Connector for Spark. This integration allows users to process data in MongoDB with the massive parallelism of Spark, its machine learning libraries, and streaming API.

•by MongoDB

mongodbmongodb.localazure

Started with-apache-spark

In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind

•by Happiest Minds Technologies

big data trendsinfrastructurebig data

Spark Cluster
Driver
Driver Node
Worker Node Worker Node Worker Node
Executor Executor Executor Executor
Cache Cache Cache Cache
Task Task Task Task
aka.ms/spark-@adipolak

What’s
inside an
executer
aka.ms/spark-@adipolak

Spark Architecture
aka.ms/spark-@adipolak

Spark Architecture
aka.ms/spark-
Structured
Streaming
MLlib GraphFrame
TensorFrame
s
SQL SparkSession / DataFrame / Dataset APIs
Data
Source
Connectors
Catalyst Optimization & Tungsten Execution
Spark Core (RDD APIs)
SQL
API
@adipolak

Recommended for you

PowerApps

This slide deck accompanied the presentation at #SUGUK on 20180322 in London, UK. PowerApps allows you to build business application with no-code, and is included in most Office 365 plans.

•by Penny Coventry

microsoftofficeo365

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!

•by Databricks

End to-end large messages processing with Kafka Streams & Kafka Connect

This document discusses processing large messages with Kafka Streams and Kafka Connect. It describes how large messages can exceed Kafka's maximum message size limit. It proposes using an S3-backed serializer to store large messages in S3 and send pointers to Kafka instead. This allows processing logic to remain unchanged while handling large messages. The serializer transparently retrieves messages from S3 during deserialization.

•by confluent

open sourcebig data

Why Apache Spark is fast and
efficient?
CATALYST
OPTIMIZER
IN MEMORY
COMPUTING
TUNGSTEN
@adipolak
aka.ms/spar

Want to
learn
more?
aka.ms/spark-
architecture
aka.ms/distribute
d-programming

Recommended for you

Burst workloads Cutting costs with Kubernetes and Virtual Kubelet

y running your workloads in Kubernetes, we can focus on designing and building your applications instead of managing the infrastructure that runs them. But wait! what about the cost?? With the Virtual Kubelet provider for AKS and Azure Container Instances, both Linux and Windows containers can be scheduled on a container instance as if it is a standard Kubernetes node. This configuration allows you to take advantage of both the capabilities of Kubernetes and the management value and cost-benefit of container instances. In this talk you will learn how to deploy an application to AKS and ACI with Virtual Kublet. While leveraging the scalability of Kubernetes and cost efficiency of ACI. https://dev.to/adipolak/kubernetes-and-virtual-kubelet-in-a-nutshell-gn4

•by Adi Polak

kubernetesvirtual-kubeletdistributed-systems

AI at Scale

This document discusses AI at scale using Apache Spark on Azure. It provides an overview of Apache Spark, how it can be used for machine learning with tools like MLlib and Databricks, and how cognitive services can be combined with Spark. It also discusses using Azure services like Databricks, HDInsight and AKS for running Spark workloads at scale, and the roles of data engineers and data scientists.

•by Adi Polak

big dataapache-sparkazure

From desktop to the cloud, cutting costs with Virtual kubelet and ACI

Breaking up a monolith or switching from client desktop to using the web in scale, require us to think of many factors, like the engineering team and the knowledge that the team already possess, technologies that exist, how to build the infrastructure right and much more. How can we use Kubernetes with Virtual Kubelet to cut costs and use the right service for the workload, whether it is a burst workload or a steady one

•by Adi Polak

kubernetescloud computingjava

What's hot

What is power apps

Cynoteck Technology Solutions Private Limited

This document discusses Microsoft PowerApps, a tool that allows users to create powerful apps that connect to data from various sources. It can be used to build apps either through a web portal or desktop studio. The document outlines what PowerApps is, its components, related technologies, and provides steps for creating a sample app using a SharePoint list as the data source. It also covers publishing and using the created app.

Spark is going to replace Apache Hadoop! Know Why?

Edureka!

The document discusses how Spark is emerging to replace Hadoop for big data processing. It notes that Hadoop MapReduce is limited to batch processing and is not fast enough for real-time processing needs. In contrast, Spark is up to 100 times faster than Hadoop MapReduce, supports both batch and real-time processing, and stores data in memory for faster analysis. A survey is cited showing increasing adoption of Spark over Hadoop in industries handling large volumes of data. The document concludes that while Hadoop will still be used, Spark will replace Hadoop MapReduce as the primary framework for big data applications due to its ability to support real-time processing demands.

LanceShivnathHadoopSummit2015

Lance Co Ting Keh

This document discusses Spark application development and common problems that can occur. It notes that failures, wrong results, poor performance, scalability issues, and application, data, storage, and resource problems can all go wrong with Spark applications. It asks how application developers currently detect and fix these issues by looking at logs, but that logs are spread out, incomplete, and difficult to understand. It proposes that a better approach is to visualize all relevant data in one place, analyze the data to provide diagnoses and fixes, and help prevent problems and meet goals. It then lists some existing tools for Hadoop and Spark that provide visualization, optimization, and strategic capabilities.

New Theme Directory

Konstantin Obenland

Ten Reasons Developers Hate Your API

John Musser

Container Days NYC Keynote

Boyd Hemphill

Aye, Aye, API - What makes Technical Communicators uneasy about API document...

Ellis Pratt

Data to Go: Mobile API Design

Chuck Greb

Contributing to pandas (Korean)

Younggun Kim

Evgeniy Burak (HYS Enterprise): “Spring Data REST or intellectual job VS manual”

HYS Enterprise

RxJS: A Beginner & Expert's Perspective - ng-conf 2017

Tracy Lee

API Documentation Meetup 2016, London

Christian Heidenreich

Espresso introduction

Val Huber

What's hot (13)

What is power apps

Spark is going to replace Apache Hadoop! Know Why?

LanceShivnathHadoopSummit2015

New Theme Directory

Ten Reasons Developers Hate Your API

Container Days NYC Keynote

Aye, Aye, API - What makes Technical Communicators uneasy about API document...

Data to Go: Mobile API Design

Contributing to pandas (Korean)

Evgeniy Burak (HYS Enterprise): “Spring Data REST or intellectual job VS manual”

RxJS: A Beginner & Expert's Perspective - ng-conf 2017

API Documentation Meetup 2016, London

Espresso introduction

Similar to Demystifying Apache Spark

Just enough DevOps for Data Scientists (Part II)

Databricks

Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...

Anya Bida

A short introduction to Spark and its benefits

Johan Picard

Introduction to Apache Spark 2.0

Knoldus Inc.

An Insider’s Guide to Maximizing Spark SQL Performance

Takuya UESHIN

H2O PySparkling Water

Sri Ambati

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Luciano Resende

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...

Spark Summit

Databricks with R: Deep Dive

Databricks

5 Reasons why Spark is in demand!

Edureka!

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

Paco Nathan

Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...

Lightbend

ASPgems - kappa architecture

Juantomás García Molina

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...

Edureka!

Spark Will Replace Hadoop ! Know Why

Edureka!

MongoDB and Azure Databricks

MongoDB

Started with-apache-spark

Happiest Minds Technologies

PowerApps

Penny Coventry

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

Databricks

End to-end large messages processing with Kafka Streams & Kafka Connect

confluent

Similar to Demystifying Apache Spark (20)

Just enough DevOps for Data Scientists (Part II)

Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...

A short introduction to Spark and its benefits

Introduction to Apache Spark 2.0

An Insider’s Guide to Maximizing Spark SQL Performance

H2O PySparkling Water

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...

Databricks with R: Deep Dive

5 Reasons why Spark is in demand!

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...

ASPgems - kappa architecture

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...

Spark Will Replace Hadoop ! Know Why

MongoDB and Azure Databricks

Started with-apache-spark

PowerApps

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

End to-end large messages processing with Kafka Streams & Kafka Connect

More from Adi Polak

Burst workloads Cutting costs with Kubernetes and Virtual Kubelet

Adi Polak

AI at Scale

Adi Polak

From desktop to the cloud, cutting costs with Virtual kubelet and ACI

Adi Polak

ETL – Everything you need to know

Adi Polak

Evolution of VS code Java ecosystem

Adi Polak

This document discusses the evolution of Java support in Visual Studio Code. It provides an overview of Java tools, frameworks, and runtimes. It highlights how VS Code offers lightweight, cross-platform Java development and debugging capabilities through extensions. It also demonstrates how VS Code can be used to develop Java applications both locally and remotely on containers through features like remote development and Live Share.

Make it clean - scala clean code

Adi Polak

Spark UDFs are EviL, Catalyst to the rEsCue!

Adi Polak

Processing data at scale usually involves struggling with performance, strict SLA, limited hardware capabilities and more. After struggling with Spark SQL query run-time, I found the felon! In this lecture, I would like to share with you the change in perspective and process we had to go through in order to find the felon (and the solution!). Today in the world of Big Data and Spark we are processing high volume transactions. Catalyst is the Spark SQL query optimizer, in this talk, we will reveal how you can fully utilize Catalyst’s optimization power in order to make queries run as fast as possible, by pushing down actions and avoiding UDFs as much as possible, while still maximizing performance.

More from Adi Polak (7)

Burst workloads Cutting costs with Kubernetes and Virtual Kubelet

AI at Scale

From desktop to the cloud, cutting costs with Virtual kubelet and ACI

ETL – Everything you need to know

Evolution of VS code Java ecosystem

Make it clean - scala clean code

Spark UDFs are EviL, Catalyst to the rEsCue!

Recently uploaded

Australian Catholic University degree offer diploma Transcript

taqyea

学历认证补办制【微信：A575476】【(ACU毕业证）澳大利亚天主教大学毕业证成绩单offer】【微信：A575476】（留信学历认证永久存档查询）采用学校原版纸张，特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：A575476】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：A575476】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Maruti Wagon R on road price in Faridabad - CarDekho

kamli sharma#S10

From Clues to Connections: How Social Media Investigators Expose Hidden Networks

Milind Agarwal

iot paper presentation FINAL EDIT by kiran.pptx

KiranKumar139571

MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT

GaneshGanesh399816

### Data Description and Analysis Summary for Presentation #### 1. **Importing Libraries** Libraries used: - `pandas`, `numpy`: Data manipulation - `matplotlib`, `seaborn`: Data visualization - `scikit-learn`: Machine learning utilities - `statsmodels`, `pmdarima`: Statistical modeling - `keras`: Deep learning models #### 2. **Loading and Exploring the Dataset** **Dataset Overview:** - **Source:** CSV file (`mumbai-monthly-rains.csv`) - **Columns:** - `Year`: The year of the recorded data. - `Jan` to `Dec`: Monthly rainfall data. - `Total`: Total annual rainfall. **Initial Data Checks:** - Displayed first few rows. - Summary statistics (mean, standard deviation, min, max). - Checked for missing values. - Verified data types. **Visualizations:** - **Annual Rainfall Time Series:** Trends in annual rainfall over the years. - **Monthly Rainfall Over Years:** Patterns and variations in monthly rainfall. - **Yearly Total Rainfall Distribution:** Distribution and frequency of annual rainfall. - **Box Plots for Monthly Data:** Spread and outliers in monthly rainfall. - **Correlation Matrix of Monthly Rainfall:** Relationships between different months' rainfall. #### 3. **Data Transformation** **Steps:** - Ensured 'Year' column is of integer type. - Created a datetime index. - Converted monthly data to a time series format. - Created lag features to capture past values. - Generated rolling statistics (mean, standard deviation) for different window sizes. - Added seasonal indicators (dummy variables for months). - Dropped rows with NaN values. **Result:** - Transformed dataset with additional features ready for time series analysis. #### 4. **Data Splitting** **Procedure:** - Split the data into features (`X`) and target (`y`). - Further split into training (80%) and testing (20%) sets without shuffling to preserve time series order. **Result:** - Training set: `(X_train, y_train)` - Testing set: `(X_test, y_test)` #### 5. **Automated Hyperparameter Tuning** **Tool Used:** `pmdarima` - Automatically selected the best parameters for the SARIMA model. - Evaluated using metrics such as AIC and BIC. **Output:** - Best SARIMA model parameters and statistical summary. #### 6. **SARIMA Model** **Steps:** - Fit the SARIMA model using the training data. - Evaluated on both training and testing sets using MAE and RMSE. **Output:** - **Train MAE:** Indicates accuracy on training data. - **Test MAE:** Indicates accuracy on unseen data. - **Train RMSE:** Measures average error magnitude on training data. - **Test RMSE:** Measures average error magnitude on testing data. #### 7. **LSTM Model** **Preparation:** - Reshaped data for LSTM input. - Converted data to `float32`. **Model Building and Training:** - Built an LSTM model with one LSTM layer and one Dense layer. - Trained the model on the training data. **Evaluation:** - Evaluated on both training and testing sets using MAE and RMSE. **Output:** - **Train MAE:** Accuracy on training data. - **T

Sunshine Coast University diploma

cwavvyy

原版一模一样【微信：741003700 】【阳光海岸大学毕业证成绩单】【微信：741003700 】学位证，留信学历认证（真实可查，永久存档）原件一模一样纸张工艺/offer、在读证明、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理阳光海岸大学毕业证【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理阳光海岸大学毕业证【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理阳光海岸大学毕业证【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理阳光海岸大学毕业证【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Streamlining Legacy Complexity Through Modernization

sanjay singh

University of Toronto degree offer diploma Transcript

taqyea

学历认证补办制【微信：A575476】【(UofT毕业证）多伦多大学毕业证成绩单offer】【微信：A575476】（留信学历认证永久存档查询）采用学校原版纸张，特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：A575476】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：A575476】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理(UofT毕业证）多伦多大学毕业证【微信：A575476】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(UofT毕业证）多伦多大学毕业证【微信：A575476】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(UofT毕业证）多伦多大学毕业证【微信：A575476】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(UofT毕业证）多伦多大学毕业证【微信：A575476 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

BIGPPTTTTTTTTtttttttttttttttttttttt.pptx

RajdeepPaul47

Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

kumkum tuteja$A17

Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

kumkum tuteja$A17

[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습

Amazon Web Services Korea

Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.

Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe

vasudha malikmonii$A17

Amul goes international: Desi dairy giant to launch fresh ...

chetankumar9855

Niagara College degree offer diploma Transcript

taqyea

原版制作【微信：A575476】【(NC毕业证)尼亚加拉学院毕业证成绩单offer】【微信：A575476】（留信学历认证永久存档查询）采用学校原版纸张（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：A575476】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：A575476】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理(NC毕业证)尼亚加拉学院毕业证【微信：A575476】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(NC毕业证)尼亚加拉学院毕业证【微信：A575476】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(NC毕业证)尼亚加拉学院毕业证【微信：A575476】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(NC毕业证)尼亚加拉学院毕业证【微信：A575476 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

aashuverma204

How We Added Replication to QuestDB - JonTheBeach

javier ramirez

Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance. A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen. Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second. In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.

Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe

butwhat24

Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe

bookmybebe1

Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe

nikita dubey$A17

Recently uploaded (20)

Australian Catholic University degree offer diploma Transcript

Maruti Wagon R on road price in Faridabad - CarDekho

From Clues to Connections: How Social Media Investigators Expose Hidden Networks

iot paper presentation FINAL EDIT by kiran.pptx

MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT

Sunshine Coast University diploma

Streamlining Legacy Complexity Through Modernization

University of Toronto degree offer diploma Transcript

BIGPPTTTTTTTTtttttttttttttttttttttt.pptx

Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습

Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe

Amul goes international: Desi dairy giant to launch fresh ...

Niagara College degree offer diploma Transcript

Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

How We Added Replication to QuestDB - JonTheBeach

Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe

Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe

Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe

Demystifying Apache Spark

1. Demystifying Apache Spark Adi Polak Sr. Software Engineer - Microsoft @adipolak aka.ms/spark-architecture

2. Want to learn more? aka.ms/spark- architecture aka.ms/distributed- programming

3. Learning to code https://aka.ms/p@adipolak

4. Learning to code – numbers and assignments https://aka.ms/p@adipolak

5. Learning to code - types https://aka.ms/p@adipolak

6. Learning to code – reading keyboard input https://aka.ms/p@adipolak

7. Learning to code – reading from file https://aka.ms/p@adipolak

8. https://aka.ms/pyt@adipolak

9. Learning to code – reading from database https://aka.ms/p Connecting and accessing DB Reading from The DB @adipolak

10. App Database @adipolak

11. AHA! Data runs on network @adipolak

12. My App Database Manager What if my data is so big that it won't fit on one machine? @adipolak

13. My App What if my data is so big that I can’t process it one machine? Return final aka.ms/spark@adipolak

14. My App Distributed Processing! aka.ms/spark@adipolak

16. Task: Sort 50 Kilo of Candy In 15 min aka.ms/spark @adipolak

17. Candy Example: aka.ms/spark@adipolak

18. Candy Example: aka.ms/spark-@adipolak

19. Candy Example: aka.ms/spark@adipolak

20. aka.ms/spark-

21. Spark - getting STARTED: aka.ms/spark- Programming language support: •Java •Scala •Python •.NET •R @adipolak

22. SPARK BASICS •Batch •Stream •Machine Learning •Graph Distributed DATA Workloads aka.ms/spark-@adipolak

23. SPARK BASICS •READ DATA/FILES •DATA OPERATIONS •SAVE DATA DATA PIPELI NE aka.ms/spark-@adipolak

24. SPARK BASICS •READ DATA/FILES •DATA OPERATIONS •SAVE DATA DATA PIPELI NE aka.ms/spark-@adipolak

25. Spark - read DATA aka.ms/spark-@adipolak

26. SPARK BASICS •READ DATA/FILES •DATA OPERATIONS •SAVE DATA DATA PIPELI NE aka.ms/spark-@adipolak

27. Spark - DATA OPERATIONS aka.ms/spark-@adipolak

28. Spark Operation • Transformations (define a new DataFrame): map, filter, sample, groupByKey, reduceByKey, sortByKey, flatMap, join, mapValues • Actions: collect, reduce, count, save, lookupKey, take aka.ms/spark-@adipolak

29. Narrow (no partitioning): Map •FlatMap •MapPartitions •Filter •Sample •Union Wide (partitioning): •Intersection •Distinct •ReduceByKey •GroupByKey •Join •Cartesian •Repartition •cogroupaka.ms/spark-@adipolak

30. SPARK BASICS •READ DATA/FILES •DATA OPERATIONS •SAVE DATA DATA PIPELI NE aka.ms/spark-@adipolak

31. Spark - SAVE DATA aka.ms/spark-@adipolak

32. SPARK CONNECTORS – load AND sink DATA Load – read/bring to machines Sink – save aka.ms/spark-@adipolak

33. Spark Cluster Driver Node Worker Node Worker Node Worker Node aka.ms/spark-@adipolak

34. Spark Cluster Driver Driver Node Worker Node Worker Node Worker Node aka.ms/spark-@adipolak

35. Spark Cluster Driver Driver Node Worker Node Worker Node Worker Node Executor Executor Executor Executor aka.ms/spark-@adipolak

36. Spark Cluster Driver Driver Node Worker Node Worker Node Worker Node Executor Executor Executor Executor Cache Cache Cache Cache aka.ms/spark-@adipolak

37. Spark Cluster Driver Driver Node Worker Node Worker Node Worker Node Executor Executor Executor Executor Cache Cache Cache Cache Task Task Task Task aka.ms/spark-@adipolak

38. What’s inside an executer aka.ms/spark-@adipolak

39. Spark Architecture aka.ms/spark-@adipolak

40. Spark Architecture aka.ms/spark- Structured Streaming MLlib GraphFrame TensorFrame s SQL SparkSession / DataFrame / Dataset APIs Data Source Connectors Catalyst Optimization & Tungsten Execution Spark Core (RDD APIs) SQL API @adipolak

41. Why Apache Spark is fast and efficient? CATALYST OPTIMIZER IN MEMORY COMPUTING TUNGSTEN @adipolak aka.ms/spar

42. Spark in the cloud

43. Want to learn more? aka.ms/spark- architecture aka.ms/distribute d-programming

44. Thank you

Editor's Notes

Sprak is efficient since it has two strong optimizer, one is the catalys who is doing database query like optimization, and the seconf one is tungsten that do more low level optimization As well as in memory approach to analysis and work on data

Demystifying Apache Spark

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (13)

Similar to Demystifying Apache Spark

Similar to Demystifying Apache Spark (20)

More from Adi Polak

More from Adi Polak (7)

Recently uploaded

Recently uploaded (20)

Demystifying Apache Spark

Editor's Notes