Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
This document provides an overview and introduction to NoSQL databases. It begins with an agenda that explores key-value, document, column family, and graph databases. For each type, 1-2 specific databases are discussed in more detail, including their origins, features, and use cases. Key databases mentioned include Voldemort, CouchDB, MongoDB, HBase, Cassandra, and Neo4j. The document concludes with references for further reading on NoSQL databases and related topics.
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Jim Boriotti presents an overview and demo of Azure Synapse Analytics, an integrated data platform for business intelligence, artificial intelligence, and continuous intelligence. Azure Synapse Analytics includes Synapse SQL for querying with T-SQL, Synapse Spark for notebooks in Python, Scala, and .NET, and Synapse Pipelines for data workflows. The demo shows how Azure Synapse Analytics provides a unified environment for all data tasks through the Synapse Studio interface.
This document discusses CQRS and event sourcing patterns for Java developers. It begins with an overview of classical monolithic architectures versus modern microservice architectures. It then contrasts CRUD with CQRS, explaining that CQRS separates reads from writes by using commands for writes and queries for reads. Events evolve from commands and represent things that occurred in the past. The document provides an example implementation using the Lagom framework that demonstrates separating the write side from the read side and persisting events. It emphasizes that with this approach, all state changes are stored as events and the current state can be recreated by replaying events. The document encourages the use of Lagom due to benefits like asynchronous programming, developer productivity, and production readiness.
Real-Time Stock Market Analysis using Spark Streaming
This document proposes using support vector machines (SVMs) to model high-frequency limit order book dynamics and predict metrics like mid-price movement and price spread crossing. It describes representing each limit order book entry as a vector of attributes, then using multi-class SVMs to build models for each metric. Experiments on real data show the selected features are effective for short-term price forecasts. The document provides background on SVMs, describing how they find an optimal separating hyperplane to classify data points into labels.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
This document provides an overview of Apache Spark, an open-source unified analytics engine for large-scale data processing. It discusses Spark's core APIs including RDDs and transformations/actions. It also covers Spark SQL, Spark Streaming, MLlib, and GraphX. Spark provides a fast and general engine for big data processing, with explicit operations for streaming, SQL, machine learning, and graph processing. The document includes installation instructions and examples of using various Spark components.
This document provides an overview and introduction to NoSQL databases. It discusses key-value stores like Dynamo and BigTable, which are distributed, scalable databases that sacrifice complex queries for availability and performance. It also explains column-oriented databases like Cassandra that scale to massive workloads. The document compares the CAP theorem and consistency models of these databases and provides examples of their architectures, data models, and operations.
This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.
This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.
The document introduces data engineering and provides an overview of the topic. It discusses (1) what data engineering is, how it has evolved with big data, and the required skills, (2) the roles of data engineers, data scientists, and data analysts in working with big data, and (3) the structure and schedule of an upcoming meetup on data engineering that will use an agile approach over monthly sprints.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
This document provides an agenda for a 3+ hour workshop on Apache Spark 2.x on Databricks. It includes introductions to Databricks, Spark fundamentals and architecture, new features in Spark 2.0 like unified APIs, and workshops on DataFrames/Datasets, Spark SQL, and structured streaming concepts. The agenda covers lunch and breaks and is divided into hour and half hour segments.
Introduction to Spark Datasets - Functional and relational together at last
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
This chapter discusses Spark SQL, which allows querying Spark data with SQL. It covers initializing Spark SQL, loading data from sources like Hive, Parquet, JSON and RDDs, caching data, writing UDFs, and performance tuning. The JDBC server allows sharing cached tables and queries between programs. SchemaRDDs returned by queries or loaded from data represent the data structure that SQL queries operate on.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
This document provides an introduction and overview of Apache Spark with Python (PySpark). It discusses key Spark concepts like RDDs, DataFrames, Spark SQL, Spark Streaming, GraphX, and MLlib. It includes code examples demonstrating how to work with data using PySpark for each of these concepts.
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
The document discusses loading data into Spark SQL and the differences between DataFrame functions and SQL. It provides examples of loading data from files, cloud storage, and directly into DataFrames from JSON and Parquet files. It also demonstrates using SQL on DataFrames after registering them as temporary views. The document outlines how to load data into RDDs and convert them to DataFrames to enable SQL querying, as well as using SQL-like functions directly in the DataFrame API.
- Apache Spark is an open-source cluster computing framework that provides fast, in-memory processing for large-scale data analytics. It can run on Hadoop clusters and standalone.
- Spark allows processing of data using transformations and actions on resilient distributed datasets (RDDs). RDDs can be persisted in memory for faster processing.
- Spark comes with modules for SQL queries, machine learning, streaming, and graphs. Spark SQL allows SQL queries on structured data. MLib provides scalable machine learning. Spark Streaming processes live data streams.
- Apache Spark is an open-source cluster computing framework that provides fast, general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) that allow in-memory processing for speed.
- The document discusses Spark's key concepts like transformations, actions, and directed acyclic graphs (DAGs) that represent Spark job execution. It also summarizes Spark SQL, MLlib, and Spark Streaming modules.
- The presenter is a solutions architect who provides an overview of Spark and how it addresses limitations of Hadoop by enabling faster, in-memory processing using RDDs and a more intuitive API compared to MapReduce.
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
This document provides an overview of Spark and using Spark on HDInsight. It discusses Spark concepts like RDDs, transformations, and actions. It also covers Spark extensions like Spark SQL, Spark Streaming, and MLlib. Finally, it highlights benefits of using Spark on HDInsight like integration with Azure services, scalability, and support.
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.
Building highly scalable data pipelines with Apache Spark
This document provides a summary of Apache Spark, including:
- Spark is a framework for large-scale data processing across clusters that is faster than Hadoop by relying more on RAM and minimizing disk IO.
- Spark transformations operate on resilient distributed datasets (RDDs) to manipulate data, while actions return results to the driver program.
- Spark can receive data from various sources like files, databases, sockets through its datasource APIs and process both batch and streaming data.
- Spark streaming divides streaming data into micro-batches called DStreams and integrates with messaging systems like Kafka. Structured streaming is a newer API that works on DataFrames/Datasets.
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
SparkR provides an R interface to Apache Spark that allows R developers to leverage Spark's distributed processing capabilities through DataFrames. DataFrames impose a schema on data from RDDs, making the data easier to access and manipulate compared to raw RDDs. DataFrames in SparkR allow R-like syntax and interactions with data while leveraging Spark's optimizations by passing operations to the JVM. The speaker demonstrated SparkR DataFrame features and discussed the project's roadmap, including expanding SparkR to support Spark machine learning capabilities.
The document describes the steps to configure Oracle sharding in an Oracle 12c environment. It includes installing Oracle software on shardcat, shard1, and shard2 nodes, creating an SCAT database, installing the GSM software, configuring the shard catalog, registering the shard nodes, creating a shard group and adding shards, deploying the shards to create databases on shard1 and shard2, verifying the shard configuration, creating a global service, and creating a sample schema and shard table to verify distribution across shards.
Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal.
Topics covered:
1. Overview of Machine Learning
2. Basics of Deep Learning
3. What is computer vision and its use-cases?
4. Various algorithms used in Computer Vision (mostly CNN)
5. Live hands-on demo of either Auto Cameraman or Face recognition system
6. What next?
This document provides an agenda for an introduction to deep learning presentation. It begins with an introduction to basic AI, machine learning, and deep learning terms. It then briefly discusses use cases of deep learning. The document outlines how to approach a deep learning problem, including which tools and algorithms to use. It concludes with a question and answer section.
This document discusses recurrent neural networks (RNNs) and their applications. It begins by explaining that RNNs can process input sequences of arbitrary lengths, unlike other neural networks. It then provides examples of RNN applications, such as predicting time series data, autonomous driving, natural language processing, and music generation. The document goes on to describe the fundamental concepts of RNNs, including recurrent neurons, memory cells, and different types of RNN architectures for processing input/output sequences. It concludes by demonstrating how to implement basic RNNs using TensorFlow's static_rnn function.
Natural Language Processing (NLP) is a field of artificial intelligence that deals with interactions between computers and human languages. NLP aims to program computers to process and analyze large amounts of natural language data. Some common NLP tasks include speech recognition, text classification, machine translation, question answering, and more. Popular NLP tools include Stanford CoreNLP, NLTK, OpenNLP, and TextBlob. Vectorization is commonly used to represent text in a way that can be used for machine learning algorithms like calculating text similarity. Tf-idf is a common technique used to weigh words based on their frequency and importance.
- Naive Bayes is a classification technique based on Bayes' theorem that uses "naive" independence assumptions. It is easy to build and can perform well even with large datasets.
- It works by calculating the posterior probability for each class given predictor values using the Bayes theorem and independence assumptions between predictors. The class with the highest posterior probability is predicted.
- It is commonly used for text classification, spam filtering, and sentiment analysis due to its fast performance and high success rates compared to other algorithms.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
This document provides an overview of NoSQL databases and compares them to relational databases. It discusses the different types of NoSQL databases including key-value stores, document databases, wide column stores, and graph databases. It also covers some common concepts like eventual consistency, CAP theorem, and MapReduce. While NoSQL databases provide better scalability for massive datasets, relational databases offer more mature tools and strong consistency models.
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
This document provides an overview and introduction to NoSQL databases. It begins with an agenda that explores key-value, document, column family, and graph databases. For each type, 1-2 specific databases are discussed in more detail, including their origins, features, and use cases. Key databases mentioned include Voldemort, CouchDB, MongoDB, HBase, Cassandra, and Neo4j. The document concludes with references for further reading on NoSQL databases and related topics.
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020Timothy McAliley
Jim Boriotti presents an overview and demo of Azure Synapse Analytics, an integrated data platform for business intelligence, artificial intelligence, and continuous intelligence. Azure Synapse Analytics includes Synapse SQL for querying with T-SQL, Synapse Spark for notebooks in Python, Scala, and .NET, and Synapse Pipelines for data workflows. The demo shows how Azure Synapse Analytics provides a unified environment for all data tasks through the Synapse Studio interface.
CQRS and Event Sourcing for Java DevelopersMarkus Eisele
This document discusses CQRS and event sourcing patterns for Java developers. It begins with an overview of classical monolithic architectures versus modern microservice architectures. It then contrasts CRUD with CQRS, explaining that CQRS separates reads from writes by using commands for writes and queries for reads. Events evolve from commands and represent things that occurred in the past. The document provides an example implementation using the Lagom framework that demonstrates separating the write side from the read side and persisting events. It emphasizes that with this approach, all state changes are stored as events and the current state can be recreated by replaying events. The document encourages the use of Lagom due to benefits like asynchronous programming, developer productivity, and production readiness.
Real-Time Stock Market Analysis using Spark StreamingSigmoid
This document proposes using support vector machines (SVMs) to model high-frequency limit order book dynamics and predict metrics like mid-price movement and price spread crossing. It describes representing each limit order book entry as a vector of attributes, then using multi-class SVMs to build models for each metric. Experiments on real data show the selected features are effective for short-term price forecasts. The document provides background on SVMs, describing how they find an optimal separating hyperplane to classify data points into labels.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
This document provides an overview of Apache Spark, an open-source unified analytics engine for large-scale data processing. It discusses Spark's core APIs including RDDs and transformations/actions. It also covers Spark SQL, Spark Streaming, MLlib, and GraphX. Spark provides a fast and general engine for big data processing, with explicit operations for streaming, SQL, machine learning, and graph processing. The document includes installation instructions and examples of using various Spark components.
This document provides an overview and introduction to NoSQL databases. It discusses key-value stores like Dynamo and BigTable, which are distributed, scalable databases that sacrifice complex queries for availability and performance. It also explains column-oriented databases like Cassandra that scale to massive workloads. The document compares the CAP theorem and consistency models of these databases and provides examples of their architectures, data models, and operations.
This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.
This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.
The document introduces data engineering and provides an overview of the topic. It discusses (1) what data engineering is, how it has evolved with big data, and the required skills, (2) the roles of data engineers, data scientists, and data analysts in working with big data, and (3) the structure and schedule of an upcoming meetup on data engineering that will use an agile approach over monthly sprints.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
This document provides an agenda for a 3+ hour workshop on Apache Spark 2.x on Databricks. It includes introductions to Databricks, Spark fundamentals and architecture, new features in Spark 2.0 like unified APIs, and workshops on DataFrames/Datasets, Spark SQL, and structured streaming concepts. The agenda covers lunch and breaks and is divided into hour and half hour segments.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
This chapter discusses Spark SQL, which allows querying Spark data with SQL. It covers initializing Spark SQL, loading data from sources like Hive, Parquet, JSON and RDDs, caching data, writing UDFs, and performance tuning. The JDBC server allows sharing cached tables and queries between programs. SchemaRDDs returned by queries or loaded from data represent the data structure that SQL queries operate on.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
This document provides an introduction and overview of Apache Spark with Python (PySpark). It discusses key Spark concepts like RDDs, DataFrames, Spark SQL, Spark Streaming, GraphX, and MLlib. It includes code examples demonstrating how to work with data using PySpark for each of these concepts.
Building a modern Application with DataFramesDatabricks
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
Building a modern Application with DataFramesSpark Summit
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
The document discusses loading data into Spark SQL and the differences between DataFrame functions and SQL. It provides examples of loading data from files, cloud storage, and directly into DataFrames from JSON and Parquet files. It also demonstrates using SQL on DataFrames after registering them as temporary views. The document outlines how to load data into RDDs and convert them to DataFrames to enable SQL querying, as well as using SQL-like functions directly in the DataFrame API.
An Introduct to Spark - Atlanta Spark Meetupjlacefie
- Apache Spark is an open-source cluster computing framework that provides fast, in-memory processing for large-scale data analytics. It can run on Hadoop clusters and standalone.
- Spark allows processing of data using transformations and actions on resilient distributed datasets (RDDs). RDDs can be persisted in memory for faster processing.
- Spark comes with modules for SQL queries, machine learning, streaming, and graphs. Spark SQL allows SQL queries on structured data. MLib provides scalable machine learning. Spark Streaming processes live data streams.
- Apache Spark is an open-source cluster computing framework that provides fast, general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) that allow in-memory processing for speed.
- The document discusses Spark's key concepts like transformations, actions, and directed acyclic graphs (DAGs) that represent Spark job execution. It also summarizes Spark SQL, MLlib, and Spark Streaming modules.
- The presenter is a solutions architect who provides an overview of Spark and how it addresses limitations of Hadoop by enabling faster, in-memory processing using RDDs and a more intuitive API compared to MapReduce.
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
This document provides an overview of Spark and using Spark on HDInsight. It discusses Spark concepts like RDDs, transformations, and actions. It also covers Spark extensions like Spark SQL, Spark Streaming, and MLlib. Finally, it highlights benefits of using Spark on HDInsight like integration with Azure services, scalability, and support.
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.
Building highly scalable data pipelines with Apache SparkMartin Toshev
This document provides a summary of Apache Spark, including:
- Spark is a framework for large-scale data processing across clusters that is faster than Hadoop by relying more on RAM and minimizing disk IO.
- Spark transformations operate on resilient distributed datasets (RDDs) to manipulate data, while actions return results to the driver program.
- Spark can receive data from various sources like files, databases, sockets through its datasource APIs and process both batch and streaming data.
- Spark streaming divides streaming data into micro-batches called DStreams and integrates with messaging systems like Kafka. Structured streaming is a newer API that works on DataFrames/Datasets.
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)Spark Summit
SparkR provides an R interface to Apache Spark that allows R developers to leverage Spark's distributed processing capabilities through DataFrames. DataFrames impose a schema on data from RDDs, making the data easier to access and manipulate compared to raw RDDs. DataFrames in SparkR allow R-like syntax and interactions with data while leveraging Spark's optimizations by passing operations to the JVM. The speaker demonstrated SparkR DataFrame features and discussed the project's roadmap, including expanding SparkR to support Spark machine learning capabilities.
The document describes the steps to configure Oracle sharding in an Oracle 12c environment. It includes installing Oracle software on shardcat, shard1, and shard2 nodes, creating an SCAT database, installing the GSM software, configuring the shard catalog, registering the shard nodes, creating a shard group and adding shards, deploying the shards to create databases on shard1 and shard2, verifying the shard configuration, creating a global service, and creating a sample schema and shard table to verify distribution across shards.
Similar to Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab (20)
Understanding computer vision with Deep LearningCloudxLab
Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal.
Topics covered:
1. Overview of Machine Learning
2. Basics of Deep Learning
3. What is computer vision and its use-cases?
4. Various algorithms used in Computer Vision (mostly CNN)
5. Live hands-on demo of either Auto Cameraman or Face recognition system
6. What next?
This document provides an agenda for an introduction to deep learning presentation. It begins with an introduction to basic AI, machine learning, and deep learning terms. It then briefly discusses use cases of deep learning. The document outlines how to approach a deep learning problem, including which tools and algorithms to use. It concludes with a question and answer section.
This document discusses recurrent neural networks (RNNs) and their applications. It begins by explaining that RNNs can process input sequences of arbitrary lengths, unlike other neural networks. It then provides examples of RNN applications, such as predicting time series data, autonomous driving, natural language processing, and music generation. The document goes on to describe the fundamental concepts of RNNs, including recurrent neurons, memory cells, and different types of RNN architectures for processing input/output sequences. It concludes by demonstrating how to implement basic RNNs using TensorFlow's static_rnn function.
Natural Language Processing (NLP) is a field of artificial intelligence that deals with interactions between computers and human languages. NLP aims to program computers to process and analyze large amounts of natural language data. Some common NLP tasks include speech recognition, text classification, machine translation, question answering, and more. Popular NLP tools include Stanford CoreNLP, NLTK, OpenNLP, and TextBlob. Vectorization is commonly used to represent text in a way that can be used for machine learning algorithms like calculating text similarity. Tf-idf is a common technique used to weigh words based on their frequency and importance.
- Naive Bayes is a classification technique based on Bayes' theorem that uses "naive" independence assumptions. It is easy to build and can perform well even with large datasets.
- It works by calculating the posterior probability for each class given predictor values using the Bayes theorem and independence assumptions between predictors. The class with the highest posterior probability is predicted.
- It is commonly used for text classification, spam filtering, and sentiment analysis due to its fast performance and high success rates compared to other algorithms.
An autoencoder is an artificial neural network that is trained to copy its input to its output. It consists of an encoder that compresses the input into a lower-dimensional latent-space encoding, and a decoder that reconstructs the output from this encoding. Autoencoders are useful for dimensionality reduction, feature learning, and generative modeling. When constrained by limiting the latent space or adding noise, autoencoders are forced to learn efficient representations of the input data. For example, a linear autoencoder trained with mean squared error performs principal component analysis.
The document discusses challenges in training deep neural networks and solutions to those challenges. Training deep neural networks with many layers and parameters can be slow and prone to overfitting. A key challenge is the vanishing gradient problem, where the gradients shrink exponentially small as they propagate through many layers, making earlier layers very slow to train. Solutions include using initialization techniques like He initialization and activation functions like ReLU and leaky ReLU that do not saturate, preventing gradients from vanishing. Later improvements include the ELU activation function.
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
The document provides information about key-value RDD transformations and actions in Spark. It defines transformations like keys(), values(), groupByKey(), combineByKey(), sortByKey(), subtractByKey(), join(), leftOuterJoin(), rightOuterJoin(), and cogroup(). It also defines actions like countByKey() and lookup() that can be performed on pair RDDs. Examples are given showing how to use these transformations and actions to manipulate key-value RDDs.
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Loading XML
2) What is RPC - Remote Process Call
3) Loading AVRO
4) Data Sources - Parquet
5) Creating DataFrames From Hive Table
6) Setting up Distributed SQL Engine
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
1) NoSQL databases are non-relational and schema-free, providing alternatives to SQL databases for big data and high availability applications.
2) Common NoSQL database models include key-value stores, column-oriented databases, document databases, and graph databases.
3) The CAP theorem states that a distributed data store can only provide two out of three guarantees around consistency, availability, and partition tolerance.
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
This document provides instructions for getting started with TensorFlow using a free CloudxLab. It outlines the following steps:
1. Open CloudxLab and enroll if not already enrolled. Otherwise go to "My Lab".
2. In "My Lab", open Jupyter and run commands to clone an ML repository containing TensorFlow examples.
3. Go to the deep learning folder in Jupyter and open the TensorFlow notebook to get started with examples.
Introduction to Deep Learning | CloudxLabCloudxLab
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/goQxnL )
This CloudxLab Deep Learning tutorial helps you to understand Deep Learning in detail. Below are the topics covered in this tutorial:
1) What is Deep Learning
2) Deep Learning Applications
3) Artificial Neural Network
4) Deep Learning Neural Networks
5) Deep Learning Frameworks
6) AI vs Machine Learning
In this tutorial, we will learn the the following topics -
+ The Curse of Dimensionality
+ Main Approaches for Dimensionality Reduction
+ PCA - Principal Component Analysis
+ Kernel PCA
+ LLE
+ Other Dimensionality Reduction Techniques
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
In this tutorial, we will learn the the following topics -
+ Training and Visualizing a Decision Tree
+ Making Predictions
+ Estimating Class Probabilities
+ The CART Training Algorithm
+ Computational Complexity
+ Gini Impurity or Entropy?
+ Regularization Hyperparameters
+ Regression
+ Instability
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxSynapseIndia
Your comprehensive guide to RPA in healthcare for 2024. Explore the benefits, use cases, and emerging trends of robotic process automation. Understand the challenges and prepare for the future of healthcare automation
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionBert Blevins
Cybersecurity is a major concern in today's connected digital world. Threats to organizations are constantly evolving and have the potential to compromise sensitive information, disrupt operations, and lead to significant financial losses. Traditional cybersecurity techniques often fall short against modern attackers. Therefore, advanced techniques for cyber security analysis and anomaly detection are essential for protecting digital assets. This blog explores these cutting-edge methods, providing a comprehensive overview of their application and importance.
Best Practices for Effectively Running dbt in Airflow.pdfTatiana Al-Chueyr
As a popular open-source library for analytics engineering, dbt is often used in combination with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.
This webinar will cover a step-by-step guide to Cosmos, an open source package from Astronomer that helps you easily run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:
- Standard ways of running dbt (and when to utilize other methods)
- How Cosmos can be used to run and visualize your dbt projects in Airflow
- Common challenges and how to address them, including performance, dependency conflicts, and more
- How running dbt projects in Airflow helps with cost optimization
Webinar given on 9 July 2024
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc
Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk.
What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year?
Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year.
This webinar will review:
- Key changes to privacy regulations in 2024
- Key themes in privacy and data governance in 2024
- How to maximize your privacy program in the second half of 2024
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Bert Blevins
Today’s digitally connected world presents a wide range of security challenges for enterprises. Insider security threats are particularly noteworthy because they have the potential to cause significant harm. Unlike external threats, insider risks originate from within the company, making them more subtle and challenging to identify. This blog aims to provide a comprehensive understanding of insider security threats, including their types, examples, effects, and mitigation techniques.
Choose our Linux Web Hosting for a seamless and successful online presencerajancomputerfbd
Our Linux Web Hosting plans offer unbeatable performance, security, and scalability, ensuring your website runs smoothly and efficiently.
Visit- https://onliveserver.com/linux-web-hosting/
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Best Programming Language for Civil EngineersAwais Yaseen
The integration of programming into civil engineering is transforming the industry. We can design complex infrastructure projects and analyse large datasets. Imagine revolutionizing the way we build our cities and infrastructure, all by the power of coding. Programming skills are no longer just a bonus—they’re a game changer in this era.
Technology is revolutionizing civil engineering by integrating advanced tools and techniques. Programming allows for the automation of repetitive tasks, enhancing the accuracy of designs, simulations, and analyses. With the advent of artificial intelligence and machine learning, engineers can now predict structural behaviors under various conditions, optimize material usage, and improve project planning.
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Implementations of Fused Deposition Modeling in real worldEmerging Tech
The presentation showcases the diverse real-world applications of Fused Deposition Modeling (FDM) across multiple industries:
1. **Manufacturing**: FDM is utilized in manufacturing for rapid prototyping, creating custom tools and fixtures, and producing functional end-use parts. Companies leverage its cost-effectiveness and flexibility to streamline production processes.
2. **Medical**: In the medical field, FDM is used to create patient-specific anatomical models, surgical guides, and prosthetics. Its ability to produce precise and biocompatible parts supports advancements in personalized healthcare solutions.
3. **Education**: FDM plays a crucial role in education by enabling students to learn about design and engineering through hands-on 3D printing projects. It promotes innovation and practical skill development in STEM disciplines.
4. **Science**: Researchers use FDM to prototype equipment for scientific experiments, build custom laboratory tools, and create models for visualization and testing purposes. It facilitates rapid iteration and customization in scientific endeavors.
5. **Automotive**: Automotive manufacturers employ FDM for prototyping vehicle components, tooling for assembly lines, and customized parts. It speeds up the design validation process and enhances efficiency in automotive engineering.
6. **Consumer Electronics**: FDM is utilized in consumer electronics for designing and prototyping product enclosures, casings, and internal components. It enables rapid iteration and customization to meet evolving consumer demands.
7. **Robotics**: Robotics engineers leverage FDM to prototype robot parts, create lightweight and durable components, and customize robot designs for specific applications. It supports innovation and optimization in robotic systems.
8. **Aerospace**: In aerospace, FDM is used to manufacture lightweight parts, complex geometries, and prototypes of aircraft components. It contributes to cost reduction, faster production cycles, and weight savings in aerospace engineering.
9. **Architecture**: Architects utilize FDM for creating detailed architectural models, prototypes of building components, and intricate designs. It aids in visualizing concepts, testing structural integrity, and communicating design ideas effectively.
Each industry example demonstrates how FDM enhances innovation, accelerates product development, and addresses specific challenges through advanced manufacturing capabilities.
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...Toru Tamaki
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models" arXiv2023
https://arxiv.org/abs/2307.12980
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
4. Spark SQL & Dataframes
Uniform Data Access
○ Source:
■ HDFS,
■ Hive
■ Relational Databases
○ Avro, Parquet, ORC, JSON
○ You can even join data across these sources.
○ Hive Compatibility
○ Standard Connectivity
Spark SQL
5. Spark SQL & Dataframes
DataFrames
1 sandeep
2 ted
3 thomas
4 priya
5 kush
RDD
Unstructured
Need code for processing
6. Spark SQL & Dataframes
DataFrames
1 sandeep
2 ted
3 thomas
4 priya
5 kush
RDD
ID Name
1 sandeep
2 ted
3 thomas
4 priya
5 kush
Data Frame
Unstructured Structured
Need code for processing Can use SQL or R like syntax:
df.sql("select Id where name = 'priya'")
head(where(df, df$ID > 21))
7. Spark SQL & Dataframes
● Collection with named columns
● Distributed
● <> Same as database table
● <> A data frame in R/Python
Data Frames
col1 col2 col3 Partition1
Partition2
8. Spark SQL & Dataframes
Data Frames
col1 col2 col3 Partition1
Partition2
Structured data:
CSV, JSON
Hive RDBMS RDDs
Can be constructed from
9. Spark SQL & Dataframes
Data Frames
DataFrame API is available in
col1 col2 col3 Partition1
Partition2
10. Spark SQL & Dataframes
● Available in Spark 2.0x onwards.
● Using usual interfaces
○ Spark-shell
○ Spark Application
○ Pyspark
○ Java
○ etc.
Getting Started
12. Spark SQL & Dataframes
$ export HADOOP_CONF_DIR=/etc/hadoop/conf/
$ export YARN_CONF_DIR=/etc/hadoop/conf/
$ ls /usr/
bin games include jdk64 lib64 local share spark1.6
spark2.0.2 tmp etc hdp java lib libexec sbin spark1.2.1
spark2.0.1 src
Getting Started
13. Spark SQL & Dataframes
$ export HADOOP_CONF_DIR=/etc/hadoop/conf/
$ export YARN_CONF_DIR=/etc/hadoop/conf/
$ ls /usr/
bin games include jdk64 lib64 local share spark1.6
spark2.0.2 tmp etc hdp java lib libexec sbin spark1.2.1
spark2.0.1 src
$ /usr/spark2.0.2/bin/spark-shell
Getting Started
14. Spark SQL & Dataframes
Getting Started
Spark context Web UI available at http://172.31.60.179:4040
Spark context available as 'sc' (master = local[*], app id = local-1498489557917).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.0.2
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
15. Spark SQL & Dataframes
Getting Started
Spark context Web UI available at http://172.31.60.179:4040
Spark context available as 'sc' (master = local[*], app id = local-1498489557917).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.0.2
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
27. Spark SQL & Dataframes
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
Running SQL Queries Programmatically
28. Spark SQL & Dataframes
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
Running SQL Queries Programmatically
29. Spark SQL & Dataframes
Running SQL Queries Programmatically
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
30. Spark SQL & Dataframes
Datasets
● Similar to RDDs
● instead Java serialization or Kryo
● use a specialized Encoder
● use Encoder to serialize
Encoders
● Are dynamically generated code
● Perform operations with deserializing
Datasets
31. Spark SQL & Dataframes
// Encoders for most common types are automatically
provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)
Creating Datasets
32. Spark SQL & Dataframes
case class Person(name: String, age: Long)
// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+
Creating Datasets
34. Spark SQL & Dataframes
Interoperating with RDDs
How to convert an RDD into dataframe?
RDD
ID Name
1 sandeep
2 ted
3 thomas
4 priya
5 kush
Data Frame
1 sandeep
2 ted
3 thomas
4 priya
5 kush
35. Spark SQL & Dataframes
Interoperating with RDDs
Two ways to convert RDDs to DF:
a. Inferring the Schema Using Reflection
b.
RDD
ID Name
1 sandeep
2 ted
3 thomas
4 priya
5 kush
Data Frame
1 sandeep
2 ted
3 thomas
4 priya
5 kush
36. Spark SQL & Dataframes
Interoperating with RDDs
Two ways to convert RDDs to DF:
a. Inferring the Schema Using Reflection
b. Programmatically Specifying the Schema
RDD
ID Name
1 sandeep
2 ted
3 thomas
4 priya
5 kush
Data Frame
1 sandeep
2 ted
3 thomas
4 priya
5 kush
37. Spark SQL & Dataframes
Inferring the Schema Using Reflection
● Spark SQL can convert an RDDs with case classes to a DataFrame
● The names of case class arguments are read using reflection and become columns
● Case classes can be nested or contain complex types
● Let us try to convert people.txt into dataframe
people.txt:
Michael, 29
Andy, 30
Justin, 19
38. Spark SQL & Dataframes
Inferring the Schema Using Reflection
https://github.com/cloudxlab/bigdata/blob/master/spark/examples/dataframes/rdd_to_df.scala
39. Spark SQL & Dataframes
scala> import spark.implicits._
import spark.implicits._
Inferring the Schema Using Reflection
40. Spark SQL & Dataframes
scala> import spark.implicits._
import spark.implicits._
scala> case class Person(name: String, age: Long)
defined class Person
Inferring the Schema Using Reflection
41. Spark SQL & Dataframes
scala> import spark.implicits._
import spark.implicits._
scala> case class Person(name: String, age: Long)
defined class Person
scala> val textRDD = sc.textFile("/data/spark/people.txt")
textRDD: org.apache.spark.rdd.RDD[String] = /data/spark/people.txt
MapPartitionsRDD[3] at textFile at <console>:30
Inferring the Schema Using Reflection
42. Spark SQL & Dataframes
scala> import spark.implicits._
import spark.implicits._
scala> case class Person(name: String, age: Long)
defined class Person
scala> val textRDD = sc.textFile("/data/spark/people.txt")
textRDD: org.apache.spark.rdd.RDD[String] = /data/spark/people.txt
MapPartitionsRDD[3] at textFile at <console>:30
scala> val arrayRDD = textRDD.map(_.split(","))
arrayRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[4] at
map at <console>:32
Inferring the Schema Using Reflection
43. Spark SQL & Dataframes
scala> import spark.implicits._
import spark.implicits._
scala> case class Person(name: String, age: Long)
defined class Person
scala> val textRDD = sc.textFile("/data/spark/people.txt")
textRDD: org.apache.spark.rdd.RDD[String] = /data/spark/people.txt
MapPartitionsRDD[3] at textFile at <console>:30
scala> val arrayRDD = textRDD.map(_.split(","))
arrayRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[4] at
map at <console>:32
scala> val personRDD = arrayRDD.map(attributes => Person(attributes(0),
attributes(1).trim.toInt))
personRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[5] at map
at <console>:36
Inferring the Schema Using Reflection
44. Spark SQL & Dataframes
scala> val peopleDF = personRDD.toDF()
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
Inferring the Schema Using Reflection
46. Spark SQL & Dataframes
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND
19")
Inferring the Schema Using Reflection
47. Spark SQL & Dataframes
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND
19")
// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
Inferring the Schema Using Reflection
48. Spark SQL & Dataframes
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND
19")
// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
// or by field name
teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
Inferring the Schema Using Reflection
49. Spark SQL & Dataframes
Inferring the Schema Using Reflection
// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
// Array(Map("name" -> "Justin", "age" -> 19))
50. Spark SQL & Dataframes
● When case classes can't be defined during time of coding
a. E.g. The fields expected in case classes are passed as arguments
● We need to programmatically create the dataframe:
Programmatically Specifying the Schema
51. Spark SQL & Dataframes
Programmatically Specifying the Schema
● When case classes can't be defined during time of coding
a. E.g. The fields expected in case classes are passed as arguments
● We need to programmatically create the dataframe:
1. Create RDD of Row objects
2. Create schema represented by StructType
3. Apply schema with createDataFrame
54. Spark SQL & Dataframes
Programmatically Specifying the Schema
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// The schema is encoded in a string
// User provided variable
val schemaString = "name age"
val filename = "/data/spark/people.txt"
val fieldsArray = schemaString.split(" ")
val fields = fieldsArray.map(
f => StructField(f, StringType, nullable = true)
)
val schema = StructType(fields)
55. Spark SQL & Dataframes
Programmatically Specifying the Schema
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// The schema is encoded in a string
// User provided variable
val schemaString = "name age"
val filename = "/data/spark/people.txt"
val fieldsArray = schemaString.split(" ")
val fields = fieldsArray.map(
f => StructField(f, StringType, nullable = true)
)
val schema = StructType(fields)
val peopleRDD = spark.sparkContext.textFile(filename)
val rowRDD = peopleRDD.map(_.split(",")).map(
attributes => Row.fromSeq(attributes)
)
val peopleDF = spark.createDataFrame(rowRDD, schema)
56. Spark SQL & Dataframes
Programmatically Specifying the Schema
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// The schema is encoded in a string
// User provided variable
val schemaString = "name age"
val filename = "/data/spark/people.txt"
val fieldsArray = schemaString.split(" ")
val fields = fieldsArray.map(
f => StructField(f, StringType, nullable = true)
)
val schema = StructType(fields)
val peopleRDD = spark.sparkContext.textFile(filename)
val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row.fromSeq(attributes))
val peopleDF = spark.createDataFrame(rowRDD, schema)
peopleDF.show()
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+