Z-Platform is the new innovative powerful and complex platform to ingest data of any kind and store the data in the form of JSON documents in MongoDB and represent a sparse representation of the same in Neo4j graph database. Mahesh discusses how he tackled deadlocks and improved the performance of the system significantly. The test environment included small graphs (ranging up to 10000 relationships to very large graphs (ranging up to 39 million relationships). The average performance of the system is 3741 relationships per minute.
This document discusses using Redis to solve real world problems. It provides 4 patterns for using Redis: [1] as a simple, fast object store using hashes; [2] indexing objects with sorted sets; [3] using bitmaps for unique value counting; and [4] resolving locations with geohashing. Each pattern is explained and code examples are provided to illustrate how to model problems in Redis and leverage its data structures and performance. A variety of use cases are presented that can benefit from each pattern.
Introduction to Apache Spark, understanding of the architecture, resilient distributed datasets and working.
The document provides an overview of MongoDB sharding, including: - Sharding allows horizontal scaling of data by partitioning a database across multiple servers or shards. - The MongoDB sharding architecture consists of shards to hold data, config servers to store metadata, and mongos processes to route requests. - Data is partitioned into chunks based on a shard key and chunks can move between shards as the data distribution changes.
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark. Below topics are explained in this Spark presentation: 1. History of Spark 2. What is Spark 3. Hadoop vs Spark 4. Components of Apache Spark 5. Spark architecture 6. Applications of Spark 7. Spark usecase What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. At this scale, output committers that create extra copies or can’t handle task failures are no longer practical. This talk will explain the problems that are caused by the available committers when writing to S3, and show how Netflix solved the committer problem. In this session, you’ll learn: – Some background about Spark at Netflix – About output committers, and how both Spark and Hadoop handle failures – How HDFS and S3 differ, and why HDFS committers don’t work well – A new output committer that uses the S3 multi-part upload API – How you can use this new committer in your Spark applications to avoid duplicating data
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
The document discusses Apache Spark, an open source cluster computing framework for real-time data processing. It notes that Spark is up to 100 times faster than Hadoop for in-memory processing and 10 times faster on disk. The main feature of Spark is its in-memory cluster computing capability, which increases processing speeds. Spark runs on a driver-executor model and uses resilient distributed datasets and directed acyclic graphs to process data in parallel across a cluster.
This document provides an introduction and overview of Apache NiFi 1.11.4. It discusses new features such as improved support for partitions in Azure Event Hubs, encrypted repositories, class loader isolation, and support for IBM MQ and the Hortonworks Schema Registry. It also summarizes new reporting tasks, controller services, and processors. Additional features include JDK 11 support, encrypted repositories, and parameter improvements to support CI/CD. The document provides examples of using NiFi with Docker, Kubernetes, and in the cloud. It concludes with useful links for additional NiFi resources.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for. This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
The document is a presentation about using Apache Arrow to improve the speed of graph queries and analytics. It discusses how Arrow uses columnar data formats and vectorization to enable faster data processing. It also provides an example of how Arrow could be incorporated into Spark jobs and used with Beam to perform embarrassingly parallel graph processing. The presentation envisions future Neo4j integrations that would make Arrow processing transparent to data scientists.
Oracle Real Application Clusters 19c provides best practices and new features for upgrading to Oracle 19c. It discusses upgrading Oracle RAC to Linux 7 with minimal downtime using node draining and relocation techniques. Oracle 19c allows for upgrading the Grid Infrastructure management repository and patching faster using a new Oracle home. The presentation also covers new resource modeling for PDBs in Oracle 19c and improved Clusterware diagnostics.
So you know you want to write a streaming app, but any non-trivial streaming app developer would have to think about these questions: – How do I manage offsets? – How do I manage state? – How do I make my Spark Streaming job resilient to failures? Can I avoid some failures? – How do I gracefully shutdown my streaming job? – How do I monitor and manage my streaming job (i.e. re-try logic)? – How can I better manage the DAG in my streaming job? – When do I use checkpointing, and for what? When should I not use checkpointing? – Do I need a WAL when using a streaming data source? Why? When don’t I need one? This session will share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Introduction: This workshop will provide a hands-on introduction to Apache Spark using the HDP Sandbox on students’ personal machines. Format: A short introductory lecture about Apache Spark components used in the lab followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions. Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari User Views. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries. Pre-requisites: Registrants must bring a laptop that can run the Hortonworks Data Cloud. Speaker: Robert Hryniewicz, Developer Advocate, Hortonworks
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Dashtable is a hashtable implementation inside Dragonfly. It supports incremental resizes and fast, cache-friendly operations. In this talk, we will learn how Dashtable helps Dragonfly to keep its tail latency in check. In Dashtable, long-tail latencies have been reduced by a factor of 1000x, but P999 are 7x longer. Find out why we still think this is a good tradeoff.
Keynote presentation at GraML 2018, The Second Workshop on the Intersection of Graph Algorithms and Machine Learning, Co-located with IPDPS 2018.