1) NoSQL databases are non-relational and schema-free, providing alternatives to SQL databases for big data and high availability applications. 2) Common NoSQL database models include key-value stores, column-oriented databases, document databases, and graph databases. 3) The CAP theorem states that a distributed data store can only provide two out of three guarantees around consistency, availability, and partition tolerance.
Quick introduction to the moving parts inside Cassandra and essential commands and tasks for System Administrators.
This talk will introduce TeraCache, a new scalable cache for Spark that avoids both garbage collection (GC) and serialization overheads. Existing Spark caching options incur either significant GC overheads for large managed heaps over persistent memory or significant serialization overheads to place objects off-heap on large storage devices. Our analysis shows that: (1) serialization increases execution time by up to 30% and (2) caching on the managed heap increases GC time by 20%. In addition, these overheads become worse as datasets grow.
This document provides an overview of Apache Spark, including why it was created, how it works, and how to get started with it. Some key points: - Spark was initially developed at UC Berkeley as a class project in 2009 to test cluster management systems like Mesos, and was later open sourced in 2010. It became an Apache project in 2014. - Spark is faster than Hadoop for machine learning tasks because it keeps data in-memory between jobs rather than writing to disk, and has a smaller codebase. - The basic unit of data in Spark is the resilient distributed dataset (RDD), which allows immutable, distributed collections across a cluster. RDDs support transformations and actions. -
This document provides an introduction to Apache Spark, including: - A brief history of Spark, which started at UC Berkeley in 2009 and was donated to the Apache Foundation in 2013. - An overview of what Spark is - an open-source, efficient, and productive cluster computing system that is interoperable with Hadoop. - Descriptions of Spark's core abstractions including Resilient Distributed Datasets (RDDs), transformations, actions, and how it allows loading and saving data. - Mentions of Spark's machine learning, SQL, streaming, and graph processing capabilities through projects like MLlib, Spark SQL, Spark Streaming, and GraphX.
This is a demo that use apps script to demo the lambda dashboard. The apps script publish a endpoint and client using fluentd to post data to apps script and also bigquery. Then, you can see the realtime and batch query in the same view.
The document discusses the Apache Hadoop ecosystem and versions. It provides details on Hadoop versioning from 0.1 to the current versions of 0.22, 0.23, and 1.0. It summarizes the key features and testing of Hadoop 0.22, which has been stabilized by eBay for production use. The document recommends Hadoop 0.22 as a reliable version to use until further versions are released.
This document provides tips and best practices for optimizing Apache Spark performance and resource allocation. It discusses: - The components of Spark including executors, drivers, and tasks - Configuring Spark on YARN and dynamic resource allocation - Optimizing memory usage, avoiding data skew, and reducing serialization costs - Best practices for Spark Streaming around microbatching, fault tolerance, and performance - Recommendations for running Spark on cloud object stores like S3
Redis modules allow for new capabilities like machine learning models to be added to Redis. The Redis-ML module stores machine learning models like random forests and supports operations like model training, evaluation, and prediction directly from Redis for low latency. Spark can be used to train models which are then saved as Redis modules, allowing models to be easily deployed and accessed from services and clients.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web. For more information please follow: https://github.com/tribbloid/spookystuff A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
Introduction to Apache Spark, understanding of the architecture, resilient distributed datasets and working.
A #NYCCassandra2013 talk wherein I outline Outbrain's automation infrastructure and how we go from metal to working cluster nodes.
From common errors seen in running Spark applications, e.g., OutOfMemory, NoClassFound, disk IO bottlenecks, History Server crash, cluster under-utilization to advanced settings used to resolve large-scale Spark SQL workloads such as HDFS blocksize vs Parquet blocksize, how best to run HDFS Balancer to re-distribute file blocks, etc. you will get all the scoop in this information-packed presentation.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
The document discusses Spark internals and provides an overview of key components such as the Spark code base size and growth over time, core developers, Scala basics used in Spark, RDDs, tasks, caching/block management, and schedulers for running Spark on clusters including Mesos and YARN. It also includes tips for using IntelliJ IDEA to work with Spark's Scala code base.
MySQL Cluster provides high availability through data replication across multiple nodes, automatic failover, and synchronous replication to ensure data integrity, but it has limitations in that the entire database must reside in memory and database size is restricted by available memory. Other options for high availability with MySQL include using MySQL proxy to split reads and writes across nodes, replication with multi-master setups, and technologies like DRBD to replicate data for recovery. Planning for failures, keeping implementations simple, and separating data and connectivity high availability are important principles for highly available MySQL architectures.