This document discusses integrating Solr and Spark. It provides an example of using Solr as a sink for streaming data from Spark Streaming. It also describes reading data from Solr into Spark using SolrRDD and exposing it as a Spark SQL DataFrame. Additional capabilities covered include querying Solr from the Spark shell, document matching using stored queries, and reading term vectors from Solr for machine learning with MLLib.
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
This document discusses PostgreSQL database architecture patterns for running PostgreSQL at scale when a relational database as a service like Amazon RDS won't meet needs. It describes challenges faced with MySQL, Redshift and Vertica and how PostgreSQL was better suited through techniques like partitioning by date, TOAST compression, foreign data wrappers, and poor man's parallel processing. Key takeaways are that PostgreSQL supported scaling to petabytes of data, sub-second queries across large date ranges, and custom extensions needed while avoiding limitations and expenses of other database options.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots. In this session, you'll learn: • Some background about big data at Netflix • Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive • How Iceberg maintains table metadata to make queries fast and reliable • The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse • How you can get started using Iceberg Speaker Ryan Blue, Software Engineer, Netflix
This document summarizes Flipkart's approach to building a real-time search index for e-commerce. It discusses the need for real-time search to address issues like showing out of stock products. It describes Flipkart's traffic volume and catalog size. It then covers challenges like high update rates and using microservices. It evaluates Apache SolrCloud and describes Flipkart's approach using a near real-time forward index with an inverted index for filtering to enable real-time sorting, filtering and querying with latency comparable to columnar data. The system processes over 150 signals and 50k updates per second in production.
This document discusses how to achieve scale with MongoDB. It covers optimization tips like schema design, indexing, and monitoring. Vertical scaling involves upgrading hardware like RAM and SSDs. Horizontal scaling involves adding shards to distribute load. The document also discusses how MongoDB scales for large customers through examples of deployments handling high throughput and large datasets.
This document provides an overview of Logstash, Elasticsearch, and Kibana for log analysis. It discusses how logging is used for troubleshooting, security, and monitoring. It then introduces Logstash as an open-source log collection and parsing tool. Elasticsearch is described as a search and analytics engine that indexes log data from Logstash. Kibana provides a web interface for visualizing and searching logs stored in Elasticsearch. The document concludes with discussing demo, installation, scaling, and deployment considerations for these log analysis tools.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution. The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Flink Forward San Francisco 2022. When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment. by Jun Qin & Karl Friedrich
Trino (formerly known as PrestoSQL) is an open source distributed SQL query engine for running fast analytical queries against data sources of all sizes. Some key updates since being rebranded from PrestoSQL to Trino include new security features, language features like window functions and temporal types, performance improvements through dynamic filtering and partition pruning, and new connectors. Upcoming improvements include support for MERGE statements, MATCH_RECOGNIZE patterns, and materialized view enhancements.
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Flink Forward San Francisco 2022. Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink. by Scott Sandre & Denny Lee
The document discusses Netflix's use of Elasticsearch for querying log events. It describes how Netflix evolved from storing logs in files to using Elasticsearch to enable interactive exploration of billions of log events. It also summarizes some of Netflix's best practices for running Elasticsearch at scale, such as automatic sharding and replication, flexible schemas, and extensive monitoring.