Presto generates Java bytecode at runtime to optimize query execution. Key query operations like filtering, projections, joins and aggregations are compiled into efficient Java methods using libraries like ASM and Fastutil. This bytecode generation improves performance by 30% through techniques like compiling row hashing for join lookups directly into machine instructions.
This document provides an overview of five steps to improve PostgreSQL performance: 1) hardware optimization, 2) operating system and filesystem tuning, 3) configuration of postgresql.conf parameters, 4) application design considerations, and 5) query tuning. The document discusses various techniques for each step such as selecting appropriate hardware components, spreading database files across multiple disks or arrays, adjusting memory and disk configuration parameters, designing schemas and queries efficiently, and leveraging caching strategies.
We will show the advantages of having a geo-distributed database cluster and how to create one using Galera Cluster for MySQL. We will also discuss the configuration and status variables that are involved and how to deal with typical situations on the WAN such as slow, untrusted or unreliable links, latency and packet loss. We will demonstrate a multi-region cluster on Amazon EC2 and perform some throughput and latency measurements in real-time (video http://galeracluster.com/videos/using-galera-replication-to-create-geo-distributed-clusters-on-the-wan-webinar-video-3/)
LiquiBase is an open source tool for tracking, managing and applying database changes, where database changes are stored in an XML file called a changelog that is executed to handle different revisions. It aims to provide consistent database changes across environments by managing databases at different states and keeping a history of all changes made through automatic rollback support and ability to effectively manage variable changes. Problems with manual database changes include inconsistent application of changes and databases becoming out of sync between environments.
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
This document discusses Apache Arrow, an open-source library that enables fast and efficient data interchange and processing. It summarizes the growth of Arrow and its ecosystem, including new features like the Arrow C++ query engine and Arrow Rust DataFusion. It also highlights how enterprises are using Arrow to solve challenges around data interoperability, access speed, query performance, and embeddable analytics. Case studies describe how companies like Microsoft, Google Cloud, Snowflake, and Meta leverage Arrow in their products and platforms. The presenter promotes Voltron Data's enterprise subscription and upcoming conference to support business use of Apache Arrow.
24시간 365일 서비스를 위한 MySQL DB 이중화.
MySQL 이중화 방안들에 대해 알아보고 운영하면서 겪은 고민들을 이야기해 봅니다.
목차
1. DB 이중화 필요성
2. 이중화 방안
- HW 이중화
- MySQL Replication 이중화
3. 이중화 운영 장애
4. DNS와 VIP
5. MySQL 이중화 솔루션 비교
대상
- MySQL을 서비스하고 있는 인프라 담당자
- MySQL 이중화에 관심 있는 개발자
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
This document introduces YugaByte DB, a high-performance, distributed, transactional database. It is built to scale horizontally on commodity servers across data centers for mission-critical applications. YugaByte DB uses a transactional document store based on RocksDB, Raft-based replication for resilience, and automatic sharding and rebalancing. It supports ACID transactions across documents, provides APIs compatible with Cassandra and Redis, and is open source. The architecture is designed for high performance, strong consistency, and cloud-native deployment.
ProxySQL High Avalability and Configuration Management OverviewRené Cannaò
The document provides an overview of high availability and configuration management options for ProxySQL. It discusses deploying ProxySQL locally on application servers, in a dedicated layer, or using both approaches. When deploying in a dedicated layer, options for high availability include keepalived, load balancers, Consul, and Kubernetes. Configuration can be managed through tools like Ansible, Puppet, or by loading SQL files. ProxySQL Cluster enables syncing configuration across nodes.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
The document discusses the JavaScript event loop, which is how JavaScript handles concurrency. It explains that JavaScript is single-threaded but uses an event loop model to simulate parallelism. Key points are:
- JavaScript uses a single thread of execution but handles I/O asynchronously by placing callbacks into a queue to be executed later.
- This allows I/O-heavy operations like networking to occur "in parallel" without blocking the main thread.
- The event loop continuously runs through the call stack and queue, executing functions and callbacks.
- While efficient for I/O, CPU-intensive tasks would block the single thread, so JavaScript is not ideal for those types of applications.
CDC Stream Processing with Apache FlinkTimo Walther
An instant world requires instant decisions at scale. This includes the ability to digest and react to changes in real-time. Thus, event logs such as Apache Kafka can be found in almost every architecture, while databases and similar systems still provide the foundation. Change Data Capture (CDC) has become popular for propagating changes. Nevertheless, integrating all these systems, which often have slightly different semantics, can be a challenge.
In this talk, we highlight what it means for Apache Flink to be a general data processor that acts as a data integration hub. Looking under the hood, we demonstrate Flink's SQL engine as a changelog processor that ships with an ecosystem tailored to processing CDC data and maintaining materialized views. We will discuss the semantics of different data sources and how to perform joins or stream enrichment between them. This talk illustrates how Flink can be used with systems such as Kafka (for upsert logging), Debezium, JDBC, and others.
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)Jean-François Gagné
To get better replication speed and less lag, MySQL implements parallel replication in the same schema, also known as LOGICAL_CLOCK. But fully benefiting from this feature is not as simple as just enabling it.
In this talk, I explain in detail how this feature works. I also cover how to optimize parallel replication and the improvements made in MySQL 8.0 and back-ported in 5.7 (Write Sets), greatly improving the potential for parallel execution on replicas (but needing RBR).
Come to this talk to get all the details about MySQL 5.7 and 8.0 Parallel Replication.
This talk explores PostgreSQL 15 enhancements (along with some history) and looks at how they improve developer experience (MERGE and SQL/JSON), optimize support for backups and compression, logical replication improvements, enhanced security and performance, and more.
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...HostedbyConfluent
Active-Active, Active-Passive, and stretch clusters are hallmark patterns that have been the gold standard in Apache Kafka® disaster recovery architectures for years. Moving to Kubernetes requires unpacking these patterns and choosing a configuration that allows you to meet the same RTO and RPO requirements.
In this talk, we will cover how Active-Active/Active-Passive modes for disaster recovery have worked in the past and how the architecture evolves with deploying Apache Kafka on Kubernetes. We'll also look at how stretch clusters sitting on this architecture give a disaster recovery solution that's built-in!
Armed with this information, you will be able to architect your new Apache Kafka Kubernetes deployment (or retool your existing one) to achieve the resilience you require.
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
The talk will motivate why Apache Arrow and related projects (e.g. DataFusion) is a good choice for implementing modern analytic database systems. It reviews the major components in most databases and explains where Apache Arrow fits in, and explains additional integration benefits from using Arrow.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It is written in Java and uses a pluggable backend. Presto is fast due to code generation and runtime compilation techniques. It provides a library and framework for building distributed services and fast Java collections. Plugins allow Presto to connect to different data sources like Hive, Cassandra, MongoDB and more.
The document discusses the future of server-side JavaScript. It covers various Node.js frameworks and libraries that support both synchronous and asynchronous programming styles. CommonJS aims to provide interoperability across platforms by implementing synchronous proposals using fibers. Examples demonstrate how CommonJS allows for synchronous-like code while maintaining asynchronous behavior under the hood. Benchmarks show it has comparable performance to Node.js. The author advocates for toolkits over frameworks and continuing development of common standards and packages.
Node has captured the attention of early adopters by clearly differentiating itself as being asynchronous from the ground up while remaining accessible. Now that server side JavaScript is at the cutting edge of the asynchronous, real time web, it is in a much better position to establish itself as the go to language for also making synchronous, CRUD webapps and gain a stronger foothold on the server.
This talk covers the current state of server side JavaScript beyond Node. It introduces Common Node, a synchronous CommonJS compatibility layer using node-fibers which bridges the gap between the different platforms. We look into Common Node's internals, compare its performance to that of other implementations such as RingoJS and go through some ideal use cases.
The document provides an overview of RxJava and its advantages over traditional Java streams and callbacks. It discusses key RxJava concepts like Observables, Observers, and Subscriptions. It demonstrates how to create Observables, subscribe to them, and compose operations like filter, map, and zip. It shows how to leverage schedulers to control threading. The document also provides examples of using RxJava with HTTP requests and the Twitter API to asynchronously retrieve user profiles and tweets. It highlights scenarios where RxJava is useful, like handling asynchronous operations, and discusses some pitfalls like its learning curve and need to understand backpressure.
This talk was given at the Dutch PHP Conference 2011 and details the use of Comet (aka reverse ajax or ajax push) technologies and the importance of websockets and server-sent events. More information is available at http://joind.in/3237.
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...Aman Kohli
The power of Gatling is the DSL it provides to allow writing meaningful and expressive tests. We provide an overview of the framework, a description of their development environment and goals, and present their test results.
Source code available https://github.com/lawlessc/random-response-time
Wprowadzenie do technologi Big Data i Apache HadoopSages
The document introduces concepts related to Big Data technology including volume, variety, and velocity of data. It discusses Hadoop architecture including HDFS, MapReduce, YARN, and the Hadoop ecosystem. Examples are provided of common Big Data problems and how they can be solved using Hadoop frameworks like Pig, Hive, and Ambari.
The document discusses Concurrency-oriented Programming (COP) using Erlang. It explains how Erlang programs work using lightweight processes that communicate asynchronously via message passing. This allows for high performance, reliability, and scalability. It provides examples of stateless server processes and using CouchDB for schema-free document storage accessible via REST APIs. Ruby libraries for interacting with CouchDB are also mentioned.
RestMQ is a message queue system based on Redis that allows storing and retrieving messages through HTTP requests. It uses Redis' data structures like lists, sets, and hashes to maintain queues and messages. Messages can be added to and received from queues using RESTful endpoints. Additional features include status monitoring, queue control, and support for protocols like JSON, Comet, and WebSockets. The core functionality is language-agnostic but implementations exist in Python and Ruby.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Finagle is an asynchronous RPC framework from Twitter that provides client/server abstractions over various protocols like HTTP and Thrift. It uses Futures to handle asynchronous operations and provides methods like map, flatmap, and handle to transform Futures. The Java Service Framework builds on Finagle to add features like metrics, logging, and rate limiting for Java services. It allows configuring options like enabling specific logs and metrics through a Proxy builder.
Leveraging Hadoop in your PostgreSQL EnvironmentJim Mlodgenski
This talk will begin with a discussion of the strengths of PostgreSQL and Hadoop. We will then lead into a high level overview of Hadoop and its community of projects like Hive, Flume and Sqoop. Finally, we will dig down into various use cases detailing how you can leverage Hadoop technologies for your PostgreSQL databases today. The use cases will range from using HDFS for simple database backups to using PostgreSQL and Foreign Data Wrappers to do low latency analytics on your Big Data.
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"DataStax Academy
The ColumnFamily data model and wide-row support provides the ability to store and access data efficiently in a de-normalized state. Recent enhancements for CQL's spare tables and built-in indexing provide the capability to store data in a manner similar to that of relational databases. For many use cases hybrid approaches are needed, because complete de-normalization is appropriate for some access patterns whereas more structured data is appropriate for others. At times a single logical event becomes multiple insertions across multiple column families. Likewise a user request might require a several reads across different column families. This talk describes some of these scenarios and demonstrates how advanced operations such multiple step procedures, filtering, intersection, and paging can be implemented client side or server side with the help of the IntraVert plugin.
Intravert Server side processing for CassandraEdward Capriolo
The document provides examples of using CQL (Cassandra Query Language) to create and query tables in Cassandra. It shows how to create tables to store user and video data, insert sample records, and perform queries. It then discusses using the IntraVert library to execute more complex queries directly against Cassandra, such as joins, filters, and multi-table operations, in order to reduce network traffic and processing compared to doing everything on the client side.
The Road To Reactive with RxJava JEEConf 2016Frank Lyaruu
This document introduces Reactive Programming with RxJava and how it can be used to create non-blocking applications. It discusses the limitations of blocking code and how RxJava uses Observables and Subscribers to implement reactive and asynchronous operations. It provides examples of converting blocking servlets and HTTP calls to non-blocking using RxJava. While non-blocking code is not always faster, it allows asynchronous operations to utilize threads more efficiently.
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
This document provides an overview of CouchDB, a NoSQL database that uses JSON documents with a flexible schema. It demonstrates CouchDB's features like replication, MapReduce, and filtering. The presentation then shows how to build a mobile running app called Couch25K that tracks locations using CouchDB and syncs data between phones and a server. Code examples are provided in Objective-C, Java, and JavaScript for creating databases, saving documents, querying, and syncing.
This is my presentation from TechBeats #3 hosted by Applause about Server-Side Swift framework called Vapor.
Swift is a great language and possibility of using it also in backend is a huge benefit for any iOS developer out there. Using Vapor is a seamless experience. With this framework creating advance APIs by iOS developer is as easy as writing simple iOS app.
https://www.meetup.com/TechBeats-hosted-by-Applause/events/254910023/
This document discusses best practices for developing Node.js applications. It recommends using frameworks like Express for building web apps, libraries like Async to avoid callback hell, and organizing code into modular sub-applications. It also covers testing, error handling, documentation, and open-sourcing projects. Standards like Felix's Style Guide and domain-driven design principles are advocated. Communication channels like events, HTTP APIs, and WebSockets are examined.
The document discusses composing reusable extract-transform-load (ETL) processes on Hadoop. It covers the data science lifecycle of acquiring, analyzing and taking action on data. It states that 80% of work in data science is spent on acquiring and preparing data. The document then discusses using Cascading, an abstraction framework for building MapReduce jobs, to create reusable ETL processes that are linearly scalable and follow a single-purpose composable design.
Continuous Application with Structured Streaming 2.0Anyscale
Introduction to Continuous Application with Apache Spark 2.0 Structured Streaming. This presentation is a culmination and curation from talks and meetups presented by Databricks engineers.
The notebooks on Structured Streaming demonstrates aspects of the Structured Streaming APIs
The document discusses various machine learning clustering algorithms like K-means clustering, DBSCAN, and EM clustering. It also discusses neural network architectures like LSTM, bi-LSTM, and convolutional neural networks. Finally, it presents results from evaluating different chatbot models on various metrics like validation score.
The document discusses challenges with using reinforcement learning for robotics. While simulations allow fast training of agents, there is often a "reality gap" when transferring learning to real robots. Other approaches like imitation learning and self-supervised learning can be safer alternatives that don't require trial-and-error. To better apply reinforcement learning, robots may need model-based approaches that learn forward models of the world, as well as techniques like active localization that allow robots to gather targeted information through interactive perception. Closing the reality gap will require finding ways to better match simulations to reality or allow robots to learn from real-world experiences.
[243] Deep Learning to help student’s Deep LearningNAVER D2
This document describes research on using deep learning to predict student performance in massive open online courses (MOOCs). It introduces GritNet, a model that takes raw student activity data as input and predicts outcomes like course graduation without feature engineering. GritNet outperforms baselines by more than 5% in predicting graduation. The document also describes how GritNet can be adapted in an unsupervised way to new courses using pseudo-labels, improving predictions in the first few weeks. Overall, GritNet is presented as the state-of-the-art for student prediction and can be transferred across courses without labels.
[234]Fast & Accurate Data Annotation Pipeline for AI applicationsNAVER D2
This document provides a summary of new datasets and papers related to computer vision tasks including object detection, image matting, person pose estimation, pedestrian detection, and person instance segmentation. A total of 8 papers and their associated datasets are listed with brief descriptions of the core contributions or techniques developed in each.
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지NAVER D2
This document presents a formula for calculating the loss function J(θ) in machine learning models. The formula averages the negative log likelihood of the predicted probabilities being correct over all samples S, and includes a regularization term λ that penalizes predicted embeddings being dissimilar from actual embeddings. It also defines the cosine similarity term used in the regularization.
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기NAVER D2
The document discusses running a TensorFlow Serving (TFS) container using Docker. It shows commands to:
1. Pull the TFS Docker image from a repository
2. Define a script to configure and run the TFS container, specifying the model path, name, and port mapping
3. Run the script to start the TFS container exposing port 13377
The document discusses linear algebra concepts including:
- Representing a system of linear equations as a matrix equation Ax = b where A is a coefficient matrix, x is a vector of unknowns, and b is a vector of constants.
- Solving for the vector x that satisfies the matrix equation using linear algebra techniques such as row reduction.
- Examples of matrix equations and their component vectors are shown.
This document describes the steps to convert a TensorFlow model to a TensorRT engine for inference. It includes steps to parse the model, optimize it, generate a runtime engine, serialize and deserialize the engine, as well as perform inference using the engine. It also provides code snippets for a PReLU plugin implementation in C++.
The document discusses machine reading comprehension (MRC) techniques for question answering (QA) systems, comparing search-based and natural language processing (NLP)-based approaches. It covers key milestones in the development of extractive QA models using NLP, from early sentence-level models to current state-of-the-art techniques like cross-attention, self-attention, and transfer learning. It notes the speed and scalability benefits of combining search and reading methods for QA.
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfNeo4j
Presented at Gartner Data & Analytics, London Maty 2024. BT Group has used the Neo4j Graph Database to enable impressive digital transformation programs over the last 6 years. By re-imagining their operational support systems to adopt self-serve and data lead principles they have substantially reduced the number of applications and complexity of their operations. The result has been a substantial reduction in risk and costs while improving time to value, innovation, and process automation. Join this session to hear their story, the lessons they learned along the way and how their future innovation plans include the exploration of uses of EKG + Generative AI.
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Quality Patents: Patents That Stand the Test of TimeAurora Consulting
Is your patent a vanity piece of paper for your office wall? Or is it a reliable, defendable, assertable, property right? The difference is often quality.
Is your patent simply a transactional cost and a large pile of legal bills for your startup? Or is it a leverageable asset worthy of attracting precious investment dollars, worth its cost in multiples of valuation? The difference is often quality.
Is your patent application only good enough to get through the examination process? Or has it been crafted to stand the tests of time and varied audiences if you later need to assert that document against an infringer, find yourself litigating with it in an Article 3 Court at the hands of a judge and jury, God forbid, end up having to defend its validity at the PTAB, or even needing to use it to block pirated imports at the International Trade Commission? The difference is often quality.
Quality will be our focus for a good chunk of the remainder of this season. What goes into a quality patent, and where possible, how do you get it without breaking the bank?
** Episode Overview **
In this first episode of our quality series, Kristen Hansen and the panel discuss:
⦿ What do we mean when we say patent quality?
⦿ Why is patent quality important?
⦿ How to balance quality and budget
⦿ The importance of searching, continuations, and draftsperson domain expertise
⦿ Very practical tips, tricks, examples, and Kristen’s Musts for drafting quality applications
https://www.aurorapatents.com/patently-strategic-podcast.html
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
YOUR RELIABLE WEB DESIGN & DEVELOPMENT TEAM — FOR LASTING SUCCESS
WPRiders is a web development company specialized in WordPress and WooCommerce websites and plugins for customers around the world. The company is headquartered in Bucharest, Romania, but our team members are located all over the world. Our customers are primarily from the US and Western Europe, but we have clients from Australia, Canada and other areas as well.
Some facts about WPRiders and why we are one of the best firms around:
More than 700 five-star reviews! You can check them here.
1500 WordPress projects delivered.
We respond 80% faster than other firms! Data provided by Freshdesk.
We’ve been in business since 2015.
We are located in 7 countries and have 22 team members.
With so many projects delivered, our team knows what works and what doesn’t when it comes to WordPress and WooCommerce.
Our team members are:
- highly experienced developers (employees & contractors with 5 -10+ years of experience),
- great designers with an eye for UX/UI with 10+ years of experience
- project managers with development background who speak both tech and non-tech
- QA specialists
- Conversion Rate Optimisation - CRO experts
They are all working together to provide you with the best possible service. We are passionate about WordPress, and we love creating custom solutions that help our clients achieve their goals.
At WPRiders, we are committed to building long-term relationships with our clients. We believe in accountability, in doing the right thing, as well as in transparency and open communication. You can read more about WPRiders on the About us page.
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...Toru Tamaki
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models" arXiv2023
https://arxiv.org/abs/2307.12980
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
Kief Morris rethinks the infrastructure code delivery lifecycle, advocating for a shift towards composable infrastructure systems. We should shift to designing around deployable components rather than code modules, use more useful levels of abstraction, and drive design and deployment from applications rather than bottom-up, monolithic architecture and delivery.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
8. Who uses Presto - Airbnb
Airpal - a web-based, query execution tool
Presto is amazing. It's an order of magnitude
faster than Hive in most our use cases. It reads
directly from HDFS, so unlike Redshift, there
isn't a lot of ETL before you can use it. It just
works.
- Christopher Gutierrez, Manager of Online Analytics, Airbnb
12. Presto is
•Fast !!! (10x faster than Hive)
•Even faster with new Presto ORC reader
•Written in Java with a pluggable backend
•Not SQL-like, but ANSI-SQL
•Code generation like LLVM
•Not only source is open, but open sourced (No private branch)
15. Presto is
CREATE TABLE mysql.hello.order_item AS
SELECT o.*, i.*
FROM hive.world.orders o —― TABLESAMPLE SYSTEM (10)
JOIN mongo.deview.lineitem i —― TABLESAMPLE BERNOULLI (40)
ON o.orderkey = i.orderkey
WHERE conditions..
16. Coordinator
Presto - Planner
Fragmenter
Worker
SQL Analyzer
Analysis
Logical
Plan
Optimizer
Plan
Plan
Fragment
Distributed
Query
Scheduler
Stage
Execution
Plan
Worker
Worker
Local
Execution
Planner
TASK
TASK
17. Presto - Page
PAGE
- positionCount
VAR_WIDTH BLOCK
nulls
Offsets
Values
FIEXED_WIDTH BLOCK
nulls
Values
positionCount blockCount 11 F I X E D _ W I D T H po
sitionCount bit encoded nullFlags values length
values
14 V A R I A B L E _ W
I D T H positionCount offsets[0] offsets[1] offsets[2…]
offsets[pos-1] offsets[pos] bit encoded nullFlags
values
Page / Block serialization
18. Presto - Cluster Memory Manager
Coordinator
Worker
Worker
Worker
GET /v1/memory
@Config(“query.max-memory”) = 20G
@Config(“query.max-memory-per-node”) = 1G
@Config(“resources.reserved-system-memory”) = 40% of -Xmx
System
reserved-system-memory
Reserved
max-memory-per-node
General
Task
Block All tasks
26. Airlift - Distributed service framework
•https://github.com/airlift/airlift
•Core of Presto communication
•HTTP
•Bootstrap
•Node discovery
•RESTful API
•Dependency Injection
•Configuration
•Utilities
@Path("/v2/event")
public class EventResource {
@POST
public Response createQuery(EventRequests events) {
…
}
}
public class CollectorMainModule implements ConfigurationAwareModule {
@Override
public synchronized void configure(Binder binder) {
discoveryBinder(binder).bindHttpAnnouncement("collector");
jsonCodecBinder(binder).bindJsonCodec(EventRequest.class);
jaxrsBinder(binder).bind(EventResource.class);
}
public static void main(String[] args){
Bootstrap app = new Bootstrap(ImmutableList.of(
new NodeModule(),
new DiscoveryModule(),
new HttpServerModule(),
new JsonModule(), new JaxrsModule(true),
new EventModule(),
new CollectorMainModule()
));
Injector injector = app.strictConfig().initialize();
injector.getInstance(Announcer.class).start();
}
}
ex) https://github.com/miniway/presto-event-collector
27. Fastutil - Fast Java collection
•FastUtil 6.6.0 turned out to be consistently fast.
•Koloboke is getting second in many tests.
•GS implementation is good enough, but is slower than FastUtil and Koloboke.
http://java-performance.info/hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove-january-2015/
28. ASM - Bytecode manipulation
package pkg;
public interface SumInterface {
long sum(long value);
}
public class MyClass implements SumInterface {
private long result = 0L;
public MyClass(long value) {
result = value;
}
@Override
public long sum(long value) {
result += value;
return result;
}
}
ClassWriter cw = new ClassWriter(0);
cw.visit(V1_7, ACC_PUBLIC,
"pkg/MyClass", null,
"java/lang/Object",
new String[] { "pkg/SumInterface" });
cw.visitField(ACC_PRIVATE,
"result", "J", null, new Long(0));
// constructor
MethodVisitor m = cw.visitMethod(ACC_PUBLIC,
"<init>", "(J)V", null, null);
m.visitCode();
// call super()
m.visitVarInsn(ALOAD, 0); // this
m.visitMethodInsn(INVOKESPECIAL,
"java/lang/Object", "<init>", “()V",
false);
29. ASM - Bytecode manipulation (Cont.)
// this.result = value
m.visitVarInsn(ALOAD, 0); // this
m.visitVarInsn(LLOAD, 1 ); // value
m.visitFieldInsn(PUTFIELD,
"pkg/MyClass", "result", "J");
m.visitInsn(RETURN);
m.visitMaxs(-1, -1).visitEnd();
// public long sum(long value)
m = cw.visitMethod(ACC_PUBLIC , "sum", "(J)J",
null, null);
m.visitCode();
m.visitVarInsn(ALOAD, 0); // this
m.visitVarInsn(ALOAD, 0); // this
m.visitFieldInsn(GETFIELD,
"pkg/MyClass", "result", "J");
m.visitVarInsn(LLOAD, 1); // value
// this.result + value
m.visitInsn(LADD);
m.visitFieldInsn(PUTFIELD,
"pkg/MyClass", "result", "J");
m.visitVarInsn(ALOAD, 0); // this
m.visitFieldInsn(GETFIELD,
"pkg/MyClass", "result", "J");
m.visitInsn(LRETURN);
m.visitMaxs(-1, -1).visitEnd();
cw.visitEnd();
byte[] bytes = cw.toByteArray();
ClassLoader.defindClass(bytes)
30. Library - Misc.
•JDK 8u40 +
•Guice - Lightweight dependency injection
•Guava - Replacing Java8 Stream, Optional and Lambda
•ANTLR4 - Parser generator, SQL parser
•Jetty - HTTP Server and Client
•Jackson - JSON
•Jersey - RESTful API
32. Code Generation
•ASM
•Runtime Java classes and methods generation base on SQL
•30% of performance gain
•Where
•Filter and Projection
•Join Lookup source
•Join Probe
•Order By
•Aggregation
33. Code Generation - Filter
SELECT * FROM lineitem
WHERE orderkey = 100 AND quantity = 200
AND
EQ
(#0,100)
EQ
(#1,200)
Logical Planner
class AndOperator extends Operator {
private Operator left = new EqualOperator(#1, 100);
private Operator right = new EqualOperator(#2, 200);
@Override
public boolean evaluate(Cursor cur)
{
if (!left.evaluate(cur)) {
return false;
}
return right.evaluate(cur);
}
}
class EqualOperator extends Operator {
@Override
public boolean evaluate(Cursor c)
{
return cur.getValue(position).equals(value);
}
}
34. Code Generation - Filter
// invoke MethodHandle( $operator$EQUAL(#0, 100) )
push cursor.getValue(#0)
push 100
$statck = invokeDynamic boostrap(0) $operator$EQUAL
if (!$stack) { goto end; }
push cursor.getValue(#1)
push 200
$stack = invokeDynamic boostrap(0) $operator$EQUAL
end:
return $stack
@ScalarOperator(EQUAL)
@SqlType(BOOLEAN)
public static boolean equal(@SqlType(BIGINT) long left,
@SqlType(BIGINT) long right){
return left == right;
}
=> MethodHandle(“$operator$EQUAL(long, long): boolean”)
AND
$op$EQ
(#0,100)
$op$EQ
(#1,200)
Local Execution Planner
37. Code Generation - PageHash (Cont.)
long hashRow (int position,
Block[] blocks) {
int result = 0;
for (int i = 0; i < hashChannels.size(); i++) {
int hashChannel = hashChannels.get(i);
Type type = types.get(hashChannel);
result = result * 31 +
type.hash(blocks[i], position);
}
return result;
}
long (Compiled)hashRow (int position,
Block[] blocks) {
int result = 0;
result = result * 31 +
type_colX.hash(block[0], position);
result = result * 31 +
type_colY.hash(block[1], position);
return result;
}
38. Code Generation - PageHash (Cont.)
boolean equalsRow (
int leftBlockIndex, int leftPosition,
int rightPosition, Block[] rightBlocks) {
for (int i = 0; i < hashChannels.size(); i++) {
int hashChannel = hashChannels.get(i);
Type type = types.get(hashChannel);
Block leftBlock =
channels.get(hashChannel)
.get(leftBlockIndex);
if (!type.equalTo(leftBlock, leftPosition,
rightBlocks[i], rightPosition)) {
return false;
}
}
return true;
}
boolean (Compiled)equalsRow (
int leftBlockIndex, int leftPosition,
int rightPosition, Block[] rightBlocks) {
Block leftBlock =
channels_colX.get(leftBlockIndex);
if (!type.equalTo(leftBlock, leftPosition,
rightBlocks[0], rightPosition)) {
return false;
}
leftBlock =
channels_colY.get(leftBlockIndex);
if (!type.equalTo(leftBlock, leftPosition,
rightBlocks[1], rightPosition)) {
return false;
}
return true;
}
39. Method Variable Binding
1. regexp_like(string, pattern) → boolean
2. regexp_like(string, cast(pattern as RegexType)) // OperatorType.CAST
3. regexp_like(string, new Regex(pattern))
4. MethodHandle handle = MethodHandles.insertArgument(1, new Regex(pattern))
5. handle.invoke (string)
@ScalarOperator(OperatorType.CAST)
@SqlType(“RegExp”)
public static Regex castToRegexp(@SqlType(VARCHAR) Slice pattern){
return new Regex(pattern.getBytes(), 0, pattern.length());
}
@ScalarFunction
@SqlType(BOOLEAN)
public static boolean regexpLike(@SqlType(VARCHAR) Slice source,
@SqlType(“RegExp”) Regex pattern){
Matcher m = pattern.matcher(source.getBytes());
int offset = m.search(0, source.length());
return offset != -1;
}
42. Plugin - Raptor
•Storage data in flash on the Presto machines in ORC format
•Metadata is stored in MySQL (Extendable)
•Near real-time loads (5 - 10mins)
•3TB / day, 80B rows/day , 5 secs query
•CREATE VIEW myview AS SELECT …
•DELETE FROM tab WHERE conditions…
•UPDATE (Future)
•Coarse grained Index : min / max value of all columns
•Compaction
•Backup Store (Extendable)
No more ?!
43. Plugin - How to write
•https://prestodb.io/docs/current/develop/spi-overview.html
•ConnectorFactory
•ConnectorMetadata
•ConnectorSplitManager
•ConnectorHandleResolver
•ConnectorRecordSetProvider (PageSourceProvider)
•ConnectorRecordSinkProvider (PageSinkProvider)
•Add new Type
•Add new Function (A.K.A UDF)
44. Plugin - MongoDB
•https://github.com/facebook/presto/pull/3337
•5 Non-business days
•Predicate Pushdown
•Add a Type (ObjectId)
•Add UDFs (objectid(), objectid(string))
public class MongoPlugin implements Plugin {
@Override
public <T> List<T> getServices(Class<T> type) {
if (type == ConnectorFactory.class) {
return ImmutableList.of(
new MongoConnectorFactory(…));
} else if (type == Type.class) {
return ImmutableList.of(OBJECT_ID);
} else if (type == FunctionFactory.class) {
return ImmutableList.of(
new MongoFunctionFactory(typeManager));
}
return ImmutableList.of();
}
}
45. Plugin - MongoDB
class MongoFactory implements ConnectorFactory {
@Override
public Connector create(String connectorId) {
Bootstrap app = new Bootstrap(new
MongoClientModule());
return app.initialize()
.getInstance(MongoConnector.class);
}
}
class MongoClientModule implements Module {
@Override
public void configure(Binder binder){
binder.bind(MongoConnector.class)
.in(SINGLETON);
…
configBinder(binder)
.bindConfig(MongoClientConfig.class);
}
}
class MongoConnector implements Connector {
@Inject
public MongoConnector(
MongoSession mongoSession,
MongoMetadata metadata,
MongoSplitManager splitManager,
MongoPageSourceProvider
pageSourceProvider,
MongoPageSinkProvider
pageSinkProvider,
MongoHandleResolver
handleResolver) {
…
}
}
46. Plugin - MongoDB UDF
public class MongoFunctionFactory
implements FunctionFactory {
@Override
public List<ParametricFunction> listFunctions()
{
return new FunctionListBuilder(typeManager)
.scalar(ObjectIdFunctions.class)
.getFunctions();
}
}
public class ObjectIdType
extends AbstractVariableWidthType {
ObjectIdType OBJECT_ID = new ObjectIdType();
@JsonCreator
public ObjectIdType() {
super(parseTypeSignature("ObjectId"),
Slice.class);
}
}
public class ObjectIdFunctions {
@ScalarFunction("objectid")
@SqlType("ObjectId")
public static Slice ObjectId() {
return Slices.wrappedBuffer(
new ObjectId().toByteArray());
}
@ScalarFunction("objectid")
@SqlType("ObjectId")
p.s Slice ObjectId(@SqlType(VARCHAR) Slice value) {
return Slices.wrappedBuffer(
new ObjectId(value.toStringUtf8()).toByteArray())
}
@ScalarOperator(EQUAL)
@SqlType(BOOLEAN)
p.s boolean equal(@SqlType("ObjectId") Slice left,
@SqlType("ObjectId") Slice right) {
return left.equals(right);
}
}