The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Move to Spark-(Yandu Oppacher, Shopify)

•

4 likes•2,504 views

The document discusses the challenges faced by Shopify in using its existing data warehouse and ETL processes due to increasing data volume and complexity. It describes Shopify's attempts to use Pig and Luigi as well as Platfora to address these issues, but notes they did not meet Shopify's needs. Shopify then moved to using Spark due to its fast performance, nice development model using Python, and ability to better handle their data and query complexity. The summary provides an overview of why Shopify changed its data warehousing approach and the key technology it adopted.

Recommended for you

Disrupting Big Data with Apache Spark in the Cloud

This document discusses the challenges of big data analytics and how Apache Spark and Databricks can help address them. It summarizes that: 1) There is a gap between the growth of data and ability to perform real-time analytics on that data due to challenges in managing infrastructure, empowering teams, and establishing production-ready applications. 2) Databricks provides a cloud-hosted platform that uses Apache Spark to allow for just-in-time processing of data across storage silos, with an integrated workspace for interactive exploration, machine learning, and production-ready workflows. 3) Databricks Enterprise Security provides an end-to-end security solution for Apache Spark to address challenges in securing file

•by Jen Aman

#apachespark

Simplify and Scale Data Engineering Pipelines with Delta Lake

We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.

•by Databricks

*  apache spark   *big data   *ai   *

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

Apache Spark was designed as a batch analytics system. By caching RDDs, Spark speeds up jobs that iteratively process the same data. This pattern is also applicable to online analytics. We use Bloomberg’s Spark Server as a server runtime for online analytics. Our framework implements certain useful patterns applicable to online query processing and is centered on the idea of “Managed” DataFrames that can be refreshed and updated as per user requirements, without violating the immutability of RDDs/DataFrames. However, Spark presents significant challenges with respect to availability and resilience in an online setting where Spark is required to respond to queries with high SLAs. In this talk, we try to identify specific areas where slow-down or failures can result in the largest hits on online-query performance and potential solutions to address these.

•by Spark Summit

spark summit eastapache spark

Couple of false starts
5
Pig + Luigi
Pig + Oozie
Platfora

–platfora.com
“Without coding or ETL, data
warehousing, BI tools, or breaking a
sweat.”
6

Enter Spark
• Fast
• Nice development model
• Python
7

Recommended for you

How Apache Spark Is Helping Tame the Wild West of Wi-Fi

Spark is helping Wi-Fi provider iPass tame the unpredictability of Wi-Fi hotspots. iPass analyzes over 21 billion Wi-Fi scans to understand characteristics of over 500 million records and 100 million hotspots globally. Using Spark on AWS Databricks, iPass can automatically scale to handle real-time analytics on this large and growing data in a cost-effective way. This allows iPass to build an understanding of Wi-Fi network characteristics to improve their services.

•by Spark Summit

apache spark

Spark Summit EU talk by Pat Patterson

This document discusses building data pipelines with Spark and StreamSets. It describes how StreamSets Data Collector can be used to build pipelines that run on Spark today by leveraging Kafka RDDs and containers on Spark. It also outlines future directions for deeper Spark integration, including running pipelines on Databricks and developing a standalone Spark processor. The document concludes with a demo of StreamSets Data Collector capabilities.

•by Spark Summit

apache spark

Shifting Data Science into High Gear

Rob Thomas discusses IBM's investments in Apache Spark and the IBM Data Science Experience. IBM is a major contributor to Spark and has introduced tools like SparkSQL and Stocator. The presentation also introduces the IBM Data Science Experience, an analytics IDE built on Spark that provides learning resources, project sharing capabilities, and community features to enable collaboration. Thomas explains how IBM is growing the ecosystem around the Data Science Experience through deep integrations with IBM tools and light integrations with independent software vendors.

•by Spark Summit

#apachespark #sparksummit

GMV
A Case Study
9
165,000+
ACTIVE SHOPIFY MERCHANTS
$8 BILLION+
CUMULATIVE GMV

Growing pains
• Joins
• Groupings
• General data skew
• Getting to know python’s performance quirks
10

Starscream
11
• specialized joins
• resolvers
• range
• cassandra
• overby
• contracts
• incrementalized fact
builds

Our current stack
12
Kafka
OLTP
HDFS
Cassandra
Spark
FrontroomBackroom
Redshift
Tableau

Recommended for you

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...

Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. Workday is a “pure SaaS” company, providing a suite of Financial and HCM (Human Capital Management) apps to about 2000 companies around the world, including more than 30% from Fortune-500 list. There are significant business and technical challenges to support millions of concurrent users and hundreds of millions daily transactions. Using memory-centric graph-based architecture allowed to overcome most of these problems. As Workday grew, data transactions from existing and new customers generated vast amounts of valuable and highly sensitive data. The next big challenge was to provide in-app analytics platform, which for the multiple types of accumulated data, and also would allow using blend in external datasets. Workday users wanted it to be super-fast, but also intuitive and easy-to-use both for the financial and HR analysts and for regular, less technical users. Existing backend technologies were not a good fit, so we turned to Apache Spark. In this presentation, we will share the lessons we learned when building highly scalable multi-tenant analytics service for transactional data. We will start with the big picture and business requirements. Then describe the architecture with batch and interactive modules for data preparation, publishing, and query engine, noting the relevant Spark technologies. Then we will dive into the internals of Prism’s Query Engine, focusing on Spark SQL, DataFrames and Catalyst compiler features used. We will describe the issues we encountered while compiling and executing complex pipelines and queries, and how we use caching, sampling, and query compilation techniques to support interactive user experience. Finally, we will share the future challenges for 2018 and beyond.

•by Databricks

apache sparksparkaisummit

Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...

Quby, an Amsterdam-based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage. Using Europe’s largest energy dataset, consisting of petabytes of IoT data, the company has developed AI powered products that are used by hundreds of thousands of users on a daily basis. Delta Lake ensures the quality of incoming records though schema enforcement and evolution. But it is the Data Engineers role to check whether the expected data is ingested in to the Delta Lake at the right time with expected metrics so that downstream processes will function their duties. Re-training models and serving on the fly might go wrong unless we put the right monitoring infrastructure too.

•by Databricks

Insights Without Tradeoffs: Using Structured Streaming

Apache Spark 2.0 introduced Structured Streaming which allows users to continually and incrementally update your view of the world as new data arrives while still using the same familiar Spark SQL abstractions. Michael Armbrust from Databricks talks about the progress made since the release of Spark 2.0 on robustness, latency, expressiveness and observability, using examples of production end-to-end continuous applications. Speaker: Michael Armbrust Video: http://go.databricks.com/videos/spark-summit-east-2017/using-structured-streaming-apache-spark This talk was originally presented at Spark Summit East 2017.

•by Databricks

spark summitstructured streamingdatabricks

Thank you
13
Yandu Oppacher (@yandu)
Data Infrastructure

What's hot

Zeppelin at Twitter

Prasad Wagle

ASPgems - kappa architecture

Juantomás García Molina

Kappa Architecture is an alternative to Lambda Architecture that simplifies real-time data processing. It uses a distributed log like Kafka to store all input data immutably to allow reprocessing from the beginning if the processing code changes. This avoids having to maintain separate batch and real-time processing systems. The ASPgems team has implemented Kappa Architecture for several clients using Kafka, Spark Streaming, and Cassandra to provide real-time analytics and metrics in sectors like telecommunications, IoT, insurance, and energy.

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Databricks

PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds up data manipulation and display with features like: auto-visualization of Spark DataFrames, real-time Spark job progress monitoring, automated local install of Python and Scala kernels running with Spark, and much more. Come along and learn how you can use this tool in your own projects to visualize and explore data effortlessly with no coding. Oh, and if you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being able to visualize your favorite Python chart engines from a Scala Notebook! We’ll finish the session with a demo combining Twitter, Watson Tone Analyzer, Spark Streaming, and some fun real-time visualizations–all running within a Notebook.

Disrupting Big Data with Apache Spark in the Cloud

Jen Aman

Simplify and Scale Data Engineering Pipelines with Delta Lake

Databricks

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

Spark Summit

How Apache Spark Is Helping Tame the Wild West of Wi-Fi

Spark Summit

Spark Summit EU talk by Pat Patterson

Spark Summit

Shifting Data Science into High Gear

Spark Summit

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...

Databricks

Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...

Databricks

Insights Without Tradeoffs: Using Structured Streaming

Databricks

Bridging the Gap Between Datasets and DataFrames

Databricks

Apple leverages Apache Spark for processing large datasets to power key components of Apple's production services. The majority of users rely on Spark SQL to benefit from state-of-the-art optimizations in Catalyst and Tungsten. As there are multiple APIs to interact with Spark SQL, users have to make a wise decision which one to pick. While DataFrames and SQL are widely used, they lack type safety so that the analysis errors will not be detected during the compile time such as invalid column names or types. Also, the ability to apply the same functional constructions as on RDDs is missing in DataFrames. Datasets expose a type-safe API and support for user-defined closures at the cost of performance. This talk will explain cases when Spark SQL cannot optimize typed Datasets as much as it can optimize DataFrames. We will also present an effort to use bytecode analysis to convert user-defined closures into native Catalyst expressions. This helps Spark to avoid the expensive conversion between the internal format and JVM objects as well as to leverage more Catalyst optimizations. A consequence, we can bridge the gap in performance between Datasets and DataFrames, so that users do not have to sacrifice the benefits of Datasets for performance reasons.

Using Hadoop to build a Data Quality Service for both real-time and batch data

DataWorks Summit/Hadoop Summit

Griffin is a data quality platform built by eBay on Hadoop and Spark to provide a unified process for detecting data quality issues in both real-time and batch data across multiple systems. It defines common data quality dimensions and metrics and calculates measurement values and quality scores, storing results and generating trending reports. Griffin provides a centralized data quality service for eBay and has been deployed processing over 1.2PB of data and 800M daily records using 100+ metrics. It is open source and contributions are welcome.

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...

Databricks

Spark SQL is a very effective distributed SQL engine for OLAP and widely adopted in Baidu production for many internal BI projects. However, Baidu has also been facing many challenges for large scale including tuning the shuffle parallelism for thousands of jobs, inefficient execution plan, and handling data skew. In this talk, we will explore Intel and Baidu’s joint efforts to address challenges in large scale and offer an overview of an adaptive execution mode we implemented for Baidu’s Big SQL platform which is based on Spark SQL. At runtime, adaptive execution can change the execution plan to use a better join strategy and handle skewed join automatically. It can also change the number of reducer to better fit the data scale. In general, adaptive execution decreases the effort involved in tuning SQL query parameters and improves the execution performance by choosing a better execution plan and parallelism at runtime. We’ll also share our experience of using adaptive execution in Baidu’s production cluster with thousands of server, where adaptive execution helps to improve the performance of some complex queries by 200%. After further analysis we found that several special scenarios in Baidu data analysis can benefit from the optimization of choosing better join type. We got 2x performance improvement in the scenario where the user wanted to analysis 1000+ advertisers’ cost from both web and mobile side and each side has a full information table with 10 TB parquet file per-day. Now we are writing probe jobs to detect more scenarios from current daily jobs of our users. We are also considering to expose the strategy interface based on the detailed metrics collected form adaptive execution mode for the upper users.

Apache Spark in Scientific Applciations

Dr. Mirko Kämpf

This document provides an overview of Apache Spark, including: - Apache Spark is a next generation data processing engine for Hadoop that allows for fast in-memory processing of huge distributed and heterogeneous datasets. - Spark offers tools for data science and components for data products and can be used for tasks like machine learning, graph processing, and streaming data analysis. - Spark improves on MapReduce by being faster, allowing parallel processing, and supporting interactive queries. It works on both standalone clusters and Hadoop clusters.

Spark at Airbnb

Hao Wang

Rapid Data Analytics @ Netflix

Data Con LA

At Netflix, we've spent a lot of time thinking about how we can make our analytics group move quickly. Netflix's Data Engineering & Analytics organization embraces the company's culture of "Freedom & Responsibility". How does a company with a $40 billion market cap and $6 billion in annual revenue keep their data teams moving with the agility of a tiny company? How do hundreds of data engineers and scientists make the best decisions for their projects independently, without the analytics environment devolving into chaos? We'll talk about how Netflix equips its business intelligence and data engineers with: the freedom to leverage cloud-based data tools - Spark, Presto, Redshift, Tableau and others - in ways that solve our most difficult data problems the freedom to find and introduce right software for the job - even if it isn't used anywhere else in-house the freedom to create and drop new tables in production without approval the freedom to choose when a question is a one-off, and when a question is asked often enough to require a self-service tool the freedom to retire analytics and data processes whose value doesn't justify their support costs Speaker Bios Monisha Kanoth is a Senior Data Architect at Netflix, and was one of the founding members of the current streaming Content Analytics team. She previously worked as a big data lead at Convertro (acquired by AOL) and as a data warehouse lead at MySpace. Jason Flittner is a Senior Business Intelligence Engineer at Netflix, focusing on data transformation, analysis, and visualization as part of the Content Data Engineering & Analytics team. He previously led the EC2 Business Intelligence team at Amazon Web Services and was a business intelligence engineer with Cisco. Chris Stephens is a Senior Data Engineer at Netflix. He previously served as the CTO at Deep 6 Analytics, a machine learning & content analytics company in Los Angeles, and on the data warehouse teams at the FOX Audience Network and Anheuser-Busch.

Building Robust Production Data Pipelines with Databricks Delta

Databricks

"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets. Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain. This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT YOU’LL LEARN: – Understand the key data reliability and performance data pipelines challenges – How Databricks Delta helps build robust pipelines at scale – Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements – How to deliver performance gains using Delta PREREQUISITES: – A fully-charged laptop (8-16GB memory) with Chrome or Firefox – Pre-register for Databricks Community Edition" Speakers: Steven Yu, Burak Yavuz

Realtime streaming architecture in INFINARIO

Jozo Kovac

What's hot (20)

Zeppelin at Twitter

ASPgems - kappa architecture

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Disrupting Big Data with Apache Spark in the Cloud

Simplify and Scale Data Engineering Pipelines with Delta Lake

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

How Apache Spark Is Helping Tame the Wild West of Wi-Fi

Spark Summit EU talk by Pat Patterson

Shifting Data Science into High Gear

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...

Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...

Insights Without Tradeoffs: Using Structured Streaming

Bridging the Gap Between Datasets and DataFrames

Using Hadoop to build a Data Quality Service for both real-time and batch data

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...

Apache Spark in Scientific Applciations

Spark at Airbnb

Rapid Data Analytics @ Netflix

Building Robust Production Data Pipelines with Databricks Delta

Realtime streaming architecture in INFINARIO

Viewers also liked

Open Stack Cheat Sheet V1

Anuchit Chalothorn

This document provides a cheat sheet of common commands for managing OpenStack services including Nova (compute), Neutron (networking), Glance (images), and Keystone (identity). Key commands covered include adding and removing networks, floating IPs, security groups, images, instances, and users/tenants. The cheat sheet contains over 30 commands organized by service to help users manage basic operations and troubleshoot issues in an OpenStack deployment.

Tachyon-2014-11-21-amp-camp5

Haoyuan Li

Tachyon is a memory-centric distributed storage system that provides reliable data sharing at memory speed across various cluster computing frameworks. It addresses issues with current storage systems like slow data sharing due to disk writes, cache loss when processes crash, and in-memory data duplication. Tachyon keeps only one copy of data in memory, tracks data lineage for fault tolerance, and enables fast sharing of data within and across frameworks and jobs. It provides a simple API and allows frameworks like Spark and MapReduce to access data reliably from memory without code changes.

Linux Filesystems, RAID, and more

Mark Wong

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...

Spark Summit

Christopher Bradford presented lessons learned from using Apache Spark at the US Patent and Trademark Office to improve their process of loading data from Cassandra into Solr. The initial Spark implementation performed poorly due to opening and closing a Solr connection for each document. Optimizations like opening a single connection per partition and pushing documents in batches significantly improved performance, resulting in a solution that was 5 times faster than the original process. Future work involves further optimizing this Spark job and exploring additional uses of Spark.

The Hot Rod Protocol in Infinispan

Galder Zamarreño

In an environment where cloud-scaling applications is becoming more and more important, client-server architectures paradigms, as shown by memcached, are back with vengeance. In this talk, Galder will talk about Hot Rod, Infinispan's new client/server binary protocol, explaining the key differences compared to memcached's binary protocol, such as the possibility of receiving cluster topology changes. Audience of this talk will learn of the importance of Hot Rod in 'cloud-scale' application server clustering, where stateless application server instances could use Infinispan Hot Rod clients to retrieve state from an elastic farm of Infinispan Hot Rod servers, improving capabilities to run application server instances as a PaaS. The talk will finish with a brief demo of a cluster of Infinispan Hot Rod servers running on EC2 being accessed from a non-Java client. The audience is expected to have an intermediate understanding of client-server software architectures and cloud deployments.

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Ceph Community

This document summarizes the performance of an all-NVMe Ceph cluster using Intel P3700 NVMe SSDs. Key results include achieving over 1.35 million 4K random read IOPS and 171K 4K random write IOPS with sub-millisecond latency. Partitioning the NVMe drives into multiple OSDs improved performance and CPU utilization compared to a single OSD per drive. The cluster also demonstrated over 5GB/s of sequential bandwidth.

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

Daniel Krook

Lightning talk from the OpenStack NYC meetup on October 8, 2014. http://bit.ly/ibm-os-meetup By Gil Vernik The integration between Apache Spark and Swift, and the use of Storlets for smart retrieval via filtering and privacy-support. The content of this talk is a statement from the IBM Research division, not IBM product divisions, and is not a statement from IBM regarding its plans, directions or product intents. Any activities described by this talk are subject to change.

Scaling up genomic analysis with ADAM

fnothaft

The document discusses ADAM, a new framework for scalable genomic data analysis. It aims to make genomic pipelines horizontally scalable by using a columnar data format and in-memory computing. This avoids disk I/O bottlenecks. The framework represents genomic data as schemas and stores data in Parquet for efficient column-based access. It has been shown to reduce genome analysis pipeline times from 100 hours to 1 hour by enabling analysis on large datasets in parallel across many nodes.

ELC-E 2010: The Right Approach to Minimal Boot Times

andrewmurraympc

Velox: Models in Action

Dan Crankshaw

Naïveté vs. Experience

Mike Fogus

The document discusses the authors' initial naive expectations for using Scala and Clojure compared to their actual experience. They initially thought the languages would solve issues like boilerplate code and Java compiler problems, but found they still had to deal with interoperability, immutability, and other functional programming concepts. In the end, Clojure met more of their needs due to its emphasis on seamless interoperability, traits, and pattern matching.

SparkR: Enabling Interactive Data Science at Scale

jeykottalam

The document discusses SparkR, which enables interactive data science using R on Apache Spark clusters. SparkR allows users to create and manipulate resilient distributed datasets (RDDs) from R and run R analytics functions in parallel on large datasets. It provides examples of using SparkR for tasks like word counting on text data and digit classification using the MNIST dataset. The API is designed to be similar to PySpark for ease of use.

SampleClean: Bringing Data Cleaning into the BDAS Stack

jeykottalam

OpenStack Cheat Sheet V2

Anuchit Chalothorn

This document provides a cheat sheet summarizing common commands for managing various OpenStack services, including Nova for compute, Glance for images, Keystone for identity, Cinder for volumes, Neutron for networking, Heat for orchestration, and Ceilometer for alarms and notifications. It lists commands for viewing status, creating and managing resources, and common operations for each service in 3 sentences or less.

A Curious Course on Coroutines and Concurrency

David Beazley (Dabeaz LLC)

David Beazley gave a tutorial on coroutines and concurrency at PyCon 2009. The tutorial provided an overview of coroutines, how they can be used, and whether they are useful. It explored coroutines in Python using generators and the send method added in Python 2.5. The tutorial was meant to determine if coroutines have practical applications as an approach to concurrency. It focused on practical examples over academic theory and included some later performance measurements.

Lab 5: Interconnecting a Datacenter using Mininet

Zubair Nabi

This document discusses using Mininet, an emulator for real-world networks that uses real kernel, switch, and application code on a single machine. It describes how Mininet uses Linux containers to emulate hosts, switches, and links. It also explains that Mininet creates a container and network namespace for each virtual host, with virtual interfaces connecting hosts to software switches via veth links. Finally, it briefly outlines Mininet's command line and Python interfaces.

Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)

Spark Summit

Typesafe has launched Spark support for Mesosphere's Data Center Operating System (DCOS). Typesafe engineers are contributing to the Mesos support for Spark and Typesafe will provide commercial support for Spark development and production deployment on Mesos. Mesos' flexibility allows many frameworks like Spark to run on top of it. This document discusses Spark on Mesos in coarse-grained and fine-grained modes and some features coming soon like dynamic allocation and constraints.

Best Practices for Virtualizing Apache Hadoop

Hortonworks

Join this webinar to discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.

In Search of the Perfect Global Interpreter Lock

David Beazley (Dabeaz LLC)

Python in Action (Part 2)

David Beazley (Dabeaz LLC)

The document discusses a Python tutorial presentation on systems programming. It describes building programs to analyze Firefox browser cache files, including a findcache.py program that recursively searches a directory for Firefox cache folders. The goal is to demonstrate Python for practical system tasks like file parsing and processing. Disclaimers note the code only uses standard Python and is intended for educational purposes.

Viewers also liked (20)

Open Stack Cheat Sheet V1

Tachyon-2014-11-21-amp-camp5

Linux Filesystems, RAID, and more

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...

The Hot Rod Protocol in Infinispan

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

Scaling up genomic analysis with ADAM

ELC-E 2010: The Right Approach to Minimal Boot Times

Velox: Models in Action

Naïveté vs. Experience

SparkR: Enabling Interactive Data Science at Scale

SampleClean: Bringing Data Cleaning into the BDAS Stack

OpenStack Cheat Sheet V2

A Curious Course on Coroutines and Concurrency

Lab 5: Interconnecting a Datacenter using Mininet

Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)

Best Practices for Virtualizing Apache Hadoop

In Search of the Perfect Global Interpreter Lock

Python in Action (Part 2)

Similar to The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Move to Spark-(Yandu Oppacher, Shopify)

Pig on Spark

mortardata

5 Things that Make Hadoop a Game Changer

Caserta

5 Things that Make Hadoop a Game Changer Webinar by Elliott Cordo, Caserta Concepts There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop. To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029 For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

Databricks

Spark provides a unified platform for processing big data from multiple sources and in different formats. It allows for just-in-time processing of data without needing to wait for ETL into a data warehouse. This provides lower latency and makes it easy to combine data. Spark also unifies batch, streaming, and machine learning functionality into a single engine. This was demonstrated on a large online service company that leverages Spark for interactive queries, machine learning, and combining data from various sources for analytics and predictive services.

100 Exadata Implementations Later-Tim Fox

Enkitec

Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL

Inside Analysis

The Briefing Room with Dr. Robin Bloor and Splice Machine Live Webcast August 11, 2015 Watch the archive: https://bloorgroup.webex.com/bloorgroup/onstage/g.php?MTID=e1b33c9d45b178e13784b4a971a4c1349 The ETL process was born out of necessity, and for decades it has been the glue between data sources and target applications. But as data growth soars and increased competition demands real-time data, standard ETL has become brittle and often unmanageable. Scaling up resources can do the trick, but it’s very costly and only a matter of time before the processes hit another bottleneck. When outmoded ETL stands in the way of real-time analytics, it might be time to consider a completely new approach. Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he explains how modern, data-driven architectures must adopt an equally capable data integration strategy. He’ll be briefed by Rich Reimer of Splice Machine, who will discuss how his company solves ETL performance issues and enables real-time analytics and reports on big data. He will show that by leveraging the scale-out power of Hadoop and the in-memory speed of Spark, users can bring both analytical and operational systems together, eventually performing transformations only when needed. Visit InsideAnalysis.com for more information.

PyData: The Next Generation | Data Day Texas 2015

Cloudera, Inc.

This document discusses the past, present, and future of Python for big data analytics. It provides background on the rise of Python as a data analysis tool through projects like NumPy, pandas, and scikit-learn. However, as big data systems like Hadoop became popular, Python was not initially well-suited for problems at that scale. Recent projects like PySpark, Blaze, and Spartan aim to bring Python to big data, but challenges remain around data formats, distributed computing interfaces, and competing with Scala. The document calls for continued investment in high performance Python tools for big data to ensure its relevance in coming years.

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

Big Data Spain

http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks. Session presented at Big Data Spain 2014 Conference 18th Nov 2014 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Douglas Moore

Douglas Moore discusses common anti-patterns seen when implementing big data solutions based on lessons learned from working with over 50 clients. He covers anti-patterns in hardware and infrastructure like relying on outdated reference architectures, tooling like trying to do analytics directly in NoSQL databases, and big data warehousing like over-curating data during ETL. The key is to understand the strengths and weaknesses of different tools and deploy the right solution for the intended workload.

Hadoop and SAP BI

Praveen Kumar (Tyagi)

This document provides an overview of big data and Hadoop. It discusses that Hadoop is a software framework for distributed processing of large datasets across clusters of computers. It can handle terabytes or petabytes of data stored on hundreds or thousands of nodes. The document also summarizes key Hadoop concepts like HDFS for storage, MapReduce for processing, and YARN for resource management. It highlights top reasons for using big data and Hadoop like scalability, flexibility to handle all data formats, and ability to use inexpensive commodity hardware. Finally, it presents a typical big data architecture showing data sources, storage, processing, visualization, and machine learning capabilities.

Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...

TamrMarketing

The document discusses big data and the challenges it poses. It identifies data integration at scale as the "800 pound gorilla" problem. Traditional extract, transform, load (ETL) and master data management (MDM) tools do not scale sufficiently to integrate the many diverse data sources that enterprises have. A better solution is to use machine learning for schema integration, deduplication, and golden record resolution. This approach can automatically classify large volumes of records instead of relying on error-prone manual rules. As machine learning and complex analytics replace traditional business intelligence, effective data integration will be critical but challenging to achieve at scale.

Splice machine-bloor-webinar-data-lakes

Edgar Alejandro Villegas

Cloud-native Semantic Layer on Data Lake

Databricks

Stream based Data Integration

Jeffrey T. Pollock

SnappyData Toronto Meetup Nov 2017

SnappyData

SnappyData is a new open source project started by Pivotal GemFire founders to provide a unified platform for OLTP, OLAP and streaming analytics using Spark. It aims to simplify big data architectures by supporting mixed workloads in a single clustered database, allowing for real-time operational analytics on live data without copying to other systems. This provides faster insights than current approaches that require periodic data copying between different databases and analytics systems.

Unlocked: the Hybrid Cloud - 12th May 2014 / All Slides (morning)

Rackspace Academy

Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive

confluent

1. A streaming platform like Kafka can provide the benefits of Hadoop for batch processing but in a faster, real-time way by processing data as it arrives rather than storing all data. 2. Virtual reality applications require stream processing to power features like VR mirroring and capture in real-time. Kafka's stream processing capabilities address challenges like this for VR. 3. The document discusses how AltspaceVR uses Kafka stream processing for applications like VR mirroring and capture, presence tracking, scheduled tasks, and more to power their real-time VR experiences.

Big Data, Big Dream

Wayne Weixin

This document provides an overview of Big Data concepts including architecture, use cases, and examples. It discusses how Big Data can help tell what will happen, find unexpected relationships, monitor situations, and fix problems. Real examples mentioned include Google leaving China, input methods, Google ads, Netflix recommendations, and Walmart personalized offers. The Big Data architecture described includes collecting data, using queues to decouple systems, processing data in Hadoop clusters, storing data in data stores like HDFS, HBase and Elasticsearch, and visualizing results. Cloud services can also be used to build projects quickly.

Apache Tajo - BWC 2014

Gruter

Spark Summit EU talk by Bas Geerdink

Spark Summit

Get rid of traditional ETL processes and move to using Spark instead. Spark allows for parallel processing of data and can run on Hadoop systems where data is typically stored. It treats batch and streaming data equally and supports continuous processing without waiting for phases to complete. Spark code can extract, transform, and load data as well as perform machine learning tasks for data enrichment and prediction. This provides more flexibility than traditional ETL tools.

Distributed Data Quality - Technical Solutions for Organizational Scaling

Justin Cunningham

The document discusses Yelp's distributed data architecture and quality solutions for organizational scaling. It describes how Yelp connects over 500 engineers across many services through shared data stored in databases like MySQL, Cassandra and Elasticsearch. The data is ingested through Kafka and processed using tools like Flink. Schematizer provides documentation, discovery and ownership of data. It also enables data lineage tracking and auditing to ensure quality as the data is transformed and loaded into data lakes and warehouses. The goal is to provide reliable, up-to-date shared data to align teams and enable autonomy through self-service data access.

Similar to The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Move to Spark-(Yandu Oppacher, Shopify) (20)

Pig on Spark

5 Things that Make Hadoop a Game Changer

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

100 Exadata Implementations Later-Tim Fox

Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL

PyData: The Next Generation | Data Day Texas 2015

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Hadoop and SAP BI

Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...

Splice machine-bloor-webinar-data-lakes

Cloud-native Semantic Layer on Data Lake

Stream based Data Integration

SnappyData Toronto Meetup Nov 2017

Unlocked: the Hybrid Cloud - 12th May 2014 / All Slides (morning)

Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive

Big Data, Big Dream

Apache Tajo - BWC 2014

Spark Summit EU talk by Bas Geerdink

Distributed Data Quality - Technical Solutions for Organizational Scaling

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

Spark Summit

In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Spark Summit

In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Spark Summit

This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

Spark Summit

As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

Spark Summit

Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Spark Summit

Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.

Apache Spark and Tensorflow as a Service with Jim Dowling

Spark Summit

In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.

Apache Spark and Tensorflow as a Service with Jim Dowling

Spark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Spark Summit

With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.

Next CERN Accelerator Logging Service with Jakub Wozniak

Spark Summit

The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service. The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex. The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments. During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.

Powering a Startup with Apache Spark with Kevin Kim

Spark Summit

In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Spark Summit

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

Spark Summit

In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects. In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API. For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spark Summit

Nielsen used Databricks to test new digital advertising rating methodologies on a large scale. Databricks allowed Nielsen to run analyses on thousands of advertising campaigns using both small panel data and large production data. This identified edge cases and performance gains faster than traditional methods. Using Databricks reduced the time required to test and deploy improved rating methodologies to benefit Nielsen's clients.

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Spark Summit

Goal Based Data Production with Sim Simeonov

Spark Summit

Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Spark Summit

Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Spark Summit

Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

Spark Summit

Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Spark Summit

The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Apache Spark and Tensorflow as a Service with Jim Dowling

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Next CERN Accelerator Logging Service with Jakub Wozniak

Powering a Startup with Apache Spark with Kevin Kim

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Goal Based Data Production with Sim Simeonov

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded

Australian Catholic University degree offer diploma Transcript

taqyea

学历认证补办制【微信：A575476】【(ACU毕业证）澳大利亚天主教大学毕业证成绩单offer】【微信：A575476】（留信学历认证永久存档查询）采用学校原版纸张，特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：A575476】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：A575476】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，��志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

LLM powered Contract Compliance Application.pptx

Jyotishko Biswas

Supervised Learning (Data Science).pptx

TARIKU ENDALE

Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...

javier ramirez

Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados. QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo. Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.

Streamlining Legacy Complexity Through Modernization

sanjay singh

Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

kumkum tuteja$A17

iot paper presentation FINAL EDIT by kiran.pptx

KiranKumar139571

Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

aashuverma204

Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe

shruti singh$A17

Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

nehadubay1

Niagara College degree offer diploma Transcript

taqyea

原版制作【微信：A575476】【(NC毕业证)尼亚加拉学院毕业证成绩单offer】【微信：A575476】（留信学历认证永久存档查询）采用学校原版纸张（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：A575476】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：A575476】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理(NC毕业证)尼亚加拉学院毕业证【微信：A575476】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(NC毕业证)尼亚加拉学院毕业证【微信：A575476】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(NC毕业证)尼亚加拉学院毕业证【微信：A575476】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(NC毕业证)尼亚加拉学院毕业证【微信：A575476 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

NPS_Presentation_V3.pptx it is regarding National pension scheme

ASISHSABAT3

RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe

Alisha Pathan $A17

Maruti Wagon R on road price in Faridabad - CarDekho

kamli sharma#S10

[D3T1S02] Aurora Limitless Database Introduction

Amazon Web Services Korea

Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.

Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

kumkum tuteja$A17

University of the Sunshine Coast degree offer diploma Transcript

taqyea

学历认证补办制【微信：A575476】【(USC毕业证）阳光海岸大学毕业证成绩单offer】【微信：A575476】（留信学历认证永久存档查询）采用学校原版纸张，特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：A575476】文凭即可二、回国进私企、外企、自己��生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：A575476】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理(USC毕业证）阳光海岸大学毕业证【微信：A575476】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(USC毕业证）阳光海岸大学毕业证【微信：A575476】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(USC毕业证）阳光海岸大学毕业证【微信：A575476】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(USC毕业证）阳光海岸大学毕业证【微信：A575476 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

[D3T1S03] Amazon DynamoDB design puzzlers

Amazon Web Services Korea

AIRLINE_SATISFACTION_Data Science Solution on Azure

SanelaNikodinoska1

Simon Fraser University degree offer diploma Transcript

taqyea

学历认证补办制【微信：A575476】【(SFU毕业证）西蒙弗雷泽大学毕业证成绩单offer】【微信：A575476】（留信学历认证永久存档查询）采用学校原版纸张，特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：A575476】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：A575476】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理(SFU毕业证）西蒙弗雷泽大学毕业证【微信：A575476】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(SFU毕业证）西蒙弗雷泽大学毕业证【微信：A575476】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(SFU毕业证）西蒙弗雷泽大学毕业证【微信：A575476】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(SFU毕业证）西蒙弗雷泽大学毕业证【微信：A575476 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Recently uploaded (20)

Australian Catholic University degree offer diploma Transcript

LLM powered Contract Compliance Application.pptx

Supervised Learning (Data Science).pptx

Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...

Streamlining Legacy Complexity Through Modernization

Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

iot paper presentation FINAL EDIT by kiran.pptx

Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe

Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

Niagara College degree offer diploma Transcript

NPS_Presentation_V3.pptx it is regarding National pension scheme

RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe

Maruti Wagon R on road price in Faridabad - CarDekho

[D3T1S02] Aurora Limitless Database Introduction

Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe

University of the Sunshine Coast degree offer diploma Transcript

[D3T1S03] Amazon DynamoDB design puzzlers

AIRLINE_SATISFACTION_Data Science Solution on Azure

Simon Fraser University degree offer diploma Transcript