Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)

•

4 likes•2,651 views

Christopher Bradford presented lessons learned from using Apache Spark at the US Patent and Trademark Office to improve their process of loading data from Cassandra into Solr. The initial Spark implementation performed poorly due to opening and closing a Solr connection for each document. Optimizations like opening a single connection per partition and pushing documents in batches significantly improved performance, resulting in a solution that was 5 times faster than the original process. Future work involves further optimizing this Spark job and exploring additional uses of Spark.

Recommended for you

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

- MLlib has rapidly developed over the past 5 years, growing from a few algorithms to over 50 algorithms and featurizers for classification, regression, clustering, recommendation, and more. - This growth has shifted from just adding algorithms to improving algorithms, infrastructure, and integrating ML workflows with Spark's broader capabilities like SQL, DataFrames, and streaming. - Going forward, areas of focus include continued scalability improvements, enhancing core algorithms, extensible APIs, and making MLlib a more comprehensive standard library.

•by Databricks

spark summitapache spark

A Journey into Databricks' Pipelines: Journey and Lessons Learned

With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.

•by Databricks

databricksspark summitapache spark

Web-Scale Graph Analytics with Apache® Spark™

Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently. At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries, and hear about real-world applications.

•by Databricks

graphframesapache spark 2.0databricks

EST – Data Loading
CSS Ingestion (CSS2C) Solr Ingestion (C2S)

EST – C2S Process
Note: some connections are omitted for clarity

EST – C2S Process (Scaled Out)
Note: some connections are omitted for clarity

Recommended for you

Building a SIMD Supported Vectorized Native Engine for Spark SQL

Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.

•by Databricks

Understanding Query Plans and Spark UIs

"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark. "

•by Databricks

Advanced Data Science on Spark-(Reza Zadeh, Stanford)

The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.

•by Spark Summit

spark summit 2015apache spark

EST – C2S Review
Did it work?
Why change it?
How could we make it better?

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)

EST – Old C2S Process
Note: some connections are omitted for clarity

EST – Spark C2S Process
Note: some connections are omitted for clarity

Recommended for you

Spark r under the hood with Hossein Falaki

SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Being a distributed system with a JVM core some R users find SparkR errors unfamiliar. In this talk we will show what goes on under the hood when you interact with SparkR. We will look at SparkR architecture, performance bottlenecks and API semantics. Equipped with those, we will show how some common errors can be eliminated. I will use debugging examples based on our experience with real SparkR use cases.

•by Databricks

spark summitapache spark

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

The document discusses 5 common mistakes people make when writing Spark applications: 1) Not properly sizing executors for memory and cores. 2) Having shuffle blocks larger than 2GB which can cause jobs to fail. 3) Not addressing data skew which can cause joins and shuffles to be very slow. 4) Not properly managing the DAG to minimize shuffles and stages. 5) Classpath conflicts from mismatched dependencies causing errors.

•by Cloudera, Inc.

sparkhadoopapache spark

Introduction to Spark ML

This document provides an introduction and overview of machine learning with Spark ML. It discusses the speaker and TAs, previews the topics that will be covered which include Spark's ML APIs, running an example with one API, model save/load, and serving options. It also briefly describes the different pieces of Spark including SQL, streaming, languages APIs, MLlib, and community packages. The document provides examples of loading data with Spark SQL and Spark CSV, constructing a pipeline with transformers and estimators, training a decision tree model, adding more features to the tree, and cross validation. Finally, it discusses serving models and exporting models to PMML format.

•by Holden Karau

apache sparksparkmachine learning

Poor Performance
joinedRDD = …
joinedRDD.foreach()
document = … // build document
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
// Job is done

Poor Performance
sc = new SolrConnection()
sc.push(document)
sc.disconnect()

Optimum Performance
joinedRDD = …
sc = new SolrConnection()
joinedRDD.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
joinedRDD = …
joinedRDD.foreachPartition()
sc = new SolrConnection()
partition.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
Almost

Recommended for you

Unified Big Data Processing with Apache Spark (QCON 2014)

This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.

•by Databricks

sparkhadoopbig data

Operational Tips For Deploying Apache Spark

Operational Tips for Deploying Apache Spark provides an overview of Apache Spark configuration, pipeline design best practices, and debugging techniques. It discusses how to configure Spark through command line options, programmatically, and Hadoop configs. It also covers topics like file formats, compression codecs, partitioning, and monitoring Spark jobs. The document provides tips on common issues like OutOfMemoryErrors, debugging SQL queries, and tuning shuffle partitions.

•by Databricks

spark summitdatabricksapache spark

Writing Continuous Applications with Structured Streaming in PySpark

We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.

•by Databricks

apache spark 2.3pysparkstructured streaming

The Solution!
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
document
sc.push(document)
sc.close()
return partition.rows
.collect()
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
document
sc.push(document)
sc.close()
return partitions.rows.count
.collect()

Better Solr Indexing
Note: some connections are omitted for clarity

Recommended for you

Microservices, Containers, and Machine Learning

Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.

•by Paco Nathan

machine learningdata sciencebig data

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials

The document provides an outline for the Spark Camp @ Strata CA tutorial. The morning session will cover introductions and getting started with Spark, an introduction to MLlib, and exercises on working with Spark on a cluster and notebooks. The afternoon session will cover Spark SQL, visualizations, Spark streaming, building Scala applications, and GraphX examples. The tutorial will be led by several instructors from Databricks and include hands-on coding exercises.

•by Databricks

apache sparksparkcampdatabricks

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames. In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames in Databricks Community Edition. Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it. * Apache Spark Basics & Architecture * Spark SQL * DataFrames * Brief Overview of Databricks Certified Developer for Apache Spark

•by Databricks

apache spark 2.3spark sqldataframes

EST – Spark C2S Process v2
Note: some connections are omitted for clarity

Success?
YUP
5x faster than the original C2S process (with optimizations)

What’s Next?
•  Optimization of the C2S Spark job
•  More Spark jobs
•  Newer version of Spark & DSE
•  Scala Spark jobs instead of Java

What's hot

How Apache Spark fits into the Big Data landscape

Paco Nathan

Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/ Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc. This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

Databricks

Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is an Apache Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk, we discuss the philosophy behind Deep Learning Pipelines, as well as the main tools it provides, how they fit into the deep learning ecosystem, and how they demonstrate Spark's role in deep learning.

SparkApplicationDevMadeEasy_Spark_Summit_2015

Lance Co Ting Keh

The document discusses Spark application development and common problems that can occur. It introduces Unravel Data, a startup that aims to help developers visualize Spark job data, optimize performance through automated analysis and diagnoses, and strategize to prevent problems and meet goals. Key points include discussing common issues like failures, wrong results, poor performance, and resource problems; the difficulty of debugging using logs alone; and demonstrating Unravel's platform to address these challenges.

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Databricks

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks

Web-Scale Graph Analytics with Apache® Spark™

Databricks

Building a SIMD Supported Vectorized Native Engine for Spark SQL

Databricks

Understanding Query Plans and Spark UIs

Databricks

Advanced Data Science on Spark-(Reza Zadeh, Stanford)

Spark Summit

Spark r under the hood with Hossein Falaki

Databricks

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Cloudera, Inc.

Introduction to Spark ML

Holden Karau

Unified Big Data Processing with Apache Spark (QCON 2014)

Databricks

Operational Tips For Deploying Apache Spark

Databricks

Writing Continuous Applications with Structured Streaming in PySpark

Databricks

Microservices, Containers, and Machine Learning

Paco Nathan

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials

Databricks

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Databricks

Spark Community Update - Spark Summit San Francisco 2015

Databricks

This document summarizes Spark community updates from June 2014 to June 2015. It notes that Spark has become the most active open source project for data processing, with the number of contributors and lines of code doubling over the past year. New features in Spark include support for the R programming language and machine learning pipelines inspired by scikit-learn. The document outlines ongoing work to improve the Spark engine and platform APIs, as well as previews upcoming developments through Spark 1.5.

What's New in Apache Spark 2.3 & Why Should You Care

Databricks

The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support. This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements: * Continuous Processing in Structured Streaming. * PySpark support for vectorization, giving Python developers the ability to run native Python code fast. * Native Kubernetes support, marrying the best of container orchestration and distributed data processing.

What's hot (20)

How Apache Spark fits into the Big Data landscape

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

SparkApplicationDevMadeEasy_Spark_Summit_2015

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Web-Scale Graph Analytics with Apache® Spark™

Building a SIMD Supported Vectorized Native Engine for Spark SQL

Understanding Query Plans and Spark UIs

Advanced Data Science on Spark-(Reza Zadeh, Stanford)

Spark r under the hood with Hossein Falaki

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Introduction to Spark ML

Unified Big Data Processing with Apache Spark (QCON 2014)

Operational Tips For Deploying Apache Spark

Writing Continuous Applications with Structured Streaming in PySpark

Microservices, Containers, and Machine Learning

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Spark Community Update - Spark Summit San Francisco 2015

What's New in Apache Spark 2.3 & Why Should You Care

Viewers also liked

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)

Spark Summit

The document discusses NASA's use of Apache Spark for big data analytics. It provides context on Chris Mattmann's involvement with Spark through his roles at NASA JPL and the Apache Software Foundation. It outlines some of NASA's big data challenges around handling large volumes of Earth observation data from instruments and simulations. NASA is interested in using Spark for tasks like data triage, archiving, and knowledge extraction to help address these challenges and enable new scientific insights.

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...

Spark Summit

The document discusses the challenges faced by Shopify in using its existing data warehouse and ETL processes due to increasing data volume and complexity. It describes Shopify's attempts to use Pig and Luigi as well as Platfora to address these issues, but notes they did not meet Shopify's needs. Shopify then moved to using Spark due to its fast performance, nice development model using Python, and ability to better handle their data and query complexity. The summary provides an overview of why Shopify changed its data warehousing approach and the key technology it adopted.

Open Stack Cheat Sheet V1

Anuchit Chalothorn

This document provides a cheat sheet of common commands for managing OpenStack services including Nova (compute), Neutron (networking), Glance (images), and Keystone (identity). Key commands covered include adding and removing networks, floating IPs, security groups, images, instances, and users/tenants. The cheat sheet contains over 30 commands organized by service to help users manage basic operations and troubleshoot issues in an OpenStack deployment.

Tachyon-2014-11-21-amp-camp5

Haoyuan Li

Tachyon is a memory-centric distributed storage system that provides reliable data sharing at memory speed across various cluster computing frameworks. It addresses issues with current storage systems like slow data sharing due to disk writes, cache loss when processes crash, and in-memory data duplication. Tachyon keeps only one copy of data in memory, tracks data lineage for fault tolerance, and enables fast sharing of data within and across frameworks and jobs. It provides a simple API and allows frameworks like Spark and MapReduce to access data reliably from memory without code changes.

Linux Filesystems, RAID, and more

Mark Wong

The Hot Rod Protocol in Infinispan

Galder Zamarreño

In an environment where cloud-scaling applications is becoming more and more important, client-server architectures paradigms, as shown by memcached, are back with vengeance. In this talk, Galder will talk about Hot Rod, Infinispan's new client/server binary protocol, explaining the key differences compared to memcached's binary protocol, such as the possibility of receiving cluster topology changes. Audience of this talk will learn of the importance of Hot Rod in 'cloud-scale' application server clustering, where stateless application server instances could use Infinispan Hot Rod clients to retrieve state from an elastic farm of Infinispan Hot Rod servers, improving capabilities to run application server instances as a PaaS. The talk will finish with a brief demo of a cluster of Infinispan Hot Rod servers running on EC2 being accessed from a non-Java client. The audience is expected to have an intermediate understanding of client-server software architectures and cloud deployments.

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Ceph Community

This document summarizes the performance of an all-NVMe Ceph cluster using Intel P3700 NVMe SSDs. Key results include achieving over 1.35 million 4K random read IOPS and 171K 4K random write IOPS with sub-millisecond latency. Partitioning the NVMe drives into multiple OSDs improved performance and CPU utilization compared to a single OSD per drive. The cluster also demonstrated over 5GB/s of sequential bandwidth.

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

Daniel Krook

Lightning talk from the OpenStack NYC meetup on October 8, 2014. http://bit.ly/ibm-os-meetup By Gil Vernik The integration between Apache Spark and Swift, and the use of Storlets for smart retrieval via filtering and privacy-support. The content of this talk is a statement from the IBM Research division, not IBM product divisions, and is not a statement from IBM regarding its plans, directions or product intents. Any activities described by this talk are subject to change.

Scaling up genomic analysis with ADAM

fnothaft

The document discusses ADAM, a new framework for scalable genomic data analysis. It aims to make genomic pipelines horizontally scalable by using a columnar data format and in-memory computing. This avoids disk I/O bottlenecks. The framework represents genomic data as schemas and stores data in Parquet for efficient column-based access. It has been shown to reduce genome analysis pipeline times from 100 hours to 1 hour by enabling analysis on large datasets in parallel across many nodes.

ELC-E 2010: The Right Approach to Minimal Boot Times

andrewmurraympc

Velox: Models in Action

Dan Crankshaw

Naïveté vs. Experience

Mike Fogus

The document discusses the authors' initial naive expectations for using Scala and Clojure compared to their actual experience. They initially thought the languages would solve issues like boilerplate code and Java compiler problems, but found they still had to deal with interoperability, immutability, and other functional programming concepts. In the end, Clojure met more of their needs due to its emphasis on seamless interoperability, traits, and pattern matching.

SparkR: Enabling Interactive Data Science at Scale

jeykottalam

The document discusses SparkR, which enables interactive data science using R on Apache Spark clusters. SparkR allows users to create and manipulate resilient distributed datasets (RDDs) from R and run R analytics functions in parallel on large datasets. It provides examples of using SparkR for tasks like word counting on text data and digit classification using the MNIST dataset. The API is designed to be similar to PySpark for ease of use.

SampleClean: Bringing Data Cleaning into the BDAS Stack

jeykottalam

OpenStack Cheat Sheet V2

Anuchit Chalothorn

This document provides a cheat sheet summarizing common commands for managing various OpenStack services, including Nova for compute, Glance for images, Keystone for identity, Cinder for volumes, Neutron for networking, Heat for orchestration, and Ceilometer for alarms and notifications. It lists commands for viewing status, creating and managing resources, and common operations for each service in 3 sentences or less.

A Curious Course on Coroutines and Concurrency

David Beazley (Dabeaz LLC)

David Beazley gave a tutorial on coroutines and concurrency at PyCon 2009. The tutorial provided an overview of coroutines, how they can be used, and whether they are useful. It explored coroutines in Python using generators and the send method added in Python 2.5. The tutorial was meant to determine if coroutines have practical applications as an approach to concurrency. It focused on practical examples over academic theory and included some later performance measurements.

Lab 5: Interconnecting a Datacenter using Mininet

Zubair Nabi

This document discusses using Mininet, an emulator for real-world networks that uses real kernel, switch, and application code on a single machine. It describes how Mininet uses Linux containers to emulate hosts, switches, and links. It also explains that Mininet creates a container and network namespace for each virtual host, with virtual interfaces connecting hosts to software switches via veth links. Finally, it briefly outlines Mininet's command line and Python interfaces.

Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)

Spark Summit

Typesafe has launched Spark support for Mesosphere's Data Center Operating System (DCOS). Typesafe engineers are contributing to the Mesos support for Spark and Typesafe will provide commercial support for Spark development and production deployment on Mesos. Mesos' flexibility allows many frameworks like Spark to run on top of it. This document discusses Spark on Mesos in coarse-grained and fine-grained modes and some features coming soon like dynamic allocation and constraints.

Best Practices for Virtualizing Apache Hadoop

Hortonworks

Join this webinar to discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.

Python in Action (Part 2)

David Beazley (Dabeaz LLC)

The document discusses a Python tutorial presentation on systems programming. It describes building programs to analyze Firefox browser cache files, including a findcache.py program that recursively searches a directory for Firefox cache folders. The goal is to demonstrate Python for practical system tasks like file parsing and processing. Disclaimers note the code only uses standard Python and is intended for educational purposes.

Viewers also liked (20)

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...

Open Stack Cheat Sheet V1

Tachyon-2014-11-21-amp-camp5

Linux Filesystems, RAID, and more

The Hot Rod Protocol in Infinispan

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

Scaling up genomic analysis with ADAM

ELC-E 2010: The Right Approach to Minimal Boot Times

Velox: Models in Action

Naïveté vs. Experience

SparkR: Enabling Interactive Data Science at Scale

SampleClean: Bringing Data Cleaning into the BDAS Stack

OpenStack Cheat Sheet V2

A Curious Course on Coroutines and Concurrency

Lab 5: Interconnecting a Datacenter using Mininet

Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)

Best Practices for Virtualizing Apache Hadoop

Python in Action (Part 2)

Similar to Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)

Jump Start on Apache Spark 2.2 with Databricks

Anyscale

Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Agenda: • Overview of Spark Fundamentals & Architecture • What’s new in Spark 2.x • Unified APIs: SparkSessions, SQL, DataFrames, Datasets • Introduction to DataFrames, Datasets and Spark SQL • Introduction to Structured Streaming Concepts • Four Hands-On Labs

Jumpstart on Apache Spark 2.2 on Databricks

Databricks

In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Agenda: • Overview of Spark Fundamentals & Architecture • What’s new in Spark 2.x • Unified APIs: SparkSessions, SQL, DataFrames, Datasets • Introduction to DataFrames, Datasets and Spark SQL • Introduction to Structured Streaming Concepts • Four Hands On Labs You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL. Level: Beginner to intermediate, not for advanced Spark users. Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional. Bio: Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide: 1) Introduction to DataFrames 2) Creating DataFrames from JSON 3) DataFrame Operations 4) Running SQL Queries Programmatically 5) Datasets 6) Inferring the Schema Using Reflection 7) Programmatically Specifying the Schema

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Databricks

Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

Tuning and Debugging in Apache Spark

Databricks

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji

Data Con LA

Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.

Spark to DocumentDB connector

Denny Lee

Jdbc drivers

Prabhat gangwar

Apache Spark Fundamentals Training

Eren Avşaroğulları

Building Robust ETL Pipelines with Apache Spark

Databricks

Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.

Tuning and Debugging in Apache Spark

Patrick Wendell

Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.

Spark SQL - 10 Things You Need to Know

Kristian Alexander

The document discusses loading data into Spark SQL and the differences between DataFrame functions and SQL. It provides examples of loading data from files, cloud storage, and directly into DataFrames from JSON and Parquet files. It also demonstrates using SQL on DataFrames after registering them as temporary views. The document outlines how to load data into RDDs and convert them to DataFrames to enable SQL querying, as well as using SQL-like functions directly in the DataFrame API.

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016

Chris Fregly

Title: Real-time, Advanced Analytics and Recommendations using Machine Learning, Natural Language Processing, Graph Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird Agenda Intro Live, Interactive Recommendations Demo Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker Types of Similarity Euclidean vs. Non-Euclidean Similarity User-to-User Similarity Content-based, Item-to-Item Similarity (Amazon) Collaborative-based, User-to-Item Similarity (Netflix) Graph-based, Item-to-Item Similarity Pathway (Spotify) Similarity Approximations at Scale Twitter Algebird MinHash and Bucketing Locality Sensitive Hashing (LSH) Netflix Recommendations: From Ratings to Real-Time DVD-Ratings-based $1M Netflix Prize (2009) Streaming-based "Trending Now" (2016) Wrap Up Q & A *Bio* Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix. *Related Links* https://github.com/fluxcapacitor/pipeline/wiki http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf http://static.echonest.com/BoilTheFrog/ http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/ http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf

Building a modern Application with DataFrames

Databricks

Building a modern Application with DataFrames

Spark Summit

2 rel-algebra

Mahesh Jeedimalla

The document provides an overview of the relational data model and relational algebra. It discusses how the relational model represents data using tables of attribute-value pairs and allows standard logical operations. Key concepts covered include the relational operations of projection, selection, join, union, difference, and divide. SQL is introduced as the standard language for querying and manipulating relational data using these algebraic operations.

Quick Guide to Refresh Spark skills

Ravindra kumar

Engineering Document Collaboration with Office 365

JoAnna Cheshire

This document discusses challenges with sharing engineering information and proposes using Office 365 and Microsoft Azure to address them. It describes an engineering firm that adopted Office 365 but still lacks functionality to track design submittals. A demo then shows how the Taskforce application integrates with SharePoint and Office 365 to enable task-based sharing with deadlines, responses and notifications. The Taskforce dashboard provides a single view of issued and assigned tasks across projects.

Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines

ScyllaDB

Have you ever had a Spark job fail in it’s second to last stage after a “trivial” update or been part of the way through debugging a pipeline to wish you could look at it’s data or had an “exploratory” notebook turn into something less exploratory? Come join me for a surprisingly simple adventure into how to build recoverable pipelines and have more debuggable pipelines. Then join me on the adventure where in we find out our “simple” solution has a bunch of hidden flaws, how to work around them, and end on the reminder of how important it is to test your code.

Spark Cassandra Connector: Past, Present and Furure

DataStax Academy

The document discusses the past, present, and future of the Spark Cassandra Connector. In the past, integrating Hadoop and Cassandra required expertise and was difficult. The Spark Cassandra Connector was first released in 2014 and makes it easier to access Cassandra data from Spark applications. Currently, the connector can read and write Cassandra data into RDDs, push filters down to Cassandra, and support Java APIs. It also enables working with DataFrames/SQL for Cassandra data.

Similar to Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections) (20)

Jump Start on Apache Spark 2.2 with Databricks

Jumpstart on Apache Spark 2.2 on Databricks

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Tuning and Debugging in Apache Spark

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji

Spark to DocumentDB connector

Jdbc drivers

Apache Spark Fundamentals Training

Building Robust ETL Pipelines with Apache Spark

Tuning and Debugging in Apache Spark

Spark SQL - 10 Things You Need to Know

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016

Building a modern Application with DataFrames

2 rel-algebra

Quick Guide to Refresh Spark skills

Engineering Document Collaboration with Office 365

Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines

Spark Cassandra Connector: Past, Present and Furure

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

Spark Summit

In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Spark Summit

In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Spark Summit

This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

Spark Summit

As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

Spark Summit

Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Spark Summit

Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.

Apache Spark and Tensorflow as a Service with Jim Dowling

Spark Summit

In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.

Apache Spark and Tensorflow as a Service with Jim Dowling

Spark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Spark Summit

With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.

Next CERN Accelerator Logging Service with Jakub Wozniak

Spark Summit

The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service. The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex. The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments. During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.

Powering a Startup with Apache Spark with Kevin Kim

Spark Summit

In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Spark Summit

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

Spark Summit

In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects. In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API. For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spark Summit

Nielsen used Databricks to test new digital advertising rating methodologies on a large scale. Databricks allowed Nielsen to run analyses on thousands of advertising campaigns using both small panel data and large production data. This identified edge cases and performance gains faster than traditional methods. Using Databricks reduced the time required to test and deploy improved rating methodologies to benefit Nielsen's clients.

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Spark Summit

Goal Based Data Production with Sim Simeonov

Spark Summit

Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Spark Summit

Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Spark Summit

Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

Spark Summit

Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Spark Summit

The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Apache Spark and Tensorflow as a Service with Jim Dowling

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Next CERN Accelerator Logging Service with Jakub Wozniak

Powering a Startup with Apache Spark with Kevin Kim

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Goal Based Data Production with Sim Simeonov

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded

Sunshine Coast University diploma

cwavvyy

原版一模一样【微信：741003700 】【阳光海岸大学毕业证成绩单】【微信：741003700 】学位证，留信学历认证（真实可查，永久存档）原件一模一样纸张工艺/offer、在读证明、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理阳光海岸大学毕业证【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理阳光海岸大学毕业证【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理阳光海岸大学毕业证【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理阳光海岸大学毕业证【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

nehadubay1

How We Added Replication to QuestDB - JonTheBeach

javier ramirez

Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance. A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen. Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second. In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.

[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습

Amazon Web Services Korea

Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.

Supervised Learning (Data Science).pptx

TARIKU ENDALE

Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe

bookmybebe1

Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe

aarusi sexy model

South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

simmi singh$A17

Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe

butwhat24

Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe

jiya khan$A17

[D3T1S02] Aurora Limitless Database Introduction

Amazon Web Services Korea

Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜��션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.

Australian Catholic University degree offer diploma Transcript

taqyea

学历认证补办制【微信：A575476】【(ACU毕业证）澳大利亚天主教大学毕业证成绩单offer】【微信：A575476】（留信学历认证永久存档查询）采用学校原版纸张，特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：A575476】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：A575476】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(ACU毕业证）澳大利亚天主教大学毕业证【微信：A575476 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...

javier ramirez

Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados. QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo. Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.

[D3T1S03] Amazon DynamoDB design puzzlers

Amazon Web Services Korea

Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...

shoeb2926

Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

dipti singh$A17

Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe

bookmybebe1

Simon Fraser University degree offer diploma Transcript

taqyea

学历认证补办制【微信：A575476】【(SFU毕业证）西蒙弗雷泽大学毕业证成绩单offer】【微信：A575476】（留信学历认证永久存档查询）采用学校原版纸张，特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信：A575476】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信：A575476】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信：A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才 → 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！办理(SFU毕业证）西蒙弗雷泽大学毕业证【微信：A575476】外观非常精致，由特殊纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(SFU毕业证）西蒙弗雷泽大学毕业证【微信：A575476】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(SFU毕业证）西蒙弗雷泽大学毕业证【微信：A575476】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(SFU毕业证）西蒙弗雷泽大学毕业证【微信：A575476 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

From Clues to Connections: How Social Media Investigators Expose Hidden Networks

Milind Agarwal

Streamlining Legacy Complexity Through Modernization

sanjay singh

Recently uploaded (20)

Sunshine Coast University diploma

Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

How We Added Replication to QuestDB - JonTheBeach

[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습

Supervised Learning (Data Science).pptx

Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe

Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe

South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe

Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe

[D3T1S02] Aurora Limitless Database Introduction

Australian Catholic University degree offer diploma Transcript

Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...

[D3T1S03] Amazon DynamoDB design puzzlers

Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...

Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe

Simon Fraser University degree offer diploma Transcript

From Clues to Connections: How Social Media Investigators Expose Hidden Networks

Streamlining Legacy Complexity Through Modernization

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)

1. Lessons Learned with Spark at the US Patent & Trademark Office Christopher Bradford Big Data Architect at OpenSource Connections

2. Christopher Bradford Twitter: @bradfordcp GitHub: bradfordcp

3. OpenSource Connections

4. Exploring Search Technologies - EST

5. EST – Technology Stack

6. EST – Data Loading CSS Ingestion (CSS2C) Solr Ingestion (C2S)

7. EST – C2S Process Note: some connections are omitted for clarity

8. EST – C2S Process (Scaled Out) Note: some connections are omitted for clarity

9. EST – C2S Review Did it work? Why change it? How could we make it better?

11. EST – Old C2S Process Note: some connections are omitted for clarity

12. EST – Spark C2S Process Note: some connections are omitted for clarity

13. How did this work out? Poorly

14. Poor Performance joinedRDD = … joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect() // Job is done

15. Poor Performance sc = new SolrConnection() sc.push(document) sc.disconnect()

16. Optimum Performance joinedRDD = … sc = new SolrConnection() joinedRDD.foreach() document = … // build document sc.push(document) sc.disconnect() // Job is done joinedRDD = … joinedRDD.foreachPartition() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) sc.disconnect() // Job is done Almost

17. The Solution! joinedRDD = … joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) sc.close() return partition.rows .collect() joinedRDD = … joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) sc.close() return partitions.rows.count .collect()

18. Results?

19. Solr Indexing

20. Better Solr Indexing Note: some connections are omitted for clarity

21. EST – Spark C2S Process v2 Note: some connections are omitted for clarity

22. Success? YUP 5x faster than the original C2S process (with optimizations)

23. What’s Next? •  Optimization of the C2S Spark job •  More Spark jobs •  Newer version of Spark & DSE •  Scala Spark jobs instead of Java

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)

Similar to Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections) (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)