Hector Yee works at Airbnb developing machine learning models using their open-source ML stack Aerosolve. He has developed complex models in Spark to predict demand for Airbnb listings. His models for pricing recommendations consider numerous features like location, listing quality, seasonality and events to predict booking probabilities. The models are interpretable using techniques like controlled feature engineering and quantization, and regularization helps prevent overfitting of high-capacity models.
Visualising is essential for data science process because it allows as to look at the portrait of our data and develop new hypotheses about our problem. However, visualisation does not scale very well as we are limited by the number of pixels in the our screen (at least for static graphics). This deck talks about the approach - Bin - Summarize - Smooth approach to visualise big data which has been developed by Hadley Wickham and then implemented in an R package in Bigvis.
In this session, we will share about cutting-edge deep learning innovations, and present emerging trends in the AI community. This session is for data scientists, developers who have a keen interest in getting started in an AI project, and wants to learn the tools of the trade. We will draw on practical experiences from working on various AI projects, and share the key learning, and pitfalls
The document discusses different solutions for visualizing terrain data generated from Python in Unreal Engine. Solution 4 involves a web-based front-end that links to a Python back-end server and raster image cache server. The Python back-end would process input data like rasters, shapefiles and LIDAR to generate terrain which is sent to Unreal Engine for visualization. The front-end allows non-technical users to query datasets and visualize results.
Andy Feng discusses Yahoo's use of scalable machine learning for search and advertisement applications with massive datasets and features. Three machine learning algorithms - gradient boosted decision trees, logistic regression, and ad-query vectors - presented challenges of scale that were addressed using Hadoop and YARN across hundreds of servers. Approximate computing techniques like streaming, distributed training, and in-memory processing enabled speedups of 30x to 1000x and scaling to billions of examples and terabytes of data, allowing daily model training. Hadoop and distributed processing on CPU and GPU resources were critical to solving Yahoo's needs for scalable machine learning on big data.
This document discusses scaling out logistic regression with Apache Spark. It describes the need to classify a large number of websites using machine learning. Several approaches to logistic regression were tried, including a single machine Java implementation and moving to Spark for better scalability. Spark's L-BFGS algorithm was chosen for its out of the box distributed logistic regression solution. Challenges implementing logistic regression at large scale are discussed, such as overfitting and regularization. Methods used to address these challenges include L2 regularization, cross-validation to select the regularization parameter, and extensions made to Spark's LBFGS implementation.
The document summarizes a lecture on texture mapping in computer graphics. It discusses topics like texture mapping fundamentals, texture coordinates, texture filtering including mipmapping and anisotropic filtering, wrap modes, cube maps, and texture formats. It also provides examples of texture mapping in games and an overview of the texture sampling process in the graphics pipeline.
This document summarizes a lecture on graphics transformations, clipping, and culling. It discusses how vertex positions are transformed from object space to normalized device coordinates space using the modelview and projection matrices. It also covers generalized clipping against the view frustum and user-defined clip planes, as well as back face culling. The lecture provides examples of translation, rotation, scaling, orthographic, and perspective transformations.
How represent the digital image in Matlab https://www.youtube.com/watch?v=-6U8le3HQlI https://www.slideshare.net/mustafa_92/working-with-images-inmatlabgraphics-251331243 https://github.com/Mustafa-nafaa/Multimedia-TechnologyLab/tree/main/Week2:Image%20Representation What Is Image Data? Data Types in MATLAB Supported Image Formats Read image from graphics file Information about graphics file Write image to graphics file Convert RGB image or colormap to grayscale Image Histogram in MATLAB Resize image in MATLAB Image representation, sampling and quantization Sampling image in MATLAB quantization image in MATLAB imread() – reading an image with different postfixes imresize() – resizing an image to any given size figure – opening a new graphical window subplot(#of row, # of col, location) – showing different plots/images in one graphical window imshow() – displaying an image Imquantize- (A,levels) quantizes image What is sampling? What is spatial resolution? What is quantization? What is grey-level resolution
This document discusses generative adversarial networks (GANs) and the LAPGAN model. It explains that GANs use two neural networks, a generator and discriminator, that compete against each other. The generator learns to generate fake images to fool the discriminator, while the discriminator learns to distinguish real from fake images. LAPGAN improves upon GANs by using a Laplacian pyramid to decompose images into multiple scales, with separate generator and discriminator networks for each scale. This allows LAPGAN to generate sharper images by focusing on edges and conditional information at each scale.
This talk provides additional details around the hybrid real-time rendering pipeline we developed at SEED for Project PICA PICA. At Digital Dragons 2018, we presented how leveraging Microsoft's DirectX Raytracing enables intuitive implementations of advanced lighting effects, including soft shadows, reflections, refractions, and global illumination. We also dove into the unique challenges posed by each of those domains, discussed the tradeoffs, and evaluated where raytracing fits in the spectrum of solutions.
Youtube: https://www.youtube.com/playlist?list=PLeeHDpwX2Kj55He_jfPojKrZf22HVjAZY Paper review of "Auto-Encoding Variational Bayes"
- The document summarizes a lecture on viewing and representing 3D objects in computer graphics. It discusses representing objects as triangle meshes and storing vertex data in arrays indexed by triangle lists. It also covers transforms like glFrustum and gluLookAt for viewing, and examples of modeling transforms. - Common ways to represent 3D objects include procedural, explicit polygon meshes, and implicit surfaces. Triangle meshes stored with unique vertex positions and triangle indices are popular due to efficiency and compatibility with OpenGL/GPU rendering. - The lecture also covered projection transforms, modeling transforms, lighting, and "look at" camera positioning for 3D viewing. Next lecture will discuss mesh properties and OpenGL rendering details.
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer. Bio: DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
The slide of the talk in http://www.meetup.com/R-Users-Sydney/events/223867196/ There is a web version here: http://wush978.github.io/FeatureHashing/index.html
1. DiscoGAN is a method for learning to discover cross-domain relations without explicitly paired data using generative adversarial networks. 2. It uses two coupled GANs to map each domain into the other domain to allow for domain transfer while preserving key attributes. 3. Results show DiscoGAN performs better than other methods and is more robust to the mode collapse problem due to the symmetry granted by coupling the two GANs.
This document summarizes the DiscoGAN model, which uses generative adversarial networks to discover relations between image domains without paired training examples. It introduces GANs and the DiscoGAN model, which uses two generators and discriminators with reconstruction and adversarial losses to learn bijective mappings between domains. Experiments show DiscoGAN can discover relations like azimuth angle between car images and translate attributes like gender between faces while maintaining other features. Code links for TensorFlow and PyTorch implementations are also provided.
For many companies, recommendation systems solve important machine learning problems. But as recommendation systems grow to millions of users and millions of items, they pose significant challenges when deployed at scale. The user-item matrix can have trillions of entries (or more), most of which are zero. To make common ML techniques practical, sparse data requires special techniques. Learn how to use MXNet to build neural network models for recommendation systems that can scale efficiently to large sparse datasets.
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
This document summarizes Toyota's use of Apache Spark for customer analytics. It discusses how Spark has helped Toyota improve the performance of customer experience analytics jobs from 160 hours to 4 hours. It also outlines Toyota's efforts to develop machine learning models in Spark for categorizing social media conversations. Specifically, it details the development of an SVM model for classifying conversations about brake noise, achieving over 80% accuracy. The document advocates for continuous improvement and shares lessons learned in working with Spark.
This document discusses Spark ML pipelines for machine learning workflows. It begins with an introduction to Spark MLlib and the various algorithms it supports. It then discusses how ML workflows can be complex, involving multiple data sources, feature transformations, and models. Spark ML pipelines allow specifying the entire workflow as a single pipeline object. This simplifies debugging, re-running on new data, and parameter tuning. The document provides an example text classification pipeline and demonstrates how data is transformed through each step via DataFrames. It concludes by discussing upcoming improvements to Spark ML pipelines.
This document discusses running Spark applications on YARN and managing Spark clusters. It covers challenges like predictable job execution times and optimal cluster utilization. Spark on YARN is introduced as a way to leverage YARN's resource management. Techniques like dynamic allocation, locality-aware scheduling, and resource queues help improve cluster sharing and utilization for multi-tenant workloads. Security considerations for shared clusters running sensitive data are also addressed.
Spark on YARN provides resource management and security features through YARN, but still has areas for improvement. Dynamic allocation in YARN allows Spark applications to grow and shrink executors based on task demand, though latency and data locality could be enhanced. Security supports Kerberos authentication and delegation tokens, but long-lived applications face token expiration issues and encryption needs improvement for control plane, shuffle files, and user interfaces. Overall, usability, security, and performance remain areas of focus.
This document discusses the challenges of big data analytics and how Apache Spark and Databricks can help address them. It summarizes that: 1) There is a gap between the growth of data and ability to perform real-time analytics on that data due to challenges in managing infrastructure, empowering teams, and establishing production-ready applications. 2) Databricks provides a cloud-hosted platform that uses Apache Spark to allow for just-in-time processing of data across storage silos, with an integrated workspace for interactive exploration, machine learning, and production-ready workflows. 3) Databricks Enterprise Security provides an end-to-end security solution for Apache Spark to address challenges in securing file
The document provides an agenda for a DevOps advanced class on Spark being held in June 2015. The class will cover topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, Spark SQL, PySpark, and Spark Streaming. It will include labs on DevOps 101 and 102. The instructor has over 5 years of experience providing Big Data consulting and training, including over 100 classes taught.