The document discusses machine learning concepts and approaches for practical implementation in enterprises. It defines key terms like business analytics, predictive analytics, and machine learning. Business analytics answer questions about past data through queries, while predictive analytics uses algorithms to predict future probabilities and outcomes. The document also outlines challenges to enterprise adoption of machine learning and how vendors are helping to address skills gaps through cloud-based tools and services.
Seeing Redshift: How Amazon Changed Data Warehousing ForeverInside Analysis
The Briefing Room with Claudia Imhoff and Birst
Live Webcast April 9, 2013
What a difference a day can make! When Amazon announced their new RedShift offering – a data warehouse in the cloud – the entire industry of information management changed. The most notable disruption? Price. At a whopping $1,000 per year for a terabyte, RedShift achieved a price-point improvement that amounts to at least two orders of magnitude, if not three when compared to its top-tier competitors. But pricing is just one change; there's also the entire process by which data warehousing is done.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Claudia Imhoff explain why a new cloud-based reality for data warehousing significantly changes the game for business intelligence and analytics. She'll be briefed by Brad Peters of Birst who will tout his company's BI solution, which has been specifically architected for cloud-based hosting. Peters will discuss several key intricacies of doing BI in the cloud, including the unique provisioning, loading and modeling requirements. Founded in 2004, Birst has nearly a decade of doing cloud-based BI and Analytics.
Visit: http://www.insideanalysis.com
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
The right architecture is key for any IT project. This is valid in the case for big data projects as well, but on the other hand there are not yet many standard architectures which have proven their suitability over years.
This session discusses different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Event Driven architecture as well as Lambda and Kappa architecture.
Each architecture is presented in a vendor- and technology-independent way using a standard architecture blueprint. In a second step, these architecture blueprints are used to show how a given architecture can support certain use cases and which popular open source technologies can help to implement a solution based on a given architecture.
Analysing data analytics use cases to understand big data platformdataeaze systems
Get big picture of data platform architecture by knowing its purpose and problem it solves.
These slides take top down approach, starting with basic purpose of data platform ie. to serve analytics use cases. These slides categorise use cases and analyses their expectation from data platform.
What’s New with Databricks Machine LearningDatabricks
In this session, the Databricks product team provides a deeper dive into the machine learning announcements. Join us for a detailed demo that gives you insights into the latest innovations that simplify the ML lifecycle — from preparing data, discovering features, and training and managing models in production.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
Motorists insurance company was facing challenges from aging systems, data silos, and an inability to analyze new types of data sources. They partnered with Saama Technologies to implement a hybrid Hadoop and SQL data warehouse ecosystem to consolidate their internal and external data in a scalable and cost-effective manner. This allowed Motorists to gain new insights from claims data, reduce load times by 30% with potential for 70% improvements, and save hundreds of hours on report building. Saama's Fluid Analytics for Insurance solution established a robust data foundation and provided self-service reporting and predictive analytics capabilities. The new environment enabled enterprise-wide data access and advanced analytics to improve business performance.
There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.
But where are the data science and data engineering patterns?
Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.
This document discusses big data and AWS tools for managing it. It defines big data as data with high volume, velocity and variety. AWS provides scalable tools like EC2, EMR, Kinesis and Redshift to handle the ingestion, storage, processing and analysis of large and diverse datasets in the cloud. These tools work together in an integrated environment and auto-scale based on demand, providing a cost-effective solution for big data challenges. An example use case of real-time IoT analytics is presented to illustrate how different AWS products interact to build scalable data pipelines.
Everyone is awash in the new buzzword, Big Data, and it seems as if you can’t escape it wherever you go. But there are real companies with real use cases creating real value for their businesses by using big data. This talk will discuss some of the more compelling current or recent projects, their architecture & systems used, and successful outcomes.
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreAmazon Web Services
This document discusses how companies can use Amazon Web Services (AWS) big data and analytics services like Amazon Elastic MapReduce (EMR), Amazon Redshift, Amazon DynamoDB, and Amazon Kinesis to gain insights from massive amounts of data. It provides examples of how companies in various industries like mobile, e-commerce, media, and gaming use these AWS services for use cases like recommendations, targeted advertising, fraud detection, and real-time analytics. The document also compares different AWS analytics services and discusses best practices for deploying big data solutions on AWS.
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
This document discusses approaches to data warehouse (DWH) and big data architectures. It begins with an overview of big data, describing its large size and complexity that makes it difficult to process with traditional databases. It then compares Hadoop and relational database management systems (RDBMS), noting pros and cons of each for distributed computing. The document outlines how Hadoop uses MapReduce and has a structure including HDFS, HBase, Hive and Pig. Finally, it proposes using Hadoop as an ETL and data quality tool to improve traceability, reduce costs and handle exception data cleansing more effectively.
Democratizing data science Using spark, hive and druidDataWorks Summit
MZ is re-inventing how the entire world experiences data via our mobile games division MZ Games Studios, our digital marketing division Cognant, and our live data platform division Satori.
Growing need of data science capabilities across the organization requires an architecture that can democratize building these applications and disseminating insight from the outcome of data science applications to the wider organization.
Attend this session to learn about how we built a platform for data science using spark, hive, and druid specifically for our performance marketing division cognant.This platform powers several data science application like fraud detection and bid optimization at large scale.
We will be sharing lessons learned over past 3 years in building this platform by also walking through some of the actual data science applications built on top of this platform.
Attendees from ML engineering and data science background can gain deep insight from our experience of building this platform.
Speakers
Pushkar Priyadarshi, Director of Engineer, Michaine Zone Inc
Igor Yurinok, Staff Software Engineer, MZ
This talk given at the Hadoop Summit in San Jose on June 28, 2016, analyzes a few major trends in Big Data analytics.
These are a few takeaways from this talk:
- Adopt Apache Beam for easier development and portability between Big Data Execution Engines.
- Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency.
- Accelerate your Big Data applications with In-Memory open source tools.
- Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices…
- Have Machine Learning part of your strategy or passively watch your industry completely transformed!
- How to advance your strategy for hybrid integration between cloud and on-premise deployments?
This document discusses building a scalable data science platform with R. It describes R as a popular statistical programming language with over 2.5 million users. It notes that while R is widely used, its open source nature means it lacks enterprise capabilities for large-scale use. The document then introduces Microsoft R Server as a way to bring enterprise capabilities like scalability, efficiency, and support to R in order to make it suitable for production use on big data problems. It provides examples of using R Server with Hadoop and HDInsight on the Azure cloud to operationalize advanced analytics workflows from data cleaning and modeling to deployment as web services at scale.
Forget becoming a Data Scientist, become a Machine Learning Engineer insteadData Con LA
Data Con LA 2020
Description
Machine learning is an essential skill in today's job market. But when it comes to learning Machine Learning, beginners get lot of conflicting advice. I have been teaching ML for software engineers for years. In this talk
*I will dis-spell some of the myths surrounding machine learning
*give you solid, tangible plan on how to go about learning ML
*and give you good pointers to start from
*and steer you away from common mistakes
Speaker
Sujee Maniyam, Elephant Scale, Founder, Principal instructor
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
The document discusses using a data lake approach with EMC Isilon storage to address various business use cases. It describes how the solution provides shared storage for multiple workloads through multi-protocol support, enables data protection and isolation of client data, and allows testing applications across Hadoop distributions through a common platform. Examples are given of how this approach supports an enterprise data hub, data warehouse offloading, data integration, and enrichment services.
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
This document discusses Databricks' goal of democratizing access to Spark. It introduces the Databricks cloud platform, which provides a hosted model for Spark with rapid releases, dynamic scaling, and security controls. The platform is used for just-in-time data warehousing, advanced analytics, and real-time use cases. Many companies struggle with the steep learning curve and costs of big data projects. To empower more developers, Databricks trained thousands on Spark and launched online courses with over 100,000 students. They are announcing the Databricks Community Edition, a free version of their platform, to further democratize access to Spark through mini clusters, notebooks, APIs, and continuous delivery of learning content.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
The document discusses how startups can build scalable applications without servers by leveraging serverless architectures on AWS. It describes how Dean Bryen challenged himself to build an image processing microservice within 45 minutes using only AWS services like Lambda, API Gateway, S3, and DynamoDB without any servers or monolithic code. The microservice included a static site in S3, an API built with API Gateway that triggers a Lambda function for image processing, and stores results in DynamoDB. This demonstrated how platforms services can provide high availability and scalability without the need to manage infrastructure. The document also discusses how Gousto evolved from a monolithic PHP application to a microservices architecture using Lambda and other AWS services.
Roll Your Own API Management Platform with nginx and LuaJon Moore
We recently replaced a proprietary API management solution with an in-house implementation built with nginx and Lua that let us get to a continuous delivery practice in a handful of months. Learn about our development process and the overall architecture that allowed us to write minimal amounts of code, enjoying native code performance while permitting interactive codeing, and how we leveraged other open source tools like Vagrant, Ansible, and OpenStack to build an automation-rich delivery pipeline. We will also take an in-depth look at our capacity management approach that differs from the rate limiting concept prevalent in the API community.
Application of machine learning in industrial applicationsAnish Das
The group will present an introduction to machine learning, the basics of machine learning, and applications of machine learning in industry such as product categorization, improving the accuracy of inertial measurement units using supervised machine learning, data mining techniques, and machine learning for medical diagnosis. They will also discuss the future scope of machine learning.
Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.
1) Deploying machine learning models into production involves evaluating, monitoring, deploying, and managing models over their lifecycle.
2) Evaluation involves continuously tracking metrics on both historical and live data to determine when models need to be updated. Monitoring involves choosing between existing models, such as by using A/B testing or multi-armed bandits.
3) Dato provides tools to simplify each stage of the machine learning lifecycle from batch training to real-time predictions to continuous evaluation and management of models in production.
Production machine learning_infrastructurejoshwills
This document discusses building machine learning infrastructure to scale data science from the lab to production. It describes two types of data scientists - those focused on investigative analytics in the lab and those building production systems in the factory. Moving analytics from the lab to the factory requires a shift from question-driven and ad-hoc work to metric-driven and automated systems. The document outlines steps to begin this transition such as choosing a good problem, logging everything, and hiring more data scientists. It also describes tools and techniques for experimentation in production machine learning.
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
This document discusses challenges in running machine learning applications in production environments. It notes that while Kaggle competitions focus on accuracy, real-world applications require balancing accuracy with interpretability, speed and infrastructure constraints. It also emphasizes that machine learning in production is as much a software and systems problem as a modeling problem. Key aspects that are discussed include flexible and scalable deployment architectures, model versioning, packaging and serving, online evaluation and experiments, and ensuring reproducibility of results.
Gluecon Monitoring Microservices and Containers: A ChallengeAdrian Cockcroft
This document discusses the challenges of monitoring microservices and containers. It provides six rules for effective monitoring: 1) spend more time on analysis than data collection, 2) reduce latency of key metrics to under 10 seconds, 3) validate measurement accuracy, 4) make monitoring more available than services monitored, 5) optimize for distributed cloud-native applications, 6) fit metrics to models to understand relationships. It also examines models for infrastructure, flow, and ownership and discusses speed, scale, failures, and testing challenges with microservices.
The document discusses building data pipelines in the cloud. It covers serverless data pipeline patterns using services like BigQuery, Cloud Storage, Cloud Dataflow, and Cloud Pub/Sub. It also compares Cloud Dataflow and Cloud Dataproc for ETL workflows. Key questions around ingestion and ETL are discussed, focusing on volume, variety, velocity and veracity of data. Cloud vendor offerings for streaming and ETL are also compared.
This document is a thesis submitted by Jinxing Lin to Cranfield University in partial fulfillment of a Master of Science degree. The thesis investigates applying machine learning techniques for sales forecasting. It includes a literature review covering machine learning algorithms that have been applied for sales forecasting, such as regression trees, support vector machines, neural networks, and extreme learning machine. The methodology section describes the data source and preparation, as well as techniques to be applied including random forest regression, time series forecasting, and evaluating results. The thesis aims to study machine learning algorithms and apply them to a dataset to perform sales forecasting.
1. The document discusses rational decision making and business intelligence. It defines rational decision making as selecting the optimal alternative based on analyzing past data and considering various performance criteria.
2. It describes the typical cycle of a business intelligence analysis as involving defining objectives, generating insights from data analysis, making decisions based on insights, and evaluating performance.
3. Key components of business intelligence architectures are data sources, data warehouses/marts for storing and processing data, and business intelligence tools for generating insights and supporting decision making.
Credit card fraud detection using python machine learningSandeep Garg
This document provides an overview of machine learning tools, technologies, and the data preparation process. It discusses collecting and selecting relevant data, data visualization, labeling data for supervised learning, and transforming raw data into a tidy format. The document also covers various data preprocessing techniques, including data cleaning, formatting, handling missing values and outliers, smoothing, aggregation, generalization, and data reduction methods. The goal of these preprocessing steps is to prepare raw data into a structured format suitable for machine learning modeling.
This white paper discusses how companies can apply data science insights to improve products and operations. It describes the typical data science project lifecycle, including problem definition, data collection, model building and testing. However, many companies struggle to deploy models into production applications. The paper argues that data science teams need tools that allow models to be easily updated and redeployed without disrupting operations. The Yhat platform aims to streamline this process and help companies more quickly turn insights into data-driven products.
A Data Warehouse And Business Intelligence ApplicationKate Subramanian
The document outlines a project to develop a real-time fraud detection system for banking transactions by capturing functional and non-functional requirements, including system capabilities, interfaces, performance needs, security requirements, and an overall design architecture. The goal is to help banks identify fraudulent transactions in real-time through analyzing banking data and transactions based on pre-defined rules to flag suspicious activity and prevent financial losses from fraud.
The objective of this project is to discuss the importance of Machine Learning in different sectors and how does it solve the problems in the Marketing Analytics field. We have discussed Marketing Segmentation, Advertisement, and Fraud detection in our project. We used different Machine Learning algorithms and used R and Python library to predict and solve these problems. After making models and running test data on those models we got following results:
• We trained a Decision tree and Random Forest classifier model which has 73% accuracy to predict whether a person will be a defaulter or not based on credit history, income, job type, dependents etc.
• We segmented the Social networking profiles based on the likes and dislikes of a person using K-Means Clustering.
• We made a predictive model of the messages a customer receives and determined whether a message will be a Spam or not a spam with an accuracy of 97%. We used Naïve Bayes classifier for this model.
The document describes a Driverless ML API that was created to automate machine learning workflows including feature engineering, model validation, tuning, selection, and deployment. The API uses machine learning interpretability techniques to provide visualizations and explanations of models. It aims to help scale data science efforts and enable both expert and junior data scientists to more quickly develop accurate, production-ready models. Key capabilities of the API include automated exploratory data analysis, feature selection and engineering, model selection and hyperparameter tuning using GPUs for faster training, and model interpretability visualizations.
This document discusses Oracle's approach to big data and information architecture. It begins by explaining what makes big data different from traditional data, noting that big data refers to large datasets that are challenging to store, search, share, visualize, and analyze due to their volume, velocity, and variety. It then provides an overview of big data architecture capabilities and describes how to integrate big data capabilities into an organization's overall information architecture. The document concludes by outlining some key big data architecture considerations and best practices.
This document discusses Oracle's approach to big data and information architecture. It begins by explaining what makes big data different from traditional data, noting that big data refers to large datasets that are challenging to store, search, share, visualize, and analyze due to their volume, velocity, and variety. It then provides an overview of big data architecture capabilities and describes how to integrate big data capabilities into an organization's overall information architecture. The document concludes by outlining some key big data use cases and best practices for organizations adopting big data.
1) The document discusses a self-study approach to learning data science through project-based learning using various online resources.
2) It recommends breaking down projects into 5 steps: defining problems/solutions, data extraction/preprocessing, exploration/engineering, model implementation, and evaluation.
3) Each step requires different skillsets from domains like statistics, programming, SQL, visualization, mathematics, and business knowledge.
Machine Learning: The First Salvo of the AI Business RevolutionCognizant
Machine learning (ML), a branch of artificial intelligence (AI), is coming into its own as a force in the business landscape, performing a variety of innovative and highly skilled activities that enhance customer experience and offer market advantages. This is a brief guide to getting started with ML, the thinking, tools and frameworks to make it a powerful business tool.
Accelerating Machine Learning as a Service with Automated Feature EngineeringCognizant
Building scalable machine learning as a service, or MLaaS, is critical to enterprise success. Key to translate machine learning project success into program success is to solve the evolving convoluted data engineering challenge, using local and global data. Enabling sharing of data features across a multitude of models within and across various line of business is pivotal to program success.
Machine learning engineers are computer programmers who develop machines and systems that can learn and apply knowledge without specific direction. This article explores the work of machine learning engineers, the skills and education needed for the role, and how to become a machine learning engineer. Key skills include computer programming, strong mathematical skills, and knowledge of machine learning algorithms and libraries. A master's or PhD is typically required for machine learning engineer roles.
To effectively leverage the power of rich visualizations in making data-driven decisions, you must significantly reduce front-end data preparation time.
In order to create visualizations that lead to answers quickly, you need to prepare your data in the right way. Together, Alteryx and Tableau can help. This paper will show you how.
Operational Analytics: Best Software For Sourcing Actionable Insights 2013Newton Day Uploads
Actionable Insights are those views of data that cause managers to ask new questions about how processes work and take action. They differ from traditional key performance measures and daily operating reports that focus on delivering a picture of progress against a strategic objective, operating budget or forecast. What software is best for your business to source these game-changing perspectives of your enterprise?
BA is used to gain insights that inform business decisions and can be used to automate and optimize business processes. Data-driven companies treat their data as a corporate asset and leverage it for a competitive advantage. Successful business analytics depends on data quality, skilled analysts who understand the technologies and the business, and an organizational commitment to data-driven decision-making.
Business analytics examples
Business analytics techniques break down into two main areas. The first is basic business intelligence. This involves examining historical data to get a sense of how a business department, team or staff member performed over a particular time. This is a mature practice that most enterprises are fairly accomplished at using.
Module Overview Careers in Analytics In this module, we .docxaudeleypearl
Module Overview | Careers in
Analytics
In this module, we will evaluate the various quantitative data collection and analysis methods in
standard industry practice. These methods are what will be used throughout this program, so
you should become familiar with the terminology.
The second part of this module presents a variety of career paths for data analysts and an
overview of how several industries are currently using data analytics. Pay special attention to
the intersection of skills necessary for a data analyst to possess, and think of the steps you can
take to gain or improve on these in your own skill set. This may give you an idea of the career
path and industry you would like to pursue, or enhance your understanding of a career path and
industry you have already chosen.
Industry Practice
Learning Objectives
Explain the technical elements and steps associated with analytics practices and processes
Explore industry practice of data analytics
Typical Quantitative Techniques Used in Advanced Analytics
Several quantitative techniques apply to analytics projects, including:
Type Description
Simulation
Randomized repetitions of a set of discrete events in order to
model real-world systems and phenomena (e.g., queues)
Optimization
Algorithm selects the best possible outcome, subject to
satisfying constraints
Matrix Algebra
Calculations involving matrices solve multidimensional
problems
Fitting Functions to Data
Also called “curve fitting,” using numerical methods to
interpolate data
Survival Analysis
Originally used by life scientists, but adopted by marketers and
actuaries
Time Series
When data are “auto-correlated,” such as time-dependent data
(also called “Box-Jenkins”)
Predictive Analytics and Machine Learning
Classical Statistics
Descriptive: calculates metrics to characterize the distribution of
values of data (mean, standard deviation, range, etc.)
Predictive: estimates parameters using historical data and
making predictions of future outcomes (multivariate regression,
generalized linear regression, etc.)
Learning
Unsupervised learning: characterizes the data to establish
classes without using explicit metrics, e.g., k-means clustering
Supervised learning: Classify and describe the data with pre-
defined ‘labels,’ e.g., decision trees
Bayesian
Used to augment classical analysis when there is prior
knowledge about how the data was generated
Typical Challenges and Pitfalls in an Analytics Project
1. Poorly defined problem
• Unclear goal of problem-solving
• Scope is unclear, e.g., how many SKUs to analyze
• Mixed objectives, e.g., economic analysis of a product category promotion for retailer versus
CPG mixed
2. Limited IT resources
• Cloud data can’t be acquired off-line within a reasonable time
• Can’t run the complete model due to computation limitation
• Too slow to generate results in real time
• Can’t share.
Module Overview Careers in Analytics In this module, we .docxroushhsiu
Module Overview | Careers in
Analytics
In this module, we will evaluate the various quantitative data collection and analysis methods in
standard industry practice. These methods are what will be used throughout this program, so
you should become familiar with the terminology.
The second part of this module presents a variety of career paths for data analysts and an
overview of how several industries are currently using data analytics. Pay special attention to
the intersection of skills necessary for a data analyst to possess, and think of the steps you can
take to gain or improve on these in your own skill set. This may give you an idea of the career
path and industry you would like to pursue, or enhance your understanding of a career path and
industry you have already chosen.
Industry Practice
Learning Objectives
Explain the technical elements and steps associated with analytics practices and processes
Explore industry practice of data analytics
Typical Quantitative Techniques Used in Advanced Analytics
Several quantitative techniques apply to analytics projects, including:
Type Description
Simulation
Randomized repetitions of a set of discrete events in order to
model real-world systems and phenomena (e.g., queues)
Optimization
Algorithm selects the best possible outcome, subject to
satisfying constraints
Matrix Algebra
Calculations involving matrices solve multidimensional
problems
Fitting Functions to Data
Also called “curve fitting,” using numerical methods to
interpolate data
Survival Analysis
Originally used by life scientists, but adopted by marketers and
actuaries
Time Series
When data are “auto-correlated,” such as time-dependent data
(also called “Box-Jenkins”)
Predictive Analytics and Machine Learning
Classical Statistics
Descriptive: calculates metrics to characterize the distribution of
values of data (mean, standard deviation, range, etc.)
Predictive: estimates parameters using historical data and
making predictions of future outcomes (multivariate regression,
generalized linear regression, etc.)
Learning
Unsupervised learning: characterizes the data to establish
classes without using explicit metrics, e.g., k-means clustering
Supervised learning: Classify and describe the data with pre-
defined ‘labels,’ e.g., decision trees
Bayesian
Used to augment classical analysis when there is prior
knowledge about how the data was generated
Typical Challenges and Pitfalls in an Analytics Project
1. Poorly defined problem
• Unclear goal of problem-solving
• Scope is unclear, e.g., how many SKUs to analyze
• Mixed objectives, e.g., economic analysis of a product category promotion for retailer versus
CPG mixed
2. Limited IT resources
• Cloud data can’t be acquired off-line within a reasonable time
• Can’t run the complete model due to computation limitation
• Too slow to generate results in real time
• Can’t share ...
Purpose of this presentation is to highlight how end to end machine learning looks like in real world enterprise. This is to provide insight to aspiring data scientist who have been through courses or education in ML that mostly focus on ML algorithms and not end to end pipeline.
Architecture and components mentioned in Slide 11 will be discussed in detailed in series of post on LinkedIn over the course of next few month
To get updates on this follow me on LinkedIn or search/follow hashtag #end2endDS. Post will be active in August 2019 and will be posted till September 2019
Data analytics presentation- Management career institute PoojaPatidar11
1. The basic definition of Data, Analytics, and Data Analytics
2. Definition: Data: Data is a set of values of qualitative or quantitative variables. It is information in the raw or unorganized form. It may be a fact, figure, characters, symbols etc
Analytics: Analytics is the discovery, interpretation, and communication of meaningful patterns in data and applying those patterns towards effective decision making.
Data Analytics: Data analytics refers to qualitative and quantitative techniques and processes used to enhance productivity and business gain.
3.Types of analytics: Predictive Analytics (What could happen?)
Prescriptive Analytics (What should we do)
Descriptive Analytics (What has happened?)
4.Why Data analytics? Data Analytics is needed in Business to Consumer applications (B2C)
5.The process of Data analytics: Data requirements,
Data collection, Data processing, Data cleaning, Exploratory data analysis,
Modeling and algorithms, Data product, Communication
6.The scope of Data Analytics: Bright future of data analytics, many professionals and students are interested in a career in data analytics.
7.Importance of data analytics:1. Predict customer trends and behaviors
Analyze,
2 interpret and deliver data in meaningful ways
3.Increase business productivity
4.Drive effective decision-making
8.why become a data analyst? talented gaps of skill candidates, good salaries for freshers, great future growth path
9. What recruiters look for in applicants: Problem-Solving Skills, Analytical Mind, Maths and Statistic Skills, Communication (both oral and written), Teamwork Abilities
10. Skill is required for Data analytics?
1.) Analytical Skills
2.) Numeracy Skills
3.) Technical and Computer Skills
4.) Attention to Details
5.) Business Skills
6.) Communication Skills
11. Data analytics tools
1.SAS: SAS (Statistical Analysis System) is a software suite developed by SAS Institute. sas language can be defined as a programming language in the computing field. This language is generally used for the purpose of statistical analysis. The language has the ability to read data from databases and common spreadsheets.
2. R: R is a programming language and software environment for statistical analysis, graphics representation and reporting.R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows, and Mac.
3.PYTHON: Python is a popular programming language Python is a powerful, flexible, open-sources language that is easy to use,
and has a powerful library for data manipulation and analysis.
4.TABLEAU: Tableau Software is a software company that produces interactive data visualization products focused on business intelligence.
This document discusses serverless computing and compares different cloud computing services. It outlines how serverless computing can be used for various workloads including web requests, queue messages, transactions, infrequent tasks, and scheduled jobs. It also compares common services between AWS and GCP for virtual machines, containers, functions, storage, databases, machine learning, IoT, and more. Serverless architectures are suggested to isolate and scale dynamic workloads without needing to manage servers.
The document discusses different types of tests for microservices including unit tests, service tests, services composition tests, deployment tests, and modern approaches to testing microservices. It provides examples of testing functions, services, interactions, and deploying to containers using tools like Docker. It emphasizes the importance of testing at the unit, integration, deployment levels as well as testing documentation, full stack setups, and chaos engineering.
This document discusses genomic-scale data pipelines. It introduces Dr. Denis Bauer and his transformational bioinformatics team. It describes how genomic data and research will grow exponentially to exabytes by 2025. It outlines genomic research workflows and challenges like processing, analyzing, and visualizing large variant call format (VCF) data. It presents two cloud data pipeline patterns used by the team: 1) A Spark server cluster pipeline for machine learning on large genomic datasets. 2) A serverless pipeline using AWS Lambda and Step Functions for scalable genomic searches.
Lynn Langit was introduced to programming at age 16 through Teaching Kids Programming (TKP) events. She went on to teach TKP events while studying computer science and bioinformatics in college. After interning at Microsoft Research and working at Pivotal Labs, she created TKP Java courseware to teach programming to middle school students. The courseware includes 80 coding lessons and bridges students to AP Computer Science courses using a teacher-led, puzzle-based approach. Langit advocates for developers to create educational courseware, work with K-12 teachers, and be visible technical role models to help address the lack of gender diversity in computer science fields.
deck from talk at YOW Data in Sydney, covers VariantSpark, custom Apache Spark Machine Learning library and also GT-Scan2 using AWS Lambda architecture for bioinformatics
VariantSpark - a Spark library for genomicsLynn Langit
VariantSpark a customer Apache Spark library for genomic data. Customer wide random forest machine learning algorithm, designed for workloads with millions of features.
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
The document discusses cancer genomics data pipelines and CSIRO's solutions. CSIRO has developed variant-spark, an open-source Apache Spark library for scalable genomic analysis. Variant-spark allows analysis of large genomic datasets up to 80% faster than other tools. CSIRO recommends using cloud data pipelines with serverless architectures, Apache Spark on AWS, and SaaS tools like Databricks for scalable, fast cancer genomics analysis. Their solutions provide reusable patterns for ingesting, processing, analyzing and visualizing genomic data in the cloud.
This document discusses serverless computing and compares it to traditional server-based computing. It defines serverless computing and provides examples of serverless technologies like AWS Lambda. It also outlines common use cases for serverless computing like handling dynamic workloads and scheduled tasks. Finally, it compares different services between server-based and serverless models like compute, files, databases, data pipelines, machine learning, and IoT.
New AWS services were announced at re:Invent 2016 including Athena, Step Functions, Batch, Glue, and QuickSight that could be useful for scaling bioinformatics pipelines. Athena allows SQL queries on data stored in S3, Step Functions allows creating serverless visual workflows using Lambda functions, and Batch provides fully managed batch processing at scale across AWS services. Glue provides serverless ETL capabilities, and QuickSight allows creating quick data dashboards. Examples were shown of using these services for genomics workflows, running jobs on unmanaged compute environments, and processing genomic data.
Google Cloud and Data Pipeline PatternsLynn Langit
1. The document discusses various Google Cloud Platform products and patterns for data pipelines, including virtual machines, storage, data warehousing, streaming analytics, machine learning, internet of things, and bioinformatics.
2. Demos and examples are provided of storage, virtual machines, BigQuery, Cloud Spanner, and machine learning on the Google Cloud Platform.
3. The core Google Cloud Platform products discussed for various data and analytics use cases include Cloud Storage, BigQuery, Cloud Dataflow, Compute Engine, Cloud Pub/Sub, and Bigtable.
SQL Server can run fast and well-priced on Google Cloud Platform infrastructure, with data centers opening locally in Australia in 2017. GCP services like Google Compute Engine offer on-demand virtual machines in various sizes running Linux, Windows, and more. A demo showed how to set up and use SQL Server 2016 with its new features on GCP, with step-by-step guides, best practices, and load testing tutorials available.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
Support en anglais diffusé lors de l'événement 100% IA organisé dans les locaux parisiens d'Iguane Solutions, le mardi 2 juillet 2024 :
- Présentation de notre plateforme IA plug and play : ses fonctionnalités avancées, telles que son interface utilisateur intuitive, son copilot puissant et des outils de monitoring performants.
- REX client : Cyril Janssens, CTO d’ easybourse, partage son expérience d’utilisation de notre plateforme IA plug & play.
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfNeo4j
Presented at Gartner Data & Analytics, London Maty 2024. BT Group has used the Neo4j Graph Database to enable impressive digital transformation programs over the last 6 years. By re-imagining their operational support systems to adopt self-serve and data lead principles they have substantially reduced the number of applications and complexity of their operations. The result has been a substantial reduction in risk and costs while improving time to value, innovation, and process automation. Join this session to hear their story, the lessons they learned along the way and how their future innovation plans include the exploration of uses of EKG + Generative AI.
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Bert Blevins
Today’s digitally connected world presents a wide range of security challenges for enterprises. Insider security threats are particularly noteworthy because they have the potential to cause significant harm. Unlike external threats, insider risks originate from within the company, making them more subtle and challenging to identify. This blog aims to provide a comprehensive understanding of insider security threats, including their types, examples, effects, and mitigation techniques.
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionBert Blevins
Cybersecurity is a major concern in today's connected digital world. Threats to organizations are constantly evolving and have the potential to compromise sensitive information, disrupt operations, and lead to significant financial losses. Traditional cybersecurity techniques often fall short against modern attackers. Therefore, advanced techniques for cyber security analysis and anomaly detection are essential for protecting digital assets. This blog explores these cutting-edge methods, providing a comprehensive overview of their application and importance.
Best Programming Language for Civil EngineersAwais Yaseen
The integration of programming into civil engineering is transforming the industry. We can design complex infrastructure projects and analyse large datasets. Imagine revolutionizing the way we build our cities and infrastructure, all by the power of coding. Programming skills are no longer just a bonus—they’re a game changer in this era.
Technology is revolutionizing civil engineering by integrating advanced tools and techniques. Programming allows for the automation of repetitive tasks, enhancing the accuracy of designs, simulations, and analyses. With the advent of artificial intelligence and machine learning, engineers can now predict structural behaviors under various conditions, optimize material usage, and improve project planning.
Choose our Linux Web Hosting for a seamless and successful online presencerajancomputerfbd
Our Linux Web Hosting plans offer unbeatable performance, security, and scalability, ensuring your website runs smoothly and efficiently.
Visit- https://onliveserver.com/linux-web-hosting/
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
2.
Practical Machine Learning: discerning differences and selecting the best approach
2
TABLE OF CONTENTS
Executive
summary
...................................................................................................................................................................
3
Introduction
..................................................................................................................................................................................
3
Concepts
.........................................................................................................................................................................................
6
Process
and
Practicalities
.....................................................................................................................................................
15
Accessible
to
Data
Scientists
&
Business
Users
...........................................................................................................
20
Accessible
to
Developers
&
BI/DW
Professionals
.....................................................................................................
24
Key
Takeaways
..........................................................................................................................................................................
30
References
and
Resources
....................................................................................................................................................
32
Table
of
Abbreviations
......................................................................................................................................................
33
About
Lynn
Langit
....................................................................................................................................................................
34
About
Mark
Tabladillo
............................................................................................................................................................
34
3.
Practical Machine Learning: discerning differences and selecting the best approach
3
Executive summary
The formal definition of Machine Learning is this: the ability of computing systems to gain
knowledge from experience. Practical ML enables your organization to answer business
questions more effectively because of that experience. Machine Learning solutions consist of
your input data built into models which combine that data with statistical and data mining
algorithms.
Until relatively recently applied ML (as contrasted to ML for research) was simply too
specialized, difficult and expensive to have broad adoption outside of the academic community
and a few commercial domains (finance, ad serving). However, improvements in languages,
libraries as well new commercial offerings (including cloud-only products) have greatly
increased the practicality of implementing ML applications. Also demand has been fueled by Big
Data - more data encourages more powerful methods of processing to gain understanding from
that data.
This report will discuss technologies and implementation approaches for creating enterprise
data solutions that include one or more machine learning components. The report will also detail
the tradeoffs of each solution and determine which approach best fits organizational needs.
Introduction
The term ‘Predictive Analytics’ is used somewhat interchangeably with Machine Learning. The
central idea is that Machine Learning enables the creation of important business insights based
on a analyzing some set of input data with one or more data mining or statistical algorithms.
Where Machine Learning is used
In some sectors, particularly academic research, statistical analysis and data mining have been
standard analytical techniques for years. These sectors tend to use open source languages, tools
and libraries. Academics commonly use specialty coding languages such as R or Python libraries
(SciPy/NumPy/Pandas), rather than enterprise languages, such as Java for their ML research
projects. Also researchers tend to work with wide (many attributes) and shallow (relatively small
sample sizes) datasets. This academic dataset size is significant because many of the commonly
4.
Practical Machine Learning: discerning differences and selecting the best approach
4
used tools, such as R Studio or even Weka, are designed for small (albeit rich) datasets and they
are limited to working with datasets that can fit in the memory of analyst’s desktop computer
rather than requiring server or even cloud-scale processing power.
In a few commercial sectors, such as financial (for example with credit scoring) and security (for
example for email spam detection), use of ML (via data mining) is not a new approach. In these
areas, highly specialized tools and specially trained professionals have supported these types of
solutions. These vertical-specific ML solution development cycles run to the hundreds of
thousand or even millions of dollars to implement. These costs include software licenses,
powerful hardware, proprietary development and management tools and consulting fees. Also
these types of projects have commonly taken months or even years to implement.
However, the ML market landscape is rapidly changing with the availability of Big Data/cloud
storage, processing and data pipelines. These new services enable faster and cheaper data
collection, storage and processing. Also the growth of IoT (mostly sensor) data is increasing the
volumes of available data for analysis. These market changes are making the overall ‘entry point’
for ML projects less risky –i.e. cheaper and faster. Another driver of adoption is the efforts that
commercial vendors are putting into creating usable ML tooling – most of which is runs on that
particular vendor’s cloud infrastructure (such as IBM Watson on Bluemix, Microsoft Azure ML
on Azure or Amazon ML). ML projects are increasingly seen as a realistic possibility given the
larger market landscape. Simply put, more data means a need for more powerful methods of
deriving meaning from the increasingly large and complex datasets. Enter the
democratization of Machine Learning.
Challenges to Adoption
Although tools are reducing the complexity of applying the power of statistical and data mining
techniques to increasingly larger data sets, the enterprise market is in the early stages of ML
adoption. One of the key blockers is complexity -- creating useful predictive analytics or ML
differs substantially from the more traditional business analytics.
Because the application (and demand) for technical professionals skilled in applied statistics and
data mining had traditionally been a small market, we are faced with a lack of trained, working
5.
Practical Machine Learning: discerning differences and selecting the best approach
5
professionals who can produce useful results in this area. Specifically we lack those who have
experience on how to perform the tasks needed in the enterprise ML solution lifecycle – such as
to clean and groom the input data, to select appropriate techniques and algorithms, to build and
evaluate models and to support moving the result of their work to production.
Vendors are stepping in to reduce this gap. Several major commercial vendors have launched
general-purpose machine learning suites this year. As mentioned, the majority of these new
offers are cloud-based. Some solutions offer you the ability to train, test and deploy in either a
cloud or on premises, while other solutions are cloud-only, such as BigML.
6.
Practical Machine Learning: discerning differences and selecting the best approach
6
Concepts
Taxonomies and terms for Machine Learning solutions have important and nuanced differences
in meaning, proper understanding is key to differentiating products and solutions available in
the ML space. To begin, we’ll start by providing definitions of associated technologies.
What
is
the
difference
between
business
analytics
and
predictive
analytics?
Business Analytics is defined as finding answers to business questions by querying data and
producing a definite result or result set. For example: “What are the top five items that are
found in a shopping basket for a 38 year old man from California who is shopping on a
Saturday at 5pm at a major grocery chain?” The answer to this question (via a query to source
data) produces a deterministic result set, usually shown as a report or a dashboard is the only
type of analytics that they have available. Stated differently, business analytics are used to
analyze “what has happened” for past events.
Predictive Analytics is defined as finding answers to business questions by applying one or
more probabilistic algorithms to some set of input data and producing one or more
probabilistic results. For example: “Consider the items which appear together in the
shopping baskets of all 38 year old men from California who are shopping on a Saturday at
5pm at any of the major grocery chain stores for which we have data and predict how many of
a given item from this set the stores should have on hand to ensure proper supply for this type
of customer.” In this case, the type of algorithm is regression because it is used to predict a
future value or set of values. To get a result one or more regression algorithms are applied to the
source data – for example, linear regression. Because the results are probabilistic, i.e. a
percentage or score of likelihood of a result, it is common to use more than one evaluative
algorithm and then to evaluate the quality of the result. This is process is called ‘evaluating the
model.’ The best result from the models is selected and is either presented via statistical output
(probability) or via a customized visualization. Stated differently, predictive analytics are used to
analyze “what will happen” for potential or future events. The graphic below illustrates and
7.
Practical Machine Learning: discerning differences and selecting the best approach
7
contrasts sample results in business and predictive analytics.
Figure 1 - Two Types of Analytics
What
is
the
difference
between
data
mining
and
predictive
analytics?
Data Mining encompasses a broader set of tasks than that included in predictive analytics. In
addition to regression algorithms, data mining also includes other types predictive analysis.
Specifically, finding groupings in the source data, by matching new data to existing labeled (or
categorized) data is called classification. Classification algorithm executions are characterized
as implementations of ‘supervised’ algorithms because there is an authoritative set of data,
which is used to process the input data in addition to an algorithm. For example “In a set of data
there are examples of pictures or drawings of objects that we’ve identified and labeled as
particular animals – i.e. ‘this is a picture of a dog and that is a picture of a cat.’ “ A classification
task is to evaluate the likelihood of a new picture being a dog or a cat based on pattern matching
to the set of known states. An example of a classification algorithm is decision trees. Of note is
that regression is also ‘supervised’ because a data set with ‘known values’ is used in conjunction
8.
Practical Machine Learning: discerning differences and selecting the best approach
8
with the application of the regression algorithm when evaluating the probability of a result using
new input data.
Discovering natural groupings in source data, for which there are no known states or labels is
called clustering. Since there are no known states when clustering algorithms are used, this
type of machine learning is called ‘unsupervised’. An example of this technique is ‘here are some
pictures, group them into subsets based on characteristics (or labels) that are discovered
during the process of running the algorithm.’ As with the other types of ML, when
implementing clustering it is common to use multiple clustering algorithms, such as k-means,
then to evaluate the model results and finally to select the top performing algorithm and model
for the particular business problem.
What
is
the
difference
between
predictive
analytics
and
machine
learning?
Machine Learning is evolving to support the increasing volumes, varieties and velocities of Big
Data projects, rather than the smaller, simpler datasets that typified data mining projects,
particularly in academia. Another way to understand ML is as the next generation of data
mining. Machine learning is a superset of predictive analytics because it involves more than
application of one or more predictive analytic techniques (and associated algorithms) to sets of
input data. Another consideration is the current push toward commercial ‘productization’ of
machine learning applications. Although data mining and statistical analysis has been widely
used in particular domains, the broadest application, for academic research, is implemented
quite differently than for commercial applications.
Specifically there are many steps in data preparation for predictive analytics (or ML) projects
that are different from data preparation common for business analytics projects. Steps to
prepare input data for predictive analytics include such tasks as the following:
• Evaluating data types and detecting or creating labels (for classification)
• Evaluating number / ratio of null values
9.
Practical Machine Learning: discerning differences and selecting the best approach
9
• Evaluating quality/ usefulness of input data based on statistical analysis (mean, mode,
etc…)
• Removing outlier values (exceptions)
• Creating groupings (called ‘bucketing’)
Commercial tools provide data visualizers, which assist with data quality assessment at this state
and also facilitate easy modification of the input data. After the data preparation tasks have
been completed there is a 3-step process to implement a machine learning solution or model. It
is quite common for the model process to be iterative (because the outputs are probabilistic)
during the model creation phase. Iterations often include returning to the data preparation
phase because adjusting the quality of the input data impacts outputs. The need for iteration
over increasingly large data sets marries nicely with the scalability of cloud-based ML solutions.
These steps include the following:
• Input Data
o Ingest – in this step you ingest source data, common ingest methods are file-
based, database-based. Increasingly accepting streaming input is a requirement.
o Evaluate & Clean – in this step you review the input data (often done using
statistical analysis) and tune that data, so as to be prepared for inclusion in one or
more ML models
• Model
o Select ML Algorithm and Initialize Model(s) – in this step you match the
business question and input data to a ML technique (regression, classification or
clustering) and one or more algorithms from within that technique (such as, linear
regression, decision trees, k-means clustering) to evaluate the possibility of
building a useful model with this information
10.
Practical Machine Learning: discerning differences and selecting the best approach
10
o Train Model(s) – in this step you create the model and load it with data, you
then process the model and view the output
o Score Model(s) – in this step you evaluate the effectiveness of model results vs.
the ‘random guess’ line to understand the potential use of the model(s) for future
predictions, classifications and clustering tasks
• Predict
o Perform Prediction – in this step you evaluate new data against the model in
order to predict the likelihood of selected results.
These steps are often performed iteratively, as model scoring results in differentiation between
multiple models. You may decide to repeat some or all of the entire cycle with slightly different
input data, different algorithms, different algorithm parameters, etc… in order to produce one or
mode ‘useful’ models. Wizards and visualization tools found in ML products speed up these
iterative cycles.
Shown below is an open source project for RStudio called Shiny. Shiny is used by many R
developers, because it allows them to quickly an easily visualize (and query) models they created
in the R programming language. Note the use of input parameters via slider bars and text boxes.
These controls allow the ML developer to ‘try out’ different values in evaluating the usefulness of
their model. Lightweight visualization tools for rapid iteration are particularly
valuable for ML scenarios.
11.
Practical Machine Learning: discerning differences and selecting the best approach
11
Figure 2 - Visualization of R results using Shiny
Is
data
science
the
same
thing
as
machine
learning?
Data science is a super set of Machine Learning in that in addition to all of the tasks described in
the last paragraph, data science also includes hypothesis formation, or more simply, ‘asking the
right question(s)?’ Data science, as shown in the graphic, involves domain expertise, healthy
curiosity, scientific thinking, understanding of math, statistics, algorithms, data input sets and
visualization. Increasingly, a team of people in the enterprise is responsible for data science
projects, because the skill sets needs are simply not found in any one or two people. Also these
teams benefit from using enterprise-grade tools, which facilitate communication and other
12.
Practical Machine Learning: discerning differences and selecting the best approach
12
enterprise needs, such as security, source control and others.
Figure 3 - Skills need for Data Science
What is Artificial Intelligence and how does it relate to machine learning?
An AI (Artificial Intelligence) solution contains one of more intelligent agents. AI intelligent
agents automate tasks that would normally require a highly trained person to do. An example of
this type of task is speech recognition and translation. An AI system is one that responds to
complex problems in a human-like way. A well-known AI success of late is the celebrated win of
the IBM Watson AI system again two top human players in the TV trivia game show Jeopardy.
13.
Practical Machine Learning: discerning differences and selecting the best approach
13
In some ways, AI has more to do with process automation than learning because AI systems
ingest vast amounts of source data and perform iterative ML processes, often over a period of
years. In practice AI includes a number of ML components, so that the system and its processes
can be increasingly optimized or can learn over time. You can see commercial application of AI
systems in domains as disparate as medical diagnostics, self-driving cars, face and speech
recognition and bank fraud detection.
What
is
Deep
Learning
and
how
does
it
relate
to
machine
learning?
Deep Learning is a relatively new aspect of Machine Learning. It’s a set of algorithms in ML that
attempt to model high-level abstractions in data by using multiple non-linear transformations.
Deep Learning is focusing on improving the efficiency of unsupervised or semi-supervised
feature learning algorithms. It’s based on research in human neuroscience, such as human
neural coding. Algorithms are deep neural networks and problem sets include computer vision,
natural language processing and speed recognition. Also Deep Learning has been called the new
definition of the ‘neural networks’ data-mining algorithm.
Advances in hardware, particularly around GPU computational capabilities have facilitated use
of Deep Learning as they have enabled model-processing times to shrink from weeks or days to a
more practical level, i.e. minutes. However, given the computational intensity, it is still the case
that computational (processing time) requirements limit the widespread application of Deep
Learning algorithms.
Deep Learning is also called ‘strong AI’ because of it’s potential to disrupt a large number of
processes. Major software companies are focusing millions of dollars in research around
improving usability of Deep Learning in their own core products (such as their voice recognition
systems, Google Now, Microsoft Cortana and Apple Siri and other products). Although the
potential of Deep Learning is exciting, the reality is that the broad application of its results due
to time, cost, complexity and skills needed is still limited to experimental and (mostly) research
projects at a small subset of companies, such as Google, IBM, Microsoft, etc....
14.
Practical Machine Learning: discerning differences and selecting the best approach
14
What
is
the
importance
of
real-‐time
analytics?
Broader adoption of technologies such as in-memory databases and streaming Hadoop (Spark
Streaming, Storm and Samza), along with new types of data providers, e.g. IoT data input
devices, are increasing the demand for real-time analytics as a category. In addition creation of
cloud-based data pipeline libraries and products, enables the creation of more complex conduits
for incoming data, including through multiple processing pipelines.
Along with these advances in real-time Big Data technologies in general comes demand for
products, which can enable rapid creation of solutions that also include real-time predictive
analytics. Major software vendors are creating consumer products and services, such as adaptive
voice input (Google Now, Microsoft Cortana and Apple Siri) that use real-time predictive
analytics. These types of applications are igniting consumer imagination and fueling demand in
general.
15.
Practical Machine Learning: discerning differences and selecting the best approach
15
Process and Practicalities
Let’s take a deeper look at the processes involved in creating commercial machine learning
solutions. We are doing so, because, as mentioned, the process for creating useful commercial
predictive analytics is quite different than that of creating business analytics. Digging into the
detailed processes involved will help in our understanding of the usability of the libraries, tools
and products currently available.
Business data projects are driven by the need to gain more or better business insights. Given
that, what are the types of use cases that machine learning solutions can address? Remembering
the core functionality of ML, i.e. predicting one or more discrete, future values, classifying or
labeling new data into known groups and/or detecting natural groups in new data, here is a short
list of some types of common use cases:
• Facilities
&
Manufacturing
-‐-‐
Smart
Buildings,
Predictive
Maintenance
• Sales
&
Marketing
-‐-‐
Demand
Forecasting,
Churn
Analysis,
Target
Advertising
• Biomedical
-‐-‐
Life
Science
Research,
Healthcare
outcomes
(patient
re-‐admission
rates)
• Security
-‐-‐
Fraud
Detection,
Network
Intrusion
Detection
• Logistics
–
Routing
As mentioned the steps involved in a creating an end-to-end machine learning solution include a
number of considerations. Before the advent of cloud-based data storage, pipelines and machine
learning model tooling, costs involved in creating what were then called data mining solutions
blocked many enterprises. These costs included high hardware and software license fees (often
well over $ 100k, up to $ 1 million simply to start what was often a multi-year project was not
unheard of as well). Additionally, the costs of re-training or hiring specialty consultants to
implement the data mining projects added to the project costs and complexity. Prior to cloud-
based data storage and cloud-based data pipeline products, costs associated to unearthing
enterprise data from the various (and often proprietary) on-premise data silos added to adoption
blockers. Yet another blocker to implementing traditional data mining was that the domain of
16.
Practical Machine Learning: discerning differences and selecting the best approach
16
business analyst (or, in some cases, statistician) were wholly separated from developers who
would be charged with creating application interfaces for the results of the data mining work
produced by the business analysts.
Cloud storage combined with new types of Big Data storage has driven overall enterprise data
volumes up dramatically. Increasingly large and complex data sets are becoming progressively
more difficult to analyze in a meaningful way for the enterprise. Driven by particular sectors,
such as the ML analysis of massive amounts of behavioral data collected in social gaming (Angry
Birds, Halo, etc…), the enterprise appetite for getting started with ML projects has increased
sharply over the last 12 months.
Although the landscape is improving due to the release of improved open source libraries, tools
as well as new commercial tools, for most enterprises, ML projects are a new type of analytics.
Given that, for traditional enterprises, the newly releasing set of cloud-based ML tools and
services, such as Azure ML, IBM Waston, Predixion Software, AWS ML, BigML and others are a
welcome compliment to the existing (mostly open source) languages, libraries and tools.
Another new item in the emerging ecosystem of enterprise tools and products designed to
support enterprise ML projects is the emergence of commercial data markets. IBM, Microsoft
and Predixion Software all include the ability to directly ‘publish’ the results of one or more
useful ML experiments into their cloud-based repository or marketplace. Technically, most
enable the ML experiment to be published as a REST-based web service endpoint.
Interestingly, cloud vendors are leveraging integration with their own cloud services. For
example, Amazon ML includes the ability to enable real-time ML via a one-button click as shown
in the screenshot below. This real-time capability is integrated with AWS S3 storage. AWS ML
integrates with S3, RDS or Redshift at this time.
17.
Practical Machine Learning: discerning differences and selecting the best approach
17
Figure 4 - Amazon ML Model Usage Options
This functionality not only facilitates quick and easy deployment to production of commercial
ML services, but also has the interesting implication of providing the enterprise a commercial
platform from which they can monetize the results of their ML experiments by making those
results available as a commercial offering.
18.
Practical Machine Learning: discerning differences and selecting the best approach
18
Shown below is a chart that lists many of the major offerings – either commercial or open source.
Phase
Azure
AWS
Google
Commercial
Open
Source
Ingest
Stream
Insight
Kinesis
Big
Query
Data
Torrent
Flume
Pipeline
Data
Pipeline
Data
Pipeline
Data
Pipeline
Data
Torrent
Kafka
Storage
BLOB
Document
DB
SQLAzure
HDInsight
S3
Dynamo
DB
RDS
–
SQL
Redshift
EMR
BLOB
H/R
Datastore
MySQL
Hadoop
on
GCE
SAS
NoSQL
Hadoop
Create
Predictive
Models
Azure
ML
Revolution
Analytics
for
R
Language
AWS
ML
Prediction
API
SAS
IBM
Watson
Predixion
Software
BigML
Matlab
Mathematica
PredictionIO…
R
Mahout
Python
Pandas
Weka
Predicative
Results
Publication
and/or
Visualization
Excel
Power
BI
Gateway
PowerView
Azure
Data
Market
AWS
Lambdas
Partners
Google
Charts
BigML
Dato
Predixion
Marketplace
Tableau
Wolfram
Language
D3
In some verticals, such as biomedical, it is common to have some form of academic data mining
or statistics work (data sets and / or data mining models) to use as a basis for creating
commercial machine learning solutions. One example is when you are turning that academic
research into commercial biomedical products. Given that, we’ll list data mining languages,
libraries and tools, which are commonly used in academic research. Also, it has been the case
that traditional statistical tools and languages, i.e. Matlab, Mathematica, have high adoption in
the research sector.
19.
Practical Machine Learning: discerning differences and selecting the best approach
19
ML Academic Languages, Tools and Libraries – some are open source – most have free
versions for academic research – shown below is a chart that summarizes many of these items.
We have included the communities’ category, because academic data science communities are at
the front edge of work on improving open source tools and libraries and bear watching when you
are assessing the state of ML tools and products.
Category Objects Notes
Languages R Language
SciPy/NumPy/Pandas
Matlab
Mathematica
Julia
Mahout
Weka
Stats Language
Python Libraries for ML
Stats Language
Stats Language
Scalable Stats Language
ML for Hadoop
Research Stats Language
Tools R Studio
Shiny for R
Weka Studio
PyCharm
Sublime
IDE for R
Visualization for R
IDE for Weka
IDE for Python
IDE for Python and more
Communities KDNuggets
Kaggle
DataKind
Open Gov/Open Data
Code for America
Website
Competition
Community
Community
Community
20.
Practical Machine Learning: discerning differences and selecting the best approach
20
Accessible to Data Scientists & Business Users
A key question around the practicality of ML solutions for the enterprise is this: Who exactly will
develop the ML solutions in the enterprise? Given the diverse set of skills needed to successfully
implement any type of data science solution, much less the smaller subset (which is even more
complex – around ML), the first part of the answer is the most critical. A team of skilled
professionals best implements ML projects. Our answer to the common question “Do I just need
to hire a statistician to implement a ML project?” is an unqualified “No!” Commercial ML
differs substantially from ML for academic research. While the image of the lone scientist,
toiling away in his/her lab and carefully analyzing the results via complex statistical calculations
is the heritage of ML, this images bears little relationship to the practicalities of implementing
ML in the enterprise.
While there is definitely a place for a dedicated statistician on an enterprise ML team, this is no
longer a requirement for all ML projects. That being said, ML tools compliment (but do not
substitute for) statistical and data mining domain expertise. What has changed with the advent
of these tools, is the ability for your key team members to work with others (business analysts,
decision makers, developers, DevOps, etc…) because the tools use common interfaces and well-
designed dataflow visualizations. Also most tools are cloud-based, which means zero-install and
configuration and quick environment start up time. Additionally commercial tools are designed
to scale storage and processing via cloud capacity, enabling faster movement from small dataset
experiments to full-scale production deployments. Cloud-based tools are particularly well
suited for building quick proof-of-concept projects for the enterprise.
Given the democratization of tooling, you may be wondering whether this new tooling is
sophisticated enough for classically trained data scientists and academics to be able to make full
use of their complete skill sets? The answer is a conditional yes – some, but not all, commercial
products, such as Azure ML, contain integration with commonly used statistical languages (R
Language and Python libraries) and allow re-use of scripts created in these languages.
21.
Practical Machine Learning: discerning differences and selecting the best approach
21
Additionally, it’s important for researches to have visibility into algorithms and algorithm
parameters. This is important for reproducibility of published experiment results. Shown below
is an Azure ML model, which uses two-class support vector machines in performing
classification (of Tweets in this sample). Also of note is the ability to use R Language scripts in a
ML workflow:
Figure 5 - Azure ML Experiement
22.
Practical Machine Learning: discerning differences and selecting the best approach
22
Model evaluation is a key component of a ML Experiment. Here is sample output from Azure
ML model evaluation visualization. You’ll note that both score information (table) and graphical
output are included in the visualization:
Figure 6 - Azure ML Model Evaluation Output
23.
Practical Machine Learning: discerning differences and selecting the best approach
23
For comparison, shown below is output from a sample Amazon ML model evaluation:
Figure 7 - Amazon ML Model Evaluation Visualization
24.
Practical Machine Learning: discerning differences and selecting the best approach
24
Accessible to Developers & BI/DW
Professionals
An interesting and somewhat unexpected aspect of ML enterprise projects is that in no way is
having one or more Big Data repositories a requirement for undertaking this type of project.
Due to the origins of ML, i.e. academic research using statistics and data mining, some of the
most useful ML projects are, in fact, based on application of these techniques to LOB data. You
can think of it as being able to ask different kinds of questions of your current data.
Understanding when to use ML (and when not to) relates directly to the definitions of business
and predictive analytics. Simply put, use ML when you want to ask business questions will result
in probabilistic answers.
The ability to ask predictive questions of LOB data often yields useful results. For example, it
has been quite common to begin ML projects in sales and marketing departments, using CRM
data as source for ML experiments that involve answering business questions like ‘what are the
characteristics of the customers who produce the most revenue?’ (Clustering) and ‘what type of
cross-sell opportunities can we introduce on our website based on known customer purchase
patterns?’ (Classification).
Another common ‘entry point’ for ML solutions in the enterprise is in using IT (log) data.
Regulatory (access auditing) and compliance requirements – and also general security concerns,
drive ML experiments such as ‘at what day / time can I expect that network bandwidth usage will
spike to a particular level (value) for a particular segment of my corporate users?’ (Regression).
25.
Practical Machine Learning: discerning differences and selecting the best approach
25
In general, the enterprise can find value in appropriately applying predictive analytics via ML
solutions to a broad spectrum of domains. In addition to sales and market or DevOps,
enterprises can apply ML to other scenarios for which probabilistic analysis would yield useful
results. For example questions such as these can now be addressed:
• What are the most closely correlated employee attributes with highest revenue
production of that employee’s team?
• At what future point (value) in time do our customers in a certain segment (i.e.
demographics, geographic…) tend to make a subsequent purchase?
• What groups (trial or free items) of our public resources (website, Github, YouTube…)
tend to be used by browsers who become our customers?
As mentioned, integrated tooling provided by commercial vendors enables simpler deployment
and embedding of ML model results into enterprise applications via their ‘publish as a web
service’ functionality. Given that relatively few enterprise application developers have familiarity,
much less expertise in ML languages, tools and libraries, using commercial ML tools that include
‘click to publish’ functionality significantly speeds up time to market.
Another advantage of using commercial ML tools for the enterprise is the built in connectors to
disparate incoming data sources. Given that it is increasingly common to use a broad variety of
data sources as ML ingest sources, the availability of pre-built connectors once again speed
development cycles. It is common to include connector for LOB data, i.e. RDBMS systems (both
on-premise and cloud-based) as well as for some of the newer NoSQL databases, Hadoop as well
as one or more type of incoming data stream.
26.
Practical Machine Learning: discerning differences and selecting the best approach
26
Also useful are the quick statistical snapshots that most commercial ML tools provide of datasets
in your ML project. For example, the AWS ML dataset console view includes the visualization
shown below:
Figure 8 - AWS ML Datasources Attribute Information
The AWS viewer not only allows the ML team to ‘see’ the attribute names, but also the
correlations, uniqueness of data, most/least frequent categories, it also includes an inline
‘Preview’ visualization of the uniqueness of the data.
As mentioned, integrated commercial ML tooling, which include ‘one-click’ to deploy capabilities
increases usability for developers and BI professionals. Additionally, capabilities, which
essentially advertise published ML web services, such as Microsoft Azure Data Market, provide
additional discoverability; usability and also commerce opportunities for published services are
also emerging. An example is shown below.
27.
Practical Machine Learning: discerning differences and selecting the best approach
27
Figure 9 - Azure Machine Learning Test Harness
28.
Practical Machine Learning: discerning differences and selecting the best approach
28
Visualization of results is another element of ML solution usability. To that end, we’ve included
a sample from IBM Watson Analytics. This service includes flexible visualizations at all phases
of the ML process (i.e. data discovery, modeling, etc…) an example is shown below.
Figure 10 - IBM Watson ML Visualization
29.
Practical Machine Learning: discerning differences and selecting the best approach
29
Our last example of model visualization is from the commercial cloud-based vendor BigML and
is shown below. Also interesting is how vendors such as BigML enable community via providing
a platform for their users to get more value from their ML models. You’ll note BigML allows
users to upload, share, rate and also sell models for use by others in their own ML scenarios.
Figure 11 - BigML Model Visualization
30.
Practical Machine Learning: discerning differences and selecting the best approach
30
Key Takeaways
Incorporating the results of machine learning experiments into production data solutions adds
significant complexity to the overall projects. Given this, a solid understanding of technology
choices around machine learning solutions is essential for designing and delivering solutions
that provide business value to the organization.
• Use commercial machine learning products when team members new to
machine learning processes are creating your solution. Due to fundamental
differences at every stage in the data pipeline, i.e. data preparation, hypothesis formation,
algorithm selection, model training and evaluation, ML projects introduce a set of
complex processes into the enterprise. If your data paradigm consists of an OLTP store
alone, you would be best served by leveraging commercial ML development suites, rather
than attempting to cobble together solutions based on tools and libraries that were built
primarily for statisticians.
• Select tools or coding libraries that perform at the speed and scale for the
data ingest and processing scale for the types of machine learning methods
that your business problems require. Enterprises will benefit from leveraging cloud
storage and process of Big Data workloads as sources for ML solutions because their data
volumes are generally significantly larger than those of academic research. Also, in-
memory streams are increasingly relevant, particularly with the advent of more and more
IoT scenarios.
• Teams that have already implemented pure open source data solutions are
most capable of adding pure open source machine learning solutions.
Domains where data mining and/or statistics may have already been in use, such as
academic research will have more success using open source tools and libraries, so long as
their input data does not overrun the capabilities of those tools.
• Plan for and test your model deployment topology to ensure ML experiments
deliver production business value. Commercial vendors are incorporating one-click
to deploy functionality in their ML studio environments, given the common challenges
31.
Practical Machine Learning: discerning differences and selecting the best approach
31
around deployment of ML models; such functionality enables faster time to market for
production solutions. Also consider the vendor path to implementing streaming or near-
real time ML solutions if that is part of your requirements.
• Select tools or plan for coding appropriate types of visualization solutions.
ML outputs are unfamiliar to many business users. Standard reports and
dashboards have not been designed to display ML results in a meaningful way. Selecting
ML vendors, which integrate results easily into other commercial solutions or common
libraries results in broader usability for ML solutions.
32.
Practical Machine Learning: discerning differences and selecting the best approach
32
References and Resources
This
section
lists
the
references
and
resources
referred
to
in
this
article.
Data
Science
graphic
-‐-‐
http://civicscience.com/data-‐science-‐a-‐visual-‐guide/
Shiny
for
R-‐Studio
-‐-‐
http://shiny.rstudio.com/gallery/movie-‐explorer.html
Deep
Learning
and
the
Hololens
-‐-‐
https://technoptimist.wordpress.com/2015/01/25/deep-‐
learning-‐and-‐the-‐hololens
Collection
of
papers
on
how
IBM
Watson
works
-‐
http://www.andrew.cmu.edu/user/ooo/watson/
What
is
AI?
-‐-‐
http://www.techopedia.com/definition/190/artificial-‐intelligence-‐ai
How
Google
is
Teaching
Computers
to
See
-‐
https://gigaom.com/2012/06/25/how-‐google-‐is-‐
teaching-‐computers-‐to-‐see/
Need
Deep
Learning?
Here
are
4
Lessons
from
Google
-‐
https://gigaom.com/2015/01/29/new-‐to-‐
deep-‐learning-‐here-‐are-‐4-‐easy-‐lessons-‐from-‐google/
Getting
started
with
AWS
ML
-‐-‐
http://docs.aws.amazon.com/machine-‐
learning/latest/dg/tutorial.html
AzureML
on
Windows
Azure
DataMarket
/
Binary
Classifier
Sample
-‐-‐
https://datamarket.azure.com/dataset/aml_labs/log_regression
BigML
Sample
Model
-‐
https://bigml.com/user/ashikiar/gallery/model/53b2f21ec8db635905000d33
Kaggle
Community
-‐
https://www.kaggle.com/
DataKind
Community
-‐
http://www.datakind.org/
33.
Practical Machine Learning: discerning differences and selecting the best approach
33
Table of Abbreviations
Abbreviation
Full
Term
AI
Artificial
Intelligence
AWS
Amazon
Web
Services
BI
Business
Intelligence
CRM
Customer
Relationship
Management
DW
Data
Warehouse
GPU
Graphics
Processing
Unit
IoT
Internet
of
Things
LOB
Line
of
Business
ML
Machine
Learning
NoSQL
No
SQL
OLAP
On
line
analytical
processing
OLTP
On
line
transactional
processing
POC
Proof-‐of-‐concept
RDBMS
Relational
Database
Management
System
34.
Practical Machine Learning: discerning differences and selecting the best approach
34
About Lynn Langit
Lynn Langit is a Big Data and Cloud Architect who has been working with database solutions for
more than 15 years. Over the past 4 years, she’s been working as an independent architect using
these technologies, mostly in the biotech, education, manufacturing and facilities verticals. Lynn
has done POCs and has helped teams build solutions on the AWS, Azure, Google and Rackspace
Clouds. She has done work with SQL Server, MySQL, AWS Redshift, AWS MapReduce, Cloudera
Hadoop, MongoDB, Neo4j, Aerospike and many other database systems. In addition to building
solutions, Lynn also partners with all major vendor cloud vendors, providing early technical
feedback into their Big Data and Cloud offerings. She is an AWS Community Hero, Google
Developer Expert (Cloud), Microsoft MVP (SQL Server) and a MongoDB Master. Lynn is also a
Cloudera certified instructor (for MapReduce Programming).
Prior to re-entering the consulting world 3 years ago, Lynn’s background is over 10 years as a
Microsoft Certified instructor, a Microsoft vendor and then 4 years as Microsoft employee. She’s
published 3 books on SQL Server Business Intelligence and has most recently worked with the
SQL Azure team at Microsoft. She continues to write and screencast and hosts a BigData channel
on YouTube (http://www.youtube.com/SoCalDevGal) with over 150 different technical videos
on Cloud and BigData topics. Lynn is also a committer on several open source projects
(http://github.com/lynnlangit).
About Mark Tabladillo
Mark Tabladillo is a Senior Data Scientist at midtown Atlanta's Predictix/LogicBlox. He has used
and promoted Microsoft Azure Machine Learning, Microsoft SQL Server Data Mining, Microsoft
BI Stack, Power BI, SAS, SPSS, R, and Julia. He is a SQL Server MVP and has a research
doctorate (PhD) from Georgia Tech. He is chapter leader for PASS Data Science Virtual Chapter,
which has periodic live meetings and its own YouTube channel.