The Future of Data Science
- 1. The Future of Data Science
SARITH DIVAKAR M | LBS COLLEGE OF ENGINEERING, KASARAGOD
www.sarithdivakar.info sarith@cusat.ac.in
- 3. Data Scientist
“A data scientist is someone who is better
at statistics than any software engineer
and better at software engineering than
any statistician”
- 4. An Interview with Lisa Qian, Airbnb
WHICH SKILLS OR PROGRAMMING LANGUAGES DO YOU
MOST FREQUENTLY USE IN YOUR WORK, AND WHY?
“At Airbnb, we all use Hive to query data and build derived
tables. I use R to do analysis and build models. I use Hive and R
every day of the job. A lot of data scientists use Python instead
of R – it’s just a matter of what we were familiar with when we
came in. There have also been recent efforts to use Spark to
build large-scale machine learning models.”
- 6. Data Scientist Salaries
Average Salary (2015): $118,709 per year
Minimum: $76,000
Maximum: $148,000
Median Salary (2015): $93,991 per year
Total Pay Range: $63,524 – $138,123
- 7. Data Scientist Qualifications
Master’s degree 80%
PhD 46%
Math and statistics 32%
Computer Science 19%
Engineering 16%
Reference: The Burtch Works Study, “http://www.burtchworks.com/big-data-analyst-salary/big-data-career-tips/”
- 8. Data Scientist Job Outlook
McKinsey reported that by 2018 the U.S. could face a
shortage of 1,40,000 to 1,90,000 “people with deep
analytic skills”
Reference: Report of McKinsey Global Institute, “http://www.mckinsey.com/business-functions/business-
technology/our-insights/big-data-the-next-frontier-for-innovation”
- 12. Past and Future of Data Science
Descriptive analytics
Describing what has already taken place
Predictive analytics and real-time
analytics in pursuit of business goals
Improving the customer experience
Improving products and services
Reducing costs
- 13. Where to prioritize their Focus?
Amazon, Google and Netflix.
Python
Variety of tools, perspectives and approaches
Identify methods and models most appropriate
in a particular use case.
Reference: Devavrat Shah, Professor, Department of Electrical Engineering and
Computer Science, MIT, “http://blog.edx.org/future-data-science-qa-mit-
professional-educations-devavrat-shah”
- 15. Data Science to refine the “Crude Oil”
Volume
Variety
Velocity
Veracity
Value
(add your own V here…..)
- 16. Where big data comes from?
Huge amount of data is created everyday!
It comes from Us!
No digitized process becomes digitized
Digital India
Programmee to transform India to a digitally
empowered society and knowledge economy
- 17. Excavating Hidden Treasures from Big Data
Insights into data can provide business advantage
Some key early indications can mean fortunes to
business
More precise analysis with more data
Integrate Big Data with traditional data: Enhance
business intelligence analysis
- 19. Challenges in big data
Heterogeneity and
incompleteness
Scale
Timeliness
Privacy
Human collaboration
- 20. RDBMS : Why not for Big Data?
Limitations in RDBMS
RDBMS cannot handle petabytes of data
Seek time of disk drives is improving more slowly than transfer
rate of data
RDBMS are not built to handle unstructured or semi structured
data
Normalization of data makes it difficult for handling large data sets
Example : WebLogs
- 21. Distributed computing
Dividing large problems into smaller ones, and solved
concurrently ("in parallel")
Connecting multiple machines together for
Storing big files
Parallel processing
Data locality
Redundancy
- 22. Challenges in distributed computing
The distributed computing had some challenges which
restricted organizations to depend upon it. Those are
Concurrency control
Data synchronization
Atomic commit
Transaction split into small tasks
Leader election
- 23. Big data and cloud: converging
technologies
Big data: Extracting value out of “variety,
velocity and volume” from unstructured
information available
Cloud: On demand, elastic, scalable pay
per use self service model
- 24. Answer these before moving to big data
analysis
Do you have an effective big data problem?
Can the business benefit from using Big Data?
Do your data volumes really require these distributed
mechanisms?
- 25. Technology to handle big data
Google was the first company to effectively use big data
Engineers at google created massively distributed
systems
Collected and analyzed massive collections of web pages
& relationships between them and created “Google
Search Engine” capable of querying billions of pages
- 26. First generation of Distributed systems
Proprietary
Custom Hardware and software
Centralized data
Hardware based fault recovery
Eg: Teradata, Netezza etc
- 27. Second generation of Distributed systems
Open source
Commodity hardware
Distributed data
Software based fault recovery
Eg : Hadoop, HPCC
- 28. Why we need new generation?
Lot has been changed from 2000
Both hardware and software gone through changes
Big data has become necessity now
Let’s look at what changed over decade
- 29. Changes in Hardware
State of hardware in 2000 State of hardware now
Disk was cheap so disk was primary
source of data
RAM is the king
Network was costly so data locality RAM is primary source of data and we
use disk for fallback
RAM was very costly Network is speedier
Single core machines were dominant Multi core machines are commonplace
- 30. Shortcomings of Second generation
Batch processing is primary objective
Not designed to change depending upon use cases
Tight coupling between API and run time
Do not exploit new hardware capabilities
Too much complex
- 31. Third generation distributed systems
Handle both batch processing and real time
Exploit RAM as much as disk
Multiple core aware
Do not reinvent the wheel
They use
HDFS for storage
Apache Mesos / YARN for distribution
Plays well with Hadoop
- 32. Hadoop vs Spark
Stores data on disk Sores data in memory (RAM)
Commodity hardware can be utilized Need high end systems with greater RAM
Uses Replication to achieve fault tolerance Uses different data storage models to achieve
fault tolerance (Eg. RDD)
Speed of processing is less due to disk read
write
100x faster than Hadoop
Supports only Java & R Supports Java, Python, R, Scala etc. Ease of
programming is high.
Everything is just Map and Reduce Supports Map, Reduce, SQL. Streaming etc
Data should be in HDFS Data can be in HDFS,Cassandra,Hbase or S3.
Runs on Hadoop, Cloud, Mesos or standalone
- 36. References
1. The Burtch Works Study, http://www.burtchworks.com/big-data-analyst-salary/big-data-career-tips/
2. Mathrubhumi, http://digitalpaper.mathrubhumi.com/943320/kochi/21-Sept-2016#page/6/2
3. Report of McKinsey Global Institute, http://www.mckinsey.com/business-functions/business-technology/our-
insights/big-data-the-next-frontier-for-innovation
4. Devavrat Shah, Professor, Department of Electrical Engineering and Computer Science, MIT,
“http://blog.edx.org/future-data-science-qa-mit-professional-educations-devavrat-shah
5. “Data Mining and Data Warehousing”, M.Sudheep Elayidom, SOE, CUSAT
6. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”. Matei Zaharia,
Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott
Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award.
7. “What is Big Data”, https://www-01.ibm.com/software/in/data/bigdata/
8. “Apache Hadoop”, https://hadoop.apache.org/
9. “Apache Spark”, http://spark.apache.org/
Editor's Notes
- Glassdoor helps you find a job and company you love. Reviews, salaries and benefits from employees. Interview questions from candidates. Millions of jobs.
PayScale, Inc. or payscale.com is an online salary, benefits and compensation information company, which launched its service on January 1, 2002. It was founded by Joe Giordano, a former Microsoft and drugstore.com manager, and John Gaffney
- Math (e.g. linear algebra, calculus and probability)
Statistics (e.g. hypothesis testing and summary statistics)
Machine learning tools and techniques (e.g. k-nearest neighbors, random forests, ensemble methods, etc.)
Software engineering skills (e.g. distributed computing, algorithms and data structures)
Data mining
Data visualization (e.g. ggplot and d3.js) and reporting techniques
Unstructured data techniques
R and/or SAS languages
SQL databases and database querying languages
Python (most common), C/C++ Java, Perl
Big data platforms like Hadoop, Hive & Pig
Cloud tools like Amazon S3
- Devavrat Shah received his Bachelor of Technology in Computer Science and Engineering from Indian Institute of Technology, Bombay in 1999 with the Presidents of India Gold Medal – awarded to the best graduating student across all engineering disciplines. He received his PhD in Computer Science from Stanford University in 2004