How it works- Data Science
- 2. Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work on Large Data Base
Verifiable Certificate
Slide2
www.edureka.in/data-science
How it Works?
- 3. Slide3
www.edureka.in/data-science
Topics for the Day
Big Data
Big Data Scenarios
Big Data Challenges
Introduction to Data Science
Data Science: Components
Types of Data Scientists
Data Science: Core Components
Use-Cases
Introduction to Hadoop and R
R and Hadoop Integration
Machine Learning with Mahout
Assignment, Pre-work and Agenda for the Next Class
What’s Within the LMS
References
- 4. Objectives
At the end of this module, you will be able to
Understand Big Data and its challenges
Implement Big Data in real time scenarios
List and explain the components and prospects of Data Science
Learn the implementation of Hadoop on Big data
Analyze some real world use-cases with the help of R programming Language
Understand machine learning concepts
- 7. Slide7
www.edureka.in/data-science
What is Big Data?
LotsofData(TerabytesorPetabytes)
Systems/EnterprisesgeneratehugeamountofdatafromTerabytestoandevenPetabytesofinformation
http://www.today.mccombs.utexas.edu/2012/04/the-big-data-machine
- 12. Slide12
www.edureka.in/data-science
Big Data Scenarios : Hospital Care
Hospitalsareanalyzingmedicaldataandpatientrecordstopredictthosepatientsthatarelikelytoseekreadmissionwithinafewmonthsofdischarge.Thehospitalcantheninterveneinhopesofpreventinganothercostlyhospitalstay.
Medicaldiagnosticscompanyanalyzesmillionsoflinesofdatatodevelopfirstnon-intrusivetestforpredictingcoronaryarterydisease.Todoso, researchersatthecompanyanalyzedover100milliongenesamplestoultimatelyidentifythe23primarypredictivegenesforcoronaryarterydisease e
- 14. Slide14
www.edureka.in/data-science
http://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png
Amazonhasanunrivalledbankofdataononlineconsumerpurchasingbehaviourthatitcanminefromits152millioncustomeraccounts.
AmazonalsousesBigDatatomonitor,trackandsecureits1.5billionitemsinitsretailstorethatarelayingaroundit200fulfilmentcentresaroundtheworld.AmazonstorestheproductcataloguedatainS3.
S3canwrite,readanddeleteobjectsupto5TBofdataeach. ThecataloguestoredinS3receivesmorethan50millionupdatesaweekandevery30minutesalldatareceivediscrunchedandreportedbacktothedifferentwarehousesandthewebsite.
Big Data Scenarios : Amazon.com
- 19. Slide19
www.edureka.in/data-science
IBM’s Definition
IBM’s Definition–Big DataCharacteristics
http://www-01.ibm.com/software/data/bigdata/
Web logs
Images
Videos
Audios
Sensor Data
VOLUME
VELOCITY
VARIETY
- 20. Slide20
www.edureka.in/data-science
IBM’s Definition
Structured
Unstructured
Semi structured
All the above
Variety
3 Vs of Big data
Batch
Near Time
Real Time
Streams
Velocity
Terabytes
Records
Transactions
Tables, files
Volume
IBM’s Definition–Big DataCharacteristics
http://www-01.ibm.com/software/data/bigdata/
- 22. Slide22
www.edureka.in/data-science
Hello There!!
My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and answer my questions.
Hello There!!
My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and answer my questions.
Annie’s Introduction
- 23. Slide23
www.edureka.in/data-science
Map the following to corresponding type: Structured/ Unstructured/ Semi- structured.
-XML Files
-Word Docs, PDF files, Text files
-E-Mail body
-Data from Enterprise systems (ERP, CRM etc.)
Annie’s Question
- 24. Slide24
www.edureka.in/data-science
XML Files -> Semi-structured data
Word Docs, PDF files, Text files -> Unstructured Data
E-Mail body -> Unstructured Data
Data from Enterprise systems (ERP, CRM etc.) -> Structured Data
Annie’s Answer
- 26. Slide26
www.edureka.in/data-science
Big Data Challenges
Data security and Privacy
High variety of Information
High veracity of Data
Data Acquisition
High velocity of processed Data
Information search and Analytics
High volume of Data
Information storage and Analytics
Big Data: Challenges
- 30. Slide30
www.edureka.in/data-science
No matter how extremely unpleasant your algorithm is, they can often be beaten simply by having moredata (and a less sophisticated algorithm).
Big Data is here
Bad News
We are struggling to store and analyze it.
Good News
Data Science
- 34. Slide34
www.edureka.in/data-science
Types of Data Scientists
BasedonclusteringthewaysthatdataishandledbyDataScientists,thefollowing4categoriescanbecreated:
DataBusinesspeoplearetheproductandprofit-focuseddatascientists.They’releaders,managers,andentrepreneurs,butwithatechnicalbent.AcommoneducationalpathisanengineeringdegreepairedwithanMBA.
DataCreativesareeclecticjacks-of-all-trades,abletoworkwithabroadrangeofdataandtools.Theymaythinkofthemselvesasartistsorhackers,andexcelatvisualizationandopensourcetechnologies.
DataDevelopersarefocusedonwritingsoftwaretodoanalytic,statistical,andmachinelearningtasks,ofteninproductionenvironments.Theyoftenhavecomputersciencedegrees,andoftenworkwithso-called“bigdata”.
DataResearchersapplytheirscientifictraining,andthetoolsandtechniquestheylearnedinacademia,toorganizationaldata.TheymayhavePhDs,andtheircreativeapplicationsofmathematicaltoolsyieldsvaluableinsightsandproducts.
http://datacommunitydc.org/blog/2013/06/there-is-more-than-one-kind-of-data-scientist/
- 40. Slide40
www.edureka.in/data-science
Understanding the Machine Learning algorithm to be used
Implementing Machine Learning in Hadoop on Big Data
Visualisation of the analysis
Understanding the problem statement and defining the solution
Exploring ways to integrate R with Hadoop
Implementing Machine Learning algorithm in R on the smaller dataset
Use-Case Implementation:Process Flow Diagram
- 41. Slide41
www.edureka.in/data-science
DomainoftheDataset:
CommunicationsandMedia.However,theapplicationofthealgorithmisnotlimitedtoonlyCommunicationsandMedia.Thetechniqueisusefulforanydomainwhichrequiresorganizingdocumentstoimproveretrievalandsupportbrowsing.
ProblemStatement:
AtopmediacompanywantstobrowsethroughthepopularnewsfromacollectionthatappearedontheReutersnewswirein1987.
Clustering/Groupingdocumentsbasedontheircontentswillmaketheanalysiseasier.
Media Use-Case
The Reuters-21578 data set composition
- 42. Slide42
www.edureka.in/data-science
Media Use-Case: K-means Clustering
First we will understand the implementation of the technique in R on a smaller dataset
Then we will understand how to achieve document clustering on Big Data using Mahout libraries on Hadoop
K-Means Clustering can be implemented on this dataset
Communications and Media Dataset to be Clustered based on their contents
R Implementation
Hadoop
Implementation
Machine Learning
Implementation
Content-wise Clustered/Grouped documents
- 43. Slide43
www.edureka.in/data-science
DomainoftheDataset:
ProductsandRetail.However,theapplicationofthealgorithmisnotlimitedtoonlyProductsandRetail.Thetechniquecanbeappliedwhereverwewanttodiscovertheco-occurrencerelationshipamongstvariousactivities.
ProblemStatement:
MarketBasketAnalysis.
Aretailoutletwantsunderstandthepurchasebehaviorofabuyer.Thisinformationwillenabletheretailertounderstandthebuyer'sneeds.
Theanalysismighttellaretailerthatcustomersoftenpurchaseshampooandconditionertogether,soputtingbothitemsonpromotionatthesametimewouldcreateasignificantincreaseinprofit,whileapromotioninvolvingjustoneoftheitemswouldlikelydrivesalesoftheother.
Market Basket Use-Case
Market Basket Analysis
98% of people who purchased items A and B also purchased item C
- 44. Slide44
www.edureka.in/data-science
Market Basket Use-Case: Association Rule Mining
Product and Retail Dataset
Understand the implementation of the technique on a smaller dataset
Understand how to achieve the same on Big Data using Mahout libraries on Hadoop
The technique used is Affinity Analysis or Association Rule Mining
R Implementation
Hadoop
Implementation
Machine Learning
Implementation
Market Basket Analysis
- 45. Slide45
www.edureka.in/data-science
DomainoftheDataset:
LifeScienceandHealthCare.However,theapplicationofthealgorithmisnotlimitedtoonlyLifeScienceandHealthCare.Thetechniquecanbeappliedwhereverwewanttoforecasttheoccurrenceofaeventonthebasisofcertainconditions.
ProblemStatement:
AhealthcareorganizationwantstoforecasttheonsetofdiabetesmellitusinIndiansusingcertainsetofattributesofpatientsasinputsuchas:
Plasmaglucoseconcentration
Diastolicbloodpressure
Tricepsskinfoldthickness
etc.
Health Care Use-Case
http://www.thenewstribe.com/2013/11/15/diabetes-is-killing-one-patient-every-six-seconds/
- 46. Slide46
www.edureka.in/data-science
Understand the basic implementation of the technique on a smaller dataset using R
Achieve parallel processing on the same algorithm using a parallel processing library provided by Revolution R.
Understand how to achieve the same on Big Data using Mahout libraries on Hadoop
The technique used is Affinity Analysis or Association Rule Mining.
R Implementation
Hadoop
Implementation
Machine Learning
Implementation
Forecast the onset of diabetes mellitus in Indians
Life Science and Health Care Dataset with some attributes of patients as input.
Health Care Use-Case: Parallel Processing
- 47. Slide47
www.edureka.in/data-science
DomainoftheDataset:
SocialMedia.However,theapplicationofthealgorithmisnotlimitedtoonlySocialMedia.Thetechniquecanbeappliedwhereverwewanttoputdocumentsintocategorywithoutgoingthroughthecontentsofallthedocuments.
ProblemStatement:
ASocialMediaresearchfirmwantstoknowthetrendsoftopicsdiscussedonTwitter.Foreasyanalysisitwantstoclassifytheminthefollowingcategories:
apparel(clothes,shoes,watches,…)
art(Book,DVD,Music,…)
camera
event(travel,concert,…)
health(beauty,spa,…)
home(kitchen,furniture,garden,…)
tech(computer,laptop,tablet,…)
http://www.mobigyaan.com/images/stories/Miscellaneous/mobigyaan-twitter-chat.jpg
Social Media Use-Case
- 48. Slide48
www.edureka.in/data-science
Social Media Use-Case: Naïve Bayes Classifier
Understand the basic implementation of the technique on a smaller dataset using R.
Understand how to achieve the same on Big Data using Mahout libraries on Hadoop.
The technique used is Naïve Bayes Classifier.
Social Media dataset
R Implementation
Hadoop
Implementation
Machine Learning
Implementation
Categorical classification of the tweets
- 49. Slide49
www.edureka.in/data-science
Going forward with the class, we will throw some light on the concepts of Hadoop, R andMachine Learning respectively.
These topics will be vividly covered in their respective modules during the course.
Data Science: Core Components
- 53. Slide53
www.edureka.in/data-science
Hadoop Core Components
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
MapReduce
Engine
HDFS
Cluster
Job Tracker
Admin Node
Name node
- 55. Slide55
www.edureka.in/data-science
Large Data Sets. It is also capable to process small data-sets however to experience the true power of Hadoop one needs to have data in Tb’s because this where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.
Annie’s Answer
- 59. Slide59
www.edureka.in/data-science
R : Characteristics
Risopensourceandfree.
Rhaslotsofpackagesandmultiplewaysofdoingthesamething.
BydefaultstoresmemoryinRAM.
Rhasthemostadvancedgraphics.Youneedmuchbetterprogrammingskills.
RhasGUItohelpmakelearningeasier.
Customizationneedscommandline.
Rcanconnecttomanydatabaseanddatatypes.
- 61. Slide61
www.edureka.in/data-science
ComparingR with Base SAS* /SAS Stat*
*Copyright©2012SASInstituteInc.,SASCampusDrive,Cary,NorthCarolina27513,USA.Allrightsreserved.
R
Base SAS* /SAS Stat*
R is opensourceand free
BaseSAS*,SAS/Stat*,SAS/ET*, SAS/OR*, SAS/Graph*are expensive relativelybecauseof annuallicenses
OpensourceRhassupportfrom emaillists, twitter,stack overflow
SASInstitute*productshavededicated supportandextensivedocumentation
R is sloweronthedesktopthanbase SASfordatasets~4-5gb
BydefaultRstoresmemoryinRAM, sowe canusethecloud
R has muchbettergraphics
Youneedmuchbetterprogramming skills
Youcan createcustomfunctionsin R easily
Customizationneedscommandline
R has multipleGUIthatarefree
SASGUI are moreexpensive
- 67. Slide67
www.edureka.in/data-science
Annie’s Answer
Most of the user-visible functions in R are written in R.
It is possible for the user to interface to procedures written in the C, C++, or FORTRAN languages for efficiency.
- 69. Slide69
www.edureka.in/data-science
R and Hadoop Integration
RandHadoopareanaturalmatchinBigDataAnalyticsandvisualization.
Oneofthemostwell-knownRpackagestosupportHadoopfunctionalitiesis:RHadoop
RhadoopwasdevelopedbyRevolutionAnalytics.
RHadoopisacollectionofthreeRpackages:rmr,rhdfsandrhbase.
rmrpackageprovidesHadoopMapReducefunctionalityinR,rhdfsprovidesHDFSfilemanagementinRandrhbaseprovidesHBasedatabasemanagementfromwithinR.
+
- 72. Slide72
www.edureka.in/data-science
Machine Learning: Mahout
MachineLearningisaclass of algorithmswhichisdata-driven,i.e.unlike"normal" algorithmsitis
thedatathat"tells"whatthe"goodanswer"is.
Example:
Anhypotheticalnon-machinelearningalgorithmforfacerecognitioninimageswouldtrytodefine
whatafaceis(roundskin-like-coloreddisk,withdarkareawhereyouexpecttheeyesetc).
Amachinelearningalgorithmwouldnothavesuchcodeddefinition,butwill
"learn-by-examples":you'llshowseveralimagesoffacesandnot-facesandagoodalgorithmwill
eventuallylearnandbeabletopredictwhetherornotanunseenimageisaface.
http://endthelie.com/2012/08/24/fbi-sharing-facial-recognition-software-with-police-departments-across-america/
- 73. www.edureka.in/data-science
MahoutOverview
Slide 73
Mahout is about scalable Machine Learning
Mahout has functionality for many of today’s common machine learning tasks
Machine Learning is all over the web today
MapReduce magic in action
- 74. www.edureka.in/data-science
Slide74
Hadoop and MapReduce magic in action
Write intelligent applications using Apache Mahout
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
LinkedIn Recommendations
Machine Learning: LinkedIn Recommendations
- 78. Slide78
www.edureka.in/data-science
Agenda for Next Class
Understand what is R
Describe why R is used?
Implement R Programming Concepts
Learn Data Import Techniques
Analyze the Processing of Data
In the next class you will be able to
- 83. Slide83
www.edureka.in/data-science
References
http://www.today.mccombs.utexas.edu/2012/04/the-big-data-machine
http://www.espncricinfo.com/
http://www.majorprojects.vic.gov.au/our-projects/our-past-projects/austin-hospital
http://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png
http://smhttp.23575.nexcesscdn.net/80ABE1/sbmedia/blog/wp-content/uploads/2013/03/netflix-in-asia.pnghttp://www.crowdsourcing.org/article/-nasa-tries-to-free-creativity-with-big-data-challenge/19984
http://whatsthebigdata.files.wordpress.com/2013/11/batman-on-big-data.jpghttp://spinnakr.com/blog/wp-content/uploads/2013/08/Using-Big-Data-.jpg
http://thesocietypages.org/sociologylens/files/2013/09/BIgDataDilbert_Cartoon.jpg
http://abstrusegoose.com/55http://www.thenewstribe.com/2013/11/15/diabetes-is-killing-one-patient-every-six-seconds/
http://www.mobigyaan.com/images/stories/Miscellaneous/mobigyaan-twitter-chat.jpghttp://www.r-project.org/ http://endthelie.com/2012/08/24/fbi-sharing-facial-recognition-software-with-police-departments-across-america/