SlideShare a Scribd company logo
DATA MINING AND
MACHINE LEARNING
TWO-SEMESTER COURSE PROPOSAL*
2016-06-01 (YYYY-MM-DD)
version 0.1.3
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
JAKUB RUZICKA
linkedin.com/in/littlerose
jameslittlerose@gmail.com
* something I may consider teaching in the future,
if I gain enough experience and find collaborators
(do not hesitate to drop me a line! =))
OUTLINE
LITERATURE
MACHINE LEARNING SYLLABUS
TEACHING METHODS
EXAMINATION
ENTRY REQUIREMENTS
INTENDED LEARNING OUTCOMES
ANNOTATION
DATA MINING SYLLABUS
ANNOTATION
DATA MINING ANNOTATION
The course introduces students to data mining in its interdisciplinary nature,
with the goal of being exposed to and being able to obtain variety of data,
process them, quickly find one’s feet, and perform exploratory analysis as a
basis for drawing conclusions for decision-making and/or subsequent
automation and prediction employing machine learning models.
MACHINE LEARNING ANNOTATION
The Machine Learning course follows the Data Mining course with introducing
students to the most widely used machine learning algorithms and building
machine learning models for prediction, decision-making, and/or automation of
data analysis in a computer program /application.
INTENDED LEARNING
OUTCOMES
DATA MINING ILOs
Upon completion of the course, the students will:
■ be able to handle a problem in a wide range of business cases and
scientific disciplines, assess whether and how the problem is solvable by
data mining, obtain and process the necessary data, perform exploratory
analysis, create visualizations, full report, and executive summary for
decision-making, and/or prepare the data for further processing
■ take a positive approach towards data science and computer science, gain
confidence in basic operations, get an overview of advanced data mining
methods and applications
MACHINE LEARNING ILOs
Upon completion of the course, the students will:
■ be able to handle the iterative process of selecting a machine learning
model suitable for a particular problem, perform preprocessing, feature
extraction, dimensionality reduction, training, testing and tweaking
a machine learning model, and develop a simple web application for
reporting and/or ready to be plugged into a larger project /product
■ conceptually understand the mathematics and principles behind the most
widely used machine learning algorithms
ENTRY REQUIREMENTS
DATA MINING ENTRY REQUIREMENTS
It would be convenient if students had completed any introductory computer
science and programming course, and a basic statistics and probability course,
or had related professional experience.* However, motivated students will be
able to deal with the content without these prerequisites. Our main tool will be
the Python general-purpose programming language, whose knowledge is
neither required nor expected as Python training is a fundamental part of the
Data Mining course.
* All of these (and much more) can be reviewed using online open educational resources even before the course
begins. We also plan to organize an information meeting for anyone interested in taking this course.
Note: Course vacancies might be, free of charge, offered to the general
public including but not limited to employees, freelancers, high school
students, individuals on parental leave, senior citizens, and so on.
MACHINE LEARNING ENTRY REQUIREMENTS
The introductory Data Mining course, its alternative, or related professional
experience.*
* All of these (and much more) can be reviewed using online open educational resources even before the course
begins. We also plan to organize an information meeting for anyone interested in taking this course.
Note: Course vacancies might be, free of charge, offered to the general
public including but not limited to employees, freelancers, high school
students, individuals on parental leave, senior citizens, and so on.
EXAMINATION
DATA MINING EXAMINATION
Research project developed in teams consisting of (roughly) 3 students in the form of an executive
summary for a business and/or scientific institution*, which will be assessed by the course lecturers
and clients alike.
In an effort to bring you projects closer to a real-world scenario, the assignment and the evaluation
criteria will be specified based on your discussion (in collaboration with the course lecturers) with your
client.
Specialization of each team member on a particular area she/he finds the most meaningful with
regards to her/his goals is expected. It will be discussed with her/him during the project defence,
ensure individual contributions of all members of a team, and therefore also gives one an opportunity
to use the project as a basis for her/his Bachelor’s thesis.
* Without financial remuneration but with an opportunity to earn your first professional contact (long-term
collaboration and/or professional recommendation) and a successful data science project on your CV.
MACHINE LEARNING EXAMINATION
Simple machine learning web application developed in teams consisting of (roughly) 3 students for a
business and/or scientific institution*, which will be assessed by the course lecturers and clients alike.
In an effort to bring you projects closer to a real-world scenario, the assignment and the evaluation
criteria will be specified based on your discussion (in collaboration with the course lecturers) with your
client.
Specialization of each team member on a particular area she/he finds the most meaningful with
regards to her/his goals is expected. It will be discussed with her/him during the project defence,
ensure individual contributions of all members of a team, and therefore also gives one an opportunity
to use the project as a basis for her/his Bachelor’s /Master’s thesis.
* Without financial remuneration but with an opportunity to earn your first professional contact (long-term
collaboration and/or professional recommendation) and a successful data science project on your CV.
MOTIVATION
■ preparation for conducting a commercial or academic research /active participation in
development of larger applications
■ an opportunity to try everything out under supervision and get feedback on your work
■ practicing working with open source tools lowering the financial burden (and therefore barriers)
of your prospective clients
■ practicing teamwork skills and collaboration within a larger workgroup /institution
TEACHING METHODS
F2F BLOCK TEACHING SESSIONS
■ interactive /live /real-time rather than ‘scripted’ lectures (semi-formal discussion and on-time
explanation of a concept) including hands-on tutorials /labs
■ top-down and bottom-up approach: demonstration of a particular analysis and its output (door-
opening moment, motivation), gradually disassembling it from the higher-level concepts to the
basic building blocks /necessary prerequisites (underlying math, algorithms, technologies,
program code, ...), then going back from the lower-level details showing their implementation in a
much wider range of applications and comparing different approaches to solving a problem
■ one-day data science ‘hackathons’
■ BYOD (Bring Your Own Device) as you’ll need to set up and use your own development
environment
DISTANCE LEARNING
■ open educational resources suggested for each session
■ Q&A forum
(How do I ask a good question? stackoverflow.com/help/how-to-ask)
■ sharing your work in progress and discussing it with others
■ (if you agree) shared notes /study material /wiki /... created by the students of the course and
for the students of the course (also reviewed and co-created by the lecturers), where one can
focus on creating background for her/his specialization
■ voluntary ‘challenges’, small data mining /machine learning tasks to reinforce your skills
GUEST LECTURERS (POSSIBLY WEBINARS)
■ professionals, researchers, authors, prospective employers, …
■ expertise in a particular core topic of our course and/or on request
(based on what is most meaningful to you with regard to your final project
and/or your personal professional /academic goals and interests)
Note: Class attendance is voluntary (naturally). Interact with the course in a way that suits you best. It’
s totally fine if you’re a self-driven learner who approaches the lecturers only when she/he needs their
help. Similarly, you might just want to audit the course (you don’t want to complete it) and/or hand-
pick only the topics that interest you. On the other hand, if you are not engaged because you think we
can do better, by all means, tell us so that we can work on it!
Make the course our joint project. Let’s adjust and approve the course structure and course
requirements at the very beginning so that it supports your individual and our common goals
(answering all ‘Why?’ questions should boost your self-motivation). Take the initiative and come up
with ideas for lectures /course topics /guests /..., get involved by teaching what you know /are good
at /what you want to improve in /..., contribute to the development of the course in order to obtain
your desired life /professional /academic /... skills.
OUR MAIN TOOLS
■ Python
■ Linux command line
■ VirtualBox
DATA MINING SYLLABUS
1. INTRODUCTION
■ Computer Science, Data Science, Data Mining, Data Analysis
■ Data Science Applications in Business and Various Academic Disciplines
■ Law and Ethics of Data Mining
2. PREREQUISITES
■ Pseudocoding
■ Linux Command Line
■ Programming in Python (and alternative data science tools comparison)
■ Data Sources, Types, and Formats
■ SQL and NoSQL Database Systems
■ Web Technologies and Web Programming
■ The Git Version Control System
3. DATA MINING
■ Getting Data: Files, Databases, APIs, Web Scraping
■ Data Exploration and Preprocessing
■ Basic Statistics, Probability and Linear Algebra Review
■ Mathematics and Statistics in Python
■ Bayesians and Frequentists
■ Graph Theory
■ Social Network Analysis
■ Mining the Social and Semantic Web
■ Text Mining
■ Natural Language Processing
■ Sensory Data
■ Signal Processing: Image, Video, Audio, Speech, ...
■ Geographic Information Systems
■ Metadata
Note: The topics of the 3rd
block of the course
(to which we’ll probably dedicate most of our time)
cover theoretical background, Python (possibly
also alternative tools) implementation, Python
libraries, methods, procedures, and applications.
■ Introducing Popular Data Mining
Languages and Software (apps /tools
/platforms /services /...) for Various Uses
4. REPORTING
■ Data Visualization
■ JavaScript and Python Data Visualization Libraries (and alternatives comparison)
■ Data Storytelling
5. INTRODUCTION TO MACHINE LEARNING
■ Supervised and Unsupervised Learning
■ Classification and Clustering
■ Recommender Systems
■ Large-Scale DataSets
6. APPLICATIONS & GETTING OUT OF YOUR COMFORT ZONE
■ A/B Testing
■ Price Modeling
■ Fraud Detection
■ Revenue Assurance
■ Supply Chain Management
■ Market Basket Analysis
■ Face Recognition
■ Robotics and Computer Vision
■ Music Genre Classification
■ Speech Recognition
■ Internet of Things and RaspberryPi
■ SCADA
■ Web Usage Mining
■ Intelligence Gathering
■ Packet Analysis (in Digital Forensics)
■ Computer Virus Detection
■ Weather Forecasting
■ Sports Analytics
■ Spatial Analysis
■ Investigative Journalism
■ Educational Data Mining
...why you want to attend the follow-up course =)
Do not limit ourselves to just one discipline. ICT and math skills can be applied
anywhere. Let’s first work on becoming the James Bonds (/MacGyvers /Chuck
Norrises /...) of Data Science, which should not only improve our hands-on
abilities, but also facilitate acquiring the general data science mindset that will
later allow us to innovate in our areas of expertise (outside this course and also
as part of each student’s specialization - see the Examination section).
■ Text Retrieval and Search Engines
■ Autocomplete
■ Topic Modeling
■ Content Discovery
■ Semantic Reasoners
■ Sentiment Analysis
■ Decision Support Systems
■ Medical Recommendation Systems
■ Model-based Drug Development
■ Neuroimaging Data Mining
■ DNA Sequencing
■ (...)
MACHINE LEARNING SYLLABUS
1. INTRODUCTION
■ Big Data, Data Science, Machine Learning
■ Predictive Modelling
■ Security and Data Ownership
2. PREREQUISITES
■ The Data Mining Course Topics Review (MailBox Mining)
■ Bayesian Statistics and Multidimensional Data Analysis (in ‘Classical’ Statistics) Review
■ Object-oriented Programming and Design Patterns in Python, Style Guide for Python Code
■ Working with Large Scale Datasets
Note: The 2nd
-7th
blocks of the course cover theoretical background, Python (possibly also
alternative tools) implementation, Python libraries, methods, procedures, and applications.
3. (ADVANCED) DATA MINING FOLLOW-UP
■ Advanced Web Scraping, Web Crawling, Web Automation
■ Probabilistic Graphical Models and Inference
■ Advanced Text Mining and Natural Language Processing
■ Pattern Recognition and Computer Vision
4. MACHINE LEARNING
■ Getting Data and Preprocessing
■ Dimensionality Reduction
■ Feature Selection
■ Model Learning and Testing
■ Supervised and Semi-supervised Learning
■ Classification and Regression Algorithms
■ Unsupervised Learning
■ Clustering and Dimensionality Reduction Algorithms
■ Recommender Systems
■ Collective Intelligence and Filtering Methods
■ Neural Networks and Deep Learning
■ Reinforcement Learning
■ Evolutionary Algorithms
5. BIG DATA PROCESSING
■ MapReduce
■ Spark and Hadoop Ecosystems
■ Distributed Storage
■ Cluster Computing
■ Cloud Computing
6. SERVER-SIDE PROGRAMMING
■ Developing Data Products and SaaS
■ Data Stream Mining
■ Deployment, Testing, Error Handling, Optimization, Licensing
7. APPLICATIONS & GETTING OUT OF YOUR COMFORT ZONE
■ Simulations
■ Artificial Intelligence
■ Big Data in Business, Finance, Healthcare, Education, Agriculture, Marketing, Astronomy,
Natural Sciences, Geosciences, Social Sciences, ...
■ The Remaining 99% of Data Science and Data Science Buzzwords
LITERATURE
WORK IN PROGRESS (REARRANGEMENT, COMPLETION, CITATION)
LITERATURE
The literature is common for both courses (we’ll start with the basics and get to the more difficult
topics in the follow-up course). The students are not required to read any of the following publications
but might find them handy when looking for inspiration, reference, sample code, or when some part of
the course takes their interest so that they want to follow it up with more in-depth self-directed study.
Further online /paperback study resources, tutorials, libraries, frameworks, and other tools will be
introduced within specific topics of the course.
Note: This list is by no means comprehensive and we’ll be able to give you a (much) more
targeted recommendation if you tell us where you are (regarding your current knowledge and
skills in a particular area) and where you want to be. On top of that: “Practice, practice, practice.”
DATA SCIENCE, DATA MINING, DATA ANALYSIS
[01] Doing Data Science
[02] Data Science from Scratch
[03] Python For Data Analysis
[04] Learning Data Mining with Python
[05] A Programmer's Guide to Data Mining
[06] Data Analysis with Open Source Tools
[07] Practical Data Analysis
[08] Bad Data Handbook
[09] Practical Data Science Cookbook
[10] Data Mining: The Textbook
DATA SCIENCE, DATA MINING, DATA ANALYSIS
[11] Data Mining for the Masses
[12] Data Smart
[13] Superforecasting
[14] Python Data Science Cookbook
[15] Mastering Python for Data Science
[16] Python Data Science Handbook [expected]
[17] Python Data Science Essentials
[18] Foundations for Analytics with Python [expected]
[19] Mastering Python Data Analysis [expected]
BAYESIAN STATISTICS AND PGMs
[20] Think Bayes
[21] Bayesian Data Analysis
[22] Bayesian Methods for Hackers
[23] Learning Bayesian Networks
[24] Probabilistic Graphical Models
[25] Building Probabilistic Graphical Models with Python
[26] The Signal and The Noise
SOCIAL NETWORK ANALYSIS
[27] Analyzing Social Media Networks with NodeXL
[28] Social Network Analysis for Startups
SOCIAL AND SEMANTIC WEB
[29] Mining the Social Web
[30] Analyzing the Social Web
[31] Web Scraping with Python
[32] Learning Scrapy
[33] Programming Semantic Web
[34] Linked Data
[35] A Developer’s Guide to the Semantic Web
[36] Social Media Mining with Python [expected]
[37] Mastering Social Media Mining with Python [expected]
TEXT MINING AND NLP
[38] Natural Language Processing with Python
[39] Python 3 Text Processing with NLTK 3 Cookbook
[40] Mastering Natural Language Processing with Python [expected]
[41] Speech and Language Processing
[42] Natural Language Annotation
IMAGE, SOUND, AND BEYOND
[43] Image Processing and Acquisition using Python
[44] Programming Computer Vision with Python
[45] Practical Computer Vision with SimpleCV
[46] OpenCV for Secret Agents
[47] Introduction to Sound Processing
[48] Python for Signal Processing
[49] Think Digital Signal Processing
[50] Learning Geospatial Analysis with Python
[51] Python Scripting for ArcGIS
[52] Python for Secret Agents
[53] Internet of Things with Python [expected]
DATA VISUALIZATION
[54] The Visual Display of Quantitative Information
[55] Envisioning Information
[56] Beautiful Visualization
[57] D3.js in Action
[58] Interactive Data Visualization for the Web
[59] Data Visualization with D3.js Cookbook
[60] Data Visualization with JavaScript
[61] HTML5 Graphics Data Visualization CookBook
[62] Python Data Visualization CookBook
[63] Data Visualization Cookbook [expected]
DATA VISUALIZATION
[64] Visualizing Data
[65] Making Sense of Data
MACHINE LEARNING AND ALGORITHMS
[66] Learning From Data
[67] Python Machine Learning
[68] Building Machine Learning Systems with Python
[69] Introduction to Machine Learning with Python [expected]
[70] Machine Learning in Python
[71] Applied Predictive Modeling
[72] Think Machine Learning [expected]
[73] An Introduction to Statistical Learning with Applications in R
[74] The Elements of Statistical Learning
[75] The Top Ten Algorithms in Data Mining
MACHINE LEARNING AND ALGORITHMS
[76] Data Mining and Analysis
[77] Data Mining: Practical Machine Learning Tools and Techniques
[78] Machine Learning
[79] Mastering Machine Learning with scikit-learn
[80] scikit-learn Cookbook
[81] Programming Collective Intelligence
[82] Practical Recommender Systems [expected]
[83] Machine Learning, A Probabilistic Perspective
[84] Neural Networks and Deep Learning
[85] Fundamentals of Deep Learning [expected]
[86] Deep Learning: A Practitioner's Approach [expected]
MACHINE LEARNING AND ALGORITHMS
[87] Pattern Recognition and Machine Learning
[88] Machine Learning: The Art and Science of Algorithms that Make Sense of Data
[89] Designing Machine Learning Systems with Python
[90] Real-World Machine Learning [expected]
[91] Practical Machine Learning
[92] Python Machine Learning Cookbook [expected]
[93] Introducing Data Science: Big Data, Machine Learning, and More, Using Python Tools [expected]
[94] Large Scale Machine Learning with Python [expected]
[95] Machine Learning for the Web [expected]
BIG DATA AND CLOUD COMPUTING
[96] Mining of Massive Datasets
[97] Data Algorithms
[98] Big Data Principles and Best Practices
[99] Learning Spark
[100] Advanced Analytics with Spark
[101] Fast Data Processing with Spark [expected]
[102] Hadoop
[103] Hadoop Application Architectures
[104] Data Intensive Text Processing with MapReduce
[105] Python and HDF5
BIG DATA AND CLOUD COMPUTING
[106] Amazon Web Services in Action
[107] Programming Amazon EC2
[108] Amazon Web Services For Dummies
[109] Cloudera Admin Handbook
[110] Real-Time Analytics
APPLICATIONS AND OUT OF YOUR COMFORT ZONE
[111] Data Science for Business
[112] Python for Finance
[113] Raspberry Pi Cookbook
[114] Internet of Things
[115] Bioinformatics Data Skills
[116] Bioinformatics with Python Cookbook
[117] Effective Computation in Physics
[118] Artificial Intelligence
[119] Artificial Intelligence for Humans, Volume 1: Fundamental Algorithms
[120] Artificial Intelligence for Humans, Volume 2: Nature-Inspired Algorithms
[121] Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks
(...)
(WEB) APPLICATION DEVELOPMENT
[122] Web Technologies
[123] Flask Web Development
[124] Instant Flask Web Development
[125] The Architecture of Privacy
[126] Data Jujitsu
[127] Version Control with Git
[128] Pro Git
INTRODUCTORY STATISTICS AND MATHEMATICS
[129] Think Stats
[130] Statistics in a Nutshell
[131] Doing Math with Python
[132] Numerical Python
[133] Mathematics for Computer Science
[134] Mathematics for Computer Scientists
[135] How to Lie with Statistics
PYTHON PROGRAMMING
[136] Learning Python
[137] Dive Into Python
[138] Learn Python the Hard Way
[139] Real Python
[140] Regular Expressions Cookbook
[141] Python 3 Object-Oriented Programming
DBMSs AND LANGUAGES
[142] JavaScript: The Definitive Guide
[143] JavaScript: The Good Parts
[144] Learning SQL
[145] MongoDB
[146] NoSQL Distilled
[147] Seven Databases in Seven Weeks
[148] Graph Databases
[149] Building Web Applications with Python and Neo4j
DBMSs AND LANGUAGES
[150] Redis Essentials
[151] Elasticsearch
[152] RDF Database Systems
[153] Learning SPARQL
■ pycon.org
■ pydata.org
■ conference.scipy.org
■ pyladies.com
■ kaggle.com
■ topcoder.com
■ github.com/vinta/awesome-python
■ stackexchange.com
■ github.com
■ reddit.com
■ programmableweb.com
■ w3schools.com
■ aws.amazon.com/documentation
(...)
ONLINE
youtube.com
coursera.org
ocw.mit.edu
edx.org
udacity.com
online.stanford.edu
extension.harvard.edu
webcast.berkeley.edu
nptel.ac.in
blog.agupieware.com/2014/05/online-
learning-bachelors-level.html
class-central.com
tutorialspoint.com
iversity.org
canvas.net
futurelearn.com
saylor.org
novoed.com/courses
edventis.com
udemy.com
lynda.com
codecademy.com
khanacademy.org
howstuffworks.com
wikipedia.org
(...)
oreilly.com
packtpub.com
manning.com
eu.wiley.com
elsevier.com
nostarch.com
store.elsevier.com/Syngress/IMP_76/
store.elsevier.com/Morgan-Kaufmann/IMP_16/
pearsoned.co.uk/imprints/addison-wesley/
pragprog.com
springer.com
apress.com
mhprofessional.com
(...)
Self-directed learners,
those who prefer distance
/blended learning, those
who want to know more,
or those who don‘t want
to rely on one source of
information only might
want to expand
/complement /substitute
different parts of the
course on:
…and many other [yourfavoritesearchengine] it
& learn it resources
PS: Don't forget to share on the course forum the
awesome resources you’ve found! (ideally resources that
are freely available online to compensate for our
conventional ‘backing-up-the-course-syllabus-using-lots-
of-books’ approach =))
DATA MINING AND
MACHINE LEARNING
TWO-SEMESTER COURSE PROPOSAL*
2016-06-01 (YYYY-MM-DD)
version 0.1.3
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
JAKUB RUZICKA
linkedin.com/in/littlerose
jameslittlerose@gmail.com
* something I may consider teaching in the future,
if I gain enough experience and find collaborators
(do not hesitate to drop me a line! =))

More Related Content

Data Mining and Machine Learning

  • 1. DATA MINING AND MACHINE LEARNING TWO-SEMESTER COURSE PROPOSAL* 2016-06-01 (YYYY-MM-DD) version 0.1.3 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. JAKUB RUZICKA linkedin.com/in/littlerose jameslittlerose@gmail.com * something I may consider teaching in the future, if I gain enough experience and find collaborators (do not hesitate to drop me a line! =))
  • 2. OUTLINE LITERATURE MACHINE LEARNING SYLLABUS TEACHING METHODS EXAMINATION ENTRY REQUIREMENTS INTENDED LEARNING OUTCOMES ANNOTATION DATA MINING SYLLABUS
  • 4. DATA MINING ANNOTATION The course introduces students to data mining in its interdisciplinary nature, with the goal of being exposed to and being able to obtain variety of data, process them, quickly find one’s feet, and perform exploratory analysis as a basis for drawing conclusions for decision-making and/or subsequent automation and prediction employing machine learning models.
  • 5. MACHINE LEARNING ANNOTATION The Machine Learning course follows the Data Mining course with introducing students to the most widely used machine learning algorithms and building machine learning models for prediction, decision-making, and/or automation of data analysis in a computer program /application.
  • 7. DATA MINING ILOs Upon completion of the course, the students will: ■ be able to handle a problem in a wide range of business cases and scientific disciplines, assess whether and how the problem is solvable by data mining, obtain and process the necessary data, perform exploratory analysis, create visualizations, full report, and executive summary for decision-making, and/or prepare the data for further processing ■ take a positive approach towards data science and computer science, gain confidence in basic operations, get an overview of advanced data mining methods and applications
  • 8. MACHINE LEARNING ILOs Upon completion of the course, the students will: ■ be able to handle the iterative process of selecting a machine learning model suitable for a particular problem, perform preprocessing, feature extraction, dimensionality reduction, training, testing and tweaking a machine learning model, and develop a simple web application for reporting and/or ready to be plugged into a larger project /product ■ conceptually understand the mathematics and principles behind the most widely used machine learning algorithms
  • 10. DATA MINING ENTRY REQUIREMENTS It would be convenient if students had completed any introductory computer science and programming course, and a basic statistics and probability course, or had related professional experience.* However, motivated students will be able to deal with the content without these prerequisites. Our main tool will be the Python general-purpose programming language, whose knowledge is neither required nor expected as Python training is a fundamental part of the Data Mining course. * All of these (and much more) can be reviewed using online open educational resources even before the course begins. We also plan to organize an information meeting for anyone interested in taking this course. Note: Course vacancies might be, free of charge, offered to the general public including but not limited to employees, freelancers, high school students, individuals on parental leave, senior citizens, and so on.
  • 11. MACHINE LEARNING ENTRY REQUIREMENTS The introductory Data Mining course, its alternative, or related professional experience.* * All of these (and much more) can be reviewed using online open educational resources even before the course begins. We also plan to organize an information meeting for anyone interested in taking this course. Note: Course vacancies might be, free of charge, offered to the general public including but not limited to employees, freelancers, high school students, individuals on parental leave, senior citizens, and so on.
  • 13. DATA MINING EXAMINATION Research project developed in teams consisting of (roughly) 3 students in the form of an executive summary for a business and/or scientific institution*, which will be assessed by the course lecturers and clients alike. In an effort to bring you projects closer to a real-world scenario, the assignment and the evaluation criteria will be specified based on your discussion (in collaboration with the course lecturers) with your client. Specialization of each team member on a particular area she/he finds the most meaningful with regards to her/his goals is expected. It will be discussed with her/him during the project defence, ensure individual contributions of all members of a team, and therefore also gives one an opportunity to use the project as a basis for her/his Bachelor’s thesis. * Without financial remuneration but with an opportunity to earn your first professional contact (long-term collaboration and/or professional recommendation) and a successful data science project on your CV.
  • 14. MACHINE LEARNING EXAMINATION Simple machine learning web application developed in teams consisting of (roughly) 3 students for a business and/or scientific institution*, which will be assessed by the course lecturers and clients alike. In an effort to bring you projects closer to a real-world scenario, the assignment and the evaluation criteria will be specified based on your discussion (in collaboration with the course lecturers) with your client. Specialization of each team member on a particular area she/he finds the most meaningful with regards to her/his goals is expected. It will be discussed with her/him during the project defence, ensure individual contributions of all members of a team, and therefore also gives one an opportunity to use the project as a basis for her/his Bachelor’s /Master’s thesis. * Without financial remuneration but with an opportunity to earn your first professional contact (long-term collaboration and/or professional recommendation) and a successful data science project on your CV.
  • 15. MOTIVATION ■ preparation for conducting a commercial or academic research /active participation in development of larger applications ■ an opportunity to try everything out under supervision and get feedback on your work ■ practicing working with open source tools lowering the financial burden (and therefore barriers) of your prospective clients ■ practicing teamwork skills and collaboration within a larger workgroup /institution
  • 17. F2F BLOCK TEACHING SESSIONS ■ interactive /live /real-time rather than ‘scripted’ lectures (semi-formal discussion and on-time explanation of a concept) including hands-on tutorials /labs ■ top-down and bottom-up approach: demonstration of a particular analysis and its output (door- opening moment, motivation), gradually disassembling it from the higher-level concepts to the basic building blocks /necessary prerequisites (underlying math, algorithms, technologies, program code, ...), then going back from the lower-level details showing their implementation in a much wider range of applications and comparing different approaches to solving a problem ■ one-day data science ‘hackathons’ ■ BYOD (Bring Your Own Device) as you’ll need to set up and use your own development environment
  • 18. DISTANCE LEARNING ■ open educational resources suggested for each session ■ Q&A forum (How do I ask a good question? stackoverflow.com/help/how-to-ask) ■ sharing your work in progress and discussing it with others ■ (if you agree) shared notes /study material /wiki /... created by the students of the course and for the students of the course (also reviewed and co-created by the lecturers), where one can focus on creating background for her/his specialization ■ voluntary ‘challenges’, small data mining /machine learning tasks to reinforce your skills
  • 19. GUEST LECTURERS (POSSIBLY WEBINARS) ■ professionals, researchers, authors, prospective employers, … ■ expertise in a particular core topic of our course and/or on request (based on what is most meaningful to you with regard to your final project and/or your personal professional /academic goals and interests) Note: Class attendance is voluntary (naturally). Interact with the course in a way that suits you best. It’ s totally fine if you’re a self-driven learner who approaches the lecturers only when she/he needs their help. Similarly, you might just want to audit the course (you don’t want to complete it) and/or hand- pick only the topics that interest you. On the other hand, if you are not engaged because you think we can do better, by all means, tell us so that we can work on it! Make the course our joint project. Let’s adjust and approve the course structure and course requirements at the very beginning so that it supports your individual and our common goals (answering all ‘Why?’ questions should boost your self-motivation). Take the initiative and come up with ideas for lectures /course topics /guests /..., get involved by teaching what you know /are good at /what you want to improve in /..., contribute to the development of the course in order to obtain your desired life /professional /academic /... skills.
  • 20. OUR MAIN TOOLS ■ Python ■ Linux command line ■ VirtualBox
  • 22. 1. INTRODUCTION ■ Computer Science, Data Science, Data Mining, Data Analysis ■ Data Science Applications in Business and Various Academic Disciplines ■ Law and Ethics of Data Mining
  • 23. 2. PREREQUISITES ■ Pseudocoding ■ Linux Command Line ■ Programming in Python (and alternative data science tools comparison) ■ Data Sources, Types, and Formats ■ SQL and NoSQL Database Systems ■ Web Technologies and Web Programming ■ The Git Version Control System
  • 24. 3. DATA MINING ■ Getting Data: Files, Databases, APIs, Web Scraping ■ Data Exploration and Preprocessing ■ Basic Statistics, Probability and Linear Algebra Review ■ Mathematics and Statistics in Python ■ Bayesians and Frequentists ■ Graph Theory ■ Social Network Analysis ■ Mining the Social and Semantic Web ■ Text Mining ■ Natural Language Processing ■ Sensory Data ■ Signal Processing: Image, Video, Audio, Speech, ... ■ Geographic Information Systems ■ Metadata Note: The topics of the 3rd block of the course (to which we’ll probably dedicate most of our time) cover theoretical background, Python (possibly also alternative tools) implementation, Python libraries, methods, procedures, and applications. ■ Introducing Popular Data Mining Languages and Software (apps /tools /platforms /services /...) for Various Uses
  • 25. 4. REPORTING ■ Data Visualization ■ JavaScript and Python Data Visualization Libraries (and alternatives comparison) ■ Data Storytelling
  • 26. 5. INTRODUCTION TO MACHINE LEARNING ■ Supervised and Unsupervised Learning ■ Classification and Clustering ■ Recommender Systems ■ Large-Scale DataSets
  • 27. 6. APPLICATIONS & GETTING OUT OF YOUR COMFORT ZONE ■ A/B Testing ■ Price Modeling ■ Fraud Detection ■ Revenue Assurance ■ Supply Chain Management ■ Market Basket Analysis ■ Face Recognition ■ Robotics and Computer Vision ■ Music Genre Classification ■ Speech Recognition ■ Internet of Things and RaspberryPi ■ SCADA ■ Web Usage Mining ■ Intelligence Gathering ■ Packet Analysis (in Digital Forensics) ■ Computer Virus Detection ■ Weather Forecasting ■ Sports Analytics ■ Spatial Analysis ■ Investigative Journalism ■ Educational Data Mining ...why you want to attend the follow-up course =) Do not limit ourselves to just one discipline. ICT and math skills can be applied anywhere. Let’s first work on becoming the James Bonds (/MacGyvers /Chuck Norrises /...) of Data Science, which should not only improve our hands-on abilities, but also facilitate acquiring the general data science mindset that will later allow us to innovate in our areas of expertise (outside this course and also as part of each student’s specialization - see the Examination section). ■ Text Retrieval and Search Engines ■ Autocomplete ■ Topic Modeling ■ Content Discovery ■ Semantic Reasoners ■ Sentiment Analysis ■ Decision Support Systems ■ Medical Recommendation Systems ■ Model-based Drug Development ■ Neuroimaging Data Mining ■ DNA Sequencing ■ (...)
  • 29. 1. INTRODUCTION ■ Big Data, Data Science, Machine Learning ■ Predictive Modelling ■ Security and Data Ownership
  • 30. 2. PREREQUISITES ■ The Data Mining Course Topics Review (MailBox Mining) ■ Bayesian Statistics and Multidimensional Data Analysis (in ‘Classical’ Statistics) Review ■ Object-oriented Programming and Design Patterns in Python, Style Guide for Python Code ■ Working with Large Scale Datasets Note: The 2nd -7th blocks of the course cover theoretical background, Python (possibly also alternative tools) implementation, Python libraries, methods, procedures, and applications.
  • 31. 3. (ADVANCED) DATA MINING FOLLOW-UP ■ Advanced Web Scraping, Web Crawling, Web Automation ■ Probabilistic Graphical Models and Inference ■ Advanced Text Mining and Natural Language Processing ■ Pattern Recognition and Computer Vision
  • 32. 4. MACHINE LEARNING ■ Getting Data and Preprocessing ■ Dimensionality Reduction ■ Feature Selection ■ Model Learning and Testing ■ Supervised and Semi-supervised Learning ■ Classification and Regression Algorithms ■ Unsupervised Learning ■ Clustering and Dimensionality Reduction Algorithms ■ Recommender Systems ■ Collective Intelligence and Filtering Methods ■ Neural Networks and Deep Learning ■ Reinforcement Learning ■ Evolutionary Algorithms
  • 33. 5. BIG DATA PROCESSING ■ MapReduce ■ Spark and Hadoop Ecosystems ■ Distributed Storage ■ Cluster Computing ■ Cloud Computing
  • 34. 6. SERVER-SIDE PROGRAMMING ■ Developing Data Products and SaaS ■ Data Stream Mining ■ Deployment, Testing, Error Handling, Optimization, Licensing
  • 35. 7. APPLICATIONS & GETTING OUT OF YOUR COMFORT ZONE ■ Simulations ■ Artificial Intelligence ■ Big Data in Business, Finance, Healthcare, Education, Agriculture, Marketing, Astronomy, Natural Sciences, Geosciences, Social Sciences, ... ■ The Remaining 99% of Data Science and Data Science Buzzwords
  • 36. LITERATURE WORK IN PROGRESS (REARRANGEMENT, COMPLETION, CITATION)
  • 37. LITERATURE The literature is common for both courses (we’ll start with the basics and get to the more difficult topics in the follow-up course). The students are not required to read any of the following publications but might find them handy when looking for inspiration, reference, sample code, or when some part of the course takes their interest so that they want to follow it up with more in-depth self-directed study. Further online /paperback study resources, tutorials, libraries, frameworks, and other tools will be introduced within specific topics of the course. Note: This list is by no means comprehensive and we’ll be able to give you a (much) more targeted recommendation if you tell us where you are (regarding your current knowledge and skills in a particular area) and where you want to be. On top of that: “Practice, practice, practice.”
  • 38. DATA SCIENCE, DATA MINING, DATA ANALYSIS [01] Doing Data Science [02] Data Science from Scratch [03] Python For Data Analysis [04] Learning Data Mining with Python [05] A Programmer's Guide to Data Mining [06] Data Analysis with Open Source Tools [07] Practical Data Analysis [08] Bad Data Handbook [09] Practical Data Science Cookbook [10] Data Mining: The Textbook
  • 39. DATA SCIENCE, DATA MINING, DATA ANALYSIS [11] Data Mining for the Masses [12] Data Smart [13] Superforecasting [14] Python Data Science Cookbook [15] Mastering Python for Data Science [16] Python Data Science Handbook [expected] [17] Python Data Science Essentials [18] Foundations for Analytics with Python [expected] [19] Mastering Python Data Analysis [expected]
  • 40. BAYESIAN STATISTICS AND PGMs [20] Think Bayes [21] Bayesian Data Analysis [22] Bayesian Methods for Hackers [23] Learning Bayesian Networks [24] Probabilistic Graphical Models [25] Building Probabilistic Graphical Models with Python [26] The Signal and The Noise
  • 41. SOCIAL NETWORK ANALYSIS [27] Analyzing Social Media Networks with NodeXL [28] Social Network Analysis for Startups
  • 42. SOCIAL AND SEMANTIC WEB [29] Mining the Social Web [30] Analyzing the Social Web [31] Web Scraping with Python [32] Learning Scrapy [33] Programming Semantic Web [34] Linked Data [35] A Developer’s Guide to the Semantic Web [36] Social Media Mining with Python [expected] [37] Mastering Social Media Mining with Python [expected]
  • 43. TEXT MINING AND NLP [38] Natural Language Processing with Python [39] Python 3 Text Processing with NLTK 3 Cookbook [40] Mastering Natural Language Processing with Python [expected] [41] Speech and Language Processing [42] Natural Language Annotation
  • 44. IMAGE, SOUND, AND BEYOND [43] Image Processing and Acquisition using Python [44] Programming Computer Vision with Python [45] Practical Computer Vision with SimpleCV [46] OpenCV for Secret Agents [47] Introduction to Sound Processing [48] Python for Signal Processing [49] Think Digital Signal Processing [50] Learning Geospatial Analysis with Python [51] Python Scripting for ArcGIS [52] Python for Secret Agents [53] Internet of Things with Python [expected]
  • 45. DATA VISUALIZATION [54] The Visual Display of Quantitative Information [55] Envisioning Information [56] Beautiful Visualization [57] D3.js in Action [58] Interactive Data Visualization for the Web [59] Data Visualization with D3.js Cookbook [60] Data Visualization with JavaScript [61] HTML5 Graphics Data Visualization CookBook [62] Python Data Visualization CookBook [63] Data Visualization Cookbook [expected]
  • 46. DATA VISUALIZATION [64] Visualizing Data [65] Making Sense of Data
  • 47. MACHINE LEARNING AND ALGORITHMS [66] Learning From Data [67] Python Machine Learning [68] Building Machine Learning Systems with Python [69] Introduction to Machine Learning with Python [expected] [70] Machine Learning in Python [71] Applied Predictive Modeling [72] Think Machine Learning [expected] [73] An Introduction to Statistical Learning with Applications in R [74] The Elements of Statistical Learning [75] The Top Ten Algorithms in Data Mining
  • 48. MACHINE LEARNING AND ALGORITHMS [76] Data Mining and Analysis [77] Data Mining: Practical Machine Learning Tools and Techniques [78] Machine Learning [79] Mastering Machine Learning with scikit-learn [80] scikit-learn Cookbook [81] Programming Collective Intelligence [82] Practical Recommender Systems [expected] [83] Machine Learning, A Probabilistic Perspective [84] Neural Networks and Deep Learning [85] Fundamentals of Deep Learning [expected] [86] Deep Learning: A Practitioner's Approach [expected]
  • 49. MACHINE LEARNING AND ALGORITHMS [87] Pattern Recognition and Machine Learning [88] Machine Learning: The Art and Science of Algorithms that Make Sense of Data [89] Designing Machine Learning Systems with Python [90] Real-World Machine Learning [expected] [91] Practical Machine Learning [92] Python Machine Learning Cookbook [expected] [93] Introducing Data Science: Big Data, Machine Learning, and More, Using Python Tools [expected] [94] Large Scale Machine Learning with Python [expected] [95] Machine Learning for the Web [expected]
  • 50. BIG DATA AND CLOUD COMPUTING [96] Mining of Massive Datasets [97] Data Algorithms [98] Big Data Principles and Best Practices [99] Learning Spark [100] Advanced Analytics with Spark [101] Fast Data Processing with Spark [expected] [102] Hadoop [103] Hadoop Application Architectures [104] Data Intensive Text Processing with MapReduce [105] Python and HDF5
  • 51. BIG DATA AND CLOUD COMPUTING [106] Amazon Web Services in Action [107] Programming Amazon EC2 [108] Amazon Web Services For Dummies [109] Cloudera Admin Handbook [110] Real-Time Analytics
  • 52. APPLICATIONS AND OUT OF YOUR COMFORT ZONE [111] Data Science for Business [112] Python for Finance [113] Raspberry Pi Cookbook [114] Internet of Things [115] Bioinformatics Data Skills [116] Bioinformatics with Python Cookbook [117] Effective Computation in Physics [118] Artificial Intelligence [119] Artificial Intelligence for Humans, Volume 1: Fundamental Algorithms [120] Artificial Intelligence for Humans, Volume 2: Nature-Inspired Algorithms [121] Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks (...)
  • 53. (WEB) APPLICATION DEVELOPMENT [122] Web Technologies [123] Flask Web Development [124] Instant Flask Web Development [125] The Architecture of Privacy [126] Data Jujitsu [127] Version Control with Git [128] Pro Git
  • 54. INTRODUCTORY STATISTICS AND MATHEMATICS [129] Think Stats [130] Statistics in a Nutshell [131] Doing Math with Python [132] Numerical Python [133] Mathematics for Computer Science [134] Mathematics for Computer Scientists [135] How to Lie with Statistics
  • 55. PYTHON PROGRAMMING [136] Learning Python [137] Dive Into Python [138] Learn Python the Hard Way [139] Real Python [140] Regular Expressions Cookbook [141] Python 3 Object-Oriented Programming
  • 56. DBMSs AND LANGUAGES [142] JavaScript: The Definitive Guide [143] JavaScript: The Good Parts [144] Learning SQL [145] MongoDB [146] NoSQL Distilled [147] Seven Databases in Seven Weeks [148] Graph Databases [149] Building Web Applications with Python and Neo4j
  • 57. DBMSs AND LANGUAGES [150] Redis Essentials [151] Elasticsearch [152] RDF Database Systems [153] Learning SPARQL
  • 58. ■ pycon.org ■ pydata.org ■ conference.scipy.org ■ pyladies.com ■ kaggle.com ■ topcoder.com ■ github.com/vinta/awesome-python ■ stackexchange.com ■ github.com ■ reddit.com ■ programmableweb.com ■ w3schools.com ■ aws.amazon.com/documentation (...) ONLINE
  • 59. youtube.com coursera.org ocw.mit.edu edx.org udacity.com online.stanford.edu extension.harvard.edu webcast.berkeley.edu nptel.ac.in blog.agupieware.com/2014/05/online- learning-bachelors-level.html class-central.com tutorialspoint.com iversity.org canvas.net futurelearn.com saylor.org novoed.com/courses edventis.com udemy.com lynda.com codecademy.com khanacademy.org howstuffworks.com wikipedia.org (...) oreilly.com packtpub.com manning.com eu.wiley.com elsevier.com nostarch.com store.elsevier.com/Syngress/IMP_76/ store.elsevier.com/Morgan-Kaufmann/IMP_16/ pearsoned.co.uk/imprints/addison-wesley/ pragprog.com springer.com apress.com mhprofessional.com (...) Self-directed learners, those who prefer distance /blended learning, those who want to know more, or those who don‘t want to rely on one source of information only might want to expand /complement /substitute different parts of the course on: …and many other [yourfavoritesearchengine] it & learn it resources PS: Don't forget to share on the course forum the awesome resources you’ve found! (ideally resources that are freely available online to compensate for our conventional ‘backing-up-the-course-syllabus-using-lots- of-books’ approach =))
  • 60. DATA MINING AND MACHINE LEARNING TWO-SEMESTER COURSE PROPOSAL* 2016-06-01 (YYYY-MM-DD) version 0.1.3 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. JAKUB RUZICKA linkedin.com/in/littlerose jameslittlerose@gmail.com * something I may consider teaching in the future, if I gain enough experience and find collaborators (do not hesitate to drop me a line! =))