SlideShare a Scribd company logo
bit.ly/ds-event
Data
Science
Your
Vacation
Introductions
➔ What's your name?
➔ What brought you here today?
➔ What is your programming experience?
About
Thinkful
We train web developers and
data scientists through 1x1
mentorship and project-based
learning.
Guaranteed.
Vacationing is fun.
Planning a vacation is
not.
Data Science can help.
Data,
Vacation,
and AI
To do that we need 3
things:
➔ What’s your name?
➔ What do you do?
➔ Why are you
interested in Data
Science?
What
We’re
Building
A text analyzer to take your
write-up of your dream vacation
and find your best match.
➔ Has information about
the hotel (name,
location, etc)
➔ Information about the
reviewer
➔ Review Text
➔ Rating
The
Data
The data tonight is a sample of
reviews of 1000 hotels
collected by Datafinity,
available on Kaggle here.
➔ Text processing is a
slow and involved
process
➔ This way we can make
a model and perform
matching in a
relatively quick
amount of time
Why
Only
1000
Hotels?
Why is it slow?
Text data is often referred to as
'unstructured data'.
But what is structured data?
Let’s
Talk
About
Text
This data is nice.
It's a table with
columns and we know what
to expect.
Structured
Data
This data is not as nice.
It's unpredictable, varying in length
and we don't really know what's what.
It just kind of looks like one big
thing.
The text above (and this text here) is
unstructured data....
Unstructured
Data
➔ What is a data
point?
➔ How do we compare
data?
➔ What parts of the
data matter?
The
Problems
with
Unstructured
Unstructured data gives us a
few specific problems:
An
Example
This is our test sentence.
So what parts of this sentence matter?
What are our data points?
An
Example
This is our test sentence.
The words matter! And whitespace
gives us a way to find them.
This is our test sentence.
An
Example
This is our test sentence.
And this is a second sentence.
An
Example
We’ve taken our data and
turned it into a table.
We added structure!
This is called a 'bag of words' approach.
(It's also called vectorizing.)
➔ We took our initial sentence and
created a bag for each word.
➔ Count the number of times we found a
word that matched.
➔ Words are columns, rows are counts
Bag
of
Words
Punctuation
and
Case
However, in looking at our example,
something should seem logically off.
This is our test sentence.
And this is a second sentence.
Punctuation
and
Case
Things like 'This' and 'this' are
not considered equal because the
computer doesn't see them as the
same. The case is a difference.
This is why you (almost) always
preprocess text data.
Back
to
the
Example
This is our test sentence. ---> this
is our test sentence
And this is a second sentence. --->
and this is a second sentence
Getting rid of case and punctuation
makes comparisons easier and more
effective (particularly on small
data)
Stop
Words
But there's more!
Some words don't matter. They don't
really tell us anything.
These are called 'stop words'.
Things like 'it', 'is', 'the' are
usually just thrown out.
Back
to
the
Example This is our test sentence. ---> this is our test
sentence
And this is a second sentence. ---> and this is a
second sentence
Now we have vectors of the essentials for each
sentence.
This is something we can build a model on!
The
Model
Our model is going to be a Random Forest.
A random forest is an ensemble of
decision trees to predict the most likely
class of an outcome variable.
What does that mean?
Decision
Trees
A set of rules that get us to
a prediction, in the form of a
tree. You can think of it like
a computer building a version
of 20 questions.
Decision
Trees:
Golf?
Random
Forest
A random forest builds a lot of different decision
trees and then lets each one vote.
Our questions will be things like "Contains the word
'beach'" or "Contains the world 'sun' 2 or more
times".
The
Notebook
We're going to use a Google hosted
Python notebook to build this model.
http://bit.ly/ideal-vacation
Our model has a few weaknesses:
➔ What about relative
frequency?
➔ What about context?
Let’s
Talk
About
Text
It has a nice beach.
vs
10 pages of text that says the word beach once.
Relative
Frequency
Relative
Frequency
Each one scores a 1 for beach.
TFIDF is the answer. It rates each
word by its relative frequency.
So the word beach in a ten word
sentence counts more than one
mention in 10000 words.
http://bit.ly/tfidf-wiki
Context
'I hate beaches and love cities'
vs
'I love beaches and hate cities'
Our model would see
these as the same thing.
Context
/
N-Grams We can get a sense of context with
n-grams. Each feature is a set of
words rather than individual
words.
So we'd get features like 'love
cities' and 'hate beaches' rather
than 'love' 'cities' 'hate'
'beaches'.
http://bit.ly/ngram-wiki
There’s
A
Lot
More
This all falls under the banner of
Natural Language Processing, or
NLP, one of the largest and most
exciting fields of data science
and artificial intelligence.
It's the basis for things like
chatbots and Siri and the Turing
test itself. There is a lot of fun
to be had in this space.
Data
Science
at
Thinkful
➔ Flexible, project-based curriculum to
help you become the data scientist
you want to be
➔ You don’t just learn skills, you get
to make things
➔ Mentor support from experts in the
industry
➔ Also, there's a job guarantee
Ways
to
Learn
Data
Science
➔ Start with Python and Statistics
➔ Personal Program Manager
➔ Unlimited Q&A Sessions
➔ Student Slack Community
➔ bit.ly/freetrial-ds
Thinkful
Two-Week
Free
Trial
The
Student
Experience
Marnie Boyer, Thinkful Graduate
Capstone
Wolfgang Hall, Thinkful Graduate
Capstone
➔ bit.ly/tf-event-feedback
Survey

More Related Content

Tf dsyv

  • 2. Introductions ➔ What's your name? ➔ What brought you here today? ➔ What is your programming experience?
  • 3. About Thinkful We train web developers and data scientists through 1x1 mentorship and project-based learning. Guaranteed.
  • 4. Vacationing is fun. Planning a vacation is not. Data Science can help. Data, Vacation, and AI
  • 5. To do that we need 3 things: ➔ What’s your name? ➔ What do you do? ➔ Why are you interested in Data Science? What We’re Building A text analyzer to take your write-up of your dream vacation and find your best match.
  • 6. ➔ Has information about the hotel (name, location, etc) ➔ Information about the reviewer ➔ Review Text ➔ Rating The Data The data tonight is a sample of reviews of 1000 hotels collected by Datafinity, available on Kaggle here.
  • 7. ➔ Text processing is a slow and involved process ➔ This way we can make a model and perform matching in a relatively quick amount of time Why Only 1000 Hotels? Why is it slow?
  • 8. Text data is often referred to as 'unstructured data'. But what is structured data? Let’s Talk About Text
  • 9. This data is nice. It's a table with columns and we know what to expect. Structured Data
  • 10. This data is not as nice. It's unpredictable, varying in length and we don't really know what's what. It just kind of looks like one big thing. The text above (and this text here) is unstructured data.... Unstructured Data
  • 11. ➔ What is a data point? ➔ How do we compare data? ➔ What parts of the data matter? The Problems with Unstructured Unstructured data gives us a few specific problems:
  • 12. An Example This is our test sentence. So what parts of this sentence matter? What are our data points?
  • 13. An Example This is our test sentence. The words matter! And whitespace gives us a way to find them.
  • 14. This is our test sentence. An Example
  • 15. This is our test sentence. And this is a second sentence. An Example We’ve taken our data and turned it into a table. We added structure!
  • 16. This is called a 'bag of words' approach. (It's also called vectorizing.) ➔ We took our initial sentence and created a bag for each word. ➔ Count the number of times we found a word that matched. ➔ Words are columns, rows are counts Bag of Words
  • 17. Punctuation and Case However, in looking at our example, something should seem logically off. This is our test sentence. And this is a second sentence.
  • 18. Punctuation and Case Things like 'This' and 'this' are not considered equal because the computer doesn't see them as the same. The case is a difference. This is why you (almost) always preprocess text data.
  • 19. Back to the Example This is our test sentence. ---> this is our test sentence And this is a second sentence. ---> and this is a second sentence Getting rid of case and punctuation makes comparisons easier and more effective (particularly on small data)
  • 20. Stop Words But there's more! Some words don't matter. They don't really tell us anything. These are called 'stop words'. Things like 'it', 'is', 'the' are usually just thrown out.
  • 21. Back to the Example This is our test sentence. ---> this is our test sentence And this is a second sentence. ---> and this is a second sentence Now we have vectors of the essentials for each sentence. This is something we can build a model on!
  • 22. The Model Our model is going to be a Random Forest. A random forest is an ensemble of decision trees to predict the most likely class of an outcome variable. What does that mean?
  • 23. Decision Trees A set of rules that get us to a prediction, in the form of a tree. You can think of it like a computer building a version of 20 questions.
  • 25. Random Forest A random forest builds a lot of different decision trees and then lets each one vote. Our questions will be things like "Contains the word 'beach'" or "Contains the world 'sun' 2 or more times".
  • 26. The Notebook We're going to use a Google hosted Python notebook to build this model. http://bit.ly/ideal-vacation
  • 27. Our model has a few weaknesses: ➔ What about relative frequency? ➔ What about context? Let’s Talk About Text
  • 28. It has a nice beach. vs 10 pages of text that says the word beach once. Relative Frequency
  • 29. Relative Frequency Each one scores a 1 for beach. TFIDF is the answer. It rates each word by its relative frequency. So the word beach in a ten word sentence counts more than one mention in 10000 words. http://bit.ly/tfidf-wiki
  • 30. Context 'I hate beaches and love cities' vs 'I love beaches and hate cities' Our model would see these as the same thing.
  • 31. Context / N-Grams We can get a sense of context with n-grams. Each feature is a set of words rather than individual words. So we'd get features like 'love cities' and 'hate beaches' rather than 'love' 'cities' 'hate' 'beaches'. http://bit.ly/ngram-wiki
  • 32. There’s A Lot More This all falls under the banner of Natural Language Processing, or NLP, one of the largest and most exciting fields of data science and artificial intelligence. It's the basis for things like chatbots and Siri and the Turing test itself. There is a lot of fun to be had in this space.
  • 33. Data Science at Thinkful ➔ Flexible, project-based curriculum to help you become the data scientist you want to be ➔ You don’t just learn skills, you get to make things ➔ Mentor support from experts in the industry ➔ Also, there's a job guarantee
  • 35. ➔ Start with Python and Statistics ➔ Personal Program Manager ➔ Unlimited Q&A Sessions ➔ Student Slack Community ➔ bit.ly/freetrial-ds Thinkful Two-Week Free Trial
  • 36. The Student Experience Marnie Boyer, Thinkful Graduate Capstone Wolfgang Hall, Thinkful Graduate Capstone