Tf dsyv
- 5. To do that we need 3
things:
➔ What’s your name?
➔ What do you do?
➔ Why are you
interested in Data
Science?
What
We’re
Building
A text analyzer to take your
write-up of your dream vacation
and find your best match.
- 6. ➔ Has information about
the hotel (name,
location, etc)
➔ Information about the
reviewer
➔ Review Text
➔ Rating
The
Data
The data tonight is a sample of
reviews of 1000 hotels
collected by Datafinity,
available on Kaggle here.
- 7. ➔ Text processing is a
slow and involved
process
➔ This way we can make
a model and perform
matching in a
relatively quick
amount of time
Why
Only
1000
Hotels?
Why is it slow?
- 8. Text data is often referred to as
'unstructured data'.
But what is structured data?
Let’s
Talk
About
Text
- 9. This data is nice.
It's a table with
columns and we know what
to expect.
Structured
Data
- 10. This data is not as nice.
It's unpredictable, varying in length
and we don't really know what's what.
It just kind of looks like one big
thing.
The text above (and this text here) is
unstructured data....
Unstructured
Data
- 11. ➔ What is a data
point?
➔ How do we compare
data?
➔ What parts of the
data matter?
The
Problems
with
Unstructured
Unstructured data gives us a
few specific problems:
- 15. This is our test sentence.
And this is a second sentence.
An
Example
We’ve taken our data and
turned it into a table.
We added structure!
- 16. This is called a 'bag of words' approach.
(It's also called vectorizing.)
➔ We took our initial sentence and
created a bag for each word.
➔ Count the number of times we found a
word that matched.
➔ Words are columns, rows are counts
Bag
of
Words
- 18. Punctuation
and
Case
Things like 'This' and 'this' are
not considered equal because the
computer doesn't see them as the
same. The case is a difference.
This is why you (almost) always
preprocess text data.
- 19. Back
to
the
Example
This is our test sentence. ---> this
is our test sentence
And this is a second sentence. --->
and this is a second sentence
Getting rid of case and punctuation
makes comparisons easier and more
effective (particularly on small
data)
- 20. Stop
Words
But there's more!
Some words don't matter. They don't
really tell us anything.
These are called 'stop words'.
Things like 'it', 'is', 'the' are
usually just thrown out.
- 21. Back
to
the
Example This is our test sentence. ---> this is our test
sentence
And this is a second sentence. ---> and this is a
second sentence
Now we have vectors of the essentials for each
sentence.
This is something we can build a model on!
- 22. The
Model
Our model is going to be a Random Forest.
A random forest is an ensemble of
decision trees to predict the most likely
class of an outcome variable.
What does that mean?
- 23. Decision
Trees
A set of rules that get us to
a prediction, in the form of a
tree. You can think of it like
a computer building a version
of 20 questions.
- 25. Random
Forest
A random forest builds a lot of different decision
trees and then lets each one vote.
Our questions will be things like "Contains the word
'beach'" or "Contains the world 'sun' 2 or more
times".
- 27. Our model has a few weaknesses:
➔ What about relative
frequency?
➔ What about context?
Let’s
Talk
About
Text
- 28. It has a nice beach.
vs
10 pages of text that says the word beach once.
Relative
Frequency
- 29. Relative
Frequency
Each one scores a 1 for beach.
TFIDF is the answer. It rates each
word by its relative frequency.
So the word beach in a ten word
sentence counts more than one
mention in 10000 words.
http://bit.ly/tfidf-wiki
- 30. Context
'I hate beaches and love cities'
vs
'I love beaches and hate cities'
Our model would see
these as the same thing.
- 31. Context
/
N-Grams We can get a sense of context with
n-grams. Each feature is a set of
words rather than individual
words.
So we'd get features like 'love
cities' and 'hate beaches' rather
than 'love' 'cities' 'hate'
'beaches'.
http://bit.ly/ngram-wiki
- 32. There’s
A
Lot
More
This all falls under the banner of
Natural Language Processing, or
NLP, one of the largest and most
exciting fields of data science
and artificial intelligence.
It's the basis for things like
chatbots and Siri and the Turing
test itself. There is a lot of fun
to be had in this space.
- 33. Data
Science
at
Thinkful
➔ Flexible, project-based curriculum to
help you become the data scientist
you want to be
➔ You don’t just learn skills, you get
to make things
➔ Mentor support from experts in the
industry
➔ Also, there's a job guarantee
- 35. ➔ Start with Python and Statistics
➔ Personal Program Manager
➔ Unlimited Q&A Sessions
➔ Student Slack Community
➔ bit.ly/freetrial-ds
Thinkful
Two-Week
Free
Trial