Tf dsyv

bit.ly/ds-event
Data
Science
Your
Vacation

Introductions
➔ What's your name?
➔ What brought you here today?
➔ What is your programming experience?

About
Thinkful
We train web developers and
data scientists through 1x1
mentorship and project-based
learning.
Guaranteed.

Vacationing is fun.
Planning a vacation is
not.
Data Science can help.
Data,
Vacation,
and AI

To do that we need 3
things:
➔ What’s your name?
➔ What do you do?
➔ Why are you
interested in Data
Science?
What
We’re
Building
A text analyzer to take your
write-up of your dream vacation
and find your best match.

➔ Has information about
the hotel (name,
location, etc)
➔ Information about the
reviewer
➔ Review Text
➔ Rating
The
Data
The data tonight is a sample of
reviews of 1000 hotels
collected by Datafinity,
available on Kaggle here.

➔ Text processing is a
slow and involved
process
➔ This way we can make
a model and perform
matching in a
relatively quick
amount of time
Why
Only
1000
Hotels?
Why is it slow?

Text data is often referred to as
'unstructured data'.
But what is structured data?
Let’s
Talk
About
Text

This data is nice.
It's a table with
columns and we know what
to expect.
Structured
Data

This data is not as nice.
It's unpredictable, varying in length
and we don't really know what's what.
It just kind of looks like one big
thing.
The text above (and this text here) is
unstructured data....
Unstructured
Data

➔ What is a data
point?
➔ How do we compare
data?
➔ What parts of the
data matter?
The
Problems
with
Unstructured
Unstructured data gives us a
few specific problems:

An
Example
This is our test sentence.
So what parts of this sentence matter?
What are our data points?

An
Example
The words matter! And whitespace
gives us a way to find them.

An
Example

And this is a second sentence.
An
Example
We’ve taken our data and
turned it into a table.
We added structure!

This is called a 'bag of words' approach.
(It's also called vectorizing.)
➔ We took our initial sentence and
created a bag for each word.
➔ Count the number of times we found a
word that matched.
➔ Words are columns, rows are counts
Bag
of
Words

Punctuation
and
Case
However, in looking at our example,
something should seem logically off.
And this is a second sentence.

Punctuation
and
Case
Things like 'This' and 'this' are
not considered equal because the
computer doesn't see them as the
same. The case is a difference.
This is why you (almost) always
preprocess text data.

Back
to
the
Example
This is our test sentence. ---> this
is our test sentence
And this is a second sentence. --->
and this is a second sentence
Getting rid of case and punctuation
makes comparisons easier and more
effective (particularly on small
data)

Stop
Words
But there's more!
Some words don't matter. They don't
really tell us anything.
These are called 'stop words'.
Things like 'it', 'is', 'the' are
usually just thrown out.

Back
to
the
Example This is our test sentence. ---> this is our test
sentence
And this is a second sentence. ---> and this is a
second sentence
Now we have vectors of the essentials for each
sentence.
This is something we can build a model on!

The
Model
Our model is going to be a Random Forest.
A random forest is an ensemble of
decision trees to predict the most likely
class of an outcome variable.
What does that mean?

Decision
Trees
A set of rules that get us to
a prediction, in the form of a
tree. You can think of it like
a computer building a version
of 20 questions.

Random
Forest
A random forest builds a lot of different decision
trees and then lets each one vote.
Our questions will be things like "Contains the word
'beach'" or "Contains the world 'sun' 2 or more
times".

The
Notebook
We're going to use a Google hosted
Python notebook to build this model.
http://bit.ly/ideal-vacation

Our model has a few weaknesses:
➔ What about relative
frequency?
➔ What about context?
Let’s
Talk
About
Text

It has a nice beach.
vs
10 pages of text that says the word beach once.
Relative
Frequency

Relative
Frequency
Each one scores a 1 for beach.
TFIDF is the answer. It rates each
word by its relative frequency.
So the word beach in a ten word
sentence counts more than one
mention in 10000 words.
http://bit.ly/tfidf-wiki

Context
'I hate beaches and love cities'
vs
'I love beaches and hate cities'
Our model would see
these as the same thing.

Context
/
N-Grams We can get a sense of context with
n-grams. Each feature is a set of
words rather than individual
words.
So we'd get features like 'love
cities' and 'hate beaches' rather
than 'love' 'cities' 'hate'
'beaches'.
http://bit.ly/ngram-wiki

There’s
A
Lot
More
This all falls under the banner of
Natural Language Processing, or
NLP, one of the largest and most
exciting fields of data science
and artificial intelligence.
It's the basis for things like
chatbots and Siri and the Turing
test itself. There is a lot of fun
to be had in this space.

Data
Science
at
Thinkful
➔ Flexible, project-based curriculum to
help you become the data scientist
you want to be
➔ You don’t just learn skills, you get
to make things
➔ Mentor support from experts in the
industry
➔ Also, there's a job guarantee

➔ Start with Python and Statistics
➔ Personal Program Manager
➔ Unlimited Q&A Sessions
➔ Student Slack Community
➔ bit.ly/freetrial-ds
Thinkful
Two-Week
Free
Trial

The
Student
Experience
Marnie Boyer, Thinkful Graduate
Capstone
Wolfgang Hall, Thinkful Graduate
Capstone

➔ bit.ly/tf-event-feedback
Survey

Tf dsyv

Related slideshows

More Related Content

Tf dsyv