SlideShare a Scribd company logo
Sentiment Analysis in Machine Learning
Jennifer D. Davis, Ph.D.
American Computing Machinery, Austin Chapter
Sub-group on Knowledge, Discovery and Data Mining
June 2, 2015
Who uses sentiment analysis anyway?
What is sentiment analysis?
 Machine learning technique that classifies
comments and phrases based on what is called a
‘corpus’—a group of annotated texts with weights
given to words in numerical terms
 Defined as:
 “Sentiment analysis (opinion mining) refers to the
use of natural language processing, text analysis and
computational linguistics to identify and extract
subjective information to source materials.”
wikipedia encyclopedia
Sentiment Analysis: Not your Mother’sTwitter Feed!
 Sentiment Analysis can be used to:
 Understand the intent behind language in an
unbiased manner
 Business areas that frequently use Sentiment
Analysis:
 Retail
 Entertainment
 Healthcare
 Any customer-centered organization
 Respond to customer complaints with better
solutions, a sort of virtual call center (e.g. Amelia)
Retail
 Introduce new products more successfully by
understanding culture & social media
 Understand and respond to customer needs using
internal data sources such as customer reviews or
feedback
 Develop new products based on customer wants and
needs as expressed in reviews, on-line and social media
Entertainment
 Create interest or excitement about movies by
understanding the market segment
 Target movie advertising or recommender systems
based on social commentary and collaborative
filtering
 Target advertising to gender or population or by
cultural affinity.
Healthcare and MedicalTreatment
 Healthcare:
 Learn about patient wellness –
 Potentially detect depression from journal entries
 Assist with patient adherence to treatment
 Learn about patient satisfaction and what is working
 Gather outcomes measures associated with patient
satisfaction
 This is a hot area of research and several academic
institutions are investing in research related to
patient outcomes and sentiment analysis.
What are the overall steps for sentiment analysis?
 Gather unstructured data from your own sources, web-sources, databases
(healthcare.gov surprisingly has some) and competitions like Kaggle.
 Parse out unnecessary punctuation and “stop” words or phrases, perform
other pre-processing as needed or appropriate.
 Transform the words or phrases to a numerical representation such as a
vector
 Choose an appropriate classification algorithm. For example Random Forrest
has a high accuracy rate, but isn’t always computationally efficient. We
discussed several other methods previously.
 Apply your algorithm to a training set and if enough data is available, cross-
validate. Tune the algorithm using appropriate parameters matched to
features, but avoid over-fitting.
 Apply the algorithm to test data (the fun part).
What techniques can we use?
 Many are under development by machine-learning
focused corporations and in academic linguistic
laboratories
 Often an ensemble of algorithms works best and is most
accurate
 Text data is often unstructured data. You will spend a
portion of time cleaning and organizing data. Not fun,
but necessary.
 Today we will very briefly give high-level overview of 3
methods (i) Bayesian Probability classification, (ii)
Word2Vec and (iii) Neural Recursive Networks
Bayesian Probability and classification method
 Naïve Bayes classification uses probability formulas
that are based on the assumptions that all features
function independently
 For most cases this is surprisingly accurate, and
typically can yield 70-80% accuracies
 You can read more about this in the textbook for
this course, “Building Machine Learning Systems
with Python”
Word2vec “deep” learning method
 This method relies upon creating a “Bag of Words” from semi-
structured data
 Many tools are available in scikit learn and nltk python
libraries (we will show some in our Jupyter (iPython)
notebook
 Invented by Google engineers who describes it as a “tool [that
provides] an efficient implementation of a continuous bag-of-
words and skip-gram architectures for computing vector
representations of words”
 In other words, (pun intended) words are assigned a vector of
numbers representing their importance, and meaning
Neural recursive network method
 The best (and most convenient to use) library is Stanford
University’s Natural Language Processing library.
 The method uses a recursion algorithm that will distinguish
between phrases based upon the order of words & phrases
 For example “this movie has humor that could not be denied”
would be graded as positive whereas “this movie did not have
any humor whatsoever” would be graded as negative based
on order and choice of words & phrases.
 SNLP Group can be found at: nlp.stanford.edu; their live
demonstration is available at: nlp.stanford.edu/sentiment
So which do I choose?
 It depends upon the complexity of data you are
analyzing
 It depends upon the accuracy you desire versus
scalability (always a balancing act)
 It depends on your time frame and how you will
integrate the knowledge derived from using
sentiment analysis
 Out of the box solutions can work, but sometimes
you will need to build your own
So now we can give it a try!
 A Jupyter Notebook has been created and can be accessed via
my Github account at:
https://github.com/jddavis-100/Statistics-and-Machine-Learning/
 Data is available at:
 Kaggle.com by joining the Kaggle Competition
 The test set was designed by me, and I can provide it to you or
Omar.
 Gather your own data from a number of APIs including or web-
crawlers such as:
 Rotten Tomatoes API
 Twitter API
 Web-scraping tools such as Scrapy (Python tool available at
scrapy.org)
GitHub Repository
 Tutorial:
 https://github.com/jddavis-100/Statistics-and-
Machine-Learning/wiki/Sentiment-Analysis--Class-
for-ACM,-SIGKDD,-Austin-Chapter
 Repo: https://github.com/jddavis-100/Statistics-
and-Machine-Learning

More Related Content

Sentiment analysis

  • 1. Sentiment Analysis in Machine Learning Jennifer D. Davis, Ph.D. American Computing Machinery, Austin Chapter Sub-group on Knowledge, Discovery and Data Mining June 2, 2015
  • 2. Who uses sentiment analysis anyway?
  • 3. What is sentiment analysis?  Machine learning technique that classifies comments and phrases based on what is called a ‘corpus’—a group of annotated texts with weights given to words in numerical terms  Defined as:  “Sentiment analysis (opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information to source materials.” wikipedia encyclopedia
  • 4. Sentiment Analysis: Not your Mother’sTwitter Feed!  Sentiment Analysis can be used to:  Understand the intent behind language in an unbiased manner  Business areas that frequently use Sentiment Analysis:  Retail  Entertainment  Healthcare  Any customer-centered organization  Respond to customer complaints with better solutions, a sort of virtual call center (e.g. Amelia)
  • 5. Retail  Introduce new products more successfully by understanding culture & social media  Understand and respond to customer needs using internal data sources such as customer reviews or feedback  Develop new products based on customer wants and needs as expressed in reviews, on-line and social media
  • 6. Entertainment  Create interest or excitement about movies by understanding the market segment  Target movie advertising or recommender systems based on social commentary and collaborative filtering  Target advertising to gender or population or by cultural affinity.
  • 7. Healthcare and MedicalTreatment  Healthcare:  Learn about patient wellness –  Potentially detect depression from journal entries  Assist with patient adherence to treatment  Learn about patient satisfaction and what is working  Gather outcomes measures associated with patient satisfaction  This is a hot area of research and several academic institutions are investing in research related to patient outcomes and sentiment analysis.
  • 8. What are the overall steps for sentiment analysis?  Gather unstructured data from your own sources, web-sources, databases (healthcare.gov surprisingly has some) and competitions like Kaggle.  Parse out unnecessary punctuation and “stop” words or phrases, perform other pre-processing as needed or appropriate.  Transform the words or phrases to a numerical representation such as a vector  Choose an appropriate classification algorithm. For example Random Forrest has a high accuracy rate, but isn’t always computationally efficient. We discussed several other methods previously.  Apply your algorithm to a training set and if enough data is available, cross- validate. Tune the algorithm using appropriate parameters matched to features, but avoid over-fitting.  Apply the algorithm to test data (the fun part).
  • 9. What techniques can we use?  Many are under development by machine-learning focused corporations and in academic linguistic laboratories  Often an ensemble of algorithms works best and is most accurate  Text data is often unstructured data. You will spend a portion of time cleaning and organizing data. Not fun, but necessary.  Today we will very briefly give high-level overview of 3 methods (i) Bayesian Probability classification, (ii) Word2Vec and (iii) Neural Recursive Networks
  • 10. Bayesian Probability and classification method  Naïve Bayes classification uses probability formulas that are based on the assumptions that all features function independently  For most cases this is surprisingly accurate, and typically can yield 70-80% accuracies  You can read more about this in the textbook for this course, “Building Machine Learning Systems with Python”
  • 11. Word2vec “deep” learning method  This method relies upon creating a “Bag of Words” from semi- structured data  Many tools are available in scikit learn and nltk python libraries (we will show some in our Jupyter (iPython) notebook  Invented by Google engineers who describes it as a “tool [that provides] an efficient implementation of a continuous bag-of- words and skip-gram architectures for computing vector representations of words”  In other words, (pun intended) words are assigned a vector of numbers representing their importance, and meaning
  • 12. Neural recursive network method  The best (and most convenient to use) library is Stanford University’s Natural Language Processing library.  The method uses a recursion algorithm that will distinguish between phrases based upon the order of words & phrases  For example “this movie has humor that could not be denied” would be graded as positive whereas “this movie did not have any humor whatsoever” would be graded as negative based on order and choice of words & phrases.  SNLP Group can be found at: nlp.stanford.edu; their live demonstration is available at: nlp.stanford.edu/sentiment
  • 13. So which do I choose?  It depends upon the complexity of data you are analyzing  It depends upon the accuracy you desire versus scalability (always a balancing act)  It depends on your time frame and how you will integrate the knowledge derived from using sentiment analysis  Out of the box solutions can work, but sometimes you will need to build your own
  • 14. So now we can give it a try!  A Jupyter Notebook has been created and can be accessed via my Github account at: https://github.com/jddavis-100/Statistics-and-Machine-Learning/  Data is available at:  Kaggle.com by joining the Kaggle Competition  The test set was designed by me, and I can provide it to you or Omar.  Gather your own data from a number of APIs including or web- crawlers such as:  Rotten Tomatoes API  Twitter API  Web-scraping tools such as Scrapy (Python tool available at scrapy.org)
  • 15. GitHub Repository  Tutorial:  https://github.com/jddavis-100/Statistics-and- Machine-Learning/wiki/Sentiment-Analysis--Class- for-ACM,-SIGKDD,-Austin-Chapter  Repo: https://github.com/jddavis-100/Statistics- and-Machine-Learning

Editor's Notes

  1. http://www.entrepreneur.com/article/245827 Amelia is a AI platform that can sense human emotions and innuendo using sentiment analysis