SlideShare a Scribd company logo
SENTIMENT
CLASSIFICATION
Practical Machine Learning and Rails Part2
TRAINING DATA:
TRAINING DATA:
- tweets
TRAINING DATA:
- tweets
- positive/negative
TRAINING DATA:
- tweets
- positive/negative
  - use emoticons from twitter
TRAINING DATA:
- tweets
- positive/negative
  - use emoticons from twitter
  :-) or :-(
BUILDING TRAINING DATA:
  NEGATIVE
  is upset that he cant update his Facebook by texting it... and might cry as a
  result School today also. Blah!
  I couldnt bear to watch it. And I thought the UA loss was embarrassing
  I hate when I have to call and wake people up


  POSITIVE
  Just woke up. Having no school is the best feeling ever
  Im enjoying a beautiful morning here in Phoenix
  dropping molly off getting ice cream with Aaron
Practical Machine Learning and Rails Part2
FEATURES:
FEATURES:
 BAG OF WORDS MODEL
FEATURES:
 BAG OF WORDS MODEL
 split the text into words, create a dictionary,
 and replace text with word counts
BAG OF WORDS
BAG OF WORDS
tweets:
I ran fast
Bob ran far
I ran to Bob
BAG OF WORDS
tweets:
I ran fast
Bob ran far
I ran to Bob

   dictionary = %w{I ran fast Bob far to}
BAG OF WORDS
tweets:
I ran fast
Bob ran far
I ran to Bob

   dictionary = %w{I ran fast Bob far to}
BAG OF WORDS
tweets:                   word vectors:
I ran fast                [1 1 1 0 0 0]
Bob ran far               [0 1 0 1 1 0]
I ran to Bob              [1 1 0 1 0 1]

   dictionary = %w{I ran fast Bob far to}
CLASSIFIER:
CLASSIFIER:
 training examples:
word vector -> labels
CLASSIFIER:
 training examples:
word vector -> labels
CLASSIFIER:
  training examples:
 word vector -> labels


classification algorithm
CLASSIFIER:
  training examples:
 word vector -> labels


classification algorithm
CLASSIFIER:
  training examples:
 word vector -> labels


classification algorithm


        model
WEKA
WEKA
• open source java app
WEKA
• open source java app
• contains common ML algorithms
WEKA
• open source java app
• contains common ML algorithms
• gui interface
WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
    • converting words into vectors
WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
    • converting words into vectors
    • training/test, cross-validation,
      metrics
ARFF FILE
TRAINING IN
   WEKA

[SHOW EXAMPLE HERE]
EVALUATION
• correctly classified
• mean squared error
EVALUATION

false negative/positives
SENTIMENT
   CLASSIFICATION
      EXAMPLE
https://github.com/ryanstout/
mlexample
QUERYING
arff_path = Rails.root.join("data/sentiment.arff").to_s
arff = FileReader.new(arff_path)

model_path = Rails.root.join("models/sentiment.model").to_s
classifier = SerializationHelper.read(model_path)

data = begin
  Instances.new(arff,1).tap do |instance|
    if instance.class_index == -1
      instance.set_class_index(instance.num_attributes - 1)
    end
  end
end
QUERYING

instance = SparseInstance.new(data.num_attributes)
instance.set_dataset(data)
instance.set_value(data.attribute(0), params[:sentiment][:message])

result = classifier.distribution_for_instance(instance).first
percent_positive = 1 - result.to_f

@message = "The text is #{(percent_positive*100.0).round}% positive"
HOW DO WE
 IMPROVE?
HOW DO WE
      IMPROVE?

•bigger dictionary
HOW DO WE
      IMPROVE?

•bigger dictionary
•bi-grams/tri-grams
HOW DO WE
      IMPROVE?

•bigger dictionary
•bi-grams/tri-grams
•part of speech tagging
HOW DO WE
      IMPROVE?

•bigger dictionary
•bi-grams/tri-grams
•part of speech tagging
•more data
Feature Generation
Feature Generation

 think about what information is
 valuable to an expert
Feature Generation

 think about what information is
 valuable to an expert
 remove data that isn't useful
 (attribute selection)
ATTRIBUTE
     SELECTION


[SHOW ATTRIBUTE SELECTION
EXAMPLE]
ATTRIBUTE
SELECTION
DOMAIN PRICE
    PREDICTION

• predict how much a domain would
 sell for
TRAINING DATA
TRAINING DATA

• domains
TRAINING DATA

• domains
• historical sale prices for domains
FEATURES
FEATURES
• split domain by words
FEATURES
• split domain by words
• generate features for each word
FEATURES
• split domain by words
• generate features for each word
   • how common the word is
FEATURES
• split domain by words
• generate features for each word
   • how common the word is
   • number of google results for each
      word
FEATURES
• split domain by words
• generate features for each word
   • how common the word is
   • number of google results for each
      word
   • cpc for the word
ALGORITHM

support vector regression
   functions > SMOreg in weka
WHAT WE DIDN’T
   COVER
WHAT WE DIDN’T
    COVER

• collaborative filtering
WHAT WE DIDN’T
    COVER

• collaborative filtering
• clustering
WHAT WE DIDN’T
    COVER

• collaborative filtering
• clustering
• theorem proving (classical AI)
ADDITIONAL
    RESOURCES

stanford machine learning class
    ml-class.org
TOOLS
• weka
• libsvm, liblinear
• vowpal wabbit (big dictionaries)
• recommendify
   •   https://github.com/paulasmuth/recommendify
QUESTIONS

contact us on twitter at
@tectonic and @ryanstout

More Related Content

Practical Machine Learning and Rails Part2

Editor's Notes

  1. having an example makes it easier to understand the process\n
  2. also could use movie/product review data\n
  3. also could use movie/product review data\n
  4. also could use movie/product review data\n
  5. also could use movie/product review data\n
  6. also could use movie/product review data\n
  7. \n
  8. bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn’t look at word order, syntax, grammar, punctuation, etc...\n
  9. bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn’t look at word order, syntax, grammar, punctuation, etc...\n
  10. bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn’t look at word order, syntax, grammar, punctuation, etc...\n
  11. words in dictionary array are replaced with the count’s in the text\n\n
  12. words in dictionary array are replaced with the count’s in the text\n\n
  13. words in dictionary array are replaced with the count’s in the text\n\n
  14. words in dictionary array are replaced with the count’s in the text\n\n
  15. word vectors/labels\n
  16. word vectors/labels\n
  17. word vectors/labels\n
  18. word vectors/labels\n
  19. word vectors/labels\n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. generated using RARFF\n
  28. \n
  29. \n
  30. \n
  31. \n
  32. load the arff\nload the model - serialized java object\nload a dataset\n
  33. create a sparse instance, set the dataset\nget distribution (predicted values for each class)\n
  34. the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
  35. the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
  36. the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
  37. the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. assume a max of three words\neach feature of three words, 0’s if less words\n
  46. assume a max of three words\neach feature of three words, 0’s if less words\n
  47. assume a max of three words\neach feature of three words, 0’s if less words\n
  48. assume a max of three words\neach feature of three words, 0’s if less words\n
  49. assume a max of three words\neach feature of three words, 0’s if less words\n
  50. \n
  51. clustering - similar documents, related terms\n
  52. clustering - similar documents, related terms\n
  53. clustering - similar documents, related terms\n
  54. \n
  55. vowpal - good for large datasets, contains different algorithms (matrix factorization, collab filtering, lda, etc..)\n
  56. hopefully this helped you know the tools and techniques\nyou can teach yourself\nfeel free to contact us\n