Practical Machine Learning and Rails Part2
- 8. BUILDING TRAINING DATA:
NEGATIVE
is upset that he cant update his Facebook by texting it... and might cry as a
result School today also. Blah!
I couldnt bear to watch it. And I thought the UA loss was embarrassing
I hate when I have to call and wake people up
POSITIVE
Just woke up. Having no school is the best feeling ever
Im enjoying a beautiful morning here in Phoenix
dropping molly off getting ice cream with Aaron
- 12. FEATURES:
BAG OF WORDS MODEL
split the text into words, create a dictionary,
and replace text with word counts
- 17. BAG OF WORDS
tweets: word vectors:
I ran fast [1 1 1 0 0 0]
Bob ran far [0 1 0 1 1 0]
I ran to Bob [1 1 0 1 0 1]
dictionary = %w{I ran fast Bob far to}
- 28. WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
- 29. WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
- 30. WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
• converting words into vectors
- 31. WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
• converting words into vectors
• training/test, cross-validation,
metrics
- 36. SENTIMENT
CLASSIFICATION
EXAMPLE
https://github.com/ryanstout/
mlexample
- 37. QUERYING
arff_path = Rails.root.join("data/sentiment.arff").to_s
arff = FileReader.new(arff_path)
model_path = Rails.root.join("models/sentiment.model").to_s
classifier = SerializationHelper.read(model_path)
data = begin
Instances.new(arff,1).tap do |instance|
if instance.class_index == -1
instance.set_class_index(instance.num_attributes - 1)
end
end
end
- 41. HOW DO WE
IMPROVE?
•bigger dictionary
•bi-grams/tri-grams
- 42. HOW DO WE
IMPROVE?
•bigger dictionary
•bi-grams/tri-grams
•part of speech tagging
- 43. HOW DO WE
IMPROVE?
•bigger dictionary
•bi-grams/tri-grams
•part of speech tagging
•more data
- 46. Feature Generation
think about what information is
valuable to an expert
remove data that isn't useful
(attribute selection)
- 47. ATTRIBUTE
SELECTION
[SHOW ATTRIBUTE SELECTION
EXAMPLE]
- 49. DOMAIN PRICE
PREDICTION
• predict how much a domain would
sell for
- 57. FEATURES
• split domain by words
• generate features for each word
• how common the word is
• number of google results for each
word
- 58. FEATURES
• split domain by words
• generate features for each word
• how common the word is
• number of google results for each
word
• cpc for the word
- 63. WHAT WE DIDN’T
COVER
• collaborative filtering
• clustering
• theorem proving (classical AI)
- 64. ADDITIONAL
RESOURCES
stanford machine learning class
ml-class.org
- 65. TOOLS
• weka
• libsvm, liblinear
• vowpal wabbit (big dictionaries)
• recommendify
• https://github.com/paulasmuth/recommendify
Editor's Notes
- having an example makes it easier to understand the process\n
- also could use movie/product review data\n
- also could use movie/product review data\n
- also could use movie/product review data\n
- also could use movie/product review data\n
- also could use movie/product review data\n
- \n
- bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn’t look at word order, syntax, grammar, punctuation, etc...\n
- bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn’t look at word order, syntax, grammar, punctuation, etc...\n
- bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn’t look at word order, syntax, grammar, punctuation, etc...\n
- words in dictionary array are replaced with the count’s in the text\n\n
- words in dictionary array are replaced with the count’s in the text\n\n
- words in dictionary array are replaced with the count’s in the text\n\n
- words in dictionary array are replaced with the count’s in the text\n\n
- word vectors/labels\n
- word vectors/labels\n
- word vectors/labels\n
- word vectors/labels\n
- word vectors/labels\n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- generated using RARFF\n
- \n
- \n
- \n
- \n
- load the arff\nload the model - serialized java object\nload a dataset\n
- create a sparse instance, set the dataset\nget distribution (predicted values for each class)\n
- the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
- the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
- the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
- the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- assume a max of three words\neach feature of three words, 0’s if less words\n
- assume a max of three words\neach feature of three words, 0’s if less words\n
- assume a max of three words\neach feature of three words, 0’s if less words\n
- assume a max of three words\neach feature of three words, 0’s if less words\n
- assume a max of three words\neach feature of three words, 0’s if less words\n
- \n
- clustering - similar documents, related terms\n
- clustering - similar documents, related terms\n
- clustering - similar documents, related terms\n
- \n
- vowpal - good for large datasets, contains different algorithms (matrix factorization, collab filtering, lda, etc..)\n
- hopefully this helped you know the tools and techniques\nyou can teach yourself\nfeel free to contact us\n