SlideShare a Scribd company logo
Alexey Grigorev
Team ololobhi (Abhishek & ololo)
Data set
● ~3 mln train pairs, ~1 mln test pairs
● ~10.8 mln images (~45 gb) Target
Evaluation
metric: AUC
Title
Category_ID
Pictures
Price
Description
locationID
attrsJSON
No seller
data
Process Overview
mongo
- text comparisons
- image comparisons
- text cleaning
- image raw features
CSVs with pair features
(than train the models on it)
ItemInfo train/test
images
Preprocessing Computing features
features computed
in small batches
(4k-20k rows in batch)
Features 1
● Simple Features
○ CategoryID (plain, no OHE)
○ Number of images
○ Absolute price difference
● Simple Text Features
○ Num of Rus/Eng/Digits chars
○ Length of Title, Description
○ 2-4 ngram similarity on char level
○ Fuzzy string matches (via FuzzyWuzzy)
● Simple Picture Features
○ Channel statistics (min, mean, max, etc)
○ File size differences
○ Geometry matches
○ Num of exact matches via md5 hash
● Simple GEO Features
○ MetroID
○ LocationID
○ Euclidean distance
|10 - 1| |5 - 1| |15 - 1|
|10 - 50| |5 - 50| |15 - 50|
9 4 14 40 45 35
reshape
stats
1
50
10 5 15
Features 2: Attributes
● Regularized Jaccard of values and key=value pairs
● Number of fields both ads didn’t fill
● TF-IDF on key=value pairs
○ dot product in TF-IDF space (norm=None) was better than cosine
● Cosine in SVD of TF-IDF
Features 3: Text
● Jaccard & Cosine on digits only and on English tokens only
● Russian chars in English words
○ E.g. “о” in “iphоne” is Cyrillic
● Cosine in TF, TF-IDF, BM25, cosine of SVD of them
● Common tokens & differences:
○ Text1: “продам iphone”, Text2: “продам айфон”
○ Common: {продам}, Difference: {iphone, айфон}
○ Cosine in TF (binary), SVD of it
● Word2Vec & GloVe
○ Cosine and manhattan b/w average title vectors
○ Stats of pairwise cosines between all tokens excluding the same ones
○ Tokens from title, description, title + description, nouns only
Features 3 cont’d: Word’s Mover Distance
● “True” WMD is complex and slow
● “Poor Man’s” WMD is faster:
○ WMD(A, B): For each term in doc A take distance to closest term in doc B, sum over them
○ WMD_sym(A, B) = WMD(A, B) + WMD(B, A)
figure from https://github.com/mkusner/wmd
Features 3 cont’d: Misspellings
● Idea: same author can make same types of mistakes
○ No space after dot/comma (“продам айфон.дешево”)
○ Morphological errors
○ And others
● Represent ads as “Bag of Misspellings”
● Use Regularized Jaccard and Cosine
● Misspellings extracted with languagetool.org
● Stuff that everybody used
○ Image hashes from imagehash library and forums
○ Chi2 & Bhattacharyya on histograms (with openIMAJ)
○ SIFT keypoints + matching (with openIMAJ)
○ Structural Similarity (computed with pyssim)
● Perceptive hashes computed with imagemagick
○ hashes computed on each channel separately and on the mean channel
● Image moments: Centroids (“Ellipses” in imagemagick)
○ Centroids = centers of masses of each channel
○ Distances between image centroids in each channel
● Image moment invariants (imagemagick)
○ 7 moments, invariant to translation, scale and rotation
○ Put all 7 invariants in a vector, compute cosine and distance
Features 4: Images
Features 5: GEO
● Reverse (lat, lon) code to location
● Features like same_region, same_city, same_zip
● |zip1 - zip2|
(53.87, 27.66) (region, city, zip)
OpenStreetMap
Feature Selection
● Correlation
○ A lot of features. Many correlated ones
○ Find feature groups of 0.90 correlation
○ Keep only one of the features
● XGBoost Feature Importance
○ https://github.com/Far0n/xgbfi
○ Run xgb on a sample with 100 trees
○ Use xgbfi to extract most important features
● Combined
○ In a correlated group, choose the most important feature using xgbfi output
Most Important Features
SVM fit on common & diff tokens
Models & Ensembling
● Parameter tuning
○ Random search
● My best model: 0.939 public LB
○ XGB with depth=8 and 2.5k trees
○ Trained a few days
● Ensembling:
○ Sample group of features
○ Randomly choose the parameters
○ Build ETs and XGBs
○ Stack with Log Reg (L2 regularization with low C)
● Our final model:
○ Neural network on ETs and XGBs outputs + some selected 1st level features
Lessons Learned
● It’s important to get CV right
○ My scheme: shuffle 3 fold (leaky)
○ Couldn’t use some nice features because of it
○ CV score of the ensemble was too good
○ Result: CV via LB
○ The right one: by connected components
● A lot of features is not always good
○ Computed too many features
○ Had hard time managing to use them all
○ Had to start stacking early
That’s a graph!
https://github.com/alexeygrigorev/avito-duplicates-kaggle
Questions?

More Related Content

Avito Duplicate Ads Detection @ kaggle

  • 1. Alexey Grigorev Team ololobhi (Abhishek & ololo)
  • 2. Data set ● ~3 mln train pairs, ~1 mln test pairs ● ~10.8 mln images (~45 gb) Target Evaluation metric: AUC
  • 4. Process Overview mongo - text comparisons - image comparisons - text cleaning - image raw features CSVs with pair features (than train the models on it) ItemInfo train/test images Preprocessing Computing features features computed in small batches (4k-20k rows in batch)
  • 5. Features 1 ● Simple Features ○ CategoryID (plain, no OHE) ○ Number of images ○ Absolute price difference ● Simple Text Features ○ Num of Rus/Eng/Digits chars ○ Length of Title, Description ○ 2-4 ngram similarity on char level ○ Fuzzy string matches (via FuzzyWuzzy) ● Simple Picture Features ○ Channel statistics (min, mean, max, etc) ○ File size differences ○ Geometry matches ○ Num of exact matches via md5 hash ● Simple GEO Features ○ MetroID ○ LocationID ○ Euclidean distance
  • 6. |10 - 1| |5 - 1| |15 - 1| |10 - 50| |5 - 50| |15 - 50| 9 4 14 40 45 35 reshape stats 1 50 10 5 15
  • 7. Features 2: Attributes ● Regularized Jaccard of values and key=value pairs ● Number of fields both ads didn’t fill ● TF-IDF on key=value pairs ○ dot product in TF-IDF space (norm=None) was better than cosine ● Cosine in SVD of TF-IDF
  • 8. Features 3: Text ● Jaccard & Cosine on digits only and on English tokens only ● Russian chars in English words ○ E.g. “о” in “iphоne” is Cyrillic ● Cosine in TF, TF-IDF, BM25, cosine of SVD of them ● Common tokens & differences: ○ Text1: “продам iphone”, Text2: “продам айфон” ○ Common: {продам}, Difference: {iphone, айфон} ○ Cosine in TF (binary), SVD of it ● Word2Vec & GloVe ○ Cosine and manhattan b/w average title vectors ○ Stats of pairwise cosines between all tokens excluding the same ones ○ Tokens from title, description, title + description, nouns only
  • 9. Features 3 cont’d: Word’s Mover Distance ● “True” WMD is complex and slow ● “Poor Man’s” WMD is faster: ○ WMD(A, B): For each term in doc A take distance to closest term in doc B, sum over them ○ WMD_sym(A, B) = WMD(A, B) + WMD(B, A) figure from https://github.com/mkusner/wmd
  • 10. Features 3 cont’d: Misspellings ● Idea: same author can make same types of mistakes ○ No space after dot/comma (“продам айфон.дешево”) ○ Morphological errors ○ And others ● Represent ads as “Bag of Misspellings” ● Use Regularized Jaccard and Cosine ● Misspellings extracted with languagetool.org
  • 11. ● Stuff that everybody used ○ Image hashes from imagehash library and forums ○ Chi2 & Bhattacharyya on histograms (with openIMAJ) ○ SIFT keypoints + matching (with openIMAJ) ○ Structural Similarity (computed with pyssim) ● Perceptive hashes computed with imagemagick ○ hashes computed on each channel separately and on the mean channel ● Image moments: Centroids (“Ellipses” in imagemagick) ○ Centroids = centers of masses of each channel ○ Distances between image centroids in each channel ● Image moment invariants (imagemagick) ○ 7 moments, invariant to translation, scale and rotation ○ Put all 7 invariants in a vector, compute cosine and distance Features 4: Images
  • 12. Features 5: GEO ● Reverse (lat, lon) code to location ● Features like same_region, same_city, same_zip ● |zip1 - zip2| (53.87, 27.66) (region, city, zip) OpenStreetMap
  • 13. Feature Selection ● Correlation ○ A lot of features. Many correlated ones ○ Find feature groups of 0.90 correlation ○ Keep only one of the features ● XGBoost Feature Importance ○ https://github.com/Far0n/xgbfi ○ Run xgb on a sample with 100 trees ○ Use xgbfi to extract most important features ● Combined ○ In a correlated group, choose the most important feature using xgbfi output
  • 14. Most Important Features SVM fit on common & diff tokens
  • 15. Models & Ensembling ● Parameter tuning ○ Random search ● My best model: 0.939 public LB ○ XGB with depth=8 and 2.5k trees ○ Trained a few days ● Ensembling: ○ Sample group of features ○ Randomly choose the parameters ○ Build ETs and XGBs ○ Stack with Log Reg (L2 regularization with low C) ● Our final model: ○ Neural network on ETs and XGBs outputs + some selected 1st level features
  • 16. Lessons Learned ● It’s important to get CV right ○ My scheme: shuffle 3 fold (leaky) ○ Couldn’t use some nice features because of it ○ CV score of the ensemble was too good ○ Result: CV via LB ○ The right one: by connected components ● A lot of features is not always good ○ Computed too many features ○ Had hard time managing to use them all ○ Had to start stacking early That’s a graph!