Avito Duplicate Ads Detection @ kaggle
- 2. Data set
● ~3 mln train pairs, ~1 mln test pairs
● ~10.8 mln images (~45 gb) Target
Evaluation
metric: AUC
- 4. Process Overview
mongo
- text comparisons
- image comparisons
- text cleaning
- image raw features
CSVs with pair features
(than train the models on it)
ItemInfo train/test
images
Preprocessing Computing features
features computed
in small batches
(4k-20k rows in batch)
- 5. Features 1
● Simple Features
○ CategoryID (plain, no OHE)
○ Number of images
○ Absolute price difference
● Simple Text Features
○ Num of Rus/Eng/Digits chars
○ Length of Title, Description
○ 2-4 ngram similarity on char level
○ Fuzzy string matches (via FuzzyWuzzy)
● Simple Picture Features
○ Channel statistics (min, mean, max, etc)
○ File size differences
○ Geometry matches
○ Num of exact matches via md5 hash
● Simple GEO Features
○ MetroID
○ LocationID
○ Euclidean distance
- 6. |10 - 1| |5 - 1| |15 - 1|
|10 - 50| |5 - 50| |15 - 50|
9 4 14 40 45 35
reshape
stats
1
50
10 5 15
- 7. Features 2: Attributes
● Regularized Jaccard of values and key=value pairs
● Number of fields both ads didn’t fill
● TF-IDF on key=value pairs
○ dot product in TF-IDF space (norm=None) was better than cosine
● Cosine in SVD of TF-IDF
- 8. Features 3: Text
● Jaccard & Cosine on digits only and on English tokens only
● Russian chars in English words
○ E.g. “о” in “iphоne” is Cyrillic
● Cosine in TF, TF-IDF, BM25, cosine of SVD of them
● Common tokens & differences:
○ Text1: “продам iphone”, Text2: “продам айфон”
○ Common: {продам}, Difference: {iphone, айфон}
○ Cosine in TF (binary), SVD of it
● Word2Vec & GloVe
○ Cosine and manhattan b/w average title vectors
○ Stats of pairwise cosines between all tokens excluding the same ones
○ Tokens from title, description, title + description, nouns only
- 9. Features 3 cont’d: Word’s Mover Distance
● “True” WMD is complex and slow
● “Poor Man’s” WMD is faster:
○ WMD(A, B): For each term in doc A take distance to closest term in doc B, sum over them
○ WMD_sym(A, B) = WMD(A, B) + WMD(B, A)
figure from https://github.com/mkusner/wmd
- 10. Features 3 cont’d: Misspellings
● Idea: same author can make same types of mistakes
○ No space after dot/comma (“продам айфон.дешево”)
○ Morphological errors
○ And others
● Represent ads as “Bag of Misspellings”
● Use Regularized Jaccard and Cosine
● Misspellings extracted with languagetool.org
- 11. ● Stuff that everybody used
○ Image hashes from imagehash library and forums
○ Chi2 & Bhattacharyya on histograms (with openIMAJ)
○ SIFT keypoints + matching (with openIMAJ)
○ Structural Similarity (computed with pyssim)
● Perceptive hashes computed with imagemagick
○ hashes computed on each channel separately and on the mean channel
● Image moments: Centroids (“Ellipses” in imagemagick)
○ Centroids = centers of masses of each channel
○ Distances between image centroids in each channel
● Image moment invariants (imagemagick)
○ 7 moments, invariant to translation, scale and rotation
○ Put all 7 invariants in a vector, compute cosine and distance
Features 4: Images
- 12. Features 5: GEO
● Reverse (lat, lon) code to location
● Features like same_region, same_city, same_zip
● |zip1 - zip2|
(53.87, 27.66) (region, city, zip)
OpenStreetMap
- 13. Feature Selection
● Correlation
○ A lot of features. Many correlated ones
○ Find feature groups of 0.90 correlation
○ Keep only one of the features
● XGBoost Feature Importance
○ https://github.com/Far0n/xgbfi
○ Run xgb on a sample with 100 trees
○ Use xgbfi to extract most important features
● Combined
○ In a correlated group, choose the most important feature using xgbfi output
- 15. Models & Ensembling
● Parameter tuning
○ Random search
● My best model: 0.939 public LB
○ XGB with depth=8 and 2.5k trees
○ Trained a few days
● Ensembling:
○ Sample group of features
○ Randomly choose the parameters
○ Build ETs and XGBs
○ Stack with Log Reg (L2 regularization with low C)
● Our final model:
○ Neural network on ETs and XGBs outputs + some selected 1st level features
- 16. Lessons Learned
● It’s important to get CV right
○ My scheme: shuffle 3 fold (leaky)
○ Couldn’t use some nice features because of it
○ CV score of the ensemble was too good
○ Result: CV via LB
○ The right one: by connected components
● A lot of features is not always good
○ Computed too many features
○ Had hard time managing to use them all
○ Had to start stacking early
That’s a graph!