SlideShare a Scribd company logo
2nd Place Solution
Instacart Market Basket Analysis
Agenda
• My Background
• Problem Overview
• Main Approach
• Feature Engineering
• Feature Importance
• Important Findings
• F1 maximization
My Background
• Bachelor of Economics
• Programmer of Financial Industry
• Consultant of Financial Industry
• 2nd Place at KDDCUP2015
• Data Scientist at Yahoo! JAPAN
Problem Overview
• In this competition, we have to predict reorder.
• So, it is little different from general recommendation.
• I mean,
Problem Overview
• How hot(user)?
*prior is regarded as train
Problem Overview
• How hot(item)?
*Clipped by 500
Problem Overview
• Evaluation metric is mean F1 score
• Precision and Recall
Problem Overview
• Links between the files
Main Approach
• We are given orders.csv
Main Approach
• We are given orders.csv
Main Approach
• We are given order_products.csv
Main Approach
• Reorder Prediction
user_id product_id label
Main Approach
• None Prediction
user_id label
Main Approach
Main Approach
Feature Engineering
• I made 4 types of features
1. User
• What this user like
2. Item
• What this item like
3. User x Item
• How do the user feel about the item
4. Datetime
• What this day and hour like
*For None model, I can’t use above features except user and datetime. So I convert those to
stats(min, mean, max, sum, std…).
Feature Importance for reorder
Feature Importance for None
Important Findings for reorder - 1
• user_id: 54035
Important Findings for reorder - 2
• days_last_order-max is difference between days_since_last_order_this_item and
useritem_order_days_max
• days_since_last_order_this_item is a feature belong to user and item. This means how
many days passed since last order
• Also, useritem_order_days_max is a feature belong to user and item. This means max
span(day) of order
• For more detail, see the next page
Important Findings for reorder - 2
• See the index 0, this means
the user bought this item 14 days
ago, and max span is 30 days
• So I think this feature says if the user
is bored or not by that item
Important Findings for reorder - 3
• We already know fruits are reordered more frequently than vegetables(3
Million Instacart Orders, Open Sourced)
• I wanted to know how often
• So I made a item_10to1_ratio feature
that’s defined as the reorder ratio after
an item is ordered vs. not ordered.
• Next page, for more details
Important Findings for reorder - 3
• Let’s say userA bought itemA at order_number 1 and 4
• And userB bought itemA at order_number 1 and 3
• item_10to1_ratio is 0.5
Important Findings for None - 1
• Useritem_sum_pos_cart(User A, Item B) is the average position in User A’s cart
that Item B falls into
• Useritem_sum_pos_cart-mean(User A) is the mean of the above feature across all
items
• So this feature essentially captures
the average position of an item in a user’s
cart, and we can see that users who
don’t buy many items all at once are
more likely to be None
Important Findings for None - 2
• total_buy is number of total order
• If userA bought itemA 3 times
in the past, this would be 3
• So total_buy-max is max of above
feature by user
• We can see that it predicts
whether or not a user will make a reorder
Important Findings for None - 3
• t-1_is_None(User A) is a binary feature that says whether or not the
user’s previous order was None.
• If the previous order is None,
then the next order will also be
None with 30% probability.
F1 maximization
• In this competition, the evaluation metric was an F1 score, which is a way of
capturing both precision and recall in a single metric.
• Thus, we needed to convert reorder probabilities into binary 1/0 (Yes/No)
numbers.
• However, in order to perform this conversion, we need to know a threshold. At
first, I used grid search to find a universal threshold of 0.2. But I saw
comments on the Kaggle discussion boards that said different orders should
have different thresholds.
• To understand why, let’s look at an example.
F1 maximization
F1 maximization
• In the first example, threshold is between 0.9 and 0.3
• In the second example, threshold is lower than 0.2
• As I showed, each order should have each threshold
• But using above calculation, we have to prepare all patterns of
probability at first
• Thus I needed to come up with another calculation
• See the next page
F1 maximization
• Let’s say our model predicts Item A will be reordered with probability 0.9, and Item B with probability 0.3. I then
simulate 9,999 target labels (whether A and B will be ordered or not) using these probabilities.
• For example, the simulated labels might look like this.
• I then calculate the expected F1 score for each set of labels,
starting from the highest probability items, and then adding items
(e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score
peaks and then decreases.
• We don’t need to calculate all of patterns
like A, B, AB…
• Because if we should select itemB, we should
select itemA as well
F1 maximization
• F1score_mean( , [A]) -> 0.809747641431
• F1score_mean( , [A,B]) -> 0.709004233757
F1 maximization - Predicting None
• One way to think about None is as the probability (1 - Item A)
* (1 - Item B) * …
• But another method is to try to predict None as a special
case.
• By using our None model and treating None as just another
item, we can boost the F1 score from 0.400 to 0.407.
Appendix
Appendix
Appendix
1 month to go…
Kaggle meetup #3 instacart 2nd place solution
7 days to go…
2 days to go…
(´-`).。oO(
1 hours to go…
Kaggle meetup #3 instacart 2nd place solution
30 minutes to go…
やったか?!
やったか?!
(やってない)
20 minutes to go…
EOP

More Related Content

Kaggle meetup #3 instacart 2nd place solution

  • 1. 2nd Place Solution Instacart Market Basket Analysis
  • 2. Agenda • My Background • Problem Overview • Main Approach • Feature Engineering • Feature Importance • Important Findings • F1 maximization
  • 3. My Background • Bachelor of Economics • Programmer of Financial Industry • Consultant of Financial Industry • 2nd Place at KDDCUP2015 • Data Scientist at Yahoo! JAPAN
  • 4. Problem Overview • In this competition, we have to predict reorder. • So, it is little different from general recommendation. • I mean,
  • 5. Problem Overview • How hot(user)? *prior is regarded as train
  • 6. Problem Overview • How hot(item)? *Clipped by 500
  • 7. Problem Overview • Evaluation metric is mean F1 score • Precision and Recall
  • 8. Problem Overview • Links between the files
  • 9. Main Approach • We are given orders.csv
  • 10. Main Approach • We are given orders.csv
  • 11. Main Approach • We are given order_products.csv
  • 12. Main Approach • Reorder Prediction user_id product_id label
  • 13. Main Approach • None Prediction user_id label
  • 16. Feature Engineering • I made 4 types of features 1. User • What this user like 2. Item • What this item like 3. User x Item • How do the user feel about the item 4. Datetime • What this day and hour like *For None model, I can’t use above features except user and datetime. So I convert those to stats(min, mean, max, sum, std…).
  • 19. Important Findings for reorder - 1 • user_id: 54035
  • 20. Important Findings for reorder - 2 • days_last_order-max is difference between days_since_last_order_this_item and useritem_order_days_max • days_since_last_order_this_item is a feature belong to user and item. This means how many days passed since last order • Also, useritem_order_days_max is a feature belong to user and item. This means max span(day) of order • For more detail, see the next page
  • 21. Important Findings for reorder - 2 • See the index 0, this means the user bought this item 14 days ago, and max span is 30 days • So I think this feature says if the user is bored or not by that item
  • 22. Important Findings for reorder - 3 • We already know fruits are reordered more frequently than vegetables(3 Million Instacart Orders, Open Sourced) • I wanted to know how often • So I made a item_10to1_ratio feature that’s defined as the reorder ratio after an item is ordered vs. not ordered. • Next page, for more details
  • 23. Important Findings for reorder - 3 • Let’s say userA bought itemA at order_number 1 and 4 • And userB bought itemA at order_number 1 and 3 • item_10to1_ratio is 0.5
  • 24. Important Findings for None - 1 • Useritem_sum_pos_cart(User A, Item B) is the average position in User A’s cart that Item B falls into • Useritem_sum_pos_cart-mean(User A) is the mean of the above feature across all items • So this feature essentially captures the average position of an item in a user’s cart, and we can see that users who don’t buy many items all at once are more likely to be None
  • 25. Important Findings for None - 2 • total_buy is number of total order • If userA bought itemA 3 times in the past, this would be 3 • So total_buy-max is max of above feature by user • We can see that it predicts whether or not a user will make a reorder
  • 26. Important Findings for None - 3 • t-1_is_None(User A) is a binary feature that says whether or not the user’s previous order was None. • If the previous order is None, then the next order will also be None with 30% probability.
  • 27. F1 maximization • In this competition, the evaluation metric was an F1 score, which is a way of capturing both precision and recall in a single metric. • Thus, we needed to convert reorder probabilities into binary 1/0 (Yes/No) numbers. • However, in order to perform this conversion, we need to know a threshold. At first, I used grid search to find a universal threshold of 0.2. But I saw comments on the Kaggle discussion boards that said different orders should have different thresholds. • To understand why, let’s look at an example.
  • 29. F1 maximization • In the first example, threshold is between 0.9 and 0.3 • In the second example, threshold is lower than 0.2 • As I showed, each order should have each threshold • But using above calculation, we have to prepare all patterns of probability at first • Thus I needed to come up with another calculation • See the next page
  • 30. F1 maximization • Let’s say our model predicts Item A will be reordered with probability 0.9, and Item B with probability 0.3. I then simulate 9,999 target labels (whether A and B will be ordered or not) using these probabilities. • For example, the simulated labels might look like this. • I then calculate the expected F1 score for each set of labels, starting from the highest probability items, and then adding items (e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score peaks and then decreases. • We don’t need to calculate all of patterns like A, B, AB… • Because if we should select itemB, we should select itemA as well
  • 31. F1 maximization • F1score_mean( , [A]) -> 0.809747641431 • F1score_mean( , [A,B]) -> 0.709004233757
  • 32. F1 maximization - Predicting None • One way to think about None is as the probability (1 - Item A) * (1 - Item B) * … • But another method is to try to predict None as a special case. • By using our None model and treating None as just another item, we can boost the F1 score from 0.400 to 0.407.
  • 36. 1 month to go…
  • 38. 7 days to go…
  • 39. 2 days to go…
  • 41. 1 hours to go…
  • 43. 30 minutes to go…
  • 46. 20 minutes to go…
  • 47. EOP