5

I'm trying to classify instances of a dataset as being in one of two classes, a or b. B is a minority class and only makes up 8% of the dataset. All instances are assigned an id indicating which subject generated the data. Because every subject generated multiple instances id's are repeated frequently in the dataset.

The table below is just an example, the real table has about 100,000 instances. Each subject id has about 100 instances in the table. Every subject is tied to exactly one class as you can see with "larry" below.

    * field  * field  *   id   *  class  
*******************************************
 0  *   _    *   _    *  bob   *    a
 1  *   _    *   _    *  susan *    a
 2  *   _    *   _    *  susan *    a
 3  *   _    *   _    *  bob   *    a
 4  *   _    *   _    *  larry *    b
 5  *   _    *   _    *  greg  *    a
 6  *   _    *   _    *  larry *    b
 7  *   _    *   _    *  bob   *    a
 8  *   _    *   _    *  susan *    a
 9  *   _    *   _    *  susan *    a
 10 *   _    *   _    *  bob   *    a
 11 *   _    *   _    *  greg  *    a
 ...   ...      ...      ...       ...

I would like to use cross-validation to tune the model and must stratify the dataset so that each fold contains a few examples of the minority class, b. The problem is that I have a second constraint, the same id must never appear in two different folds as this would leak information about the subject.

I'm using python's scikit-learn library. I need a method which combines both LabelKFold, which makes sure labels (id's) are not split among folds, and StratifiedKFold, which makes sure every fold has a similar ratio of classes. How can I accomplish the above using scikit-learn? If it is not possible to split on two constraints in sklearn how can I effectively split the dataset by hand or with other python libraries?

1 Answer 1

4

The following is a bit tricky with respect to indexing (it would help if you use something like Pandas for it), but conceptually simple.

Suppose you make a dummy dataset where the independent variables are only id and class. Furthermore, in this dataset, remove duplicate id entries.

For your cross validation, run stratified cross validation on the dummy dataset. At each iteration:

  1. Find out which ids were selected for the train and the test

  2. Go back to the original dataset, and insert all the instances belonging to id as necessary into train and test sets.

This works because:

  1. As you stated, each id is associated with a single label.

  2. Since we run stratified CV, each class is represented proportionally.

  3. Since each id appears only in the train or test set (but not both), it is labeled too.

3
  • This is perfect, thank you. I'm surprised scikit-learn doesn't support such a common CV problem.
    – Chris F.
    Commented Sep 8, 2016 at 15:05
  • @ChrisF. how did you do this in your code? I want to do the same thing, where I use oversampling to duplicate the minority class instances. I already apply a Label K-Fold cross-validation iterator, but I want to combine it with Stratified K-Fold to leverage the duplicate entries. Commented Oct 14, 2016 at 3:13
  • Specific implementation would depend on your data set but for me I just literally follow the instructions above step by step. I first eliminated all duplicate ids, then ran simple sklearn's Stratified K-Fold stratifying by classes. Then I simply preserved the resulting folds but added in the rest of each id's data into each fold. The only way this worked for me was because all of my id's have about the same number of instances so I still got about the same % split I was looking for with the same class balance etc. This also worked for me because every id only had one class across all instances
    – Chris F.
    Commented Oct 14, 2016 at 5:04

Not the answer you're looking for? Browse other questions tagged or ask your own question.