Skip to main content

All Questions

0 votes
0 answers
92 views

Seeking datasets for training a Language Model on U.S. mortgage loan processes

I'm in the process of training a Language Model (LLM) and require datasets that encompass various aspects of the U.S. mortgage loan process. The model's aim is to understand and simulate decision-...
Anand 's user avatar
0 votes
2 answers
74 views

Been stuck on a DS problem. Just need to know whether this problem statement is solvable or not

I've been stuck on the following problem for weeks now. To be clear I'm not asking the community to provide a full solution. Just a few ideas or at least confirmation on whether this problem statement ...
Aakash Dusane's user avatar
0 votes
2 answers
109 views

How do I separate periodic data from time series data?

I am currently working on a classification task of gym exercises based on accelerometer data. I am trying modularize window extraction so I can train my model based on metrics within a window (which ...
David Chen's user avatar
1 vote
2 answers
553 views

what qualifies as a data leakage?

I am currently working on a binary classification problem using imbalanced data. The algorithm that I am using is random forest. The problem is about predicting whether each sales project will meet ...
The Great's user avatar
  • 2,585
5 votes
1 answer
351 views

Are imbalanced data problems solvable? [closed]

I am working as a data scientist for the past 2 years where I have worked on problems related to binary classification, revenue prediction etc. In the past two years, I have had 2 problems that ...
The Great's user avatar
  • 2,585
0 votes
1 answer
85 views

Classification of a noisy data

What method can be used to classify data in the following example? There is a table (hundreds of strings and hundreds of columns). Several columns in this table uniquely allow you to classify each row:...
Mic's user avatar
  • 1
1 vote
1 answer
48 views

How to increase retention?

As you might already know there is a concept of retention. Let's say I have created a game and today hundred people have downloaded my game. Let's say tomorrow 47 out of yesterday's hundred people are ...
Narek's user avatar
  • 121
0 votes
1 answer
33 views

Classification for choice data

It is essentially a choice modelling problem, but hopefully can be addressed by classification. Suppose one needs to choose a route to drive to work among many candidates in his mind. These candidates ...
GDI's user avatar
  • 101
2 votes
0 answers
52 views

Should credit be given to AI model - low data scenario [closed]

In my office, we recently built an AI model for project success prediction using binary classification. Though the dataset size was small (977 records), my boss still wanted to go ahead with the POC ...
The Great's user avatar
  • 2,585
1 vote
1 answer
176 views

Labelling for churn measurement

I have 3 domains of supplier data (Jan 2017 to Jan 2022) and they are as follows a) Purchase data - Contains all the purchase (of product) data made by the suppliers with us. It contains columns such ...
The Great's user avatar
  • 2,585
0 votes
1 answer
78 views

Comparing two groups at large scale

Let's consider we have two datasets. Dataset "A" and Dataset "B". Dataset "A" has two columns. Supplier_id and "Status" (pass and fail are values for status ...
The Great's user avatar
  • 2,585
1 vote
1 answer
30 views

Looking for in depth knowledge in evalution metric

I am dealing with an unbalanced dataset. The total instances in my dataset is 1273 and the Yes class is 174 and No class is 1099. So the unbalance ratio is like 1:6. Now I know ...
Encipher's user avatar
  • 361
1 vote
0 answers
467 views

Intuitive explanation of FOIL's gain in Rule-based classification

I encounter the formula for calculating FOIL's gain as below: $$FOIL's\space gain = p_0(log_2(\frac{p_1}{p_1+n_1}) - log_2(\frac{p_0}{p_0+n_0}))$$ unlike Information gain or Gini index used to measure ...
tmo's user avatar
  • 11
0 votes
1 answer
409 views

How to interpret the score output by a binary classifier when using a threshold < 0.5?

My understanding is that a score output by a binary classifier e.g. logistic regression for an input instance, is interpreted as the probability of the instance belonging to class 1. The threshold 0.5 ...
David Tian's user avatar
1 vote
0 answers
1k views

SMOTE before categorical encoding vs SMOTE after categorical encoding

I have a small dataset of 977 rows with a class proportion of 77:23. For the sake of metrics improvement, I have kept my minority class ('default') as class 1 (and 'not default' as class 0). My input ...
The Great's user avatar
  • 2,585

15 30 50 per page
1
2 3 4 5
10