1% of data for training 99% of data for testing

Question

I got feedback from a reviewer. It is really important for me to answer to this question. I would appreciate of any help.

it was mentioned that 1% of the data was used for training while 99% was used for testing. This is unusual and it calls for careful evaluation of the actual need for using ML tools for the problem. In short, if just 1% is sufficient to build a ML model, it may mean that the data is essentially trivial such that using ML may not be at all necessary. For this reason, it would be good for the authors to provide a rather strong justification for the motivation of this work

So, actually we went with 10 and 90 and got same result. We wanted to show that with less amount of training data we could provide good prediction. Any idea we could write that for 80 and 20 there is not much difference?

This is an interesting question, in my opinion. I have a few ideas but could you please add a few details if possible: what kind of task is this? Which ML algorithm are you using? If this is classification, how many classes are there and how many instances? And if relevant, how was the gold standard label obtained? — Erwan, Commented Dec 17, 2022 at 10:07
task is regression, stacked ML model from TPOT library, The training data was synthetically produced from an FEM software. — Ahmad Turani, Commented Dec 17, 2022 at 15:01
Re: "the training data was synthetically produced from an FEM software": And the test data? Like Erwan and Mario hint below, I also have doubts whethe ML makes sense on your data. FEM should produce deterministic, reproducible data based only on known physical models. It may be that your ML is just reverse-engineering the formulas used in generating the data. — Igor F., Commented Dec 19, 2022 at 9:44

Erwan · Accepted Answer · 2022-12-17 15:26:00Z

21

Given the information in the question and in the comments, it seems to me that the real issue that this reviewer raises is not the 1/99 proportion used for splitting the data: the reviewer takes this information as a clue that the data might be too simple or too homogeneous for requiring ML. I'm guessing that the performance is also very high, isn't it?

In this case I'm afraid that this reviewer might be right: if the training data was generated automatically with some software, then it's possible that the generated data is homogeneous and that it doesn't have any statistical noise. In other words, it's possible that an expert could manually figure out a formula to calculate the target value. If so, then it's indeed questionable to use ML for this task.

So imho your problem is not the 1/99 proportion, it's to motivate everything:

If this task can be done with real data, you should definitely try to evaluate with real data. It's unlikely that real data would be so easy to predict, therefore this would counter the criticism.
If not possible, you could maybe justify the use of this artificial data by showing some evidence that the data is complex and guessing the target value is hard.

answered Dec 17, 2022 at 15:26

Erwan

25.6k3 gold badges14 silver badges35 bronze badges

1

$\begingroup$ If your theory was the case and the data is not so complex to use ML task, then in Feature eng. phase (using a correlation matrix or PCA or plotting pairs of features manually) should show some good signals to write a function as a model. $\endgroup$
– Mario
Commented Dec 17, 2022 at 17:27
$\begingroup$ @Erwan Performance in most of the scientific papers were very high actually I checked them. They are even 0.99! So, any justification we can find? very important for us. This is totally a new work and no work published before $\endgroup$
– Ahmad Turani
Commented Dec 18, 2022 at 5:02
5

$\begingroup$ @AhmadTurani I always give this very general advice about writing a paper: it's crucial to be as clear as possible about what is being done and why (motivations). Sometimes the authors assume that it's obvious and instead focus on explaining their method, but the reviewers doesn't have the same background and might not understand. since you say that this is 'very important to you', you should try to explain why in the paper. It's also important to explain why it makes sense to use this artificial data: maybe alternatives are not suitable, maybe there is evidence that the generation ... $\endgroup$
– Erwan
Commented Dec 18, 2022 at 11:46
$\begingroup$ ... process is very reliable, etc. Finally it would be good to give some indication proving that the generated data is not too trivial, for example in the way Mario proposed above. $\endgroup$
– Erwan
Commented Dec 18, 2022 at 11:48

Add a comment |

Stack Exchange Network

1% of data for training 99% of data for testing

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
machine-learning
or ask your own question.

Hot Network Questions

1% of data for training 99% of data for testing

1 Answer 1

Not the answer you're looking for? Browse other questions tagged machine-learning or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
or ask your own question.