How to compare the time cost of training the same model via different hardware architecture?

Question

For example, I plan to compare the performance of training the same model via:

8 GPUs in one mode; vs
two nodes with 8 GPUs in each node (equals 16 GPUs).

How can I measure the time used the train the model in these 2 cases? I know in simple python code, I can use the cell magic %timeit to measure; but how to do this in large language model training? What things should I aware when comparing training time?

Thank you!

Franck Dernoncourt · Accepted Answer · 2023-11-02 20:57:30Z

1

What things should I aware when comparing training time?

The only difference between LLM training code and regular Python code is the time it takes to ends. Therefore, it's preferable to time within the LLM training code, e.g. time to complete x batches.

answered Nov 2, 2023 at 20:57

Franck Dernoncourt

2,8333 silver badges23 bronze badges

Thank you Franck!
– Dmitry J
Commented Nov 6, 2023 at 17:49

Add a comment |

Stack Exchange Network

How to compare the time cost of training the same model via different hardware architecture?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
llm
or ask your own question.

Hot Network Questions

How to compare the time cost of training the same model via different hardware architecture?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged llm or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
llm
or ask your own question.