1

For example, I plan to compare the performance of training the same model via:

  1. 8 GPUs in one mode; vs
  2. two nodes with 8 GPUs in each node (equals 16 GPUs).

How can I measure the time used the train the model in these 2 cases? I know in simple python code, I can use the cell magic %timeit to measure; but how to do this in large language model training? What things should I aware when comparing training time?

Thank you!

1 Answer 1

1

What things should I aware when comparing training time?

The only difference between LLM training code and regular Python code is the time it takes to ends. Therefore, it's preferable to time within the LLM training code, e.g. time to complete x batches.

1
  • Thank you Franck!
    – Dmitry J
    Commented Nov 6, 2023 at 17:49

Not the answer you're looking for? Browse other questions tagged or ask your own question.