For example, I plan to compare the performance of training the same model via:
- 8 GPUs in one mode; vs
- two nodes with 8 GPUs in each node (equals 16 GPUs).
How can I measure the time used the train the model in these 2 cases? I know in simple python code, I can use the cell magic %timeit to measure; but how to do this in large language model training? What things should I aware when comparing training time?
Thank you!