0

I am using multiple GPUs (num_gpus = 4) for training one model with multiple towers. The model is training well on one set of GPUs: CUDA_VISIBLE_DEVICES = 0,1,2,3 while it gets OOM problem during the first graph evaluation with CUDA_VISIBLE_DEVICES = 0,1,4,5

Anyone has any ideas why this is happening?

Following options are used for creating a session

session_config=tf.ConfigProto(
  allow_soft_placement=True,
  log_device_placement=False)
session_config.gpu_options.per_process_gpu_memory_fraction = 0.94
session_config.gpu_options.allow_growth=False

Batch size, is already super small, = 3

System information

Tensorflow 1.0 Cuda 8.0 Ubuntu 14.04.5 LTS All GPUs : GeForce GTX 1080

logs

name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:07:00.0 Total memory: 7.92GiB Free memory: 7.81GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0xcc4593a0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:08:00.0 Total memory: 7.92GiB Free memory: 7.81GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0xd2404670 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 2 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:18:00.0 Total memory: 7.92GiB Free memory: 7.81GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0xd25591b0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 3 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:1c:00.0 Total memory: 7.92GiB Free memory: 7.81GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: Y Y Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: Y Y Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:07:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci bus id: 0000:08:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080, pci bus id: 0000:18:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080, pci bus id: 0000:1c:00.0)

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 47441 get requests, put_count=8461 evicted_count=1000 eviction_rate=0.118189 and unsatisfied allocation rate=0.844839 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110 W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.33GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.08GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.08GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.98GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.98GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.68GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.86GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2698 get requests, put_count=8709 evicted_c

1 Answer 1

0

Is the log from a good run, or a bad run? There is no error there, only warnings.

If your system have dual root complex, 0,1,4,5 could be on different partitions. The DMA matrix would show that. Copies between GPUs on the same root complex are generally faster than cross them. If the copy has to hold on the tensor reference for longer due to the longer copy, you might see an increased peak memory usage, and leads to OOM if your models are too close to the limit. Of course, it's just a theory, and without further debugging info, it's difficult to tell for sure.

Not the answer you're looking for? Browse other questions tagged or ask your own question.