Ray on slurm - Problems with initialization

Question

I write this post because since I use slurm, I have not been able to use ray correctly. Whenever I use the commands :

ray.init
trainer = A3CTrainer(env = “my_env”) (I have registered my env on tune)

, the program crashes with the following message :

core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

The program works fine on my computer, the problem appeared with the use of Slurm. I only ask slurm for one gpu.

Thank you for reading me and maybe answering. Have a great day

Some precisions about the code

@Alex I used the following code :

import ray
from ray.rllib.agents.a3c import A3CTrainer
import tensorflow as tf
from MM1c_queue_env import my_env #my_env is already registered in tune

ray.shutdown()
ray.init(ignore_reinit_error=True)
trainer = A3CTrainer(env = "my_env")

print("success")

Both lines with trainer and init cause the program to crash with the error mentionned in my previous comment. To launch the program with slurm, I use the following program :

#!/bin/bash

#SBATCH --job-name=rl_for_insensitive_policies
#SBATCH --time=0:05:00 
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu

module load anaconda3/2020.02/gcc-9.2.0
python test.py

Can you post additional details about how you're deploying ray on slurm? — Alex, Commented Jun 1, 2022 at 19:33
@Alex I added some precisions in the question. Thank you for answering — Pierre houdouin, Commented Jun 2, 2022 at 4:12
Can you add any relevant log information from /tmp/ray/session_latest/logs after running that script? Also any network/file system configurations on the slurm cluster that may be relevant? — Alex, Commented Jun 2, 2022 at 17:55

RHS · Accepted Answer · 2022-06-03 16:41:54Z

5

Limit the number of CPUs

Ray will launch as many worker processes as your execution node has CPUs (or CPU cores). If that's more than you reserved, slurm will start killing processes.

You can limit the number of worker processes as such:

import ray
ray.init(ignore_reinit_error=True, num_cpus=4)
print("success")

answered Jun 3, 2022 at 16:41

RHS

2142 silver badges8 bronze badges

Add a comment |

Saeed Akbari · Accepted Answer · 2022-07-21 05:14:24Z

You can find the detailed instructions of running Ray with SLURM in the documentation. The below instruction is based on it. I used the information in this link too.

You should launch a process for head and launch as many processes as worker nodes you have. Then, the worker nodes must be connected to the head node.

#!/bin/bash
#SBATCH -p gpu
#SBATCH -t 00:05:00 
#SBATCH --job-name= 'rl_for_insensitive_policies'

--tasks-per-node must be one based on the documentation.

#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1

After specifying some resources, load your environment

module load anaconda3/2020.02/gcc-9.2.0

Then, you need to obtain the head ip address.

Getting the node names

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-
address)

if [[ "$head_node_ip" == *" "* ]]; then
  IFS=' ' read -ra ADDR <<<"$head_node_ip"
  if [[ ${#ADDR[0]} -gt 16 ]]; then
    head_node_ip=${ADDR[2]}
  else
    head_node_ip=${ADDR[0]}
  fi
  echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi

port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
redis_password=$(uuidgen)
echo "redis_password: "$redis_password

nodeManagerPort=6700
objectManagerPort=6701
rayClientServerPort=10001
redisShardPorts=6702
minWorkerPort=10002
maxWorkerPort=19999

The below code launches the head node.

echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
    ray start --head --node-ip-address="$head_node_ip" \
        --port=$port \
        --node-manager-port=$nodeManagerPort \
        --object-manager-port=$objectManagerPort \
        --ray-client-server-port=$rayClientServerPort \
        --redis-shard-ports=$redisShardPorts \
        --min-worker-port=$minWorkerPort \
        --max-worker-port=$maxWorkerPort \
        --redis-password=$redis_password \
        --num-cpus "${SLURM_CPUS_PER_TASK}" \
        --num-gpus "${SLURM_GPUS_PER_TASK}" \
        --block &

sleep 10

number of nodes other than the head node

worker_num=$((SLURM_JOB_NUM_NODES - 1))

The below loop launches some workers (one worker for each node).

for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --nodes=1 --ntasks=1 -w "$node_i" \
        ray start --address "$ip_head" \
        --redis-password=$redis_password \
        --num-cpus "${SLURM_CPUS_PER_TASK}" \
        --num-gpus "${SLURM_GPUS_PER_TASK}" \
        --block &
    sleep 5
done

it is better to add some argeparse arguments to your code so that you can give it the specified resources and the redis-password.

python test.py --redis-password $redis_password --num-cpus 
$SLURM_CPUS_PER_TASK --num-gpus $SLURM_GPUS_PER_TASK

if you get "unable to connect to GCS server" error , use the below values or use some new values. Two users cannot use same port.

port=6380
nodeManagerPort=6800
objectManagerPort=6801
rayClientServerPort=20001
redisShardPorts=6802
minWorkerPort=20002
maxWorkerPort=29999

in your test.py, add the arguments and initialize Ray

import ray
import argparse
parser = argparse.ArgumentParser(description="Script for training RLLIB
agents")
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument("--num-gpus", type=int, default=0)
parser.add_argument("--redis-password", type=str, default=None)
args = parser.parse_args()

ray.init(_redis_password=args.redis_password, address=os.environ["ip_head"])

config["num_gpus"] = args.num_gpus
config["num_workers"] = args.num_cpus

jtlz2 · Accepted Answer · 2023-11-28 15:03:50Z

0

Same issue with me. Try to bring up a single raylet cluster first ray start --head, then use ray.init(address='auto') might help.

edited Nov 28, 2023 at 15:03

jtlz2

8,17810 gold badges70 silver badges119 bronze badges

answered Mar 17, 2023 at 19:57

Diya Li

1,0781 gold badge9 silver badges21 bronze badges

Didn't work for me - Connection refused (port issue?)
– jtlz2
Commented Nov 28, 2023 at 13:34

Add a comment |

Collectives™ on Stack Overflow

Ray on slurm - Problems with initialization

Some precisions about the code

3 Answers 3

Limit the number of CPUs

Not the answer you're looking for? Browse other questions tagged
reinforcement-learning
slurm
ray
rllib
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Some precisions about the code

3 Answers 3

Limit the number of CPUs

Not the answer you're looking for? Browse other questions tagged reinforcement-learningslurmrayrllib or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
reinforcement-learning
slurm
ray
rllib
or ask your own question.