6

I write this post because since I use slurm, I have not been able to use ray correctly. Whenever I use the commands :

  • ray.init
  • trainer = A3CTrainer(env = “my_env”) (I have registered my env on tune)

, the program crashes with the following message :

core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

The program works fine on my computer, the problem appeared with the use of Slurm. I only ask slurm for one gpu.

Thank you for reading me and maybe answering. Have a great day

Some precisions about the code

@Alex I used the following code :

import ray
from ray.rllib.agents.a3c import A3CTrainer
import tensorflow as tf
from MM1c_queue_env import my_env #my_env is already registered in tune

ray.shutdown()
ray.init(ignore_reinit_error=True)
trainer = A3CTrainer(env = "my_env")

print("success")

Both lines with trainer and init cause the program to crash with the error mentionned in my previous comment. To launch the program with slurm, I use the following program :

#!/bin/bash

#SBATCH --job-name=rl_for_insensitive_policies
#SBATCH --time=0:05:00 
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu

module load anaconda3/2020.02/gcc-9.2.0
python test.py
3
  • Can you post additional details about how you're deploying ray on slurm?
    – Alex
    Commented Jun 1, 2022 at 19:33
  • @Alex I added some precisions in the question. Thank you for answering Commented Jun 2, 2022 at 4:12
  • Can you add any relevant log information from /tmp/ray/session_latest/logs after running that script? Also any network/file system configurations on the slurm cluster that may be relevant?
    – Alex
    Commented Jun 2, 2022 at 17:55

3 Answers 3

5

Limit the number of CPUs

Ray will launch as many worker processes as your execution node has CPUs (or CPU cores). If that's more than you reserved, slurm will start killing processes.

You can limit the number of worker processes as such:

import ray
ray.init(ignore_reinit_error=True, num_cpus=4)
print("success")
2

You can find the detailed instructions of running Ray with SLURM in the documentation. The below instruction is based on it. I used the information in this link too.

You should launch a process for head and launch as many processes as worker nodes you have. Then, the worker nodes must be connected to the head node.

#!/bin/bash
#SBATCH -p gpu
#SBATCH -t 00:05:00 
#SBATCH --job-name= 'rl_for_insensitive_policies'

--tasks-per-node must be one based on the documentation.

#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1

After specifying some resources, load your environment

module load anaconda3/2020.02/gcc-9.2.0

Then, you need to obtain the head ip address.

Getting the node names

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-
address)

if [[ "$head_node_ip" == *" "* ]]; then
  IFS=' ' read -ra ADDR <<<"$head_node_ip"
  if [[ ${#ADDR[0]} -gt 16 ]]; then
    head_node_ip=${ADDR[2]}
  else
    head_node_ip=${ADDR[0]}
  fi
  echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi

port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
redis_password=$(uuidgen)
echo "redis_password: "$redis_password

nodeManagerPort=6700
objectManagerPort=6701
rayClientServerPort=10001
redisShardPorts=6702
minWorkerPort=10002
maxWorkerPort=19999

The below code launches the head node.

echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
    ray start --head --node-ip-address="$head_node_ip" \
        --port=$port \
        --node-manager-port=$nodeManagerPort \
        --object-manager-port=$objectManagerPort \
        --ray-client-server-port=$rayClientServerPort \
        --redis-shard-ports=$redisShardPorts \
        --min-worker-port=$minWorkerPort \
        --max-worker-port=$maxWorkerPort \
        --redis-password=$redis_password \
        --num-cpus "${SLURM_CPUS_PER_TASK}" \
        --num-gpus "${SLURM_GPUS_PER_TASK}" \
        --block &

sleep 10

number of nodes other than the head node

worker_num=$((SLURM_JOB_NUM_NODES - 1))

The below loop launches some workers (one worker for each node).

for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --nodes=1 --ntasks=1 -w "$node_i" \
        ray start --address "$ip_head" \
        --redis-password=$redis_password \
        --num-cpus "${SLURM_CPUS_PER_TASK}" \
        --num-gpus "${SLURM_GPUS_PER_TASK}" \
        --block &
    sleep 5
done

it is better to add some argeparse arguments to your code so that you can give it the specified resources and the redis-password.

python test.py --redis-password $redis_password --num-cpus 
$SLURM_CPUS_PER_TASK --num-gpus $SLURM_GPUS_PER_TASK

if you get "unable to connect to GCS server" error , use the below values or use some new values. Two users cannot use same port.

port=6380
nodeManagerPort=6800
objectManagerPort=6801
rayClientServerPort=20001
redisShardPorts=6802
minWorkerPort=20002
maxWorkerPort=29999

in your test.py, add the arguments and initialize Ray

import ray
import argparse
parser = argparse.ArgumentParser(description="Script for training RLLIB
agents")
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument("--num-gpus", type=int, default=0)
parser.add_argument("--redis-password", type=str, default=None)
args = parser.parse_args()

ray.init(_redis_password=args.redis_password, address=os.environ["ip_head"])

config["num_gpus"] = args.num_gpus
config["num_workers"] = args.num_cpus
0

Same issue with me. Try to bring up a single raylet cluster first ray start --head, then use ray.init(address='auto') might help.

1
  • Didn't work for me - Connection refused (port issue?)
    – jtlz2
    Commented Nov 28, 2023 at 13:34

Not the answer you're looking for? Browse other questions tagged or ask your own question.