Skip to main content

Questions tagged [slurm]

The tag has no usage guidance.

0 votes
0 answers
562 views

Resolving Slurm cgroups Plugin Errors on Ubuntu 22.04 Nodes

I'm working with Slurm and facing issues specifically with the cgroups plugin on Ubuntu 22.04 nodes. Our team is relatively new to Slurm, and we've been trying to optimize our resource management for ...
Francisco Maria Calisto's user avatar
1 vote
1 answer
1k views

Error code 140 in command running through Nextflow on SLURM

[Note: question heavily edited to correspond to the actual problem] I'm trying to debug a command that fails only in specific conditions. The failure is with an exitcode 140, but I have no other ...
Alexlok's user avatar
  • 113
0 votes
0 answers
91 views

Dynamically checking and allocating SLURM nodes within a python script

I have a computationally expensive simulation function I am looking to distribute accross a multi-node cluster. The code looks something like this: input_tasks = [input_0, input_1, ..., input_n] for i ...
GTOgod's user avatar
  • 109
0 votes
1 answer
2k views

How to sync UIDs and GIDs across multiple machines with minimal impact on users' experience?

I have two workstations, WS 1 and WS 2, and a server, S, all running Ubuntu 22.04. These machines were previously managed independently, so users could have accounts on some or all of them, and ...
Matt's user avatar
  • 103
1 vote
0 answers
2k views

How to make a host file in SLURM with $SLURM_JOB_NODELIST

I have access to a HPC with 40 cores on each node. I have a batch file to run a total of 35 codes which are in separate folders. Each code is an open mp code which requires 4 cores each. so how do I ...
Libin Varghese's user avatar
0 votes
1 answer
887 views

Common home folder for slurm cluster user on nodes and front end

I am trying to put together a SLURM cluster with an Odroid XU4 front end (Ubuntu 20.04-5.4 mate), Odroid MC1 nodes (12 nodes total: Ubuntu 20.04.1-5.4-minimal), and an Odroid HC1 NFS server (...
odroidnewbie's user avatar
0 votes
1 answer
67 views

How can I pass two arguments to `--mail-type` option of `salloc`?

I would like to pass two arguments to an option of a shell command, specifically, for salloc. I can choose to do either of the following salloc -n 1 -t 24:00:00 --mail-type=BEGIN salloc -n 1 -t 24:00:...
zyy's user avatar
  • 189
0 votes
1 answer
91 views

Linux Mint "slurm" appears on login screen

On my login screen recently the text slurm appeared above my login name. What can be its reason? How can it be removed? I use Linux Mint version 19.1 'Tessa' with its Cinnamon desktop environment. ...
bmv's user avatar
  • 147
0 votes
1 answer
2k views

SLURM setting nodes to drain due to low socket-core-thread-cpu count

I have SLURM set up with a couple of workstations. There are different kinds, but let's take one with a CPU which has 4 cores and no additional SMT, so 4 threads in total. lscpu shows me the following:...
Martin Ueding's user avatar
1 vote
1 answer
1k views

slurmd: Invalid job credential

I'm having some problems with a test configuration of Slurm on my laptop. I'm trying to run four slurmd instances on one machine, which is also the same machine as slurmctld runs on. I have a local ...
lukas's user avatar
  • 11
1 vote
0 answers
913 views

Slurm - GPU enforcement with cgroups

I am running slurm 19.05 on a single machine (Ubuntu 18.04) for scheduling GPU tasks. However, I am having trouble to setup the gpu enforcement with cgroups. If I set ConstrainDevice=yes in my cgroup....
Jonas's user avatar
  • 11
0 votes
1 answer
402 views

Slurm nodes on AWS set to drain at boot

I am working to configure slurm on an AWS cluster created with CloudFormation. At boot time some of the nodes get set to a "drain" state, with the stated reason being "Low socketcorethread count". ...
user3159132's user avatar
3 votes
2 answers
11k views

How to cancel a job that is on completing (CG) state?

I normally submitted some jobs using sbatch and canceled some of them after using scancel. However, they are in state CG and I cannot remove the jobs from my list. There is any way to get ride off ...
Iago Carvalho's user avatar
2 votes
1 answer
10k views

Slurm on AWS returns slurmstepd: error: execve(): : No such file or directory

I have installed a Burstable and Event-driven HPC Cluster on AWS Using Slurm according to this tutorial. With this installation I can burst instances and run jobs in the Slurm environment on EC2. ...
Serialchiller's user avatar
1 vote
1 answer
232 views

Ubuntu 18.10 and modify installed package - OpenMPI

I've installed openmpi-bin (OpenMPI 3.1) on Ubuntu 18.10. I also run slurm on the same machine and would like to recompile or reconfigure my installation of OpenMPI to cope with Slurm-feature. If one ...
Paer's user avatar
  • 21

15 30 50 per page