-3

I work in a research laboratory with multiple physical machines with different specifications. The machines have different CPUs (some Intel, some AMD), different RAM sizes, some have discrete GPUs, and some don't.

Our current solution is based on SSSD and Kerberos, so that users can log in to their accounts from every terminal and have access to their files. The problem is that this way, users are "tied" to one machine while they are working, resulting in sub-optimal resource allocation.

Therefore, we are looking for an alternative solution for our cluster. Our main goal is to truly unify all the machines, i.e., from the user's point of view, the cluster consists of a single machine. However, from what we gather, a solution such as Slurm is not ideal, since we do not want to rely on a job scheduler. The solution we envision goes something like this: when a user logs in, they can specify the specs they need (RAM, number of CPUs, discrete GPU, etc.), then a virtualized environment with the desired specs is created (a Docker image, or a virtual machine, for instance). The user can then use that environment as a regular "computer." Nonetheless, the resources for this virtual environment should be drawn from the cluster and not from a single machine. It should also be possible to share large datasets that can be accessed by every "virtual environment". The cluster should also have an authentication and permission system.

We have been searching for clustering tools that can achieve our goal, but we are unsure which one to pick. We have looked into Mesos, OS/DB, Docker Swarm, Kubernetes, and oVirt, but we do not know if what we want is achievable with these tools, and if so, which one is the best pick. We think that containers might be a good option for production but probably not the best choice for R&D. Can you guys help us out and give some highlights for what to do and where to start from?

Best regards, pinxau1000

2
  • Before downvoting note that this question has been posted on stack overflow, however, we were advised to move this question here.
    – pinxau1000
    Commented Dec 20, 2022 at 22:33
  • 2
    This is not achievable with any of these tools. Each individual partition, container, VM could draw no more than resources of a single machine. There were such projects which could appear to a single machine with combined resources (Kerrighed), and this is called Single System Image (SSI) cluster, but yet each job (I mean, an OS process) still has to be run on single participating computer and can't span several ones. If you want to spread your load, use job scheduling, period. Commented Dec 22, 2022 at 19:23

2 Answers 2

2
+50

Agree with @NikitaKipriyanov that you cannot combine resources from multiple systems into a single image, although there have been commercial products that did this in the past and they relied on infiniband to keep latency down (IMHO, it did not work well). Slurm can be used as a scheduler but you can also use it for interactive jobs and it then can be more of a resource manager.

Each job can specify the number of cpu cores, number and type of gpus, amount of memory, etc. The scheduler will then pick an appropriate and unused system and give you a shell prompt. X11 forwarding is available if needed.

Also, containers can be quite useful in an R&D environment. You should not throw them out because you don't see the utility but they are not the solution to this problem.

2
  • Thanks for the reply. If I understand correctly is impossible to "unify" all physical machine resources as they are a single machine. I should look for a job scheduler to select an appropriate machine from the cluster and run the job according to the user specification. Can you provide me examples of tools or frameworks that can be used to achieve that goal?
    – pinxau1000
    Commented Dec 26, 2022 at 15:51
  • @pinxau1000 Slurm is widely used for this and is well supported. It scales easily from a few systems to thousands of servers. Heterogeneous systems are the norm and each server (or group of servers) is defined with its capabilities. The user defines the needed resources and the scheduler picks a system that meets or exceeds to requirements. Jobs can be run interactively, scheduled for running as soon as resources are available, or to be run at a set time. PBS and LSF are other options.
    – doneal24
    Commented Dec 26, 2022 at 16:21
2

It's not possible.

  1. Different CPUs means instructions may differ. This is a nightmare if you want to migrate code between CPUs.
  2. Memory latency is in nanoseconds, network latency in tens of microseconds.

Depending on your workload, it may be possible to translate your workload to run on multiple computers and communicate data between them. For some problems this is trivial, and you can slice the dataset into smaller partitions and work on them in parallel. For other workloads this is is difficult. But this requires modifications to the workload, not operating system.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .