17

Is it possible to design a distributed system to generate unique ids where there is a requirement that every generated id is guaranteed to be unique with no possibility of collision?

I didn't think that this was possible, given the fact that there is a finite number of ids which can fit into some structure. So for instance, if we generate ids with the numbers 0-9, and store them in 32 bytes, then there are at most 10^32 possibilities for ids. This gives a very small possibility that there will be a collision, but there is still a possibility. Am I missing something, or is there a way to engineer a system such there there is guaranteed to never be a collision (i.e., all ids are truly unique?)

Even if we generate ids based on a timestamp, say, then it's still not clear to me that there will never be a collision.

15
  • 22
    Collision probabilities for random UUIDv4 are astronomically low. However, The other UUID variants are deterministic. For example UUIDv1 uses a node ID + timestamp + counter. (Officially UUIDv1 uses MAC addresses as the node ID, but this is problematic in practice.) As long as all participants in your distributed system can agree on unique node identifiers, and as long as you place reasonable limits on the rate at which IDs can be generated, this is easy to solve. I recommend reading up on alternatives to UUIDs using this approach, e.g. Snowflake: en.wikipedia.org/wiki/Snowflake_ID
    – amon
    Commented Jan 15 at 18:18
  • 22
    Just a note - UUID and "unique ID" are not synonymous. UUIDs are universally unique. This means that is is intended to be unique both within the source system and across any third-party system. Often, an ID only needs to be unique within some domain and does not need to be a UUID. Once universal uniqueness is removed as a requirement, it can become trivially simple to guarantee uniqueness
    – Dancrumb
    Commented Jan 16 at 3:34
  • 16
    Raymond Chen wrote a really good article about the probabilities involved: devblogs.microsoft.com/oldnewthing/20160114-00/?p=92851
    – rhellen
    Commented Jan 16 at 8:51
  • 9
    Not a direct answer to your question but you talk about 32-bit numbers here and your math isn't right. 32-bits can hold 2^32 (around 4.3 billion) different values. It's important to understand that a 128-bit number space is not just 4 times larger than a 32-bit number space. You can represent 79,228,162,514,264,337,593,543,950,336 times as many values in a 128-bit space. This is an inconceivably large number space. The more important thing to know is that if your RNG isn't very good, collisions can become likely.
    – JimmyJames
    Commented Jan 16 at 19:03
  • 6
    @JimmyJames OP said 32 bytes, not bits. The full quote is: "if we generate ids with the numbers 0-9, and store them in 32 bytes, then there are at most 10^32 possibilities". I think they mean generating 32 decimal digits, and storing each digit in a byte (presumably as an ASCII character). That's not a very compact way to represent numbers up to 10^32, but it's exactly the sort of thing that's extremely common e.g. when you're using JSON as a transport. They weren't talking about how many ids you can get out of 32-bit binary numbers.
    – Ben
    Commented Jan 17 at 2:06

11 Answers 11

32

there is a finite number of ids which can fit into some structure

Correct. If you have n bits in your structure, after you have generated 2^n IDs, the next one must be a collision; this is the pigeonhole principle.

However, assuming you are dealing with an actual physical system rather than a pure thought experiment, there are "only" 10^80 = 2^265 electrons in the universe so if your data structure is bigger than 265 bits or so then, given a perfect algorithm, you'll never be able to generate all the IDs.

8
  • 11
    The pigeonhole principle gives an upper bound on an ideal system, but for real-world systems the limit may be lower. For example if using version 4 UUIDs you have 2^122 possible values, and would expect a collision after a "mere" 2^61 generations due to the birthday paradox.
    – James_pic
    Commented Jan 16 at 10:16
  • 5
    @James_picI'm thinking of the general concept of "unique IDs" here rather than the specific implementation(s) that are UUIDs. I acknowledge the question somewhat mixes up the two concepts. Commented Jan 16 at 10:20
  • 2
    Even is you only allow for 2^50 generations, that's still 150,000 IDs for every person on earth. Commented Jan 16 at 17:58
  • 6
    The real limitation is time rather than electrons. You only need 265 bits of RAM (only a small fraction of the 2^265 electrons will be used for that RAM) and a lot of time to exhaust all 2^265 values. Commented Jan 17 at 3:05
  • 3
    @ChaiT.Rex If you're not storing the IDs once you've generated them, I assert that's not an interesting problem. Assume some kind of distributed generation of IDs. Commented Jan 17 at 13:57
20

You want version 1 UUIDs. They don't collide but they leak the generating computer's MAC address and require a monotonic clock.

From RFC4122 we have this format

8 bytes time-low
2 bytes time-mid
2 bytes time-high & version (high nybble is a 1)
2 bytes clock-seq-high & reserved (high two bits are 1 and 0)
2 bytes clock-seq-low
6 bytes MAC address

Clock-sequence is actually the subsecond portion of the clock now.

The thing about the two byte fiedls is they are not two one byte fields. If you mess the code up, you get different results on little endian and big endian machines and you have a chance of collision again. The string form looks like this no matter what your endian is:

????????-????-1???-[89AB]???-?[02468ACE]??????????

The thing about this is when it works, it works. If you don't have MAC address collisions (the general rule is if you get colliding MAC addresses, RMA both cards) and if you actually have a monotonic clock, this works 100% of the time.

There's a work around for not having a monotonic clock. That is, do it the old way. Serialize all UUID generations and use clock-seq as a counter that increments with each generation.

The thing about this is you will have bad generation and most likely collisions if your PC battery dies on any node that generates.

It's been pointed out in comments that VMs don't really get globally unique MAC addresses, and it has also been pointed out that the solution to this is to hand out the node IDs yourself. When you do this you are supposed to set the broadcast bit (the one marked [02468ACE] so that it would match [13579BDF] instead).

If you can't tolerate leaking the MAC addresses and can't issue node IDs, you're stuck with lesser means. V4 UUIDs are really good for all terrestrial use under modern hardware assumptions (has thermal diode or other hardware RNG). V5 UUIDs can do the job under specific circumstances (they're a hash of something else; the main problem is they're subject to adversarial attack).

9
  • 8
    This does require that MAC addresses are truly unique for the UUIDs to be guaranteed unique and that is no longer true any more. Hardware MAC addresses are split into 24-bits OUI representing the hardware vendor and 24-bits or 16.7 million addresses per OUI. Bigger vendors might have several OUIs to their name, however, a 10BASE-T NIC from 3 decades ago is likely defunct. Even a card sold 5 years ago is unlikely to be sent to the same customer as one sold today so it's accepted to re-use MACs within reason. Virtual Machines/Containers just randomly generate MACs on creation.
    – penguin359
    Commented Jan 17 at 3:33
  • 2
    @penguin359: Reuse of old MAC addresses really isn't a problem though because the timestamp guards against it.
    – Joshua
    Commented Jan 17 at 3:34
  • 1
    @penguin359: Enh; he can probably just use v4 UUIDs and doesn't know it.
    – Joshua
    Commented Jan 17 at 3:36
  • 4
    OP is asking for the ideal case of "really really" unique UUIDs though. Doesn't matter how unlikely it is.
    – AnoE
    Commented Jan 17 at 8:05
  • 1
    @penguin359 MAC address collisions are more likely than you'd hope. Leaving aside spoofing (where it's done on purpose), manufacturers make mistakes. Misconfigured or badly cloned virtual machines and embedded devices are especially prone to it, but it happens even for standard network cards. To be safe from this, the nodes need to negotiate the ids amongst themselves.
    – Bergi
    Commented Jan 17 at 23:44
14

Is it possible, when K different hosts are generating IDs? Yes. And we can do it with readily available libraries.

ULIDs do a pretty good job already, but they're not collision free. It looks like the format may soon be known as UUIDv7. We allocate 48 bits for a millisecond timestamp that won't overflow for millennia, plus 80 bits of entropy, for a 128-bit GUID.

So by the birthday paradox, those K servers would (collectively) need to generate on the order of 2 ** 40 random IDs within a single millisecond, to have a decent chance of producing a collision. Suppose K is "large", and servers are fast, and your appetite for risk says the odds aren't looking good enough to you. Fair enough. We could allocate 208 bits of entropy, but the collision risk is still non-zero. There must be another way.

Use a Distributed Consensus protocol, perhaps based on Raft, to enroll all servers in your scheme and assign each a distinct host_id in the range 1 .. K. I don't know how many servers you have. So let us allocate 32 bits to store such IDs. That's enough to accommodate all publicly accessible IPv4 destinations in the global internet. The GUID prefix will be {timestamp, host_id}. Now we have a mere 48 bits of entropy to work with. So if any single host rolled on the order of 2 ** 24 random IDs within a millisecond, it would have a fair chance of self colliding. That's a mere 16 million IDs, achievable if we spend about a tenth of a nanosecond generating each ID. Perhaps plausible for a core router that puts IDs on each forwarded packet.

But wait! By hypothesis these IDs are all being generated locally on a single host. So we can just use a counter for uniqueness. Or several counters, one per functional sub-unit, if for example our router has several line cards or our GPU has several cores.

If all sub-units consume IDs at roughly the same rate, then we can essentially stretch the 32-bit host_id into something longer which encompasses the sub-units. If that's still not enough, we just request additional host_ids.

For sub-units that process IDs at diverse rates, it's easy to make some central thread hand out big counter blocks of a thousand or a million. Oracle database servers have been allocating PK IDs in this way for decades.

So there you have it. Your hosts, and their sub-units, can only generate so many IDs per second, and 2 ** 80 is a large number. Allocate prefix bits to hosts and optionally to sub-units, and then you just need counter(s) for the remaining bits.

This makes the (reasonably weak) assumptions that a host can learn approximately what the current time is, that it can persist such timestamps so upon reboot it only lets the clock march monotonically forward, and that it can keep track of what happened in the last millisecond or so, to coordinate ID uniqueness across its sub-units. If you require even greater rates of ID generation, or if some assumption can't feasibly be engineered into your use case, then just allocate more than 128 bits total for each distinct ID.

2
  • There are negatives though to using UUIDs with timestamps or server ids embedded in them- they leak info about where and when they were created. For a lot of purposes that's perfectly ok. But if you're counting on the UUID for randomization and anonymization, that can be unacceptable. Commented Jan 16 at 2:24
  • 2
    Well, it's all about how the requirements drive engineering tradeoffs. I feel the chief weakness of my proposal is that, given even a single observed GUID, it is trivial for an attacker to forge GUIDs that collide with high probability. (Note that a host can keep requesting host IDs so it always has two available, stop using the old one, and cycle through ephemeral IDs as often as it likes.) OTOH, in a context where "tiny chance of collision" is acceptable, the ULID timestamp prefix is enormously winning, in that it promotes locality of reference in any RDBMS that uses a GUID as a Primary Key.
    – J_H
    Commented Jan 16 at 4:23
11

It is easy enough, but require some degree of coordination.

The general scheme is relatively simple: partition the bit-space of the ID in two sections, one with a "generator ID" unique to each generator, and one with a strictly monotonically increasing sequence.

For example: the UUID v1 scheme uses MAC addresses as generator ID, and then timestamp + counter as strictly monotonically increasing sequence.

There are 2 challenges to solve:

  • Ensuring uniqueness of the generator ID.
  • Ensuring strict monotonicity of the sequence.

The first challenge typically requires some form of coordination across generators. This can be a one-off coordination system -- assign a unique ID when setting up the generator -- or it can be a more active form of coordination -- such as registering with a central entity on start-up.

The second challenge requires either knowledge of the system, guardrails, or persistence:

  • Persistence: each generated sequence number is persisted, on start-up a generator restarts from the last persisted number. In practice, persistence is likely best performed in batch: acquire a batch of N numbers (bumping the persisted item by N), then distribute them, rinse and repeat.
  • Knowledge: if the operation for which IDs are provided is known not to exceed a generation rate of N/s, then the sequence number can be seeded on start-up by a timestamp of appropriate resolution, which is then simply incremented.
  • Guardrails: Same as Knowledge, except that a check is made that the maximum rate of ID generation is not exceeded.

As a practical example, I've regularly implemented such a scheme to generate unique 64-bits ID within a multithreaded application. In such a case, the partition key was a thread index (from 0 to N) and the sequence was seeded with a nanosecond resolution timestamp stored in a thread-local variable, and incremented on each use. The maximum rate of generation being 1/ns/thread, we knew it wouldn't be exceeded -- our application was plenty fast, but still.

For export, the ID was prepended with a 64-bits ID unique to the running process, obtained from a shared Zoo Keeper cluster. The running process would lock the use of this ID for the duration -- via a lease system -- ensuring no other process could possibly use it, and thus we had a 128-bits ID guaranteed unique across all applications.

3
  • Sure you mean uniquity, not *unicity?
    – tchrist
    Commented Jan 17 at 11:11
  • I'd have said "uniqueness". Also, "guardrails" not "railguards" is the English word for barriers that keep you on the road or walkway. (en.wikipedia.org/wiki/Guard_rail) Commented Jan 17 at 12:12
  • @PeterCordes: Thanks a lot! My French is slipping through I am afraid :x Commented Jan 17 at 12:16
6

No, but the chance the Earth is hit in soon future by meteor big enough to eradicate most or all of the human existence including the computer that generates UUID is much higher than to actually have collision in UUID.

So you should better watch for meteors rather than UUID collisions :).

In general - you just throw an error if collision happens, the data associated to it will be lost, if that happens once in your lifetime, you will be still quite lucky and the business impact should not be that critical for one-time accident. Anyway the internet is unstable place, there is quite usual chance the request is lost on the way, the server or client critially quit without finish what started etc. - just handle the possible duplication in same way as some of such critical error happens.

7
  • 9
    The problem is in detecting collisions — if we could detect them instantly, we would not release those that collided and regenerate a new one right away, so the caller would never even see the throw.
    – Erik Eidt
    Commented Jan 15 at 23:04
  • @ErikEidt - if you throw this kind of error once per your lifetime, you would be lucky man. So its fine to just risk it.
    – libik
    Commented Jan 16 at 11:03
  • If you throw this error once in your lifetime, that’s too often. But anyway, check out whether throwing an exception (with absolutely untested handling) or just ignoring the collision is worse. It’s quite possible that a collision with another UUID generated ten years ago will have no negative effect at all.
    – gnasher729
    Commented Jan 16 at 22:37
  • 2
    Even though the probability of randomly picking the same UUID twice is astronomically low, you usually have to carefully consider what would happen when there are collisions anyway, not "just risk it". Even if you throw an exception inserting the record, what state is the (distributed!) system in after that exception; is it valid, or vulnerable to attack? Are you sure that no other records were inserted assuming they were referring to the one that was rejected that are actually now referring to the other one that tried to use that id?
    – Ben
    Commented Jan 17 at 2:21
  • 4
    Because collisions can happen due to clients that are erroneous or malicious, not just due to the RNG actually rolling the same UUID again. You generally can't assume the probability of that is near-zero (especially malicious clients). This point applies to schemes that guarantee uniqueness too though; if you can't assume all clients are non-malicious and bugfree, you have to deal with potential collisions anyway.
    – Ben
    Commented Jan 17 at 2:22
4

There are GUID and UUID schemes that allow independently operating machines to generate universally unique identifiers with near certainty.

But the question asks about a distributed system. Generally, distributed systems involve communication between the nodes. So, yes, you could have a distributed system that generates truly universally unique identifiers by coordinating across a network.

In the simplest case, a central entity generates all IDs and passes them out to requesting nodes. You can improve scalability in a variety of ways. For one, you could carve up the ID space into blocks and relegate authority of those blocks to different nodes.

If you consider this cheating, consider that GUID schemes that rely on a machine's MAC address work exactly like this, because the MAC address space is divvied up by a central authority that gave vendors authority over non-overlapping blocks.

3

Is it possible to design a distributed system to generate unique ids where there is a requirement that every generated id is guaranteed to be unique with no possibility of collision?

It is possible if.

  1. You can guarantee that each node in the system has a unique ID.
  2. Your nodes have a clock (or counter) which you can guarantee will increase monotonicly.
  3. Your nodes can be trusted to actually follow the prescribed generation process.
  4. You are ok with anyone who sees the UUIDs being able to determine which UUIDs were generated by the same node.
  5. You are ok with with your system having an "expiry date" at some time in the future.

This is what "version 1" UUIDs do. A clock (which will roll over around 3400AD) is combined with the MAC address of the node (which is supposed to be uniquely allocated by network card vendors based on allocations they receive from the IEEE).

This gives a very small possibility that there will be a collision, but there is still a possibility.

Ultimately there are all sorts of low proability random events that could ruin your day. At a certain point you just have to decide that the probability is low enough that you will accept the risk. For any reasonable number of UUID's the chance of random UUID or hash based UUIDs generated from unique non-malicious input accidentally colliding is extremely low.

IMO a far bigger concern than two UUIDs randomly colliding is how you ensure that the systems generating IDs and introducing them to your larger distributed system are doing so non-maliciously.

1
  • At some point the probability that your software claims two different UUIDs are the same because of some bug is much higher than your software being two identical UUIDs that should have been different.
    – gnasher729
    Commented Apr 27 at 10:35
2

Seed each local part of the distributed system with a prefix. After that each local system creates a series of ids by adding one to the previous one.

Simple, easy system for a distributed system to generate an unbounded number of guaranteed unique identifiers.

System 1 would generate the identifiers 1:0 , 1:1, 1:2, and so on. System 432 would have 432:0, 432:1, 432:2, and so on; with no possibility of collision.

In practice however, it is generally not worth the effort, for reasons that have been explained in other answers.

2

You ask (emphasis mine)

guaranteed to be unique with no possibility of collision

No. Compromised or damaged hardware and little things like cosmic rays as well as other software, buggy and/or malicious, mean that you can never be 100% sure. What you can be is sure enough.

A 128bit random int, is going to collide a lot less frequently than bits being flipped by *cosmic rays.

*Yes, I know it's not actually cosmic rays doing the flipping, but that is the standard term for it.

0

I didn't think that this was possible,

This answer shows one rather easy possibility to create such a system. It can work as described in certain circumstances (and has been implemented like this in business systems I have been involved with), but some aspects would be impractical on larger scales. Still, it is a proof of concept that it is, indeed, possible.

  • Create a central DB which stores all generated UUIDs. This needs to be nothing special - it just needs enough storage to store all of them. If that should be an issue over time, there would be tricks to get around that ("high water marks" and such). The DB requires a single standard RDBMS feature only (the possibility to create a simple unique index).
  • Every system that wants to generate a new UUID can do so freely by any algorithm you might think of - for example the one based on a relatively unique MAC address, timestamp, and counter. Or it may simply generate a random number by any means.
  • As last step of the UUID creation, that UUID has to be sent to the central DB. The DB will insert the UUID into its table of UUIDs. If at that point a duplicate is detected, it will tell the generating system and refuse the UUID. Else it reports success. Collisions are impossible since we assume standard sane/safe RDBMS behaviour.
  • If the generator receives a failure answer, then it generates another UUID. This is repeated until success.

This is still respecting all specifications in OPs post - OP does not require that the distributed system is allowed to split into different disjunct partitions, so we can assume connectivity to the central system; it is not specified that this needs to work across different companies etc. either, so we can waive many practical aspects for now.

1
  • As this is receiving roughly the same amount of upvotes and downvotes: downvoters, feel free to comment on how this answer could be improved or why you feel that it does not answer the question.
    – AnoE
    Commented Jan 19 at 16:05
0

If your goal is having no collisions then there is a simple strategy that does is at the expense of having many collisions if there are any:

You have a large number of processes that generate UUIDs. For example my computer might be one such process. Each process generates one random UUID, and from then on returns the next UUID every time. If two processes each generate a million UUIDs then you get a collision only if the initial UUIDs are less than a million apart. However, if you have one collision, you will have many. The probability to have any collision at all is much smaller. The expected number of collisions is exactly the same as with random UUIDs. What is very important is that the first random number is truly 128 bits or more.

But "guaranteed unique UUIDs" is only possible if you have one database keeping track of all UUIDs ever created, and not allowing any duplicate entries. What would be good enough is a central database that hands out ranges of say one million UUIDs to processes wanting to create UUIDs, and those processes are responsible for not creating duplicate UUIDs in their range, and no process is ever allowed to create a UUID outside that range. Of course there is no way to enforce this.

But then, this is all pointless. If people go on creating proper random UUIDs forever, without anyone checking for duplicates, whether they are unique or not only matters if you can find two identical UUIDs. And that is impossible if they are random, because you would need to compare about 2^62 UUIDs to another 2^62 UUIDs to find a duplicate and that is impossible.

Of course if someone acts in an utterly stupid way, and for example creates UUIDs based on the current time in seconds (less than 32 million per year), then you will have lots of duplicates. But if you have two UUIDs, one created by a stupid process, and one created in a proper random way, you will never manage to find identical UUIDs from two sources.

Say the developers at Apple went completely mad and their UUID creator on iOS started creating UUIDs based on the second of the current time only, much less than a billion different UUIDs for the next 30 years, there would be plenty of collisions between those, but you would now have to check about 2^94 proper randomly created UUIDs to find a collision between the two sets.

And most of the times collisions between UUIDs don't actually matter. Say my music player uses UUIDs to identify songs in my library, and in everyone else's libraries all over the world. And my photo app uses UUIDs to identify photos. Neither my music player nor yours nor any of our photo apps will ever compare the UUID of a song with the UUID of a photo. So if there is a collision between those groups, nobody will ever know and nobody will ever be affected by it.

0

Not the answer you're looking for? Browse other questions tagged or ask your own question.