2

I have a method that will generate 50,000 random strings, save them all to a file, and then run through the file, and delete all duplicates of the strings that occur. Out of those 50,000 random strings, after using set() to generate unique ones, on average 63 of them are left.

Function to generate the strings:

def random_strings(size=8, chars=string.ascii_uppercase + string.digits + string.ascii_lowercase):
    return ''.join(random.choice(chars) for _ in xrange(size))

Delete duplicates:

    with open("dicts/temp_dict.txt", "a+") as data:
        created = 0
        while created != 50000:
            string = random_strings()
            data.write(string + "\n")
            created += 1
            sys.stdout.write("\rCreating password: {} out of 50000".format(created))
            sys.stdout.flush()

        print "\nRemoving duplicates.."
        with open("dicts\\rainbow-dict.txt", "a+") as rewrite:
            rewrite.writelines(set(data))

Example of before and after: https://gist.github.com/Ekultek/a760912b40cb32de5f5b3d2fc580b99f

How can I generate completely random unique strings without duplicates?

3
  • Do you require 2 files or do you just want 50000 unique strings? Commented Oct 18, 2016 at 18:26
  • What is set(data) supposed to do?
    – thebjorn
    Commented Oct 18, 2016 at 18:27
  • @SimonBlack 50k unique stings Commented Oct 18, 2016 at 18:29

2 Answers 2

3

You can use set from the start

created = set()
while len(created) < 50000:
    created.add(random_strings())

And save once outside the loop

5
  • Wouldn't this slow down this process a whole lot though? Commented Oct 18, 2016 at 18:29
  • @Pyth0nicPenguin, less than re-writing files. And if you remove doubles after you created 50k words - you get less words.
    – volcano
    Commented Oct 18, 2016 at 18:32
  • BTW, if you worry ab.execution time - you probably shouldn't log every 50k combinations.
    – volcano
    Commented Oct 18, 2016 at 18:33
  • I'm not to worried about the execution time, I was just curious as to why I was only getting 63 out of 50k is all, I'll give this a shot and see what happens, thank you Commented Oct 18, 2016 at 18:35
  • 2
    @Pyth0nicPenguin, probably because you get too many repetitions. random is not as random as advertised :-) . And you are welcome
    – volcano
    Commented Oct 18, 2016 at 18:37
0

You could guarantee unique strings by generating unique numbers, starting with a random number is a range that is 1/50000th of the total number of possibilities (628). Then generate more random numbers, each time determining the window in which the next number can be selected. This is not perfectly random, but I believe it's practically close enough.

Then these numbers can each be converted to strings by considering a representation of a 62-base number. Here is the code, and a test at the end to check that indeed all 50000 strings are unique:

import string
import random

def random_strings(count, size=8, chars=string.ascii_uppercase + string.digits + string.ascii_lowercase):
    max = len(chars) ** size - 1
    start = 0
    choices = []
    for i in range(0,count):
        start = random.randint(start, start + (max-start) // (count-i))
        digits = []
        temp = start
        while len(digits) < size:
            temp, i = divmod(temp, len(chars))
            digits.append(chars[i])
        choices.append(''.join(digits))
        start += 1
    return choices

choices = random_strings(50000)
# optional shuffle, since they are produced in order of `chars`
random.shuffle(choices)
# Test: output how many distinct values there are:
print (len(set(choices)))

See it run on repl.it

This produces your strings in linear time. With the above parameters you'll have the answer within a second on the average PC.

Not the answer you're looking for? Browse other questions tagged or ask your own question.