2

If one uses a passphrase which is grammatical, such as "I put my keys under the doormat because they're safe there."... how many words (on average) must it contain to be as secure as a password with ~ 14 characters, such as 28fjha9;582g-jg ? (say we allow 26 lowercase, 26 capital, 10 digits and 10 punctuation, so total of 72).

I've heard people say that only a few thousand English words are commonly used, so let's say that there are 5000 possible words. If the passphrase were not grammatical, then using 8 words would give us better entropy than a 14-character password (assuming the words weren't all super short).

5000^8 = 3.906*10^29 and 72^14 = 1.006*10^26

However, if our passphrase is grammatical, I would think this severely restricts the number of combinations.

Given this restriction on the passphrase, how long does it need to be in order to be as secure as a 14-digit password randomly composed of ~ 72 characters? What about if we apply some restriction on the password as well, expecting that it will be mostly composed of dictionary words and a few extra characters?

EDIT: I just realized my calculation of 5000^8 also doesn't take into account punctuation and capital letters, so presumably it would be much better than that.

4 Answers 4

2

Unfortunately I don't think our industry has the information we would need to answer your question. While you are right that going from a random sequence of words to a grammatically valid sequence of words offers many less combinations, it's not an easy answer to determine how many combinations can be produced. And it's really figuring that out that then allows you to compare them to random passwords.

As you increase the number of words in a sequence you also exponentially increase the overall number of possibilities to check, which results in the problem of not having enough time/computing power to check them all. So while you can check a specific sequence fairly quickly you can't check all possible sequences very fast once you reach a certain number of words (6, maybe 7?). Pre-tagging words with their relevant part of speech and constructing sentence possibilities based on selecting appropriate tags would be faster than a brute force approach, but I don't know how much faster. I don't think we'd reach 1.006*10^26 acceptable combinations before hitting the computational wall. I could be overestimating the burden of this work, but that's my initial suspicion.

As the CMU paper (Effect of Grammar on Security of Long Passwords) @Steve-dl mentioned discusses, you also have the problem that attackers will further reduce that set of acceptable grammatical sequences to likely grammatical sequences. This is where I've seen most passphrase cracking research focused. Using either word sequences pulled directly from public sources (song titles/lyrics, book passages, famous quotes, wikipedia sentences, etc.) or partial phrase combinator attacks ("iloveto" + wordlist entry) they try to focus in on the likely choices. But the number of word combinations this produces will vary from attacker to attacker, so it is still difficult to say exactly what numbers these approaches can produce.

So to have a 'secure' grammatically correct passphrase you would not only need to make sure it was formed from a sufficiently long combination of words, but also make sure it has not appeared in popular media that could be part of an attacker's wordlist.

3
  • 2
    It's worth noting that there is a further decrease in possibilities from the space of semantically 'legit' and commonly observed sentences to the space of sentences that one would pick as an auth factor to remember. With passwords we can tell that some words (password, test, monkey, etc.) are much likelier to appear than others. It's not unreasonable to expect the same phenomena with passphrases and that is what we have absolutely no data to reason on at the moment... Commented Aug 16, 2014 at 0:19
  • @SteveDL True. Much of our spoken language is repetitive, dull, and vague ("yeah, that's true", "they probably wouldn't have done that"), while much of the content of, say, Wikipedia is very technical. The pass phrases that most people would pick would probably be somewhere in between - not overly technical but containing enough "interesting" words to be easy to remember. Commented Aug 16, 2014 at 2:42
  • Wikipedia is actually biased towards sports events and pop stars. Otherwise it's a very rich but not so technical dataset. (I've done latent semantic analysis with it in a distant past) Commented Aug 16, 2014 at 9:10
9

Shannon's original paper on the subject and further research gives grammatical English about 1.3 bits of entropy per character. You can compute the entropy of a random password by raising the character space to the power of the length of the password, and then taking the log base 2 of that value, eg. an n-character password created using the 96 printable ASCII characters would have an entropy of log2(96^n).

From this, you can compute the length of English text needed to get the same security. For example, an eight-character random password is equivalent to (log2(96^8)/1.3), or about a 40-character sentence.

5
  • Wow, Shannon was really ahead of his time. Cool paper, thanks. Commented Aug 15, 2014 at 22:22
  • 5
    There has been some disagreement on how useful Shannon entropy is in trying to measure password strength: http://www.passwordresearch.com/papers/paper229.html
    – PwdRsch
    Commented Aug 15, 2014 at 22:36
  • Indeed. Passwords do not have the same distributions as English words and so it'd be dangerous to assume passphrases would, without some real-world evidence to back this up. Commented Aug 15, 2014 at 22:43
  • Sadly, the link is now broken. Commented Mar 24, 2017 at 1:11
  • 1
    @CareyGregory, link fixed
    – Mark
    Commented Mar 24, 2017 at 1:22
1

You're comparing apples and oranges.

Just like you can figure out what a dictionary word is likely to be based on what letters have been typed before (or a word that is part of the 'password language', as does Telepathwords, which is basically the same NLP problem with a better corpus to train your predictions), you can figure out what the next words of a sentence are likely to be as soon as you have a corpus to base your predictions on.

Assuming this you can consider the language of your targets, the domain of the service where the factor is in use and any prior credential leakages as prior information and reason in terms of information leakage. Of all the possible combinations of words (grammatically correct) how many are likely to be semantically correct (basic latent semantic analysis on a language's corpus) and which are more relevant to your user (which is a sort of meaning disambiguation learning problem)?

What I mean is that basically, the passphrase alphabet makes it easier for users to create meaningful passphrases and whilst this is a good thing usability-wise it also means that you're more likely to find prior information valuable than with actual random passwords (as opposed to truly user-chosen passwords). It's very nice to want to put scores on different credentials with 'objective' metrics but the environment and way in which they are used contributes much more to both their relative security and usability than their form itself.

4
  • So if I hear you right, you're saying the comparison would be more appropriate between non-random passwords and non-random passphrases. That's reasonable. In that sort of situation, do you have an off-the-cuff idea of which is more secure? Or any links to papers? Commented Aug 15, 2014 at 21:37
  • You hear me right. I've seen some calculations made on SE, I've seen a lot of papers by folks at CMU CUPS on passphrases (Google Scholar will help tremendously). I don't consider either to be definitive and objective knowledge because of the conditions in which these studies are done. There's basically no real data on passphrase usage so it's hard to discuss what characteristics the passphrase space would exhibit and then what 'entropy' it'd have or how it'd 'leak information'. Commented Aug 15, 2014 at 21:41
  • 2
    Until then, the CMU paper on correct horse battery staple provides some nice heuristics to reason with, and this other paper on how grammar affects passphrases might be relevant (disclaimer: haven't read it). Commented Aug 15, 2014 at 21:43
  • If someone tells you that they have a definitive answer to that question (not for the random space but the actual space) I would rather assume they don't know what they're saying. There are very serious validity concerns with lab studies of passwords, on top of the usual validity and applicability issues of lab work. Commented Aug 15, 2014 at 21:44
1

I did an analysis for Google based on a sample of 15 million un-salted hashed passwords used for Google accounts by actual users. Some slides based on the results are published here, "Limits to Anti phishing".

http://research.google.com/pubs/SecurityCryptographyandPrivacy.html

If you look at page 14, the bubble chart shows the frequency of common passwords. One of the biggest conclusions is that a dictionary of just 1000 passwords are used by 6.1% of accounts. About 1 million passwords are used by 50% of accounts. Thus, brute force attacks on a known salted-hash are fairly trivial and even online password guessing attacks can be relatively successful against some fraction of accounts, unless your login system detects online password guessing attempts (which all secure systems, should).

To the extent that 1000 passwords are shared by a fraction of accounts, 1000 ~ 2^10, which is consistent with the estimate 8 x 1.3 = 10.4 bits.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .