83
$\begingroup$

If I have a ciphertext, what information about the probable ciphers used can I infer from the ciphertext itself? For example, information from:

  • the length
  • the spacing / grouping of characters
  • the range of characters used
  • frequency counts

and any other information that is intrinsic to the ciphertext itself.

I'm talking about pencil-and-paper ciphers of the kind often used in cryptogram puzzles, rather than real, 'strong', modern forms of crypto.

$\endgroup$
7
  • 7
    $\begingroup$ Wow... look what we have done to ourself, gem of questions wish I could upvote 10 times $\endgroup$
    – skv
    Commented Dec 4, 2014 at 17:04
  • 2
    $\begingroup$ Inspired by this TED talk I have seen a while back, I was wondering what type of tools/programs people on this site are using to attack ciphers and spot patterns? Is anyone using any sort of data visualization tools? If so, how do the add to the question of this post? $\endgroup$
    – BmyGuest
    Commented Dec 16, 2014 at 7:50
  • $\begingroup$ @BmyGuest, I'd find that interesting too. I suggest you ask it as a new question. $\endgroup$
    – A E
    Commented Dec 16, 2014 at 10:22
  • $\begingroup$ @AE I wanted to do this first, but then considered it a bit 'out of scope' for this site. It is strongly going into cryptography and computer science. That's why I posted it as a comment here and not as a question. (See parallels to my double-encryption question.) $\endgroup$
    – BmyGuest
    Commented Dec 16, 2014 at 10:24
  • $\begingroup$ @BmyGuest, I think it's in-scope if you word it as 'tools for solving cryptogram puzzles' or something like that. It's about methods for solving puzzles, rather than 'real' modern cryptography, after all. $\endgroup$
    – A E
    Commented Dec 16, 2014 at 10:28

5 Answers 5

45
$\begingroup$

I'll start with a few simple ones to get us rolling.

Base64 has a 2/3 chance of requiring one or two trailing = signs - if you spot any of these, there's a very good chance you're dealing with Base64 encoded text.

A Bacon cipher is composed of 2 binary bits, which can be represented by anything (eg. could be upper/lowercase as in one recent question, or could be based on whether a cat is white/black, as seen in another). The tell-tale sign is if the ciphertext length is divisible by 5, as each letter requires 5 binary bits of Bacon. Yum.

Frequency analysis is most useful for Caesar ciphers (also known as Caesarian shifts) or similar ciphers where a letter is always encoded as itself (ie. if an E encodes to a K anywhere in the text, it will encode to a K everywhere in the text). Many websites have frequency tables for how often you should expect a letter to appear, and if you have a reasonably large block of ciphertext, this may be a good way to get some initial letters.

A substitution cipher is a very simple numerical cipher where letters are replaced with their numeric position in the alphabet (eg. A is 1, B is 2, etc) - this may also be combined with a Caesarian shift. If you have a sequence of numbers where none are over 26, try a quick substitution and see if you get anywhere. Numbers above 26 may still be substitutable using modulo, but it's less common.

Playfair ciphers are composed of digrams (pairs of letters) so the pair of letters HI may encode to something like PK. If you have an even-length ciphertext, split it into digrams and look for duplicates. If you spot the same digram appearing multiple times, you should begin to suspect a Playfair cipher (or other digram-based cipher).

Polybius ciphers (or Polybius squares) are composed of the same 5 characters over and over again (for example, A, B, C, D, E) and also happen to be digram-based so it must be an even-length ciphertext. If you have only 5 repeated characters and can split them into digrams, it's highly likely you're dealing with Polybius.

One more common-sense point - if you have a keyword then you can usually safely rule out any cipher which doesn't require a keyword. For example, Base64 is just an algorithm which you put text in and it gets spat out the other end, while a Playfair cipher uses a keyword to make the cipher stronger.

$\endgroup$
6
  • 11
    $\begingroup$ For Base64, I'd only refer to it as an "Encoding" and not an encryption. $\endgroup$ Commented Dec 4, 2014 at 16:24
  • 1
    $\begingroup$ @CoreyOgburn, you have a good point re terminology, but I think it's in-scope of what I intended with the question. All these possibilities aren't really what I'd call proper 'encryption', at least not these days! Historical interest only, as real-world encryption techniques. 'Cipher' is a relevant term but 'encoding' is certainly another one. $\endgroup$
    – A E
    Commented Dec 4, 2014 at 17:19
  • 1
    $\begingroup$ FWIW, in crypto terms a Caesar shift is a substitution cipher (with a particular additional constraint). Frequency analysis helps with all substitution ciphers, not just Caesar shifts. If you get a distribution of code symbols that looks like letter frequencies then you have a lead. It also helps with a Vigenère cipher if you can guess the key length (or try different lengths until you get a plausible-looking spectrum). $\endgroup$ Commented Dec 4, 2014 at 21:18
  • $\begingroup$ Note: Playfair doesn't do "E" -> "PK" it does "TH" -> "AB" and "TO" -> "XY". It substitutes two letters for two other letters. $\endgroup$
    – user2322
    Commented Dec 5, 2014 at 0:19
  • $\begingroup$ @MichaelT wow, I brainfarted hard there :P Updated, cheers $\endgroup$
    – Joe
    Commented Dec 5, 2014 at 8:08
22
$\begingroup$

Groups of 5 or total length divisible by 5

Might be the Baconian Cipher, which uses a set of 5 binary-coded ('A' or 'B') items to represent each letter in the plaintext.

Example: Lolcat Steganography: Find the message hidden within the transport medium of humorous feline photography has 20 cats, and some of the possible plaintext answers are 4 characters in length. So each plaintext character could be represented by 5 cats, which makes it worth checking for Bacon.

Short groups (1 to 4 characters)

Check for Morse Code. Is there a binary distinction between the characters which could easily be mapped into dots and dashes? Look particularly for circles or lines in the ciphertext symbols, which could map into dots and dashes.

Example: Decode the message enciphered in these symbols: ◳◰ ◓◨ ◨◧◕ ◎◌ ◱◯◱◯ ◍◌○ ◉◉ ◇◔◓◕ ◐►◓◒ ◒◑ ◈◑ ◆◆◓ ◉◉◉

Max group length is 5 characters if digits are used, 6 if punctuation is used, and prosigns can be even longer. So if the ciphertext is almost all groups of up to 4 characters then it still might be worth checking for Morse.

Lengths of groups = lengths of possible plaintext words

Perhaps the spaces between the groups correspond directly to spaces between the words in the plaintext. If so then this could be a simple substitution cipher such as Caesar or even good old ROT-13! Frequency analysis might yield interesting results.

Ciphertext consists of arbitrary objects rather than textual characters

Can a binary distinction be made which divides the objects into 2 types? If so then if the total number of objects is divisible by 5 then this could well be a Baconian Cipher.

"Knowledge Is Powe"
Detail from a photograph of World War I cryptographers trained by William and Elizebeth Friedman, Aurora, Illinois, early 1918. By facing either forward or sideways, the soldiers formed a coded phrase utilizing Francis Bacon’s biliteral cipher. The intended message was the Baconian motto “Knowledge is power,” but there were insufficient people to complete the r (and the w was compromised by one soldier looking the wrong way).

How to Make Anything Signify Anything, Cabinet issue 40 winter 2010/2011, William H. Sherman

Ends in an equals symbol

Try Base64 - the '=' symbol is sometimes (but not always) needed for padding, to make the ciphertext the right length.

$\endgroup$
6
  • 3
    $\begingroup$ The link to How to Make Anything Signify Anything is extremely interesting reading for anyone who hasn't read it - click it already! $\endgroup$
    – Joe
    Commented Dec 4, 2014 at 14:54
  • 1
    $\begingroup$ Your paragraph on morse code isn't entirely accurate. There's for example also the eight dots mistake code, which could occur in live encryption. Still, most of the time you'll mostly see shorter codes of course. But if I'm not mistaken, off the top of my head, the @ sign has six letters. $\endgroup$
    – user6196
    Commented Dec 5, 2014 at 6:57
  • 1
    $\begingroup$ ... But also most punctuation signs have six. $\endgroup$
    – user6196
    Commented Dec 5, 2014 at 7:04
  • 1
    $\begingroup$ Thank you, but eight dots is not 'say again'. It's rather 'I made a mistake, scrap that last character' $\endgroup$
    – user6196
    Commented Dec 5, 2014 at 13:24
  • 1
    $\begingroup$ @CamilStaps, oh, is it? Amending. :) $\endgroup$
    – A E
    Commented Dec 5, 2014 at 13:25
10
$\begingroup$

One class of ciphers that can be identified in an interesting way is the polyalphabetic ciphers. Of these the Vigenère cipher is the best known. They cycle through different alphabets periodically. This period can be identified, which will both identify the cipher as polyalphabetic, and also provide a starting point for decrypting it.

An effective way to break a polyalphabetic cipher is by using a technique known as index of coincidence. This is a measure of how frequently a pair of independent characters happen to be identical.

For example if a text has 100 characters, there will be 100*99/2 = 4950 ways to pick two characters. One can count how large a percentage of these 4950 pairs happen to be identical.

If 100 characters were encrypted using a polyalphabetic cipher that cycle through 10 different alphabets, then each alphabet would be used 10 times. A pair of characters encrypted using different alphabets would be mostly independent. They are not very likely to be identical. But for a pair of characters encrypted using the same alphabet, they are more likely to be identical.

If one makes a guess at the period being 10, one can compute percentage of identical characters for only those pairs using the same alphabet. There are 10 alphabets and for each alphabet there are 10*9/2 possible pairs of characters, that's a total of 450 pairs, which is enough for the percentage to still be statistically significant.

If you do that calculation with a correct period, you get more identical pairs than if you do the calculation with an incorrect period. So you can simply try different period lengths, and for each length compute the frequency of identical pairs.

If you plot those frequencies, you should see a visible spike when you hit the correct period. If the guessed period and the correct period are different but share a prime factor, there will be a small spike. For example if the correct period was 10, you would see small spikes at 2, 4, 5, 6, 8, 10, 12, 14, 15.

$\endgroup$
2
  • $\begingroup$ Are there any particular online resources or references on that which you'd recommend? $\endgroup$
    – A E
    Commented Dec 4, 2014 at 22:28
  • $\begingroup$ @AE I don't know of any particular online resources to recommend. What I explained was just an algorithm I happened to read about in a book back in 1997 and then implemented a few years later, and I have long since forgotten the name of the book. $\endgroup$
    – kasperd
    Commented Dec 4, 2014 at 23:51
9
$\begingroup$

unique character mapping?

If you have already the information, that a cipher maps something onto letters, then counting the number of different types of something can give you a clue if there is a unique or a multiple mapping of something onto a single character.

i.e. there could be a cipher where B stands for A but C also stands for A.

An example for this can be found in the My new cipher with numbers puzzle:

$(2_ 2,5_ 2,7_ 5,8_ 5,9_ 4,14_ 4),(6_ 2,7_ 1),(3_ 2,5_ 4,30_ 4,70_ 2)$

In this puzzle, all number-pairs are different, despite the additional information given that each pair stands for 1 letter, and that the given ciphertext is a full sentence.

Clearly a specific letter can be expressed in different ways which already rules out many of the known ciphers.

$\endgroup$
0
3
$\begingroup$

Adding to the previous excellent answers:

If the letter frequency distribution looks like regular English (lots of ETAOIN, a bit of SHRDLU), and there are maybe (but not necessarily) some short recognisable word fragments in the mix, you are probably dealing with a transposition cipher of one kind or another. If the word lengths look reasonable, it may be just anagrams.

If you have a lot of numbers, mostly between 65 and 122, what you have is ASCII encoded text. This is often used as an additional layer, if the cipher puzzle is based on mathematics.

If the ciphertext only has numbers 0-9 and the letters A to F, then it is hex encoded. If the ciphertext is mostly made of pairs that are in the range from 41 to 7A, you have hex-encoded ASCII text.

If the ciphertext has irregular spaces at the beginning of lines, but otherwise looks like regular text, it's probable that the creator is trying to hide (or create) a simple acrostic.

If you have only capital letters in groups of five, you are probably dealing with a strong cipher, like Enigma, or another military cipher. Since these are too hard to crack for puzzling purposes, they will usually come with hints and/or encryption keys given.

  • If you see mentions of wheels, rotors, rings, or plugboards (or a group of Roman numerals from I to VIII and some German text), it's almost certainly Enigma.
  • If you don't have those, but you've got a password, Vigenère would be a good guess. (In its day, the Vigenère cipher was considered strong enough to be virtually unbreakable, so hiding word and sentence lengths made a lot of sense then.)

And of course, if the ciphertext consists only of the words a'la'ih and do'neh'lini, you are up against Navajo code talkers.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.