Cracking Ciphers with Transformer

This is a write-up of my experiment to leverage the power of deep learning to crack historic substitution ciphers. It was performed in November, 2023 but I procrastinated until now to summarise it…

Theoretical backgrounds

Substitution ciphers

Substitution ciphers refer to a method of encryption by which units of plaintext are replaced with ciphertext according to a regular system. The best-known example is the Caesar cipher, which shifts the alphabet by a fixed number of places. For example, with a shift of 3, A would be replaced by D, B would become E, and so on.

Such ciphers, categorised as monoalphabetic, are extremely vulnerable to frequency analysis, which is a technique that exploits the fact that some letters appear more frequently than others in a given language. In order to make the ciphertext less vulnerable to this attack, polyalphabetic ciphers were developed, such as the Vigenère cipher.

The Vigenère cipher operates by using a short keyword that is expanded to the length of the plaintext by repetition, which is then used to shift the letters in the plaintext with the same principle as the Caesar cipher. During the centuries that followed its invention, many cryptanalysts tried to break it, without success, and it was not until the 19th century that Friedrich Kasiski published a general solution, which is now known as the Kasiski examination.

Kasiski noticed that if the same keyword is used to encrypt multiple plaintexts, the same sequence of letters would be encrypted in the same way, which would allow the cryptanalyst to identify the length of the keyword by finding the distance between the repetitions of the same sequence. Once the length is known, the Vigenère cipher could be thought of as a series of Caesar ciphers, allowing the cryptanalyst to break it with frequency analysis.

A variation of the Vigenère exists, known as the autokey cipher, which uses the plaintext itself, concatenated with the keyword, to encrypt the message. This method is considered more secure than the original; however, it unavoidably introduces common words in the cipher, and attackers could hypothesise the positions of such words to break the encryption.

The Enigma machine, used by Nazi Germany during the Second World War, completely revolutionised the field of cryptography. It was an extremely sophisticated device that encrypted messages with a series of rotors, each of which is a substitution table. Enigma’s security was based on the fact that the first rotor would rotate (thus changing the configuration) after some key presses, and some rotations would affect the next rotor, and so on. At the end is a reflector (UKW, Umkehrwalze, literally “reversing rotor”), which makes the encryption process reciprocal. During the war, a plugboard was added to the machine, which swapped ten pairs of letters, further complicating the encryption. Considering the number of possible configurations, the success of the Polish and British cryptanalysts in breaking the Enigma was a remarkable achievement. However, this article will not delve into the details of cryptography and the mathematics behind it, but rather focus on the application of deep learning to crack substitution ciphers.

Deep learning

As we all know, neural networks are universal approximaters, so, it could be hypothesised that they could obtain the plaintext given an arbitrary ciphertext alone. The encryption process could be abstracted as such a function:

Encrypt: key, plaintextciphertext,

Similarly, decryption is

Decrypt: key, ciphertextplaintext,

which is equivalent to

Decrypt: key → (ciphertextplaintext).

Now the task of the neural network is to learn the outer function that is abstract enough and covers all possible keys, so that given any ciphertext, it could output the plaintext without the knowledge of the key.

Experiments

The model

The processes of encryption and decryption with substitution ciphers have a very convenient property: the lengths of the plaintext and the ciphertext are equal. Thus, cipher breaking is a typical sequence-to-sequence problem, which is a perfect fit for the Transformer architecture. Here it is supposed that both the plaintext and the ciphertext are composed of the 26 letters of the Latin alphabet, but this could be easily extended to other alphabets.

In my experimental setup, Vanilla Transformer (# attention heads = 8, embedding dimension = 32, other hyperparameters are default) is used. And the architecture is as follows:

Embedding → Positional Encoding (sin-cos) → Transformer (as a black box).

Training data

I tried to break the following ciphers in sequence: Caesar cipher, autokey cipher, and Enigma with increasing difficulty. A single function is used to convert plaintext and ciphertext strings into tensors, and the data providing function is one with such a signature: gen_plain_cipher() -> tuple[str, str].

Caesar cipher

SHIFT = 3
def gen_plain_cipher_caesar() -> tuple[str, str]:
    plain = ''.join((random.choice(string.ascii_uppercase) for _ in range(LEN_TEXT)))
    cipher = ''.join((chr((ord(c) - ord('A') + SHIFT) % 26 + ord('A')) for c in plain))
    return plain, cipher

Autokey cipher

def tabula_recta_encrypt(plain: str, key: str) -> str:
    """
    `plain` and `key` are both single characters.
    """
        return chr((ord(plain) - ord('A') + ord(key) - ord('A')) % 26 + ord('A'))

PRIMER = 'DATASCIENCE'
def gen_plain_cipher_autokey() -> tuple[str, str]:
    plain = ''.join((random.choice(string.ascii_uppercase) for _ in range(LEN_TEXT)))
    key = PRIMER + plain[:-len(PRIMER)]
    cipher = ''.join((tabula_recta_encrypt(plain[i], key[i]) for i in range(LEN_TEXT)))
    return plain, cipher

Enigma

I used an existing Enigma simulator and only made minor changes to the API. Several data generators are defined:

  1. Fixed encryption parameters, including the order of the rotors, the initial positions, the plugboard, etc.; fixed substrings of the plaintext, the prefix being ANX (German for ‘to’, X stands for a space) and the sequence KEINEXBESONDERENXEREIGNISSE (German for ‘no special events’) being in the middle.
  2. Fixed encryption parameters, without fixed substrings of the plaintext.
  3. Fixed rotors, but randomised the plugboard.
  4. Three rotors randomly chosen from the five used by the Wehrmacht, randomised order and initial positions, and randomised plugboard.
  5. All parameters randomised.

Training

Training data are dynamically generated, and only one batch, which consists of 600 samples, each 48 letters in length, is processed during one ‘epoch’. Due to the randomness of the data, the problem of overfitting is not a concern. I originally planned to train the model for 100,000 epochs, checkpointing every 2,500 epochs; however, the model converged very rapidly in all the cases listed above, the accuracy reaching 100%. A sample output of the last test is shown below:

MZMVOKNXXVEFXLONIZOVKGLEXTLSGJNDDYXVCVQSPLSPGZPN
UGJGZWSAIIFQDXDJWLMHQYSBIBGDUBCXWBSIIELDVJPDSJJG
UGJGZWSAIIFQDXDJWLMHQYSBIBGDUBCXWBSIIELDVJPDSJJG

QMGFCJPQLTMWGXPOTDRVTHZGQPXUYLMIPHVNPHLUHIQZQGFN
AGJCFPTFTZGUNTYPFYLHXZAFZVSHQXORYUFCEXKPXVLQFFET
AGJCFPTFTZGUNTYPFYLHXZAFZVSHQXORYUFCEXKPXVLQFFET

LPEPLCQDFONJIVXSWUOSUXVRIIYVEXIYRWJOVZNCZHGJNEWJ
CNOGANKNOTXQREQUEHFLEWLWDAKCGMWUZGPIPSZOGCRODNNK
CNOGANKNOTXQREQUEHFLEWLWDAKCGMWUZGPIPSZOGCRODNNK

WVYLCDWAXNONSLVYXEVQBLTSKMAXRNOHADDCPXMJBUTVEDMO
NXDREJTREECJZFJELCJJXIDMEHCRFJCLIMIACKQREIVKIPCT
NXDREJTREECJZFJELCJJXIDMEHCRFJCLIMIACKQREIVKIPCT

YAJEIOQDWLFPMEKDRBVDFOCPUUFASVDYMHRCGXEWSXROMYAX
ZFUSQNLBCDWLRPSVOGWIYJYOHHINVJNAWPGWDNMGMTGZEAZQ
ZFUSQNLBCDWLRPSVOGWIYJYOHHINVJNAWPGWDNMGMTGZEAZQ

OYSXXWOBNZFGRFMCMJWLCEMEOSWZRDNRCYQYWDVMOUSLJWFW
RLBMVKMDSODQSNPZCZAYYRIJHGNPZAQFDSMNXEWEBIOEUIZG
RLBMVKMDSODQSNPZCZAYYRIJHGNPZAQFDSMNXEWEBIOEUIZG

EBYXPPZPJKJUMCUHUEMLXTHOVEYSRKIKANDJERAWSQLDYLVX
SNQVAJRICWFYKFFAXVWXZPQZDUSDJGYOVKZPJKCNQRUGUNTL
SNQVAJRICWFYKFFAXVWXZPQZDUSDJGYOVKZPJKCNQRUGUNTL

Accuracy: 1.0
Loss: 0.002676135650835931

Conclusions

A simple, small Transformer model could crack ciphers as complex as the Enigma, given enough training data. Amazing, ain’t it?

Afterthoughts

What I did not consider last year is the fact that one ciphertext could correspond to multiple plaintexts (only in the case of Enigma, as the Caesar and autokey ciphers are deterministic). So why on earth did the accuracy reach 100%? I no longer have time to investigate this, but I hope the kind reader could enlighten me.