Sunday, July 10, 2016

Cracking a OTP Cipher: Python Unit Testing by Example - Part 1 Encryption Motivation

This article is special to me because it combines 3 things I love dearly in one project. Today I am going to talk about how someone (namely me) can use The Python language to automate an attack on an implementation of the One Time Pad Encryption Scheme that ignored the "One-Time" portion of the name. Python, Unit Testing, and Attacking Crypto...does it get any better?




My motivation for this topic came from the Cryptography Course offered Online from the University of Maryland. If you enjoy Crypto problems like the following I highly recommend taking the Course on Coursera (https://www.coursera.org/learn/cryptography)
This is going to be one of my more technical posts, so cinch down those thinking Hoodies and let's dive in.

Dealing with Cryptographic Systems  in general requires a lot of proofs. Mathematical, Statistical, Practical, the list goes on. As you should know by now writing code also means Unit Testing. So why not combine the two? To write a Python framework to crack the One Time Pad Reuse problem, we first need define the algorithm for cracking the problem. Then we need to develop Unit Tests that will tell us when our code has satisfied those conditions. Finally we need to write the code to pass the Unit Tests.

First off: A little background on the One Time Pad.
let E(Gen,Enc,Dec) be the One-Time Pad encryption scheme we are discussing
m = Hexadecimal of some English language text chosen randomly from all possible texts
i = # of bytes in m
The algorithms for the scheme are defined as:
Gen(i){k = Uniformly selected i bytes}
Enc(k,m){c = k XOR m}
Dec(k,c){m = k XOR c}
You can pause here to prove to yourself that m = Dec(k,Enc(k,m)) and that the Pr[m|c] = Pr[m] (The probability of any message m, given the cipher text c, is the same as the a priori probability of any given message m).

So what is the problem if we are using a "perfectly secret" algorithm like OTP? Well it comes in the stipulation that the key never be reused.
Suppose you have 2 cipher texts, c1 and c2. It is a mathematical fact that c1 XOR c2 = m1 XOR m2. So the XOR of two cipher texts using the same key is equal to the XOR of the two plain text messages.
Mathematically it look like this:
(k XOR m1) XOR (k XOR m2) == (k XOR k) XOR (m1 XOR m2)
Since the k XOR k cancels itself out that just leaves (m1 XOR m2)
This leaks usable information about the two underlying plain-text messages.

I will call this XOR of the two Ciphers XORXOR. Converting the XORXOR to a binary string will tell you exactly where the two messages differed.
Lets take a 1 byte XORXOR 0x07 as an example in Binary that byte is 00000111 immediately you can see that, whatever the parent bytes were, they were nearly identical except the last 3 bits.
We can formulate this as a search on the set of all possible 1 Byte XOR values as such:
Any single byte pairs that result in the XOR 0x07 
In fact, there are 256 possible single byte parent combinations for XORXOR 0x07 (or any other single byte).

Furthermore, by exploiting a bit of knowledge about ASCII -> Binary conversion you can see that any Letter byte starts with the prefix 01 and the space character starts with the prefix 00. So the XOR of any two English letters will result in a byte starting with the prefix 00 and the XOR of a letter and a space character will result in byte string with the prefix 01. This is a heuristic that can help us narrow the possible parents down even further. Using this logic we can see the above example is most likely the XOR of two letter bytes.

Our new search would be:
Any 1 byte pairs that result in the XOR 0x07 
AND are both Character bytes (between int value 64 - 128)
Which yields 62 possible byte combinations
* list includes both possibilities (m1=a,m2=b) and (m1=b,m2=a), the number of unique combinations is 1/2 (31  Unique parent byte combinations for XORXOR 0x07)

Note a couple of those pairs include bytes less common to certain plain texts. You can further reduce these if you apply Domain Knowledge about the ciphers. If we assume the underlying messages were both valid English texts we can reduce the set even further to only those combinations where both characters are in the normal typographic range. These reductions produce a Message Possibility table like this (bold indicates more likely pairs):

   Message X        |     Message Y
-------------------------------------------------------
    @ (0x40)        |        G (0x47)
    A (0x41)         |        F (0x46)
    B (0x42)         |        E (0x45)
    C (0x43)         |        D (0x44)
    H (0x48)        |        O (0x4f)
    I (0x49)          |        N (0x4e)
    J (0x4a)          |        M (0x4d)
    K (0x4b)        |        L (0x4c)
    P (0x50)         |        W (0x57)
    Q (0x51)        |        V (0x56)
    R (0x52)        |        U (0x55)
    S (0x53)         |        T (0x54)

    X (0x58)        |        _ (0x5f)
    Y (0x59)        |        ^ (0x5e)
    Z (0x5a)        |        ] (0x5d)
    [ (0x5b)         |        \ (0x5c)
    ` (0x60)        |        g (0x67)
    a (0x61)        |        f (0x66)
    b (0x62)        |        e (0x65)
    c (0x63)         |        d (0x64)
    h (0x68)        |        o (0x6f)
    i (0x69)         |        n (0x6e)
    j (0x6a)         |        m (0x6d)
    k (0x6b)        |        l (0x6c)
    p (0x70)        |        w (0x77)
    q (0x71)        |        v (0x76)
    r (0x72)         |        u (0x75)
    s (0x73)         |        t (0x74)

    y (0x79)        |        ~ (0x7e)
    z (0x7a)        |        } (0x7d)
    { (0x7b)        |        | (0x7c)

With a single cipher there is no sure way of telling which message owned which character. If the XORXOR is long enough you may be able to reason about it, but the above 1 byte example gives you only an If/Then idea. This problem is more interesting when you think of multiple messages. Suppose you have a set of 3 Cipher texts. All of which were produced using the same OTP key. [c1,c2,c3]

XORXOR c1,c2 = 0x02 (00000010)
Table of Possibilities for XORXOR 0x02
    Message X        |     Message Y
----------------------------------------------

    A (0x41)        |        C (0x43)
    D (0x44)        |        F (0x46)
    E (0x45)        |        G (0x47)
    H (0x48)        |        J (0x4a)
    I (0x49)        |        K (0x4b)
    L (0x4c)        |        N (0x4e)
    M (0x4d)        |        O (0x4f)
    P (0x50)        |        R (0x52)
    Q (0x51)        |        S (0x53)
    T (0x54)        |        V (0x56)
    U (0x55)        |        W (0x57)
    X (0x58)        |        Z (0x5a)
    a (0x61)        |        c (0x63)
    d (0x64)        |        f (0x66)
    e (0x65)        |        g (0x67)
    h (0x68)        |        j (0x6a)
    i (0x69)        |        k (0x6b)
    l (0x6c)        |        n (0x6e)
    m (0x6d)        |        o (0x6f)
    p (0x70)        |        r (0x72)
    q (0x71)        |        s (0x73)
    t (0x74)        |        v (0x76)
    u (0x75)        |        w (0x77)
    x (0x78)        |        z (0x7a)


XORXOR c1,c3 = 0x64 (01100100)
Table of Possibilities for XORXOR 0x64
    Message X        |     Message Y
-------------------------------------------------------
      (0x20)        |        D (0x44)


XORXOR c2,c3 = 0x66 (01100110)
Table of Possibilities for XORXOR 0x66
    Message X        |     Message Y
-------------------------------------------------------
      (0x20)        |        F (0x46)

Since c1,c3 and c2,c3 both were XORs with a space and c1,c2 was not we can say conclusively that c3 is the space character. Which means c2 is "F" and c1 is "D". We have now cracked the One Time Pad for a single byte, with just 3 cipher texts. It is much easier to crack the XOR of a space and a character as you can see from the table above.

You can imagine, hopefully, repeating the same method above for each byte of a multi-byte XORXOR for each pair of messages. Doing so would generate a series of tables similar to the above. Finding the parent message then becomes a Combinatorics problem. Now suppose each byte of a 7 byte cipher has an average of 31 possible pairs of parent bytes, that is ~31^7 possible outputs. Still too big to reasonably search through manually in my opinion. Plus there is no need. A lot of these outputs will be garbage that is not likely to be plain text. Therefore, we can reduce the field once again using our assumption of the English language plain-text. By producing the different combinations and seeing if they comprise valid English words (using Python and an English Dictionary available in the python-enchant package) we can reduce our search set to only those combinations that might produce two valid English Messages.
Also, notice how the XOR of capitol and lower case letter pairs are equivalent in the tables above? You can further reduce the number of possibilities simply by not concerning yourself with the capitalization of the underlying texts. This is fine in a lot of cases, but not all obviously.

Finally, the idea can be expanded to include texts from other languages but the underlying assumptions about XORing of bytes would need to be updated as well.

Since this method relies on heuristics it cannot be said to work all the time. If the underlying text is not English, or is poorly formed the Dictionary Heuristic may throw it out. Nor have I mentioned anything about dealing with numeric sentences or special cases of text. you can expand the syntactic checker with any knowledge you have about the underlying texts. Are they chat messages going back and forth? Are they business documents on a server? Different types of texts require different insights into cracking so I will stick with the base case for now and leave it up to you the reader to expand or generalize it as you need.

This concludes the Cryptographic Motivation section. Hopefully you understand now what the weakness in the One Time Pad I am attacking is, and how I plan to attack it. In the next post I will be discussing the Python Unit Tests I have written to help with several of the functions above. Namely generating the tables of parent bytes and, combinations.


1 comment: