Sci-Fi Crypto

From YobiWiki
Revision as of 21:37, 24 November 2010 by <bdi>PhilippeTeuwen</bdi> (talk | contribs) (Reverted edits by Etegohy (Talk) to last revision by PhilippeTeuwen)
Jump to navigation Jump to search

This is the copy of an idea I've spread during the HackCamp about Sci-Fi crypto

The original essay

Sci-Fi(?) crypto: brainstorming on how to securely exchange messages while Big Brother has at its disposal nice quantum computers (or how to defeat brute-force attacks on message ciphers)

Some time ago I was reading "Digital Fortress" by Dan Brown.
A thriller in the NSA headquarters where the head of the crypto division is facing an unbreakable code, resistant to the brute-force attacks of the NSA's supercomputer TRANSLTR.
The thrilling fact is that the author of this code threatened NSA that the code will be revealed to the entire world if he happens to die.
And his death is the start of the book...

So I was reading and while the protagonists were busy imaginating how this unbreakable code was designed, I thought:
"Bu..sh..![1] To make an unbreakable code this is not like that that you have to do. You've rather to..."
And that's how I imaginated what I'm about to talk about.

Reading further, I found one quote much smarter than the rest of the book, which illustrates very well my technique:
"you could have broken the code but maybe you didn't see it!"

Let's try to create a message cipher resistant to brute force.
I said "message cipher" and not "stream cipher" because this cipher will not work on binary data but english text.
Brute force by definition can break any known symmetric cipher, but what if you don't know you broke it?
Generating all the possible plaintexts is only half of the game, the other half is to have a decision criterion to discriminate the right plaintext amongst the others.
And that's this half that we'll try to make harder for the cryptanalyst.
Here I need a new word because usually crypto talks about plaintext and ciphertext but not about possibly wrongly decoded ciphertexts.
We'll call that an attemptext (tentative of plaintext).

Imagine a ciphered text that is always deciphered to a grammatically correct attemptext, no matter which (wrong) key you used.
Cryptanalysis requires therefore a large amount of work to retrieve the correct plaintext among all attemptexts.
This is the start of a war between artificial intelligences...
Anything the cryptanalyst program can find to discard an attemptext means that you can probably tune the initial grammar model to create more plausible attemptexts.
The problem for the cryptanalyst is a variation of the Turing test as it's a test of humanity, with some differences with the original test:

  • the test procedure is not a dialog but consists of n texts of which only one was written by a human
  • the tester is rather a computer than a human (like for a captcha(tm)[6])
  • the human does not want to prove his humanity, to the contrary, so he can deliberately introduce some non-sense in his text [7]

Let's represent a language by a graph.
Each sentence is one path through the language graph.
Taking another random path will lead to another grammatically correct sentence.
The perfect language graph represents all possible sentences but no junk, so a cryptanalysis has to be done on another level than the grammar or other easy tricks like alphabet frequency analysis.


To cipher our message, the first step is not a cryptographic one but the encoding of the sentence as a path along the grammar graph.
This path has to be identified numerically (enumerated) amongst the possible paths.
Ideally the enumeration must be balanced by the frequency of common grammatical constructions and vocabulary, for example something you get for free if you manage to map some Huffman coding on it.
If there is a complete map between all the paths up to a given length and a bounded set of integers, then we have the guarantee that any random pick in the set will be accepted by the deciphering routine and will lead to an attemptext.
And the numerical representation can now be ciphered by any classic symmetric cipher.
So the complete solution has to follow some rules:

  • it must run without revealing any metadata confirming the right key when brute-forced, so e.g. don't introduce any checksum in the plaintext [2] but on the ciphered text!
  • any wrong key must lead to a proper deciphering and an attemptext, no rejection.

Some additional reflections:
Such a cipher method covering a balanced language graph would become for free one of the best natural language text compressor available (like when you order in a chinese restaurant the numbers 3, 10 and 12 (I recommend the 12)).
This gives plausible deniability, especially if re-decoded (to another attemptext): you can even deny it is a message encoded with this method, this text was just the poetry of a drunken man ;-)
A little junk is tolerated in the brute-forced plaintexts (even a lot! 99% of detectable junk = loss of 6.6 bits of key material).
A limited language coverage with a stronger grammar can help but then we need a special predictive editor to create only sentences respectful of the cipher grammar.
We covered english language but this could be extended to any formal media description if a grammar can be written for it.
Problems: Grammar/Vocabulary could be limited, we need special care when entering a sentence and be sure it'll be handled properly by the grammar.
We also have to foresee loopholes for whatever will not be in a dictionary (family names, brands, acronyms, numbers...)

Proof-of-concept:
Good starting points are grammar and spelling checkers, maybe Turing games (mimicking a human, elisa etc).

$ apt-cache search grammar [3]
link-grammar - Carnegie Mellon University's link grammar parser for English
$ apt-cache show link-grammar
    In Selator, D. and Temperly, D. "Parsing English with a Link Grammar" (1991), the authors defined a new formal grammatical system called a "link grammar". A sequence of words is in the language of a link grammar if there is a way to draw "links" between words in such a way that the local requirements of each word are satisfied, the links do not cross, and the words form a connected graph. The authors encoded English grammar into such a system, and wrote this program to parse English using this grammar.

Links, graphs, mmm this description talks to me...

$ echo "the cat is chasing the mouse" |link-grammar en
                  +-----Os-----+
+-Ds-+--Ss-+-Pg*b-+      +--Ds-+
|       |        |          |       |        |
the cat.n is.v chasing.v the mouse.n 

The difficulty is the enumeration of paths that would cover the key space.
For a first attempt I will keep the grammar links of the plaintext and replace every word by another which respects the same grammar structure.
After scripting some bash around link-grammar and its dictionnaries, here's what I get:

$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode            
@23:2 n.1:2865 v.4.2:1050 a n.1:4906 to v.4.1:1352 a n.1:1762 n.1:7124 adv.1:369 with a adj.1:338 @3:3

This is the encoding of the input, every word is replaced by a reference to a wordlist and its position in the list.

$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode|./decode 
my example illustrates a mean to obfuscate a complex sentence easily with a basic number

So far, so good.

$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode 123     
@23:1 n.1:7695 v.4.2:2054 a n.1:2757 to v.4.1:2068 a n.1:8659 n.1:2548 adv.1:236 with a adj.1:302 @3:6

Here I'm changing the positions using a secret key with a very stupid 16bit numeric cipher.

$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode 123|./decode 123
my example illustrates a mean to obfuscate a complex sentence easily with a basic number

So far, so good.
But because any wrong key leads to an attemptext, we could send this attemptext instead of the encoded path so it looks less like ciphered stuff than like psychedelic poem:

$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode 123|./decode
its stampede subtends an enchantress to sustain a venture dowager crassly with an avaricious bulk
$ echo its stampede subtends an enchantress to sustain a venture dowager crassly with an avaricious bulk|./encode|./decode 123
my example illustrates a mean to obfuscate a complex sentence easily with a basic number

So far, so good.

$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode 123|./decode 456
their triangle uncorks a chap to belittle a simpleton guitarist uneventfully with an impersonal group

Deciphering with a wrong key gives a grammatically correct sentence.[4]

$ echo their triangle uncorks a chap to belittle a simpleton guitarist uneventfully with an impersonal group|./encode 456|./decode 123
my example illustrates a mean to obfuscate a complex sentence easily with a basic number
$ echo their triangle uncorks a chap to belittle a simpleton guitarist uneventfully with an impersonal group|./encode 334|./decode 1
my example illustrates a mean to obfuscate a complex sentence easily with a basic number

With this simple cipher, it's a bit like stepping forward and backward and 123-456+334-1=0

Brute-forcing:

$ echo "@23:1 n.1:7695 v.4.2:2054 a n.1:2757 to v.4.1:2068 a n.1:8659 n.1:2548 adv.1:236 with a adj.1:302 @3:6"|./force     
...
31158:their leak envisages a readership to groom a gene walkway formally with a treacherous bulk
31159:its pontiff typifies a tenor to barbecue a militarist castle hoarsely with a unrepentant handful
31160:your stalk mesmerizes a beholder to prohibit a recess eye informally with an accidental group
31161:my grocery transships a panhandle to amass a doge studio profitably with a putrid majority
31162:their mould mass-produces a shoemaker to pip an iguana assay satisfactorily with a sincere minority
31163:its rioter comprehends a washbasin to dynamite a paragon counterplot stealthily with a textured number
31164:our echelon mangles a lute to oversee a cleaner savannah centrally with a machiavellian bunch
31165:my jailer comforts a pronoun to disembarrass a flytrap twentieth culturally with an onerous batch
31166:their peso relinquishes a subtitle to suffuse a magician bucketful doubly with a procedural bulk
31167:your compliment codes a heater to deify a beautician playschool intransitively with a fricative handful
31168:our gamekeeper reinforces a nude to span a daybook spokesperson manfully with an inarticulate group
31169:my memoir explicates a scenario to insure a helpmate airflow offhand with a lifeless majority
31170:its blowhole refunds an expedition to seduce a wallaby mole symmetrically with a coastal minority
31171:your dirigible exemplifies a lark to illustrate a casualty rerun unjustly with a discreet number
31172:our humorist unfrocks a pluralist to brainwash an eyebrow toll adoringly with a flaccid bunch
...

This was just a proof-of-concepts, the numeric cipher is stupid, probably leading to some cycles in some wordlists, the key is much too short, the words in a list are not balanced by their frequency, the grammar links are not ciphered at all, my script retains only the first path found by link-grammar and this can lead to situations the grammar links discovered on a modified plaintext does not correspond to the initial plaintext links [5], etc.
BTW why mentioning quantum computers? I'm not sure they would help in brute-force, they will help breaking public key cryptography for sure, but I found it sexy in the title ;-)

polygen /usr/share/polygen/eng/debian/thanks.grm

Philippe Teuwen <phil teuwen org>
Any comment is welcome!

Scripts: (require link-grammar to be installed)
http://www.hack.lu/pres/Scificrypto/

Notes:
[1] there is more than one technological error in this book (cf e.g. http://victoria.tc.ca/int-grps/books/techrev/bkdgtlft.rvw ), one of the most obvious ones is probably that NSA never considered that the "unbreakable message" could have been random garbage.
Giving a "ciphered text" as a challenge to break, without even giving access to the cipher itself, even not as oracle, is by no mean a proof your cipher is unbreakable, even if I remember in the past having faced such a marketing communication from fools trying to sell a new cipher.
Note that what is finally revealed as the Digital Fortress unbreakable code makes eventually sense in its context (if the proper terminology was used), think about the integer overflow vulnerability in the command "file" for example (CVE-2007-2799)
No I will not tell more, for those who didn't read the book yet and plan to do so, I'm not a spoiler ;-)
[2] that's how I broke my first 32bit cipher for HP48 when I was 17, its author was so confident that he prepended the plaintext with the key.
[3] I also discovered "polygen": PolyGen is a program for generating random sentences according to a grammar definition, that is following custom syntactical and lexical rules.
Not suitable for us unless someone writes a complete grammar of english, but so funny, try it ;-)
[4] Actually on this particular input link-grammar found 2 paths and the first one considers "complex" as a noun rather than an adjective, using the second path would have been more correct.
[5] Here's an example (secret diffusion was commented out in the code to build this case more easily) echo "it's a complete push" |./encode 1|./decode|./encode|./decode 1 => it 's a complement nonentity
It happened because the intermediate sentence was "it 's a complex sentence", "complex" was build from the adjective list (complete->complex) but analysed in "|./encode|" by link-grammar as a noun, so deciphered as (complex->complement)
It's a rare border case but if the message has to be conveyed as ciphered sentence instead of a ciphered path, check first that it can be deciphered properly.
[6] CAPTCHA is an acronym: "Completely Automated Public Turing test to tell Computers and Humans Apart" and is trademarked by Carnegie Mellon University.
Note that in our case, a human could manually finish the test if the computer already managed to reduce the original set of n texts to a more reasonable amount.
[7] The intended recipient won't have any problem to distinguish in one single text which parts are meaningful and which are just bananas jumping after the carpet.

Scripts: (require link-grammar to be installed)
http://www.hack.lu/pres/Scificrypto/

SECURITY UPDATE link-grammar versions prior to v4.2.5 have an exploitable buffer-overflow bug:

echo xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|link-grammar en
linkparser> Segmentation fault

Be sure to not parse unknown text or to upgrade to v2.4.5

Feel free to share your opinion