Difference between revisions of "Sci-Fi Crypto"
Line 487: | Line 487: | ||
''In cryptography, format-preserving encryption (FPE) refers to encrypting in such a way that the output (the ciphertext) is in the same format as the input (the plaintext). The meaning of "format" varies. Typically only finite domains are discussed, for example:'' |
''In cryptography, format-preserving encryption (FPE) refers to encrypting in such a way that the output (the ciphertext) is in the same format as the input (the plaintext). The meaning of "format" varies. Typically only finite domains are discussed, for example:'' |
||
<br>''[...] To encrypt an English word so that the ciphertext is another English word.'' (added in April 2010) |
<br>''[...] To encrypt an English word so that the ciphertext is another English word.'' (added in April 2010) |
||
− | Once the sentence is tokenised, those techniques could help for a better word substitution |
+ | <br><br>Once the sentence is tokenised, those techniques could help for a better word substitution |
====Those folks who treated foreign languages like ciphers==== |
====Those folks who treated foreign languages like ciphers==== |
Revision as of 17:24, 27 November 2012
This is the copy of an idea I've spread during a Hack.lu 2007 HackCamp about Sci-Fi crypto, original link is now dead.
Sci-Fi crypto
Brainstorming on how to securely exchange messages while Big Brother has at its disposal nice quantum computers
or how to defeat brute-force attacks on message ciphers
Some time ago I was reading "Digital Fortress" by Dan Brown.
A thriller in the NSA headquarters where the head of the crypto division is facing an unbreakable code, resistant to the brute-force attacks of the NSA's supercomputer TRANSLTR.
The thrilling fact is that the author of this code threatened NSA that the code will be revealed to the entire world if he happens to die.
And his death is the start of the book...
So I was reading and while the protagonists were busy imaginating how this unbreakable code was designed, I thought:
"Bu..sh..![1] To make an unbreakable code this is not like that that you have to do. You've rather to..."
And that's how I imaginated what I'm about to talk about.
Reading further, I found one quote much smarter than the rest of the book, which illustrates very well my technique:
"you could have broken the code but maybe you didn't see it!"
Let's try to create a message cipher resistant to brute force.
I said "message cipher" and not "stream cipher" because this cipher will not work on binary data but english text.
Brute force by definition can break any known symmetric cipher, but what if you don't know you broke it?
Generating all the possible plaintexts is only half of the game, the other half is to have a decision criterion to discriminate the right plaintext amongst the others.
And that's this half that we'll try to make harder for the cryptanalyst.
Here I need a new word because usually crypto talks about plaintext and ciphertext but not about possibly wrongly decoded ciphertexts.
We'll call that an attemptext (tentative of plaintext).
Imagine a ciphered text that is always deciphered to a grammatically correct attemptext, no matter which (wrong) key you used.
Cryptanalysis requires therefore a large amount of work to retrieve the correct plaintext among all attemptexts.
This is the start of a war between artificial intelligences...
Anything the cryptanalyst program can find to discard an attemptext means that you can probably tune the initial grammar model to create more plausible attemptexts.
The problem for the cryptanalyst is a variation of the Turing test as it's a test of humanity, with some differences with the original test:
- the test procedure is not a dialog but consists of n texts of which only one was written by a human
- the tester is rather a computer than a human (like for a captcha(tm)[6])
- the human does not want to prove his humanity, to the contrary, so he can deliberately introduce some non-sense in his text [7]
Let's represent a language by a graph.
Each sentence is one path through the language graph.
Taking another random path will lead to another grammatically correct sentence.
The perfect language graph represents all possible sentences but no junk, so a cryptanalysis has to be done on another level than the grammar or other easy tricks like alphabet frequency analysis.
To cipher our message, the first step is not a cryptographic one but the encoding of the sentence as a path along the grammar graph.
This path has to be identified numerically (enumerated) amongst the possible paths.
Ideally the enumeration must be balanced by the frequency of common grammatical constructions and vocabulary, for example something you get for free if you manage to map some Huffman coding on it.
If there is a complete map between all the paths up to a given length and a bounded set of integers, then we have the guarantee that any random pick in the set will be accepted by the deciphering routine and will lead to an attemptext.
And the numerical representation can now be ciphered by any classic symmetric cipher.
So the complete solution has to follow some rules:
- it must run without revealing any metadata confirming the right key when brute-forced, so e.g. don't introduce any checksum in the plaintext [2] but on the ciphered text!
- any wrong key must lead to a proper deciphering and an attemptext, no rejection.
Some additional thoughts:
Such a cipher method covering a balanced language graph would become for free one of the best natural language text compressor available (like when you order in a chinese restaurant the numbers 3, 10 and 12 (I recommend the 12)).
This gives plausible deniability, especially if re-decoded (to another attemptext): you can even deny it is a message encoded with this method, this text was just the poetry of a drunken man ;-)
A little junk is tolerated in the brute-forced plaintexts (even a lot! 99% of detectable junk = loss of 6.6 bits of key material).
A limited language coverage with a stronger grammar can help but then we need a special predictive editor to create only sentences respectful of the cipher grammar.
We covered english language but this could be extended to any formal media description if a grammar can be written for it.
Problems: Grammar/Vocabulary could be limited, we need special care when entering a sentence and be sure it'll be handled properly by the grammar.
We also have to foresee loopholes for whatever will not be in a dictionary (family names, brands, acronyms, numbers...)
Proof-of-concept
Good starting points are grammar and spelling checkers, maybe Turing games (mimicking a human, elisa etc).
$ apt-cache search grammar link-grammar - Carnegie Mellon University's link grammar parser
See also [3]
$ apt-cache show link-grammar In Selator, D. and Temperly, D. "Parsing English with a Link Grammar" (1991), the authors defined a new formal grammatical system called a "link grammar". A sequence of words is in the language of a link grammar if there is a way to draw "links" between words in such a way that the local requirements of each word are satisfied, the links do not cross, and the words form a connected graph. The authors encoded English grammar into such a system, and wrote this program to parse English using this grammar.
Links, graphs, mmm this description talks to me...
# apt-get install link-grammar $ echo "the cat is chasing the mouse" |link-grammar en +-----Os-----+ +-Ds-+--Ss-+-Pg*b-+ +--Ds-+ | | | | | | the cat.n is.v chasing.v the mouse.n
The difficulty is the enumeration of paths that would cover the key space.
For a first attempt I will keep the grammar links of the plaintext and replace every word by another which respects the same grammar structure.
After scripting some bash around link-grammar and its dictionnaries, here's what I get (see at the bottom of the page for source code):
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode @24:2 n.1:3651 v.4.2:1105 a n.1:6281 to v.4.1:1430 a adj.1:973 n.1:9127 adv.1:385 with a @476:16 n.1:2021
This is the encoding of the input, every word is replaced by a reference to a wordlist and its position in the list.
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode|./decode my example illustrates a mean to obfuscate a complex sentence easily with an impressive code
So far, so good.
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123 @24:1 n.1:8481 v.4.2:1980 a n.1:1626 to v.4.1:469 a adj.1:6271 n.1:8763 adv.1:1393 with a @476:28 n.1:6878
Here I'm changing the positions using a secret key with a very very stupid 16bit numeric cipher.
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123|./decode 123 my example illustrates a mean to obfuscate a complex sentence easily with an impressive code
So far, so good.
But because any wrong key leads to an attemptext, we could send this attemptext instead of the encoded path so it looks less like ciphered stuff than like psychedelic poem:
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123|./decode its rehash romanticizes a caterpillar to curry a unobstructed rotter tensely with a undeniable nose
$ echo its rehash romanticizes a caterpillar to curry a unobstructed rotter tensely with a undeniable nose|./encode|./decode 123 my example illustrates a mean to obfuscate a complex sentence easily with a basic number
So far, so good.
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123|./decode 456 their age attunes a deciliter to co-produce a pious kingpin anon with a plausible subcontinent
Deciphering with a wrong key gives a grammatically correct sentence.[4]
$ echo their age attunes a deciliter to co-produce a pious kingpin anon with a plausible subcontinent|./encode 456|./decode 123 my example illustrates a mean to obfuscate a complex sentence easily with a basic number
$ echo their age attunes a deciliter to co-produce a pious kingpin anon with a plausible subcontinent|./encode 334|./decode 1 my example illustrates a mean to obfuscate a complex sentence easily with a basic number
With this simple cipher, it's a bit like stepping forward and backward and 123-456+334-1=0
Brute-forcing:
$ echo "@24:1 n.1:8481 v.4.2:1980 a n.1:1626 to v.4.1:469 a adj.1:6271 n.1:8763 adv.1:1393 with a @476:28 n.1:6878"|./force ... 28364:its budgie drapes a firefly to facet an inseparable neighbour deeply with a regrettable wager 28365:your armistice models a dispersion to plagiarize a whiny lirone unfeelingly with a paradoxical teaching 28366:our vine segregates a connotation to uncork a neat humour ploddingly with a self-evident slight 28367:their appetite test-drives a dioxide to wangle a fevered librarian negatively with a false tailspin 28368:its vendor cross-indexes a conch to donate a tractable horse globally with a unbelievable siren 28369:your swimsuit implements a bulldog to notate an inflammable fragrance categorically with a understandable repair 28370:my uptake misplaces a commercial to overstrain a constant hitter adroitly with a significant shuttle 28371:their super scotches a bristle to tamp a rhombohedral force rowdily with a debatable regard 28372:its shipwright blunts an apricot to decolonize a fallback drawing languorously with an improbable pioneer 28373:your reconnaissance facelifts a vernacular to jinx a thorough correlation enquiringly with a conceivable neckband 28374:my shadow imitates an ante to mimic an opinionated donor defectively with a regrettable pianoforte 28375:their rationalist recolors a uttermost to short-change a comprehensible conveyor unflinchingly with a paradoxical musket 28376:its pestilence walls a surface to clamp a reproducible can pneumatically with a self-evident levy 28377:our radical blackmails a underclass to convey an isodicentric constabulary negligently with a false motive 28378:my penchant expropriates a succession to incense an adiabatic caber gloomily with a unbelievable lawmaker 28379:their moneylender overtaxes a shareholder to renounce an obstinate auricle causally with a understandable hindrance 28380:your patron rearranges a stroller to rival a frightful bunk adversely with a significant lair 28381:our misfit videotapes a sensationalist to brand a unavoidable aspect royally with a debatable heliport ^C
Disclaimers
This was just a proof-of-concept
- The numeric cipher is stupid, probably leading to some cycles in some wordlists
- The key is much too short
- The words in a list are not balanced by their frequency
- The grammar links are not ciphered at all
- The script retains only the first path found by link-grammar and this can lead to situations the grammar links discovered on a modified plaintext does not correspond to the initial plaintext links [5]
- The encoding step is painfully slow
- If link-grammar wordlists get updated in a new version, this will change the way sentences are enciphered and deciphering of old encoded messages will be broken
- etc.
BTW why mentioning quantum computers?
I'm not sure they would help in brute-force, they will help breaking public key cryptography for sure, but in 2007 I found it sexy in the title ;-)
polygen /usr/share/polygen/eng/debian/thanks.grm
Footnotes
[1] there is more than one technological error in this book (cf e.g. http://victoria.tc.ca/int-grps/books/techrev/bkdgtlft.rvw ), one of the most obvious ones is probably that NSA never considered that the "unbreakable message" could have been random garbage.
Giving a "ciphered text" as a challenge to break, without even giving access to the cipher itself, even not as oracle, is by no mean a proof your cipher is unbreakable, even if I remember in the past having faced such a marketing communication from fools trying to sell a new cipher.
Note that what is finally revealed as the Digital Fortress unbreakable code makes eventually sense in its context (if the proper terminology was used), think about the integer overflow vulnerability in the command "file" for example (CVE-2007-2799)
No I will not tell more, for those who didn't read the book yet and plan to do so, I'm not a spoiler ;-)
[2] that's how I broke my first 32bit cipher for HP48 when I was 17, its author was so confident that he prepended the plaintext with the key.
[3] I also discovered "polygen": PolyGen is a program for generating random sentences according to a grammar definition, that is following custom syntactical and lexical rules.
Not suitable for us unless someone writes a complete grammar of english, but so funny, try it ;-)
[4] Actually on this particular input link-grammar found 2 paths and the first one considers "complex" as a noun rather than an adjective, using the second path would have been more correct.
[5] Here's an example with an older version of link-grammar (secret diffusion was commented out in the code to build this case more easily)
echo "it's a complete push" |./encode 1|./decode|./encode|./decode 1 => it 's a complement nonentity
It happened because the intermediate sentence was "it 's a complex sentence", "complex" was build from the adjective list (complete->complex) but analysed in "|./encode|" by link-grammar as a noun, so deciphered as (complex->complement)
This specific example works again in latest versions of link-grammar but I'm not sure it could never happen again with other sentences...
It's a rare border case but if the message has to be conveyed as ciphered sentence instead of a ciphered path, check first that it can be deciphered properly.
[6] CAPTCHA is an acronym: "Completely Automated Public Turing test to tell Computers and Humans Apart" and is trademarked by Carnegie Mellon University.
Note that in our case, a human could manually finish the test if the computer already managed to reduce the original set of n texts to a more reasonable amount.
[7] The intended recipient won't have any problem to distinguish in one single text which parts are meaningful and which are just bananas jumping after the carpet.
Scripts
They require link-grammar
Latest scripts revision were tested with link-grammar version 4.7.4)
SECURITY WARNING link-grammar versions prior to v4.2.5 have an exploitable buffer-overflow bug:
echo xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|link-grammar en linkparser> Segmentation fault
Be sure to not parse unknown text and to upgrade link-grammar
encode
Script can be easily downloaded by [{{#file: encode}} this link]
#!/bin/bash
# Copyright Philippe Teuwen <phil_a_teuwen_point_org>
# License: GPLv3
LGPATH=/usr/share/link-grammar
DEBUG=false
if [ "$1" == "-v" ]; then
DEBUG=true
shift
fi
secret=$1
secret=${secret:-0}
input=$(cat)
# Keep max counts in the encoded stream
# if not, can be reconstructed at decode time
KEEP_MAX=false
# Grammar parsing:
##################
sentence=$(echo -e '!width=5000\n'$input |\
link-parser en 2>/dev/null |\
awk '/^[[:space:]]*\|[[:space:]]*\|[[:space:]]*\|/ {
getline;
sub(/ ?[A-Z]+-WALL ?/,"",$0);
print;
exit;
}')
$DEBUG && echo -n "Parsing: " 1>&2
$DEBUG && echo $sentence 1>&2
# Encoding:
##################
$DEBUG && echo -n "Encoding1: " 1>&2
sentencex=""
for word in $sentence; do
# Try to find the word in one of the dict files
file=$(echo '!!'$word |\
link-parser 2>/dev/null |\
grep "$word "|\
grep "<" |\
sed 's/.*<\(.*\)>.*/\1/')
if [ "$file" != "" ]; then
# See which occurence is our word in that file, that's our "encoding"
n=$(cat $LGPATH/$file |\
sed 's/ $//;s/ /\n/g;'|\
grep -x -n $word|\
sed 's/:.*$//')
t=$(cat $LGPATH/$file | wc -w)
file=${file#en/words/words.}
# We encode also t but it's just to make ciphering easier
# as t is just the size of the wordlist
wordx="$file:$n/$t"
else
wordx=$word
fi
[ "$sentencex" != "" ] && sentencex="$sentencex "
sentencex="$sentencex$wordx"
$DEBUG && echo -n "$wordx " 1>&2
done
$DEBUG && echo 1>&2
# Prepare common words dict for phases 2&3
###########################
TMPDICT=/tmp/4.0.dict.short
# we eliminate the list "a<>an" and "such a<>such an"
# as the grammar cannot make the diff
# At decoding we ll try to do smart guess
[ -e $TMPDICT ] ||\
cat $LGPATH/en/4.0.dict |\
tr '\n;' ' \n'|\
sed 's/^ \+//;s/ / /g'|\
grep '^[a-z].* [a-z].*:'|\
sed 's/:.*$//'|\
grep -v "^\(a an\|such_a such_an\)$" \
> $TMPDICT
# Encoding, phase 2
###################
$DEBUG && echo -n "Encoding2: " 1>&2
# Restore words with spaces
dictspace=$(cat $TMPDICT |egrep -o "[a-z]+(_[a-z]+)+")
for d in $dictspace; do
sentencex=$(echo $sentencex|sed "s/$(echo $d|tr '_' ' ')/$d/g")
done
$DEBUG && echo $sentencex 1>&2
# Encoding, phase 3
###################
$DEBUG && echo -n "Encoding3: " 1>&2
sentencey=""
for wordx in $sentencex; do
if [[ "$wordx" =~ : ]]; then
# This is an already encoded word
wordy=$wordx
else
# Look in the tmpdict for a second chance
# result is "line_nr:list of words"
list=$(cat $TMPDICT|grep -n -w -m 1 $wordx|tr : ' ')
# grep -w considers e.g. vis-a-vis as 3 words
# so we want to check the match by ourselves
list=$(echo " $list "|grep " $wordx ")
if [ "$list" != "" ]; then
# We got the word in a list
eval $(echo $list|sed 's/^\([0-9]\+\) \(.*\)/l=\1;list="\2"/')
n=$(echo $list|\
tr ' ' '\n'|\
grep -n $wordx|\
sed 's/:.*//')
t=$(echo $list|wc -w)
wordy="@$l:$n/$t"
else
# Replace all "an" by "a" as the grammar cannot make the diff
# At decoding we ll try to do smart guess
[ "$wordx" == "an" ] && wordx="a"
# Clean word extension
wordy=${wordx/.*}
fi
fi
[ "$sentencey" != "" ] && sentencey="$sentencey "
sentencey="$sentencey$wordy"
$DEBUG && echo -n "$wordy " 1>&2
done
$DEBUG && echo 1>&2
# Ciphering:
##################
$DEBUG && echo -n "Ciphering: " 1>&2
# loosy way to diffuse the secret
secret=$(echo "$secret * 1234567 % 32768"|bc)
oldnx=0
sentencez=""
for wordy in $sentencey; do
if [[ "$wordy" =~ : ]]; then
# This is an encoded word
eval $(echo $wordy | sed 's/\(.*\):\(.*\)\/\(.*\)/file=\1;n=\2;t=\3/')
# Very stupid ciphering: apply secret as a common shift to all the words
nx=$(echo "($n - 1 + $secret + $oldnx) % $t + 1"|bc)
[ $secret -ne 0 ] && oldnx=$nx
wordz="$file:$nx"
$KEEP_MAX && wordz="$wordz/$t"
else
wordz=$wordy
fi
[ "$sentencez" != "" ] && sentencez="$sentencez "
sentencez="$sentencez$wordz"
$DEBUG && echo -n "$wordz " 1>&2
done
$DEBUG && echo 1>&2
echo $sentencez
decode
Script can be easily downloaded by [{{#file: decode}} this link]
#!/bin/bash
# Copyright Philippe Teuwen <phil_a_teuwen_point_org>
# License: GPLv3
LGPATH=/usr/share/link-grammar
sentencez=$(cat)
DEBUG=false
if [ "$1" == "-v" ]; then
DEBUG=true
shift
fi
secret=$1
secret=${secret:-0}
# Prepare common words dict
###########################
TMPDICT=/tmp/4.0.dict.short
# we eliminate the list "a<>an" and "such a<>such an"
# as there should have been a stricter rule
[ -e $TMPDICT ] ||\
cat $LGPATH/en/4.0.dict |\
tr '\n;' ' \n'|\
sed 's/^ \+//;s/ / /g'|\
grep '^[a-z].* [a-z].*:'|\
sed 's/:.*$//'|\
grep -v "^\(a an\|such_a such_an\)$" \
> $TMPDICT
# Deciphering:
##############
# loosy way to diffuse the secret
$DEBUG && echo -n "key: $secret"
secret=$(echo "$secret * 1234567 % 32768"|bc)
$DEBUG && echo "=>$secret"
oldnx=0
sentencex=""
for wordz in $sentencez; do
if [[ "$wordz" =~ : ]]; then
# This is a ciphered word
if [[ "$wordz" =~ / ]]; then
eval $(echo $wordz | sed 's/\(.*\):\(.*\)\/\(.*\)/file=\1;nx=\2;t=\3/')
else
eval $(echo $wordz | sed 's/\(.*\):\(.*\)/file=\1;nx=\2/')
if [[ "$file" =~ ^@ ]]; then
l=${file##@}
t=$(cat $TMPDICT |\
awk "FNR==$l{print NF}")
else
t=$(cat $LGPATH/en/words/words.$file | wc -w)
fi
fi
n=$(echo "($nx - 1 - $secret - $oldnx + ($t * ((($secret + $oldnx) / $t)+1))) % $t + 1"|bc)
[ $secret -ne 0 ] && oldnx=$nx
wordx="$file:$n"
else
wordx=$wordz
fi
[ "$sentencex" != "" ] && sentencex="$sentencex "
sentencex="$sentencex$wordx"
done
$DEBUG && echo -n "Deciphering:" 1>&2
$DEBUG && echo $sentencex 1>&2
# Decoding:
###########
was_a=false
for wordx in $sentencex; do
if [[ "$wordx" =~ : ]]; then
# This is an encoded word
eval $(echo $wordx | sed 's/\(.*\):\(.*\)/file=\1;n=\2/')
if [[ "$file" =~ ^@ ]]; then
l=${file##@}
word=$(cat $TMPDICT |\
awk "FNR==$l{print \$$n}")
word=${word//_/ }
else
word=$(cat $LGPATH/en/words/words.$file |\
sed 's/ $//;s/ /\n/g;'|\
awk "FNR==$n")
# Schtroumpf mode
# Example: ./decode 0 n smurf
[ "$3" != "" ] && [[ "$word" =~ \.$2 ]] && word=$3
fi
# Clean word extension
word=${word/.*}
is_a=false
else
word=${wordx}
[ "$word" == "a" ] && is_a=true || is_a=false
fi
# as the grammar cannot make the diff between "a" and "an"
# we ll try to do smart guess
if $was_a; then
a="a"
[[ "$word" =~ ^[aeio] ]] && a="an"
word="$a $word"
fi
was_a=$is_a
if ! $is_a; then
[ "$sentence" != "" ] && sentence="$sentence "
sentence="$sentence$word"
fi
done
$DEBUG && echo -n "Decoding: " 1>&2
$DEBUG && echo $sentence 1>&2
echo $sentence
force
Innocent attempt to use brute-force
Script can be easily downloaded by [{{#file: force}} this link]
#!/bin/bash
# Copyright Philippe Teuwen <phil_a_teuwen_point_org>
# License: GPLv3
input=$(cat)
seed=$RANDOM
for ((i=0;i<100;i++)); do
echo -n "$(($i+$seed)):"
echo $input|./decode $(($i+$seed)) || exit
done
Related work
Which means *I* relate those topics to mine, not that they rely on me :-)
Format-preserving encryption
In cryptography, format-preserving encryption (FPE) refers to encrypting in such a way that the output (the ciphertext) is in the same format as the input (the plaintext). The meaning of "format" varies. Typically only finite domains are discussed, for example:
[...] To encrypt an English word so that the ciphertext is another English word. (added in April 2010)
Once the sentence is tokenised, those techniques could help for a better word substitution
Those folks who treated foreign languages like ciphers
Check Kevin Knight's work, and this article though which I discovered his work: Knight was part of an extremely small group of machine-translation researchers who treated foreign languages like ciphers—as if Russian, for example, were just a series of cryptological symbols representing English words. In code-breaking, he explained, the central job is to figure out the set of rules for turning the cipher’s text into plain words: which letters should be swapped, when to turn a phrase on its head, when to ignore a word altogether. Establishing that type of rule set, or “key,” is the main goal of machine translators too. Except that the key for translating Russian into English is far more complex. Words have multiple meanings, depending on context. Grammar varies widely from language to language. And there are billions of possible word combinations.
Philippe Teuwen <phil teuwen org>