Difference between revisions of "Sci-Fi Crypto"

From YobiWiki
Jump to navigation Jump to search
m (Reverted edits by Etegohy (Talk) to last revision by PhilippeTeuwen)
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is the copy of [http://hack.lu/index.php/Sci-Fi_Crypto an idea I've spread during the HackCamp] about Sci-Fi crypto
+
This is the copy of an idea I've spread during a Hack.lu 2007 HackCamp about Sci-Fi crypto, original link is now dead.
   
  +
==Sci-Fi crypto==
===The original essay===
 
'''Sci-Fi(?) crypto: brainstorming on how to securely exchange messages while Big Brother has at its disposal nice quantum computers
+
===Brainstorming on how to securely exchange messages while Big Brother has at its disposal nice quantum computers===
(or how to defeat brute-force attacks on message ciphers)'''
+
===or how to defeat brute-force attacks on message ciphers===
   
 
Some time ago I was reading "Digital Fortress" by Dan Brown.
 
Some time ago I was reading "Digital Fortress" by Dan Brown.
Line 48: Line 48:
 
* any wrong key must lead to a proper deciphering and an attemptext, no rejection.
 
* any wrong key must lead to a proper deciphering and an attemptext, no rejection.
   
Some additional reflections:
+
Some additional thoughts:
 
<br>Such a cipher method covering a balanced language graph would become for free one of the best natural language text compressor available (like when you order in a chinese restaurant the numbers 3, 10 and 12 (I recommend the 12)).
 
<br>Such a cipher method covering a balanced language graph would become for free one of the best natural language text compressor available (like when you order in a chinese restaurant the numbers 3, 10 and 12 (I recommend the 12)).
 
<br>This gives plausible deniability, especially if re-decoded (to another attemptext): you can even deny it is a message encoded with this method, this text was just the poetry of a drunken man ;-)
 
<br>This gives plausible deniability, especially if re-decoded (to another attemptext): you can even deny it is a message encoded with this method, this text was just the poetry of a drunken man ;-)
Line 57: Line 57:
 
<br>We also have to foresee loopholes for whatever will not be in a dictionary (family names, brands, acronyms, numbers...)
 
<br>We also have to foresee loopholes for whatever will not be in a dictionary (family names, brands, acronyms, numbers...)
   
Proof-of-concept:
+
===Proof-of-concept===
 
<br>Good starting points are grammar and spelling checkers, maybe Turing games (mimicking a human, elisa etc).
 
<br>Good starting points are grammar and spelling checkers, maybe Turing games (mimicking a human, elisa etc).
   
$ apt-cache search grammar [3]
+
$ apt-cache search grammar
link-grammar - Carnegie Mellon University's link grammar parser for English
+
link-grammar - Carnegie Mellon University's link grammar parser
  +
See also [3]
   
 
$ apt-cache show link-grammar
 
$ apt-cache show link-grammar
Line 67: Line 68:
   
 
Links, graphs, mmm this description talks to me...
 
Links, graphs, mmm this description talks to me...
  +
# apt-get install link-grammar
 
$ echo "the cat is chasing the mouse" |link-grammar en
 
$ echo "the cat is chasing the mouse" |link-grammar en
+-----Os-----+
+
+-----Os-----+
+-Ds-+--Ss-+-Pg*b-+ +--Ds-+
+
+-Ds-+--Ss-+-Pg*b-+ +--Ds-+
| | | | | |
+
| | | | | |
 
the cat.n is.v chasing.v the mouse.n
 
the cat.n is.v chasing.v the mouse.n
   
 
The difficulty is the enumeration of paths that would cover the key space.
 
The difficulty is the enumeration of paths that would cover the key space.
 
<br>For a first attempt I will keep the grammar links of the plaintext and replace every word by another which respects the same grammar structure.
 
<br>For a first attempt I will keep the grammar links of the plaintext and replace every word by another which respects the same grammar structure.
<br>After scripting some bash around link-grammar and its dictionnaries, here's what I get:
+
<br>After scripting some bash around link-grammar and its dictionaries, here's what I get (see at the bottom of the page for source code):
   
$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode
+
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode
@23:2 n.1:2865 v.4.2:1050 a n.1:4906 to v.4.1:1352 a n.1:1762 n.1:7124 adv.1:369 with a adj.1:338 @3:3
+
@24:2 n.1:3651 v.4.2:1105 a n.1:6281 to v.4.1:1430 a adj.1:973 n.1:9127 adv.1:385 with a @476:16 n.1:2021
   
 
This is the encoding of the input, every word is replaced by a reference to a wordlist and its position in the list.
 
This is the encoding of the input, every word is replaced by a reference to a wordlist and its position in the list.
   
$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode|./decode
+
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode|./decode
my example illustrates a mean to obfuscate a complex sentence easily with a basic number
+
my example illustrates a mean to obfuscate a complex sentence easily with an impressive code
   
 
So far, so good.
 
So far, so good.
   
$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode 123
+
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123
@23:1 n.1:7695 v.4.2:2054 a n.1:2757 to v.4.1:2068 a n.1:8659 n.1:2548 adv.1:236 with a adj.1:302 @3:6
+
@24:1 n.1:8481 v.4.2:1980 a n.1:1626 to v.4.1:469 a adj.1:6271 n.1:8763 adv.1:1393 with a @476:28 n.1:6878
   
Here I'm changing the positions using a secret key with a very stupid 16bit numeric cipher.
+
Here I'm changing the positions using a secret key with a very very stupid 16-bit numeric cipher.
   
$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode 123|./decode 123
+
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123|./decode 123
my example illustrates a mean to obfuscate a complex sentence easily with a basic number
+
my example illustrates a mean to obfuscate a complex sentence easily with an impressive code
   
 
So far, so good.
 
So far, so good.
 
<br>But because any wrong key leads to an attemptext, we could send this attemptext instead of the encoded path so it looks less like ciphered stuff than like psychedelic poem:
 
<br>But because any wrong key leads to an attemptext, we could send this attemptext instead of the encoded path so it looks less like ciphered stuff than like psychedelic poem:
   
$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode 123|./decode
+
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123|./decode
its stampede subtends an enchantress to sustain a venture dowager crassly with an avaricious bulk
+
its rehash romanticizes a caterpillar to curry a unobstructed rotter tensely with a undeniable nose
   
$ echo its stampede subtends an enchantress to sustain a venture dowager crassly with an avaricious bulk|./encode|./decode 123
+
$ echo its rehash romanticizes a caterpillar to curry a unobstructed rotter tensely with a undeniable nose|./encode|./decode 123
 
my example illustrates a mean to obfuscate a complex sentence easily with a basic number
 
my example illustrates a mean to obfuscate a complex sentence easily with a basic number
   
 
So far, so good.
 
So far, so good.
   
$ echo my example illustrates a mean to obfuscate a complex sentence easily with a basic number|./encode 123|./decode 456
+
$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123|./decode 456
their triangle uncorks a chap to belittle a simpleton guitarist uneventfully with an impersonal group
+
their age attunes a deciliter to co-produce a pious kingpin anon with a plausible subcontinent
   
 
Deciphering with a wrong key gives a grammatically correct sentence.[4]
 
Deciphering with a wrong key gives a grammatically correct sentence.[4]
   
$ echo their triangle uncorks a chap to belittle a simpleton guitarist uneventfully with an impersonal group|./encode 456|./decode 123
+
$ echo their age attunes a deciliter to co-produce a pious kingpin anon with a plausible subcontinent|./encode 456|./decode 123
 
my example illustrates a mean to obfuscate a complex sentence easily with a basic number
 
my example illustrates a mean to obfuscate a complex sentence easily with a basic number
   
$ echo their triangle uncorks a chap to belittle a simpleton guitarist uneventfully with an impersonal group|./encode 334|./decode 1
+
$ echo their age attunes a deciliter to co-produce a pious kingpin anon with a plausible subcontinent|./encode 334|./decode 1
 
my example illustrates a mean to obfuscate a complex sentence easily with a basic number
 
my example illustrates a mean to obfuscate a complex sentence easily with a basic number
   
Line 121: Line 123:
 
Brute-forcing:
 
Brute-forcing:
   
$ echo "@23:1 n.1:7695 v.4.2:2054 a n.1:2757 to v.4.1:2068 a n.1:8659 n.1:2548 adv.1:236 with a adj.1:302 @3:6"|./force
+
$ echo "@24:1 n.1:8481 v.4.2:1980 a n.1:1626 to v.4.1:469 a adj.1:6271 n.1:8763 adv.1:1393 with a @476:28 n.1:6878"|./force
...
 
31158:their leak envisages a readership to groom a gene walkway formally with a treacherous bulk
 
31159:its pontiff typifies a tenor to barbecue a militarist castle hoarsely with a unrepentant handful
 
31160:your stalk mesmerizes a beholder to prohibit a recess eye informally with an accidental group
 
31161:my grocery transships a panhandle to amass a doge studio profitably with a putrid majority
 
31162:their mould mass-produces a shoemaker to pip an iguana assay satisfactorily with a sincere minority
 
31163:its rioter comprehends a washbasin to dynamite a paragon counterplot stealthily with a textured number
 
31164:our echelon mangles a lute to oversee a cleaner savannah centrally with a machiavellian bunch
 
31165:my jailer comforts a pronoun to disembarrass a flytrap twentieth culturally with an onerous batch
 
31166:their peso relinquishes a subtitle to suffuse a magician bucketful doubly with a procedural bulk
 
31167:your compliment codes a heater to deify a beautician playschool intransitively with a fricative handful
 
31168:our gamekeeper reinforces a nude to span a daybook spokesperson manfully with an inarticulate group
 
31169:my memoir explicates a scenario to insure a helpmate airflow offhand with a lifeless majority
 
31170:its blowhole refunds an expedition to seduce a wallaby mole symmetrically with a coastal minority
 
31171:your dirigible exemplifies a lark to illustrate a casualty rerun unjustly with a discreet number
 
31172:our humorist unfrocks a pluralist to brainwash an eyebrow toll adoringly with a flaccid bunch
 
 
...
 
...
  +
28364:its budgie drapes a firefly to facet an inseparable neighbour deeply with a regrettable wager
  +
28365:your armistice models a dispersion to plagiarize a whiny lirone unfeelingly with a paradoxical teaching
  +
28366:our vine segregates a connotation to uncork a neat humour ploddingly with a self-evident slight
  +
28367:their appetite test-drives a dioxide to wangle a fevered librarian negatively with a false tailspin
  +
28368:its vendor cross-indexes a conch to donate a tractable horse globally with a unbelievable siren
  +
28369:your swimsuit implements a bulldog to notate an inflammable fragrance categorically with a understandable repair
  +
28370:my uptake misplaces a commercial to overstrain a constant hitter adroitly with a significant shuttle
  +
28371:their super scotches a bristle to tamp a rhombohedral force rowdily with a debatable regard
  +
28372:its shipwright blunts an apricot to decolonize a fallback drawing languorously with an improbable pioneer
  +
28373:your reconnaissance facelifts a vernacular to jinx a thorough correlation enquiringly with a conceivable neckband
  +
28374:my shadow imitates an ante to mimic an opinionated donor defectively with a regrettable pianoforte
  +
28375:their rationalist recolors a uttermost to short-change a comprehensible conveyor unflinchingly with a paradoxical musket
  +
28376:its pestilence walls a surface to clamp a reproducible can pneumatically with a self-evident levy
  +
28377:our radical blackmails a underclass to convey an isodicentric constabulary negligently with a false motive
  +
28378:my penchant expropriates a succession to incense an adiabatic caber gloomily with a unbelievable lawmaker
  +
28379:their moneylender overtaxes a shareholder to renounce an obstinate auricle causally with a understandable hindrance
  +
28380:your patron rearranges a stroller to rival a frightful bunk adversely with a significant lair
  +
28381:our misfit videotapes a sensationalist to brand a unavoidable aspect royally with a debatable heliport
  +
^C
   
  +
===Disclaimers===
This was just a proof-of-concepts, the numeric cipher is stupid, probably leading to some cycles in some wordlists, the key is much too short, the words in a list are not balanced by their frequency, the grammar links are not ciphered at all, my script retains only the first path found by link-grammar and this can lead to situations the grammar links discovered on a modified plaintext does not correspond to the initial plaintext links [5], etc.
 
  +
This was just a proof-of-concept
<br>BTW why mentioning quantum computers? I'm not sure they would help in brute-force, they will help breaking public key cryptography for sure, but I found it sexy in the title ;-)
 
  +
* The numeric cipher is stupid, probably leading to some cycles in some wordlists
  +
* The key is much too short
  +
* The words in a list are not balanced by their frequency
  +
* The grammar links are not ciphered at all
  +
* The script retains only the first path found by link-grammar and this can lead to situations the grammar links discovered on a modified plaintext does not correspond to the initial plaintext links [5]
  +
* The encoding step is painfully slow
  +
* If link-grammar wordlists get updated in a new version, this will change the way sentences are enciphered and deciphering of old encoded messages will be broken
  +
* etc.
  +
BTW why mentioning quantum computers?
  +
<br>I'm not sure they would help in brute-force, they will help breaking public key cryptography for sure, but in 2007 I found it sexy in the title ;-)
   
  +
<source lang=bash>
 
polygen /usr/share/polygen/eng/debian/thanks.grm
 
polygen /usr/share/polygen/eng/debian/thanks.grm
  +
</source>
   
  +
===Footnotes===
Philippe Teuwen <phil teuwen org>
 
<br>Any comment is welcome!
 
 
Scripts: (require link-grammar to be installed)
 
<br>http://www.hack.lu/pres/Scificrypto/
 
 
Notes:
 
 
<br>[1] there is more than one technological error in this book (cf e.g. http://victoria.tc.ca/int-grps/books/techrev/bkdgtlft.rvw ), one of the most obvious ones is probably that NSA never considered that the "unbreakable message" could have been random garbage.
 
<br>[1] there is more than one technological error in this book (cf e.g. http://victoria.tc.ca/int-grps/books/techrev/bkdgtlft.rvw ), one of the most obvious ones is probably that NSA never considered that the "unbreakable message" could have been random garbage.
 
<br>Giving a "ciphered text" as a challenge to break, without even giving access to the cipher itself, even not as oracle, is by no mean a proof your cipher is unbreakable, even if I remember in the past having faced such a marketing communication from fools trying to sell a new cipher.
 
<br>Giving a "ciphered text" as a challenge to break, without even giving access to the cipher itself, even not as oracle, is by no mean a proof your cipher is unbreakable, even if I remember in the past having faced such a marketing communication from fools trying to sell a new cipher.
Line 160: Line 171:
 
<br>Not suitable for us unless someone writes a complete grammar of english, but so funny, try it ;-)
 
<br>Not suitable for us unless someone writes a complete grammar of english, but so funny, try it ;-)
 
<br>[4] Actually on this particular input link-grammar found 2 paths and the first one considers "complex" as a noun rather than an adjective, using the second path would have been more correct.
 
<br>[4] Actually on this particular input link-grammar found 2 paths and the first one considers "complex" as a noun rather than an adjective, using the second path would have been more correct.
<br>[5] Here's an example (secret diffusion was commented out in the code to build this case more easily)
+
<br>[5] Here's an example with an older version of link-grammar (secret diffusion was commented out in the code to build this case more easily)
 
echo "it's a complete push" |./encode 1|./decode|./encode|./decode 1 => it 's a complement nonentity
 
echo "it's a complete push" |./encode 1|./decode|./encode|./decode 1 => it 's a complement nonentity
<br>It happened because the intermediate sentence was "it 's a complex sentence", "complex" was build from the adjective list (complete->complex) but analysed in "|./encode|" by link-grammar as a noun, so deciphered as (complex->complement)
+
<br>It happened because the intermediate sentence was "it 's a complex sentence", "complex" was build from the adjective list (complete->complex) but analysed in "|./encode|" by link-grammar as a noun, so deciphered as (complex->complement)
  +
<br>This specific example works again in latest versions of link-grammar but I'm not sure it could never happen again with other sentences...
 
<br>It's a rare border case but if the message has to be conveyed as ciphered sentence instead of a ciphered path, check first that it can be deciphered properly.
 
<br>It's a rare border case but if the message has to be conveyed as ciphered sentence instead of a ciphered path, check first that it can be deciphered properly.
 
<br>[6] CAPTCHA is an acronym: "Completely Automated Public Turing test to tell Computers and Humans Apart" and is trademarked by Carnegie Mellon University.
 
<br>[6] CAPTCHA is an acronym: "Completely Automated Public Turing test to tell Computers and Humans Apart" and is trademarked by Carnegie Mellon University.
Line 168: Line 180:
 
<br>[7] The intended recipient won't have any problem to distinguish in one single text which parts are meaningful and which are just bananas jumping after the carpet.
 
<br>[7] The intended recipient won't have any problem to distinguish in one single text which parts are meaningful and which are just bananas jumping after the carpet.
   
  +
===Scripts===
Scripts: (require link-grammar to be installed)
 
  +
They require link-grammar
<br>http://www.hack.lu/pres/Scificrypto/
 
  +
<br>Latest scripts revision were tested with link-grammar version 4.7.4)
 
'''SECURITY UPDATE''' link-grammar versions prior to v4.2.5 have an [http://secunia.com/advisories/27340/ exploitable buffer-overflow bug]:
+
<br>'''SECURITY WARNING''' link-grammar versions prior to v4.2.5 have an [http://secunia.com/advisories/27340/ exploitable buffer-overflow bug]:
 
echo xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|link-grammar en
 
echo xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|link-grammar en
 
linkparser> Segmentation fault
 
linkparser> Segmentation fault
Be sure to not parse unknown text or to upgrade to v2.4.5
+
Be sure to not parse unknown text and to upgrade link-grammar
  +
====encode====
  +
Script can be easily downloaded by [{{#file: encode}} this link]
  +
<source lang=bash>
  +
#!/bin/bash
  +
  +
# Copyright Philippe Teuwen <phil_a_teuwen_point_org>
  +
# License: GPLv3
  +
  +
LGPATH=/usr/share/link-grammar
  +
DEBUG=false
  +
if [ "$1" == "-v" ]; then
  +
DEBUG=true
  +
shift
  +
fi
  +
secret=$1
  +
secret=${secret:-0}
  +
input=$(cat)
  +
# Keep max counts in the encoded stream
  +
# if not, can be reconstructed at decode time
  +
KEEP_MAX=false
  +
  +
# Grammar parsing:
  +
##################
  +
  +
sentence=$(echo -e '!width=5000\n'$input |\
  +
link-parser en 2>/dev/null |\
  +
awk '/^[[:space:]]*\|[[:space:]]*\|[[:space:]]*\|/ {
  +
getline;
  +
sub(/ ?[A-Z]+-WALL ?/,"",$0);
  +
print;
  +
exit;
  +
}')
  +
$DEBUG && echo -n "Parsing: " 1>&2
  +
$DEBUG && echo $sentence 1>&2
  +
  +
# Encoding:
  +
##################
  +
  +
$DEBUG && echo -n "Encoding1: " 1>&2
  +
sentencex=""
  +
for word in $sentence; do
  +
# Try to find the word in one of the dict files
  +
file=$(echo '!!'$word |\
  +
link-parser 2>/dev/null |\
  +
grep "$word "|\
  +
grep "<" |\
  +
sed 's/.*<\(.*\)>.*/\1/')
  +
if [ "$file" != "" ]; then
  +
# See which occurence is our word in that file, that's our "encoding"
  +
n=$(cat $LGPATH/$file |\
  +
sed 's/ $//;s/ /\n/g;'|\
  +
grep -x -n $word|\
  +
sed 's/:.*$//')
  +
t=$(cat $LGPATH/$file | wc -w)
  +
file=${file#en/words/words.}
  +
# We encode also t but it's just to make ciphering easier
  +
# as t is just the size of the wordlist
  +
wordx="$file:$n/$t"
  +
else
  +
wordx=$word
  +
fi
  +
[ "$sentencex" != "" ] && sentencex="$sentencex "
  +
sentencex="$sentencex$wordx"
  +
$DEBUG && echo -n "$wordx " 1>&2
  +
done
  +
$DEBUG && echo 1>&2
  +
  +
# Prepare common words dict for phases 2&3
  +
###########################
  +
  +
TMPDICT=/tmp/4.0.dict.short
  +
# we eliminate the list "a<>an" and "such a<>such an"
  +
# as the grammar cannot make the diff
  +
# At decoding we ll try to do smart guess
  +
[ -e $TMPDICT ] ||\
  +
cat $LGPATH/en/4.0.dict |\
  +
tr '\n;' ' \n'|\
  +
sed 's/^ \+//;s/ / /g'|\
  +
grep '^[a-z].* [a-z].*:'|\
  +
sed 's/:.*$//'|\
  +
grep -v "^\(a an\|such_a such_an\)$" \
  +
> $TMPDICT
  +
  +
# Encoding, phase 2
  +
###################
  +
  +
$DEBUG && echo -n "Encoding2: " 1>&2
  +
# Restore words with spaces
  +
dictspace=$(cat $TMPDICT |egrep -o "[a-z]+(_[a-z]+)+")
  +
for d in $dictspace; do
  +
sentencex=$(echo $sentencex|sed "s/$(echo $d|tr '_' ' ')/$d/g")
  +
done
  +
  +
$DEBUG && echo $sentencex 1>&2
  +
  +
# Encoding, phase 3
  +
###################
  +
  +
$DEBUG && echo -n "Encoding3: " 1>&2
  +
sentencey=""
  +
for wordx in $sentencex; do
  +
if [[ "$wordx" =~ : ]]; then
  +
# This is an already encoded word
  +
wordy=$wordx
  +
else
  +
# Look in the tmpdict for a second chance
  +
# result is "line_nr:list of words"
  +
list=$(cat $TMPDICT|grep -n -w -m 1 $wordx|tr : ' ')
  +
# grep -w considers e.g. vis-a-vis as 3 words
  +
# so we want to check the match by ourselves
  +
list=$(echo " $list "|grep " $wordx ")
  +
if [ "$list" != "" ]; then
  +
# We got the word in a list
  +
eval $(echo $list|sed 's/^\([0-9]\+\) \(.*\)/l=\1;list="\2"/')
  +
n=$(echo $list|\
  +
tr ' ' '\n'|\
  +
grep -n $wordx|\
  +
sed 's/:.*//')
  +
t=$(echo $list|wc -w)
  +
wordy="@$l:$n/$t"
  +
else
  +
# Replace all "an" by "a" as the grammar cannot make the diff
  +
# At decoding we ll try to do smart guess
  +
[ "$wordx" == "an" ] && wordx="a"
  +
# Clean word extension
  +
wordy=${wordx/.*}
  +
fi
  +
fi
  +
[ "$sentencey" != "" ] && sentencey="$sentencey "
  +
sentencey="$sentencey$wordy"
  +
$DEBUG && echo -n "$wordy " 1>&2
  +
done
  +
$DEBUG && echo 1>&2
  +
  +
# Ciphering:
  +
##################
  +
  +
$DEBUG && echo -n "Ciphering: " 1>&2
  +
# loosy way to diffuse the secret
  +
secret=$(echo "$secret * 1234567 % 32768"|bc)
  +
oldnx=0
  +
  +
sentencez=""
  +
for wordy in $sentencey; do
  +
if [[ "$wordy" =~ : ]]; then
  +
# This is an encoded word
  +
eval $(echo $wordy | sed 's/\(.*\):\(.*\)\/\(.*\)/file=\1;n=\2;t=\3/')
  +
# Very stupid ciphering: apply secret as a common shift to all the words
  +
nx=$(echo "($n - 1 + $secret + $oldnx) % $t + 1"|bc)
  +
[ $secret -ne 0 ] && oldnx=$nx
  +
wordz="$file:$nx"
  +
$KEEP_MAX && wordz="$wordz/$t"
  +
else
  +
wordz=$wordy
  +
fi
  +
[ "$sentencez" != "" ] && sentencez="$sentencez "
  +
sentencez="$sentencez$wordz"
  +
$DEBUG && echo -n "$wordz " 1>&2
  +
done
  +
$DEBUG && echo 1>&2
  +
  +
echo $sentencez
  +
</source>
  +
====decode====
  +
Script can be easily downloaded by [{{#file: decode}} this link]
  +
<source lang=bash>
  +
#!/bin/bash
  +
  +
# Copyright Philippe Teuwen <phil_a_teuwen_point_org>
  +
# License: GPLv3
  +
  +
LGPATH=/usr/share/link-grammar
  +
sentencez=$(cat)
  +
DEBUG=false
  +
if [ "$1" == "-v" ]; then
  +
DEBUG=true
  +
shift
  +
fi
  +
secret=$1
  +
secret=${secret:-0}
  +
  +
# Prepare common words dict
  +
###########################
  +
  +
TMPDICT=/tmp/4.0.dict.short
  +
# we eliminate the list "a<>an" and "such a<>such an"
  +
# as there should have been a stricter rule
  +
[ -e $TMPDICT ] ||\
  +
cat $LGPATH/en/4.0.dict |\
  +
tr '\n;' ' \n'|\
  +
sed 's/^ \+//;s/ / /g'|\
  +
grep '^[a-z].* [a-z].*:'|\
  +
sed 's/:.*$//'|\
  +
grep -v "^\(a an\|such_a such_an\)$" \
  +
> $TMPDICT
  +
  +
# Deciphering:
  +
##############
  +
  +
# loosy way to diffuse the secret
  +
$DEBUG && echo -n "key: $secret"
  +
secret=$(echo "$secret * 1234567 % 32768"|bc)
  +
$DEBUG && echo "=>$secret"
  +
oldnx=0
  +
  +
sentencex=""
  +
for wordz in $sentencez; do
  +
if [[ "$wordz" =~ : ]]; then
  +
# This is a ciphered word
  +
if [[ "$wordz" =~ / ]]; then
  +
eval $(echo $wordz | sed 's/\(.*\):\(.*\)\/\(.*\)/file=\1;nx=\2;t=\3/')
  +
else
  +
eval $(echo $wordz | sed 's/\(.*\):\(.*\)/file=\1;nx=\2/')
  +
if [[ "$file" =~ ^@ ]]; then
  +
l=${file##@}
  +
t=$(cat $TMPDICT |\
  +
awk "FNR==$l{print NF}")
  +
else
  +
t=$(cat $LGPATH/en/words/words.$file | wc -w)
  +
fi
  +
fi
  +
n=$(echo "($nx - 1 - $secret - $oldnx + ($t * ((($secret + $oldnx) / $t)+1))) % $t + 1"|bc)
  +
[ $secret -ne 0 ] && oldnx=$nx
  +
wordx="$file:$n"
  +
else
  +
wordx=$wordz
  +
fi
  +
[ "$sentencex" != "" ] && sentencex="$sentencex "
  +
sentencex="$sentencex$wordx"
  +
done
  +
  +
$DEBUG && echo -n "Deciphering:" 1>&2
  +
$DEBUG && echo $sentencex 1>&2
  +
  +
# Decoding:
  +
###########
  +
  +
was_a=false
  +
for wordx in $sentencex; do
  +
if [[ "$wordx" =~ : ]]; then
  +
# This is an encoded word
  +
eval $(echo $wordx | sed 's/\(.*\):\(.*\)/file=\1;n=\2/')
  +
if [[ "$file" =~ ^@ ]]; then
  +
l=${file##@}
  +
word=$(cat $TMPDICT |\
  +
awk "FNR==$l{print \$$n}")
  +
word=${word//_/ }
  +
else
  +
word=$(cat $LGPATH/en/words/words.$file |\
  +
sed 's/ $//;s/ /\n/g;'|\
  +
awk "FNR==$n")
  +
# Schtroumpf mode
  +
# Example: ./decode 0 n smurf
  +
[ "$3" != "" ] && [[ "$word" =~ \.$2 ]] && word=$3
  +
fi
  +
# Clean word extension
  +
word=${word/.*}
  +
is_a=false
  +
else
  +
word=${wordx}
  +
[ "$word" == "a" ] && is_a=true || is_a=false
  +
fi
  +
# as the grammar cannot make the diff between "a" and "an"
  +
# we ll try to do smart guess
  +
if $was_a; then
  +
a="a"
  +
[[ "$word" =~ ^[aeio] ]] && a="an"
  +
word="$a $word"
  +
fi
  +
was_a=$is_a
  +
if ! $is_a; then
  +
[ "$sentence" != "" ] && sentence="$sentence "
  +
sentence="$sentence$word"
  +
fi
  +
done
  +
  +
$DEBUG && echo -n "Decoding: " 1>&2
  +
$DEBUG && echo $sentence 1>&2
  +
echo $sentence
  +
</source>
  +
====force====
  +
Innocent attempt to use brute-force
  +
<br>Script can be easily downloaded by [{{#file: force}} this link]
  +
<source lang=bash>
  +
#!/bin/bash
  +
  +
# Copyright Philippe Teuwen <phil_a_teuwen_point_org>
  +
# License: GPLv3
  +
  +
input=$(cat)
  +
seed=$RANDOM
  +
for ((i=0;i<100;i++)); do
  +
echo -n "$(($i+$seed)):"
  +
echo $input|./decode $(($i+$seed)) || exit
  +
done
  +
</source>
  +
===Related work===
  +
Which means *I* relate those topics to mine, not that they rely on me :-)
  +
====[http://www.cosic.esat.kuleuven.be/publications/article-915.pdf Lexical natural language steganography systems with human interaction (pdf)]====
  +
Word substitution over an IRC channel.
  +
<br>It's actually pure steganography: using e.g. choice between 4 synonyms to encode 2 bits
  +
  +
====[https://en.wikipedia.org/wiki/Format-preserving_encryption Format-preserving encryption]====
  +
''In cryptography, format-preserving encryption (FPE) refers to encrypting in such a way that the output (the ciphertext) is in the same format as the input (the plaintext). The meaning of "format" varies. Typically only finite domains are discussed, for example:''
  +
<br>''[...] To encrypt an English word so that the ciphertext is another English word.'' (added in April 2010)
  +
<br><br>Once the sentence is tokenised, those techniques could help for a better word substitution
  +
====[http://pages.cs.wisc.edu/~rist/papers/HoneyEncryptionpre.pdf Honey encryption and decoys (pdf)]====
  +
''We introduce honey encryption, a form of password-based encryption in which decrypting with incorrect passwords yields fake, but realistic-looking, plaintexts.'' (added in January 2014)
  +
  +
====Those folks who treat foreign languages like ciphers====
  +
Check [http://www.isi.edu/~knight/ Kevin Knight's work], and [http://www.wired.com/dangerroom/2012/11/ff-the-manuscript/all/ this fascinating article] though which I discovered his work:
  +
<br>''Knight was part of an extremely small group of machine-translation researchers who treated foreign languages like ciphers—as if Russian, for example, were just a series of cryptological symbols representing English words. In code-breaking, he explained, the central job is to figure out the set of rules for turning the cipher’s text into plain words: which letters should be swapped, when to turn a phrase on its head, when to ignore a word altogether. Establishing that type of rule set, or “key,” is the main goal of machine translators too. Except that the key for translating Russian into English is far more complex. Words have multiple meanings, depending on context. Grammar varies widely from language to language. And there are billions of possible word combinations.''
  +
  +
====Those folks who try to compress natural language====
  +
See [https://en.wikipedia.org/wiki/Hutter_Prize Hutter Prize]
  +
<br>Best result if I read correctly is reducing initial text size of 100Mb to 15.95Mb (including the decompressor!).
  +
<br>For comparison, a 7zip self-extractible archive is 21.2Mb large.
  +
  +
====[https://github.com/ESultanik/lenticrypt Lenticrypt]====
  +
''A simple cryptosystem that provides provable plausibly deniable encryption. Lenticrypt can generate a single ciphertext file such that different plaintexts are generated depending on which key is used for decryption.''
   
 
===Feel free to share your opinion===
 
===Feel free to share your opinion===
  +
Philippe Teuwen <phil teuwen org>
  +
<br>@doegox

Latest revision as of 12:48, 25 May 2015

This is the copy of an idea I've spread during a Hack.lu 2007 HackCamp about Sci-Fi crypto, original link is now dead.

Sci-Fi crypto

Brainstorming on how to securely exchange messages while Big Brother has at its disposal nice quantum computers

or how to defeat brute-force attacks on message ciphers

Some time ago I was reading "Digital Fortress" by Dan Brown.
A thriller in the NSA headquarters where the head of the crypto division is facing an unbreakable code, resistant to the brute-force attacks of the NSA's supercomputer TRANSLTR.
The thrilling fact is that the author of this code threatened NSA that the code will be revealed to the entire world if he happens to die.
And his death is the start of the book...

So I was reading and while the protagonists were busy imaginating how this unbreakable code was designed, I thought:
"Bu..sh..![1] To make an unbreakable code this is not like that that you have to do. You've rather to..."
And that's how I imaginated what I'm about to talk about.

Reading further, I found one quote much smarter than the rest of the book, which illustrates very well my technique:
"you could have broken the code but maybe you didn't see it!"

Let's try to create a message cipher resistant to brute force.
I said "message cipher" and not "stream cipher" because this cipher will not work on binary data but english text.
Brute force by definition can break any known symmetric cipher, but what if you don't know you broke it?
Generating all the possible plaintexts is only half of the game, the other half is to have a decision criterion to discriminate the right plaintext amongst the others.
And that's this half that we'll try to make harder for the cryptanalyst.
Here I need a new word because usually crypto talks about plaintext and ciphertext but not about possibly wrongly decoded ciphertexts.
We'll call that an attemptext (tentative of plaintext).

Imagine a ciphered text that is always deciphered to a grammatically correct attemptext, no matter which (wrong) key you used.
Cryptanalysis requires therefore a large amount of work to retrieve the correct plaintext among all attemptexts.
This is the start of a war between artificial intelligences...
Anything the cryptanalyst program can find to discard an attemptext means that you can probably tune the initial grammar model to create more plausible attemptexts.
The problem for the cryptanalyst is a variation of the Turing test as it's a test of humanity, with some differences with the original test:

  • the test procedure is not a dialog but consists of n texts of which only one was written by a human
  • the tester is rather a computer than a human (like for a captcha(tm)[6])
  • the human does not want to prove his humanity, to the contrary, so he can deliberately introduce some non-sense in his text [7]

Let's represent a language by a graph.
Each sentence is one path through the language graph.
Taking another random path will lead to another grammatically correct sentence.
The perfect language graph represents all possible sentences but no junk, so a cryptanalysis has to be done on another level than the grammar or other easy tricks like alphabet frequency analysis.


To cipher our message, the first step is not a cryptographic one but the encoding of the sentence as a path along the grammar graph.
This path has to be identified numerically (enumerated) amongst the possible paths.
Ideally the enumeration must be balanced by the frequency of common grammatical constructions and vocabulary, for example something you get for free if you manage to map some Huffman coding on it.
If there is a complete map between all the paths up to a given length and a bounded set of integers, then we have the guarantee that any random pick in the set will be accepted by the deciphering routine and will lead to an attemptext.
And the numerical representation can now be ciphered by any classic symmetric cipher.
So the complete solution has to follow some rules:

  • it must run without revealing any metadata confirming the right key when brute-forced, so e.g. don't introduce any checksum in the plaintext [2] but on the ciphered text!
  • any wrong key must lead to a proper deciphering and an attemptext, no rejection.

Some additional thoughts:
Such a cipher method covering a balanced language graph would become for free one of the best natural language text compressor available (like when you order in a chinese restaurant the numbers 3, 10 and 12 (I recommend the 12)).
This gives plausible deniability, especially if re-decoded (to another attemptext): you can even deny it is a message encoded with this method, this text was just the poetry of a drunken man ;-)
A little junk is tolerated in the brute-forced plaintexts (even a lot! 99% of detectable junk = loss of 6.6 bits of key material).
A limited language coverage with a stronger grammar can help but then we need a special predictive editor to create only sentences respectful of the cipher grammar.
We covered english language but this could be extended to any formal media description if a grammar can be written for it.
Problems: Grammar/Vocabulary could be limited, we need special care when entering a sentence and be sure it'll be handled properly by the grammar.
We also have to foresee loopholes for whatever will not be in a dictionary (family names, brands, acronyms, numbers...)

Proof-of-concept


Good starting points are grammar and spelling checkers, maybe Turing games (mimicking a human, elisa etc).

$ apt-cache search grammar
link-grammar - Carnegie Mellon University's link grammar parser

See also [3]

$ apt-cache show link-grammar
    In Selator, D. and Temperly, D. "Parsing English with a Link Grammar" (1991), the authors defined a new formal grammatical system called a "link grammar". A sequence of words is in the language of a link grammar if there is a way to draw "links" between words in such a way that the local requirements of each word are satisfied, the links do not cross, and the words form a connected graph. The authors encoded English grammar into such a system, and wrote this program to parse English using this grammar.

Links, graphs, mmm this description talks to me...

# apt-get install link-grammar
$ echo "the cat is chasing the mouse" |link-grammar en
                   +-----Os-----+
 +-Ds-+--Ss-+-Pg*b-+      +--Ds-+
 |    |     |      |      |     |
the cat.n is.v chasing.v the mouse.n 

The difficulty is the enumeration of paths that would cover the key space.
For a first attempt I will keep the grammar links of the plaintext and replace every word by another which respects the same grammar structure.
After scripting some bash around link-grammar and its dictionaries, here's what I get (see at the bottom of the page for source code):

$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 
@24:2 n.1:3651 v.4.2:1105 a n.1:6281 to v.4.1:1430 a adj.1:973 n.1:9127 adv.1:385 with a @476:16 n.1:2021

This is the encoding of the input, every word is replaced by a reference to a wordlist and its position in the list.

$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode|./decode
my example illustrates a mean to obfuscate a complex sentence easily with an impressive code

So far, so good.

$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123
@24:1 n.1:8481 v.4.2:1980 a n.1:1626 to v.4.1:469 a adj.1:6271 n.1:8763 adv.1:1393 with a @476:28 n.1:6878

Here I'm changing the positions using a secret key with a very very stupid 16-bit numeric cipher.

$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123|./decode 123
my example illustrates a mean to obfuscate a complex sentence easily with an impressive code

So far, so good.
But because any wrong key leads to an attemptext, we could send this attemptext instead of the encoded path so it looks less like ciphered stuff than like psychedelic poem:

$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123|./decode
its rehash romanticizes a caterpillar to curry a unobstructed rotter tensely with a undeniable nose
$ echo its rehash romanticizes a caterpillar to curry a unobstructed rotter tensely with a undeniable nose|./encode|./decode 123
my example illustrates a mean to obfuscate a complex sentence easily with a basic number

So far, so good.

$ echo my example illustrates a mean to obfuscate a complex sentence easily with an impressive code|./encode 123|./decode 456
their age attunes a deciliter to co-produce a pious kingpin anon with a plausible subcontinent

Deciphering with a wrong key gives a grammatically correct sentence.[4]

$ echo their age attunes a deciliter to co-produce a pious kingpin anon with a plausible subcontinent|./encode 456|./decode 123
my example illustrates a mean to obfuscate a complex sentence easily with a basic number
$ echo their age attunes a deciliter to co-produce a pious kingpin anon with a plausible subcontinent|./encode 334|./decode 1
my example illustrates a mean to obfuscate a complex sentence easily with a basic number

With this simple cipher, it's a bit like stepping forward and backward and 123-456+334-1=0

Brute-forcing:

$ echo "@24:1 n.1:8481 v.4.2:1980 a n.1:1626 to v.4.1:469 a adj.1:6271 n.1:8763 adv.1:1393 with a @476:28 n.1:6878"|./force     
...
28364:its budgie drapes a firefly to facet an inseparable neighbour deeply with a regrettable wager
28365:your armistice models a dispersion to plagiarize a whiny lirone unfeelingly with a paradoxical teaching
28366:our vine segregates a connotation to uncork a neat humour ploddingly with a self-evident slight
28367:their appetite test-drives a dioxide to wangle a fevered librarian negatively with a false tailspin
28368:its vendor cross-indexes a conch to donate a tractable horse globally with a unbelievable siren
28369:your swimsuit implements a bulldog to notate an inflammable fragrance categorically with a understandable repair
28370:my uptake misplaces a commercial to overstrain a constant hitter adroitly with a significant shuttle
28371:their super scotches a bristle to tamp a rhombohedral force rowdily with a debatable regard
28372:its shipwright blunts an apricot to decolonize a fallback drawing languorously with an improbable pioneer
28373:your reconnaissance facelifts a vernacular to jinx a thorough correlation enquiringly with a conceivable neckband
28374:my shadow imitates an ante to mimic an opinionated donor defectively with a regrettable pianoforte
28375:their rationalist recolors a uttermost to short-change a comprehensible conveyor unflinchingly with a paradoxical musket
28376:its pestilence walls a surface to clamp a reproducible can pneumatically with a self-evident levy
28377:our radical blackmails a underclass to convey an isodicentric constabulary negligently with a false motive
28378:my penchant expropriates a succession to incense an adiabatic caber gloomily with a unbelievable lawmaker
28379:their moneylender overtaxes a shareholder to renounce an obstinate auricle causally with a understandable hindrance
28380:your patron rearranges a stroller to rival a frightful bunk adversely with a significant lair
28381:our misfit videotapes a sensationalist to brand a unavoidable aspect royally with a debatable heliport
^C

Disclaimers

This was just a proof-of-concept

  • The numeric cipher is stupid, probably leading to some cycles in some wordlists
  • The key is much too short
  • The words in a list are not balanced by their frequency
  • The grammar links are not ciphered at all
  • The script retains only the first path found by link-grammar and this can lead to situations the grammar links discovered on a modified plaintext does not correspond to the initial plaintext links [5]
  • The encoding step is painfully slow
  • If link-grammar wordlists get updated in a new version, this will change the way sentences are enciphered and deciphering of old encoded messages will be broken
  • etc.

BTW why mentioning quantum computers?
I'm not sure they would help in brute-force, they will help breaking public key cryptography for sure, but in 2007 I found it sexy in the title ;-)

polygen /usr/share/polygen/eng/debian/thanks.grm

Footnotes


[1] there is more than one technological error in this book (cf e.g. http://victoria.tc.ca/int-grps/books/techrev/bkdgtlft.rvw ), one of the most obvious ones is probably that NSA never considered that the "unbreakable message" could have been random garbage.
Giving a "ciphered text" as a challenge to break, without even giving access to the cipher itself, even not as oracle, is by no mean a proof your cipher is unbreakable, even if I remember in the past having faced such a marketing communication from fools trying to sell a new cipher.
Note that what is finally revealed as the Digital Fortress unbreakable code makes eventually sense in its context (if the proper terminology was used), think about the integer overflow vulnerability in the command "file" for example (CVE-2007-2799)
No I will not tell more, for those who didn't read the book yet and plan to do so, I'm not a spoiler ;-)
[2] that's how I broke my first 32bit cipher for HP48 when I was 17, its author was so confident that he prepended the plaintext with the key.
[3] I also discovered "polygen": PolyGen is a program for generating random sentences according to a grammar definition, that is following custom syntactical and lexical rules.
Not suitable for us unless someone writes a complete grammar of english, but so funny, try it ;-)
[4] Actually on this particular input link-grammar found 2 paths and the first one considers "complex" as a noun rather than an adjective, using the second path would have been more correct.
[5] Here's an example with an older version of link-grammar (secret diffusion was commented out in the code to build this case more easily) echo "it's a complete push" |./encode 1|./decode|./encode|./decode 1 => it 's a complement nonentity
It happened because the intermediate sentence was "it 's a complex sentence", "complex" was build from the adjective list (complete->complex) but analysed in "|./encode|" by link-grammar as a noun, so deciphered as (complex->complement)
This specific example works again in latest versions of link-grammar but I'm not sure it could never happen again with other sentences...
It's a rare border case but if the message has to be conveyed as ciphered sentence instead of a ciphered path, check first that it can be deciphered properly.
[6] CAPTCHA is an acronym: "Completely Automated Public Turing test to tell Computers and Humans Apart" and is trademarked by Carnegie Mellon University.
Note that in our case, a human could manually finish the test if the computer already managed to reduce the original set of n texts to a more reasonable amount.
[7] The intended recipient won't have any problem to distinguish in one single text which parts are meaningful and which are just bananas jumping after the carpet.

Scripts

They require link-grammar
Latest scripts revision were tested with link-grammar version 4.7.4)
SECURITY WARNING link-grammar versions prior to v4.2.5 have an exploitable buffer-overflow bug:

echo xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|link-grammar en
linkparser> Segmentation fault

Be sure to not parse unknown text and to upgrade link-grammar

encode

Script can be easily downloaded by [{{#file: encode}} this link]

#!/bin/bash

# Copyright Philippe Teuwen <phil_a_teuwen_point_org>
# License: GPLv3

LGPATH=/usr/share/link-grammar
DEBUG=false
if [ "$1" == "-v" ]; then
  DEBUG=true
  shift
fi
secret=$1
secret=${secret:-0}
input=$(cat)
# Keep max counts in the encoded stream
# if not, can be reconstructed at decode time
KEEP_MAX=false

# Grammar parsing:
##################

sentence=$(echo -e '!width=5000\n'$input |\
    link-parser en 2>/dev/null |\
    awk '/^[[:space:]]*\|[[:space:]]*\|[[:space:]]*\|/ {
	    getline;
	    sub(/ ?[A-Z]+-WALL ?/,"",$0);
	    print;
	    exit;
	}')
$DEBUG && echo -n "Parsing:   " 1>&2
$DEBUG && echo $sentence 1>&2

# Encoding:
##################

$DEBUG && echo -n "Encoding1: " 1>&2
sentencex=""
for word in $sentence; do
    # Try to find the word in one of the dict files
    file=$(echo '!!'$word |\
	link-parser 2>/dev/null |\
	grep "$word "|\
	grep "<" |\
	sed 's/.*<\(.*\)>.*/\1/')
    if [ "$file" != "" ]; then
	# See which occurence is our word in that file, that's our "encoding"
	n=$(cat $LGPATH/$file |\
		sed 's/ $//;s/ /\n/g;'|\
		grep -x -n $word|\
		sed 's/:.*$//')
	t=$(cat $LGPATH/$file | wc -w)
	file=${file#en/words/words.}
	# We encode also t but it's just to make ciphering easier
	# as t is just the size of the wordlist
	wordx="$file:$n/$t"
    else
	wordx=$word
    fi
    [ "$sentencex" != "" ] && sentencex="$sentencex "
    sentencex="$sentencex$wordx"
    $DEBUG && echo -n "$wordx " 1>&2
done
$DEBUG && echo 1>&2

# Prepare common words dict for phases 2&3
###########################

TMPDICT=/tmp/4.0.dict.short
# we eliminate the list "a<>an" and "such a<>such an" 
# as the grammar cannot make the diff
# At decoding we ll try to do smart guess
[ -e $TMPDICT ] ||\
    cat $LGPATH/en/4.0.dict |\
	tr '\n;' ' \n'|\
	sed 's/^ \+//;s/  / /g'|\
	grep '^[a-z].* [a-z].*:'|\
	sed 's/:.*$//'|\
	grep -v "^\(a an\|such_a such_an\)$" \
	> $TMPDICT

# Encoding, phase 2
###################

$DEBUG && echo -n "Encoding2: " 1>&2
# Restore words with spaces
dictspace=$(cat $TMPDICT |egrep -o "[a-z]+(_[a-z]+)+")
for d in $dictspace; do
    sentencex=$(echo $sentencex|sed "s/$(echo $d|tr '_' ' ')/$d/g")
done

$DEBUG && echo $sentencex 1>&2

# Encoding, phase 3
###################

$DEBUG && echo -n "Encoding3: " 1>&2
sentencey=""
for wordx in $sentencex; do
    if [[ "$wordx" =~ : ]]; then
	# This is an already encoded word
	wordy=$wordx
    else
	# Look in the tmpdict for a second chance
	# result is "line_nr:list of words"
	list=$(cat $TMPDICT|grep -n -w -m 1 $wordx|tr : ' ')
	# grep -w considers e.g. vis-a-vis as 3 words
	# so we want to check the match by ourselves
	list=$(echo " $list "|grep " $wordx ")
	if [ "$list" != "" ]; then
	    # We got the word in a list
	    eval $(echo $list|sed 's/^\([0-9]\+\) \(.*\)/l=\1;list="\2"/')
	    n=$(echo $list|\
		tr ' ' '\n'|\
		grep -n $wordx|\
		sed 's/:.*//')
	    t=$(echo $list|wc -w)
	    wordy="@$l:$n/$t"
	else
	    # Replace all "an" by "a" as the grammar cannot make the diff
	    # At decoding we ll try to do smart guess
	    [ "$wordx" == "an" ] && wordx="a"
	    # Clean word extension
	    wordy=${wordx/.*}
	fi
    fi
    [ "$sentencey" != "" ] && sentencey="$sentencey "
    sentencey="$sentencey$wordy"
    $DEBUG && echo -n "$wordy " 1>&2
done
$DEBUG && echo 1>&2

# Ciphering:
##################

$DEBUG && echo -n "Ciphering: " 1>&2
# loosy way to diffuse the secret
secret=$(echo "$secret * 1234567 % 32768"|bc)
oldnx=0

sentencez=""
for wordy in $sentencey; do
    if [[ "$wordy" =~ : ]]; then
	# This is an encoded word
	eval $(echo $wordy | sed 's/\(.*\):\(.*\)\/\(.*\)/file=\1;n=\2;t=\3/')
	# Very stupid ciphering: apply secret as a common shift to all the words
	nx=$(echo "($n - 1 + $secret + $oldnx) % $t + 1"|bc)
	[ $secret -ne 0 ] && oldnx=$nx
	wordz="$file:$nx"
	$KEEP_MAX && wordz="$wordz/$t"
    else
	wordz=$wordy
    fi
    [ "$sentencez" != "" ] && sentencez="$sentencez "
    sentencez="$sentencez$wordz"
    $DEBUG && echo -n "$wordz " 1>&2
done
$DEBUG && echo 1>&2

echo $sentencez

decode

Script can be easily downloaded by [{{#file: decode}} this link]

#!/bin/bash

# Copyright Philippe Teuwen <phil_a_teuwen_point_org>
# License: GPLv3

LGPATH=/usr/share/link-grammar
sentencez=$(cat)
DEBUG=false
if [ "$1" == "-v" ]; then
  DEBUG=true
  shift
fi
secret=$1
secret=${secret:-0}

# Prepare common words dict
###########################

TMPDICT=/tmp/4.0.dict.short
# we eliminate the list "a<>an" and "such a<>such an" 
# as there should have been a stricter rule
[ -e $TMPDICT ] ||\
    cat $LGPATH/en/4.0.dict |\
	tr '\n;' ' \n'|\
	sed 's/^ \+//;s/  / /g'|\
	grep '^[a-z].* [a-z].*:'|\
	sed 's/:.*$//'|\
	grep -v "^\(a an\|such_a such_an\)$" \
	> $TMPDICT

# Deciphering:
##############

# loosy way to diffuse the secret
$DEBUG && echo -n "key: $secret"
secret=$(echo "$secret * 1234567 % 32768"|bc)
$DEBUG && echo "=>$secret"
oldnx=0

sentencex=""
for wordz in $sentencez; do
    if [[ "$wordz" =~ : ]]; then
	# This is a ciphered word
	if [[ "$wordz" =~ / ]]; then
	    eval $(echo $wordz | sed 's/\(.*\):\(.*\)\/\(.*\)/file=\1;nx=\2;t=\3/')
	else
	    eval $(echo $wordz | sed 's/\(.*\):\(.*\)/file=\1;nx=\2/')
	    if [[ "$file" =~ ^@ ]]; then
		l=${file##@}
		t=$(cat $TMPDICT |\
		    awk "FNR==$l{print NF}")
	    else
	    	t=$(cat $LGPATH/en/words/words.$file | wc -w)
	    fi
	fi
	n=$(echo "($nx - 1 - $secret - $oldnx + ($t * ((($secret + $oldnx) / $t)+1))) % $t + 1"|bc)
	[ $secret -ne 0 ] && oldnx=$nx
	wordx="$file:$n"
    else
	wordx=$wordz
    fi
    [ "$sentencex" != "" ] && sentencex="$sentencex "
    sentencex="$sentencex$wordx"
done

$DEBUG && echo -n "Deciphering:" 1>&2
$DEBUG && echo $sentencex 1>&2

# Decoding:
###########

was_a=false
for wordx in $sentencex; do
    if [[ "$wordx" =~ : ]]; then
	# This is an encoded word
	eval $(echo $wordx | sed 's/\(.*\):\(.*\)/file=\1;n=\2/')
	if [[ "$file" =~ ^@ ]]; then
	    l=${file##@}
	    word=$(cat $TMPDICT |\
		awk "FNR==$l{print \$$n}")
	    word=${word//_/ }
	else
	    word=$(cat $LGPATH/en/words/words.$file |\
		sed 's/ $//;s/ /\n/g;'|\
		awk "FNR==$n")
	    # Schtroumpf mode
	    # Example: ./decode 0 n smurf
	    [ "$3" != "" ] && [[ "$word" =~ \.$2 ]] && word=$3
	fi
	# Clean word extension
	word=${word/.*}
	is_a=false
    else
	word=${wordx}
	[ "$word" == "a" ] && is_a=true || is_a=false
    fi
    # as the grammar cannot make the diff between "a" and "an"
    # we ll try to do smart guess
    if $was_a; then
	a="a"
	[[ "$word" =~ ^[aeio] ]] && a="an"
	word="$a $word"
    fi
    was_a=$is_a
    if ! $is_a; then
	[ "$sentence" != "" ] && sentence="$sentence "
	sentence="$sentence$word"
    fi
done

$DEBUG && echo -n "Decoding:  " 1>&2
$DEBUG && echo $sentence 1>&2
echo $sentence

force

Innocent attempt to use brute-force
Script can be easily downloaded by [{{#file: force}} this link]

#!/bin/bash

# Copyright Philippe Teuwen <phil_a_teuwen_point_org>
# License: GPLv3

input=$(cat)
seed=$RANDOM
for ((i=0;i<100;i++)); do
    echo -n "$(($i+$seed)):"
    echo $input|./decode $(($i+$seed)) || exit
done

Related work

Which means *I* relate those topics to mine, not that they rely on me :-)

Lexical natural language steganography systems with human interaction (pdf)

Word substitution over an IRC channel.
It's actually pure steganography: using e.g. choice between 4 synonyms to encode 2 bits

Format-preserving encryption

In cryptography, format-preserving encryption (FPE) refers to encrypting in such a way that the output (the ciphertext) is in the same format as the input (the plaintext). The meaning of "format" varies. Typically only finite domains are discussed, for example:
[...] To encrypt an English word so that the ciphertext is another English word. (added in April 2010)

Once the sentence is tokenised, those techniques could help for a better word substitution

Honey encryption and decoys (pdf)

We introduce honey encryption, a form of password-based encryption in which decrypting with incorrect passwords yields fake, but realistic-looking, plaintexts. (added in January 2014)

Those folks who treat foreign languages like ciphers

Check Kevin Knight's work, and this fascinating article though which I discovered his work:
Knight was part of an extremely small group of machine-translation researchers who treated foreign languages like ciphers—as if Russian, for example, were just a series of cryptological symbols representing English words. In code-breaking, he explained, the central job is to figure out the set of rules for turning the cipher’s text into plain words: which letters should be swapped, when to turn a phrase on its head, when to ignore a word altogether. Establishing that type of rule set, or “key,” is the main goal of machine translators too. Except that the key for translating Russian into English is far more complex. Words have multiple meanings, depending on context. Grammar varies widely from language to language. And there are billions of possible word combinations.

Those folks who try to compress natural language

See Hutter Prize
Best result if I read correctly is reducing initial text size of 100Mb to 15.95Mb (including the decompressor!).
For comparison, a 7zip self-extractible archive is 21.2Mb large.

Lenticrypt

A simple cryptosystem that provides provable plausibly deniable encryption. Lenticrypt can generate a single ciphertext file such that different plaintexts are generated depending on which key is used for decryption.

Feel free to share your opinion

Philippe Teuwen <phil teuwen org>
@doegox