Hut 8

Steganography is the practice of hiding data within other data, such that a third party doesn’t suspect the presence of the hidden data (the “payload”) inside of the readily apparent data (the “carrier” or “canvas”). The type of the carrier is known as a “channel,” which in this case is a Unicode text document. Some parts can be adapted to plain ASCII. All of these methods are easily detectable programatically, as they involve unusual character sequences, such as Cyrillic characters in an otherwise Latin word.

Steganography is useful in situations where cryptography is prohibited. Cryptography is very easy to detect as almost all decent encryption algorithms produce a uniformly random string of bytes for their ciphertext. The only other algorithms that do that are for compression. If a block of data is uniformly random and does not appear to be compressed using a standard algorithm (of which there are many: gzip, bz2, etc.), it is probably encrypted. Also, most protocols involving cryptography, e.g., TLS/SSL use standard headers for negotiating keys and cipher suites, which makes it trivial to detect even without statistical analysis.

Detection of all of these methods is very simple, but not as easy as detecting cryptography.

The following are various strategies for steganography of arbitrary data inside of Unicode text documents.

Here are methods to encode the payload in:

Spaces

Spaces: the final frontier. Space characters are not limited to ASCII 0x20.

Unicode 6.2 lists spaces in table 6-2 (page 194):

Unicode Code Point	Description
U+0020	Space
U+00A0	No-break Space
U+1680	Ogham Space Mark (most fonts display as a dash)
U+180E	Mongolian Vowel Separator
U+2000	En Quad
U+2001	Em Quad
U+2002	En Space
U+2003	Em Space
U+2004	Three-Per-Em Space
U+2005	Four-Per-Em Space
U+2006	Six-Per-Em Space
U+2007	Figure Space
U+2008	Punctuation Space
U+2009	Thin Space
U+200a	Hair Space
U+202f	Narrow No-Break Space
U+205f	Medium Mathematical Space
U+3000	Ideographic Space

Unicode also classifies these code points as spaces, even though they have no width (i.e., spaces that take up no space):

Unicode Code Point	Description
U+200B	Zero-Width Space
U+FEFF	Zero-Width No-Break Space (BOM)

In ASCII, newlines were specified using either a CR/LF combination (Windows), LF (*NIX) or CR (old Mac). To “simplify” this, Unicode has:

Unicode Code Point	Description
U+2028	Line Separator
U+2029	Paragraph Separator

Typically, a space is typeset with 1/4 em width. So, that should display identically to U+2005.

This simplified implementation only uses U+200B and U+FEFF to encode arbitrary data. You could append data into a canvas using shell redirection, cat and this utility.

#!/usr/bin/env ruby

class UnicodeSpaceStegoEngine
  def initialize(stream)
    @stream = stream
  end

  # Decode U+200B => 0
  #        U+FEFF => 1
  def decode
    byte = 0
    bit_ix = 0
    @stream.each_char do |char|
      case char
      when [0x200B].pack('U') # Unset bit
        byte &= (~(1 << bit_ix))
        bit_ix += 1
      when [0xFEFF].pack('U') # Set bit
        byte |= (1 << bit_ix)
        bit_ix += 1
      end
      if bit_ix == 8
        write([byte].pack("C"))
        byte = 0
        bit_ix = 0
      end
    end
  end

  # Encode 0 => U+200B
  #        1 => U+FEFF
  def encode
    @stream.each_byte do |byte|
      (0..7).each do |bit_ix|
        set_bit = (byte & (1 << bit_ix)) != 0
        codepoint = set_bit ? 0xFEFF : 0x200B
        print [codepoint].pack 'U'
      end
    end
  end
end

# CLI Follows
def usage
  puts "usage: #{$0} <encode|decode> [data_file, ...]"
  puts "or     #{$0} <encode|decode> < data"
end

operation = ARGV.shift
engine = UnicodeSpaceStegoEngine.new(ARGF)

case operation
when "encode"
  engine.encode
when "decode"
  engine.decode
else
  usage
end

Cyrillic Characters

This strategy is applicable more generally to alphabets containing Latin-alphabet-looking characters. Unicode has quite a few characters (95,221 or so), some of which are indistinguishable from others in most fonts. I found that Cyrillic characters are generally well-supported in common fonts.

Latin	Cyrillic	Cyrillic Code Point
A	А	U+0410
a	а	U+0430
B	В	U+0412
C	С	U+0421
c	с	U+0441
E	Е	U+0415
e	е	U+0435
H	Н	U+041D
I	І	U+0406
i	і	U+0456
K	К	U+041A
M	М	U+041C
m	м	U+043C
O	О	U+041E
o	о	U+043E
P	Р	U+0420
p	р	U+0440
S	Ѕ	U+0405
T	Т	U+0422
y	у	U+0443
X	Х	U+0425
x	х	U+0445

So there is the option of replacing a character in the Latin column with a character in the Cyrillic column. The byte-offset and bit-offset positions of the payload are tracked in the algorithm. Whenever a replaceable character is read from the canvas, a test is performed to see if the replacement should occur (in C):

int perform_replacement = canvas[byte_offset] & bit_offset

Whenever a potential replacement is performed (i.e., a viable replacement character from the canvas is either replaced or not), the bit_offset is incremented. When bit_offset == 8, bit_offset is zeroed and byte_offset is incremented. Note that the range of bit_offset is 0-7.

Text direction characters

Unicode has support for indicating that text should suddenly start flowing in a given direction (LTRM / RTLM) as well as reversing the direction

These can actually be nested up to 61 levels deep just in case you have a deeply nested quote from someone, e.g., an Arabic document quoting English that contains Arabic that contains English (etc., etc., …) U+200E, U+200F, U+202A, U+202B, U+202C, U+202D, U+202E

Annotation characters

Unicode Code Point	Description
U+FFF9	Interlinear Annotation Anchor
U+FFFA	Interlinear Annotation Separator
U+FFFB	Interlinear Annotation Terminator

These symbols are for general annotations on documents. The program rendering the document will probably show the annotations, but it should not display the annotation characters themselves.

Language tag characters

RFC 2482 provides a method for embedding the language of text as metadata inside of a Unicode text document. Nobody uses this anymore, and it was deprecated in 5.1, however deprecated doesn’t mean removed. The Unicode spec says that these symbols should not be displayed, so I suppose you could hide data in fake language tags.

The Unicode Confusables

The canonical reference on Unicode symbols that look like other symbols is the Recommended confusable mapping for IDN

It would be very easy to write a script that uses all of these, however most are either not found in any common font or don’t really look that similar.

Text Steganography - Sun, Jan 26, 2014