.. CantoGraph documentation master file, created by
   sphinx-quickstart on Thu Jul 12 20:36:05 2012.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

CantoGraph
==========

Author: Richard Liming

Overview
--------

Method for hiding a message in a Cantonese vocabulary list with recipient/viewer plausible deniability.

Here, a basic method is described along with a few issues and potential enhancements.

The basic method is, I believe, a `steganographic`_ technique with the following qualities: 

    -   The plane of information is not visible, but the implied aural plane of the tones used in spoken Cantonese. 

    -   Since the characters for Mandarin and Cantonese are the same, one doesn't necessarily realize that
        a Cantonese reading is used for the interpretation of the message.  An assumption is probably that it is Mandarin.

    -   With practice a human that knows Cantonese can write and decode the message in their head without any software 
        or other type of decoder.  Since nothing is required to decode it, in some circumstances this also can 
        provide plausible deniability for the reader.


Demo Pages
----------

A `demo`_ application is available. The page has a button to pop-up the encoding/decoding table, if that doesn't work in your browser, use the direct link to the `encode decode table`_.


Encoding Process
----------------

Assuming there are six (6) tones in Cantonese as represented in `Jyutping`_ romanization, then with two digits, or Han character tones in this case, one can represent 6 squared, or 36 combinations.

This maps fairly nicely to the 26 characters of the English alphabet plus the 10 digits from 0-9. Call this the basic alphanumeric set ('a' through 'z' and 0 through 9), or 'basic set'.  With the basic method, no distinction can be made between upper and lower case letters.

The process for mapping starts with giving each element in the set (a-z,0-9) a decimal number representation from 0 to 35:

+-------+--------+
|element| decimal|
+-------+--------+
|a      | 0      |
+-------+--------+
|b      | 1      |
+-------+--------+
|c      | 2      |
+-------+--------+
|...    |        |
+-------+--------+
|z      | 25     |
+-------+--------+
|0      | 26     |
+-------+--------+
|1      | 27     |
+-------+--------+
|...    |        |
+-------+--------+
|9      | 35     |
+-------+--------+

Then this decimal number is converted to a base 6 (hexary) math representation, using two digits (which will map to two Han characters).  Just like base 10 math (decimal) has a max number of 9 in each position, and base 2 math (binary) has a max number of 1 in any position, so base 6 math (hexary) has a max number of 5 in any position:

+-------+--------+----------------+
|element| decimal|two digit hexary|
+-------+--------+----------------+
|a      | 0      | 00             |
+-------+--------+----------------+
|b      | 1      | 01             |
+-------+--------+----------------+
|c      | 2      | 02             |
+-------+--------+----------------+
|d      | 3      | 03             |
+-------+--------+----------------+
|e      | 4      | 04             |
+-------+--------+----------------+
|f      | 5      | 05             |
+-------+--------+----------------+
|g      | 6      | 10             |
+-------+--------+----------------+
|h      | 7      | 11             |
+-------+--------+----------------+
|...    | ..     | ..             |
+-------+--------+----------------+
|m      | 12     | 20             |
+-------+--------+----------------+
|n      | 13     | 21             |
+-------+--------+----------------+
|...    | ..     | ..             |
+-------+--------+----------------+
|9      | 35     | 55             |
+-------+--------+----------------+

However, the Jyutping romanization uses the numbers 1 through 6 to represent the tones, not 0 through 5. So, we make a simple mapping conversion by adding one to each position in the hexary representation:

+-------+--------+----------------+-----------+
|element| decimal|two digit hexary|hexary + 1 |
+-------+--------+----------------+-----------+
|a      | 0      | 00             | 11        |
+-------+--------+----------------+-----------+
|b      | 1      | 01             | 12        |
+-------+--------+----------------+-----------+
|c      | 2      | 02             | 13        |
+-------+--------+----------------+-----------+ 
|d      | 3      | 03             | 14        |
+-------+--------+----------------+-----------+
|e      | 4      | 04             | 15        |
+-------+--------+----------------+-----------+
|f      | 5      | 05             | 16        |
+-------+--------+----------------+-----------+
|g      | 6      | 10             | 21        |
+-------+--------+----------------+-----------+
|h      | 7      | 11             | 22        |
+-------+--------+----------------+-----------+
|...    | ..     | ..             |           |
+-------+--------+----------------+-----------+
|m      | 12     | 20             | 31        |
+-------+--------+----------------+-----------+
|n      | 13     | 21             | 32        |
+-------+--------+----------------+-----------+
|...    | ..     | ..             | ..        |
+-------+--------+----------------+-----------+
|9      | 35     | 55             | 66        |
+-------+--------+----------------+-----------+


Now with this basic mapping we can represent an English or other romanized message with Han characters using the tone.  Referencing the chart above, to represent an 'a' we need two characters, both 1st tone, (1 and 1) or 11. We could use::

    貓  cat maau1       雞  chicken gai1

or::

    鷹  eagle   jing1       獅  lion    si1

and so on.

The word 'the' requires six characters, two per English letter. From the table we see that 't' is 42, 'h' is '22' and 'e' is '15', so one possible mapping is::

 't' .. 42 .. 牛  cow     ngau4       狗  dog     gau2  
 'h' .. 22 .. 虎  tiger   fu2         煮  cook    zyu2 
 'e' .. 15 .. 貓  cat     maau1       鳥  bird    niu5

To get the orginal message back, the tones are simply mapped back to the characters.

Current Encoding Software
-------------------------

The current encoding software in the demo, uses a preset list of characters.  It can easily be set to use the full `Unihan`_ data set.  The software will generate the encoded vocabulary lists by finding the first character it can that meets the tone requirement, and then not reuse that character again, unless it runs out of new characters, so as to generate authentic looking vocabulary lists.  The initial lists were made by hand from common subjects like (animals, food, etc) to make the generated lists look more like legitimate vocabulary lists with common themes.  However they only have 75 characters, so in a long message there may be repeats. But, longer lists or the full Unihan can be used in a non-demo situation.

Issues and Enhancements
-----------------------

The basic method only encodes the letters &#8216;a&#8217; through &#8216;z&#8217; and the digits 0 through 9, so some characters are lost including spaces, apostrophes, periods and any other character not in the list. In practice the message is understandable without these characters, even spaces, but this adds a little difficulty.

Any enhancement that requires software to decode, potentially removes the deniability by the recipient.

I shared this with a friend who speaks Cantonese and described the delimma of six tones only allowing for 36 characters and that I couldn't find a good source for the nine tones of Cantonese.  Of course it is easy to come up with variations of code and methods that can add more characters for encoding, but I still wanted a version that could be decoded by a human with no software on the decoding end.  After a bit he explained the p, t, k finals, or as Wikipedia calls them 'checked tones' (http://en.wikipedia.org/wiki/Checked_tone ).  So, I think the easiest way to get a few more characters without so much complexity is to just consider a character whose romanization ends in p, t, or k as another postion, so base 7.  Thus you get 49 possible combinations, or 13 more than with base 6, which could be used for spaces, periods, and whatever other punctuation is critical to your specific message.

Many more sophisticated options exist, but they tend to necessitate software, such as changing the index into the alphabet, so that say 'b' = 0 instead of 'a'.  This could be done using some external source, or dynamic source, such as sunrise time, etc., but again a human can get used to the 36 or 49 characters and essentially read and write in code using the basic method without any software.

Related
-------

The mechanism is similar to a `Bacon Cipher`_, but with the advantage of not being visible to the naked eye, and using base 6 math instead of what is essentially binary. Most of the pages describing the Bacon Cipher are not clear, for me anyway, and difficult to understand, but there are a few pictures that can be found of actual messages in the script of that time period that make it easier to grasp e.g:  http://farm3.static.flickr.com/2223/2198740846_62a2e3af86_b.jpg

.. _`demo`: http://gravityfails.org/cgi-bin/cantograph.py
.. _`encode decode table`: http://gravityfails/cantograph/table.html
.. _`Jyutping`: http://en.wikipedia.org/wiki/Jyutping
.. _`steganographic`: http://en.wikipedia.org/wiki/Steganography
.. _`Unihan`: http://www.unicode.org/reports/tr38/
.. _`Bacon Cipher`: http://en.wikipedia.org/wiki/Bacon%27s_cipher


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`