Copyright 2010-2023, Richard Liming

Hiding English in Cantonese


I suppose this is classified as steganography. These techniques have been around a long time. One well known example is the Bacon Cipher.

The Bacon Cipher is interesting but a keen observer will notice a different font. While working on censorship circumvention efforts and studying Mandarin, which is a tonal language, it occured to me that tones could be used for signalling, however Mandarin only has 4 tones, or 5 considering the neutral tone. This means that to code each English letter using the tones, even in lowercase or uppercase only, would require 26 possible encoded values, and to include numbers from 0-9 you need 36 possible values. To achieve that with 4 tones would require 3 Chinese characters for each letter, which seemed combersome; the same for 5 tones although there are some other complications using the neutral tone.

From studying Chinese, I knew that Cantonese had more tones; at least 6 and possibly 9. So with base 6 or hexary math, using two characters yielded 36 possible values, which allowed for encoding the alphabet from A to Z without case distinction, as well as the 10 digits from 0-9. After some research in various romanizations, I found the popular Jyutping romanization which includes the 6 tone numbers, and there were sufficent online resources such as dictionaries and vocabulary lists using jyutping, including the large Unihan database, to make using jyutping viable. At this point, I couldn’t readily find a 9 tone romanization, but 6 was enough to start.

Initially, I was hoping to write meaningful sentences in Chinese that could have the encoded message, but I realized that, if not impossible, it was likely beyond my capabilities, and definitely beyond my free time constraints, so I settled on a less ambitious goal of using vocabulary lists. With that in mind the goal was still for a fairly dense encoding, that could also ideally be done ‘in your head’ without needing software or other assitance. Thus the 6 tones of Cantonese using two characters per english letter and providing 36 values still seemed to fit the bill.

Thus, here a basic method is described for encoding essentially a limited character (A-Z,0-9) message using the tones of Cantonese for signalling, along with a few issues and potential enhancements.

The basic qualities are:

  • The plane of information is not visible, but the implied aural plane of the tones used in spoken Cantonese.
  • Since the characters for Mandarin and Cantonese are the same, one doesn’t necessarily realize that a Cantonese reading is used for the interpretation of the message. An assumption is probably that it is Mandarin, may provide a bit more obscurity.
  • With practice a human that knows Cantonese can write and decode the message in their head without any software or other type of decoder. Since nothing is required to decode it, in some circumstances this also can provide plausible deniability for the reader.
  • A secret message can be conveyed orally. For instance, in a fake teaching session, say between prisoners in different cells within hearing distance of each other, they could read a vocabulary list as if pretending to learn english or Chinese.

Demo Pages

A demo application is available. The page has a button to pop-up the encoding/decoding table, if that doesn’t work in your browser, use the direct link to the encode decode table. The demo generates a vocabulary list titled “Lesson 8”. Eight is a lucky number in Chinese, so we stick with that for the demo; maybe we’ll get lucky and our secret message will get through undetected!

Encoding Process

Assuming initially, that there are six (6) tones in Cantonese as represented in Jyutping romanization, then with two digits, or Han character tones in this case, one can represent 6 squared, or 36 combinations. Technically, there may be 9 tones, but more on that later.

This maps fairly nicely to the 26 characters of the English alphabet plus the 10 digits from 0-9. Call this the basic alphanumeric set (‘a’ through ‘z’ and 0 through 9), or ‘basic set’. With the basic method, no distinction can be made between upper and lower case letters.

The process for mapping starts with giving each element in the set (a-z,0-9) a decimal number representation from 0 to 35:

element decimal
a 0
b 1
c 2
z 25
0 26
1 27
9 35

Then this decimal number is converted to a base 6 (hexary) math representation, using two digits (which will map to two Han characters). Just like base 10 math (decimal) has a max number of 9 in each position, and base 2 math (binary) has a max number of 1 in any position, so base 6 math (hexary) has a max number of 5 in any position:

element decimal two digit hexary
a 0 00
b 1 01
c 2 02
d 3 03
e 4 04
f 5 05
g 6 10
h 7 11
m 12 20
n 13 21
9 35 55

However, the Jyutping romanization uses the numbers 1 through 6 to represent the tones, not 0 through 5. So, we make a simple mapping conversion by adding one to each position in the hexary representation:

element decimal two digit hexary hexary + 1
a 0 00 11
b 1 01 12
c 2 02 13
d 3 03 14
e 4 04 15
f 5 05 16
g 6 10 21
h 7 11 22
m 12 20 31
n 13 21 32
9 35 55 66

Now with this basic mapping we can represent an English or other romanized message with Han characters using the tone. Referencing the chart above, to represent an ‘a’ we need two characters, both 1st tone, (1 and 1) or 11. We could use:

貓  cat maau1       雞  chicken gai1


鷹  eagle   jing1       獅  lion    si1

and so on.

The word ‘the’ requires six characters, two per English letter. From the table we see that ‘t’ is 42, ‘h’ is ‘22’ and ‘e’ is ‘15’, so one possible mapping is:

't' .. 42 .. 牛  cow     ngau4       狗  dog     gau2
'h' .. 22 .. 虎  tiger   fu2         煮  cook    zyu2
'e' .. 15 .. 貓  cat     maau1       鳥  bird    niu5

To get the orginal message back, the tones are simply mapped back to the characters.

Current Encoding Software

The current encoding software in the demo, uses a preset list of characters. It can easily be set to use the full Unihan data set, or some other large reference or dictionary with Jyutping romanization. The software will generate the encoded vocabulary lists by finding the first character it can that meets the tone requirement, and then not reuse that character again, unless it runs out of new characters, so as to generate authentic looking vocabulary lists. The initial lists were made by hand from common subjects like (animals, food, etc) to make the generated lists look more like legitimate vocabulary lists with common themes. However they only have 75 characters, so in a long message there may be repeats. But, lorger source databases, dictionaries, etc. can be used.

Issues and Enhancements

Character set size

The basic method only encodes the letters A through Z (without case) and the digits 0 through 9, so some characters are lost including spaces, apostrophes, periods and any other character not in the list. In practice the message is understandable without these characters, even spaces, but this adds a little difficulty.

I shared this with a friend who speaks Cantonese and described the delimma of six tones only allowing for 36 characters and that I couldn’t find a good source for the nine tones of Cantonese. Of course it is easy to come up with variations of code and methods that can add more characters for encoding, but I still wanted a version that could be decoded by a human with no software on the decoding end. After a bit he explained the p, t, k finals, or as Wikipedia calls them ‘checked tones’ ( ).

So, I think the easiest way to get a few more characters without so much complexity is to just consider a character that ends in any of the three p, t, or k finals as just one other signal, so base 7. Thus you get 49 possible characters, or 13 more than with base 6, which could be used for spaces, periods, colons, and whatever other punctuation is critical to your specific message. Of course they could be treated individually to have 9 tones, yielding 81 possible characters.

Many more sophisticated options exist, but they tend to necessitate software, such as changing the index into the alphabet, so that say ‘b’ = 0 instead of ‘a’. This could be done using some external source, or dynamic source, such as sunrise time, could be used to determine daily which letter would be equal to 0, for instance, etc., but again a human can get used to the 36 characters, or maybe even 49 characters and essentially read and write in code using the basic method without any software.

Tone ambiguity

There are two general cases that I am aware of that could be the source of trouble; two sides of the same coin actually. I think the linguistic term may be ‘allophones’, but am not certain. Also, I can only describe the issue as it relates to Mandarin.

Regardless, a first example in Mandarin is (教) ‘teach’ which is pronounced jiao1 with the first tone, but when used as part ofthe word for ‘classroom’ (jiao4shi4), the same character is pronounced with the fourth tone. A Catonese dictionary lists these as (gaau3 gaau1) so first or third tone. This case isn’t technically an issue in a vocabulary list, becuase you know whether the word is ‘teach’ or ‘classroom’ and thus the proper tone.

Another case would be perhaps (的) de which is the possesive with the neutral tone. It can also be prononunced totatlly differently as part of the word meanign ‘indeed’, or ( 的確 or ‘di2que4’. I suppose again this is not a great example as the use is clear based on which word you are using. Suffice it to say, that there are likely situations where the same character can have multiple potential tones. I think the tone difference is usually disambiguated when it is part of a different word, or in the case where the definition is included, as would be the case in a vocabulary list.

Definitely, in any case, including the Jyutping alongside the character removes any ambiguity.

Indices and tables

Table Of Contents

This Page