The Chinese script is used across languages East Asia, with intricacies not often recognized. The following are choices I have made for what goes into the font. My intention of building the Cantonese Font is the pragmatic one of helping learning & teaching the language, so priority is given to representing modern usage.
Characters
The selection of characters were extracted from modern corpus (words.hk & HKCANCOR). This is further supplemented over several rounds:
- by the Hong Kong cantonese supplementary characters listing (version 1.8, 2022-09-17) by 內木一郎. Hong Kong characters, from classes 1-6 and a-c, were included as much as possible. These were in turn supplemented by
- extensive testing against specialty texts (e.g., Chinese medicine, modern history), and
- Cantonese forums
There is a long tail of Chinese characters, so omissions do occur. I encourage you to report these to the Discord channel.
Some glyphs cannot be included, mostly due to the glyph shape being excluded from Source/Noto Sans Han, or because I could not assign a pronunciation to the character (e.g., 﨧). Multi-syllable characters are also excluded.
Only Traditional Chinese characters were included, prioritizing Hong Kong variants. This means that, for example, amongst 廣 (Zh-HK/TW), 広 (Zh-JP), and 广 (Zh-S), only 廣 would behave semantically as “broad”.
You would actually find that 广 is included, but as the Zh-T character referring to “houses built next to the cliff-face”, read as jim5. You would also, in this case, find that 広 is included for pragmatic reasons; it would, however, not support any contextual substitutions (see below).
Contextual substitution occurs when particular words appear consecutively. Let’s take as example the word 廣傳: if we handle this only with Traditional Chinese, there is exactly one way that this could be encoded. However, if we allow for {廣, 广, 広} and {傳, 传, 伝}, there is now nine combinations that need to be supported. Not the most pragmatic use of the total limit of 65k glyphs.
If characters disappear when you are copy-pasting, it could be that characters that look similar may actually be of a different language encoding. Examples include 淨浄净, 惡悪恶, or 榮栄荣. Sometimes the input software could also mistaken your entry (this is quite bad with Apple’s handwriting, which strongly favors what is common and often presents Zh-S).
This preference for Zh-T extends to the choice of pronunciation. When the same glyph is used for both Traditional Chinese, and borrowed for a different purpose in Simplified Chinese, i have chosen to keep the historical meaning/pronunciation that is associated with Zh-T. Examples include 体 ban1, which in Zh-S takes on the semantic meaning of 體; or 叶 tou3, which is used as 葉 simplified.
When alternative ways of writing the same glyph is present (異體字: 刦 刧 劫; 㘭 坳), and that they are likely to be encountered as regular usage, they are included. Be careful: sometimes they look alike but are not the same: 刺 / 剌; 哂 / 晒 / 唒; 囗 / 口; 埶 / 執 comes to mind. If the pronunciation seems strange, it could be that you have input something other than the character you have in mind.
Words
Words were selected using similar sources. Each of the ~100,000 words in the initial list was checked against the character list for their pronunciation. If any characters in the word differ in pronunciation from the standalone sound, then that word would be included in the font.
Pronunciation
The initial pronunciations were assigned using rime-upstream database. For standalone character, about 98% of the assignments are what I would have made. To correct for the 2%, I manually curated pronunciation for each character contained in the font. Preference was given to 廣州話正音字典.
Often, due to the character having multiple etymologies, it also have different pronunciations. An example is 宿, which is sau3 in the context of zodiac 星宿 but suk1 in the context of habitat. In these cases, I selected the pronunciation that modern users are most likely to encounter (so suk1 in this case). Another example is 尻, which historically would be pronounced as haau1, but modern audience would expect gau1 and it is presented as such here. Pragmatic & descriptive is what I’d characterize the governing philosophy.
In some cases, two or more pronunciations are equally likely. An example is 冠, which may be gun1 or gun3. In these cases I made the call based on which usage is likely to occur in a flexible context. With 冠, it is gun1 in 衣冠 / 皇冠 / 花冠 / 雞冠, and have many more places where it is gun3; so it is assigned gun3, and we use words to catch the alternative pronunciations. As another example, 塞 could be sak1 or coi3, but the latter is almost found in the context of 要塞 or 塞翁失馬, so the standalone would be assigned sak1.
There are characters where there isn’t a good resolution. 呀 could be aa3 or aa4; 咪 may be mai1 or mai6; 平 could be peng4 / ping4; 婆婆 could be po4 po2 or po4 po1. In these cases I just made a judgment call. Sorry.
In the future, I may introduce a syntax to allow calling for alternative pronunciations, e.g., 宿(1) / 宿.1 to access an alternative glyph. That, however, requires more careful thinking.
Overall performance
Overall the performance can be bench-marked against different texts. The results are shown below.
Modern text
v1.0 of the Cantonese Font was used to prepare the jyutping in a Cantonese (colloquial) translation for several chapters of the Wizard of Oz, and its accuracy assessed by an editor. The accuracy was assessed to exceed 99%.
Conversation Transcripts
The font was used to convert a conversation transcript, and the accuracy using v1.0 exceeds 99.5%.
Ming prose
Tang poem
Error Correction
Despite manual curation, errors may persist, and important omissions may remain. You are welcome to submit bugs and request additions to the following forms: *****.