“简体字”不简单 Jiǎn tǐ zì bù jiǎn dān: The Complexity of Simplified Chinese

Part I

新华，新写法 Xīn huá, xīn xiě fǎ: New China, New Writing

In the 1950's and 1960's under the leadership of Mao Zedong 毛澤東 and Zhou Enlai 周恩來 the People's Republic of China (PRC) began a campaign of language reform. In addition to establishing pinyin 拼音 as a standard system of romanization for Mandarin, the authorities also simplified thousands of Chinese characters.

These orthographic reforms were intended to make it easier for China's vast and, at that time, largely uneducated populace to learn Chinese. The reforms undoubtedly also served a secondary purpose, political and societal in nature, of distinguishing the New China (“新华” xīn huá) from the “old” and “feudal” society (封建社会 fēng jiàn shè huì) of the past.

Many of the simplified characters represent nothing more than official standardization of written and 草書 caǒ shū calligraphic forms which had already been in widespread use for millenia. However, opponents of character simplification point out that the authorities were politically motivated when they ignored radicals, etymologies, and phonetics to create certain new simplified characters. A favorite example is “圣” shèng (holy, sacred, saint, sage). The traditional form of the character “聖” shèng bears the radical “王” wáng which means “king.” The argument goes that this was intentionally perverted to “土” tǔ which means “earth, soil, dust” in order to devalue the religions, saints and kings of the old China [Wikipedia: Simplified Chinese].

简繁之争 Jiǎn fán zhī zhēng: The Orthography War

Is simplified Chinese really easier to learn? Did these orthographic reforms really contribute to increasing literacy rates in mainland China?

These are just two among many questions which remain the subject of debate among Chinese to this day —not only on both sides of the Straits of Taiwan but also everywhere else in the world where Chinese have settled. Increasing trade and communication between China and Chinese communities outside the mainland has also resulted in pragmatic questions being raised even inside the mainland about whether knowing only simplified character forms is enough. As recently as 2004, the Beijing City Educational Committee rejected a proposal to teach traditional character forms alongside simplified forms at the elementary and middle school levels, citing inconsistencies with the country's language laws and concerns about placing too much burden on students [市教委驳回政协委员普及繁体字教学建议 (beijing.qianlong.com)].

To add more fuel to the fire surrounding simplified vs. traditional characters (“简繁之争” jiǎn fán zhī zhēng), the United Nations (UN) has now announced that beginning in 2008 it will no longer use traditional full-form characters (正體字) [Taipei Times editorial], [ 台湾岛内掀起“简繁之争” (中国语言文字网 www.china-language.gov.cn)].

So the answers you receive in the debate over simplified vs. traditional characters will depend upon whom you talk to. And you might get an earful of politics to boot.

But guess what? At the end of the day trying to decide whether simplified characters are a good or bad thing is almost irrelevant. Even the U.N.'s decision is, in a certain sense, irrelevant too. Of course the U.N.'s decision reflects China's growing influence as a world power. And perhaps the U.N. really does save a few dollars by not having to produce documents in both orthographies. But these debates are largely irrelevant because they are overshadowed by larger realities that are not subject to debate.

网上文化 Wǎng shàng wén huà : (Chinese) Culture online

The larger, quieter, reality is that Chinese all over the world, —just like everyone else all over the world— are participating in the new online culture where they do everything from searching for information on Google and Yahoo, to buying and selling goods, to crafting new entries in Wikipedia. Some of these people are using simplified characters. Some are using traditional characters. And everyone is reading what everyone else has written. Although there is clear evidence that usage of simplified Chinese is increasing in places like Hong Kong, this increase is not occurring at the expense of traditional Chinese. Numerous media in Hong Kong, including PRC-funded media, show no signs of moving to simplified Chinese [Simplified Chinese (Wikipedia)].

Large commercial and well-funded online media producers are trying to accomodate this new reality by providing web sites in both simplified and traditional characters. But the web is so much more than just Xinhua news, the Industrial and Commercial Bank of China in Hong Kong, or the Taiwanese electronics giant Tatung. Most of the really interesting stuff on the web is going to be available in only one or the other othography. Not both.

Welcome to the new reality! Now everyone has to have at least some reading knowledge of both orthographies!

Take-home Message Number One: There is a huge need for software systems that recognize this new reality and address it in intelligent ways.

This new reality has many facets and dimensions. Let's take a look at the obvious problem of converting between simplified and traditional Chinese.

The Conversion Problem

So what is the big deal? Can't someone just write a little Perl script (like this HanConvert script) that maps between simplified and traditional character forms, just as one might convert lowercase Latin text to uppercase?

Unfortunately, that's only part of the problem. It is true that there is a one-to-one mapping between many simplified and traditional characters. Simple lookup tables are sufficient for converting between this class of characters.

However in numerous cases one simplified form corresponds to two or more traditional forms. As Jack Halpern and Jouni Kerman of the CJK Dictionary Institute in Tokyo point out, this is the central issue in simplified-to-traditional conversion [The Pitfalls and Complexities of Chinese to Chinese Conversion]. Halpern and Kerman provide one of the “classic” examples: the etymologically distinct characters “發” fā (to send out, to emit) and “髮” fà (hair) were simplified to a single character, “发”.

For this reason, accurate conversion from simplified to traditional Chinese depends on the context. Converting from traditional to simplified Chinese is a little easier but likewise depends on the context. In both cases, it is not sufficient to look at individual characters (Try out the aforementioned Perl script and you will see that while it may serve as a useful tool to speed up the process of manual conversion by a human, the results are completely unacceptable as the basis of machine-based solutions). The most important conversion operation must occur at the level of whole words. This requires parsing Chinese text into words and then converting at the word level. It is not a trivial task for many reasons, as we will examine below.

Halpern and Kerman provide many details in their Chinese Conversion article which I trust those interested will read. Instead of rehashing their work, I would like to now critique it and update it a little bit. The purpose of this exercise is to establish the proper context in which we may investigate possible automated solutions that will really work.

Character Sets and Encoding Are No Longer A Problem

Let's start by making it clear what areas of the larger problem domain are no longer a problem. Numerous older writings discuss at length the issues of incompatible character sets and encoding systems: GB-2312 vs. Big 5 and so on. Halpern and Kerman are no different and go on about this stuff at some length. The answer to this problem is of course Unicode and, on the web especially, UTF-8. Smart content producers such as Wikipedia (维基百科) realized a long time ago that UTF-8 was the right place to start. Case closed.

Levels of Conversion

Halpern and Kerman describe three levels of conversion:

Level 1: Code Conversion
Level 2: Orthographic Conversion
Level 3: Lexemic Conversion

Level 1 Code conversion is appropriate for converting that subset of characters that have a one-to-one mapping between simplified and traditional. But the process must stop there.

The previously-mentioned Perl script, HanConvert, goes beyond this by putting the 1-to-n and n-to-1 cases into the lookup tables. When more than one mapping is possible, HanConvert and similar software simply write out all of the possibilities to the result set surrounded by square brackets. The user is then required to manually edit the result set until a correct document is obtained. Depending on the content, this task can be either trivial or enormous.

Level 2 Orthographic Conversion refers to word-to-word character conversion. Words in Chinese may consist of single characters, two-or-more character compounds, and multi-character idiomatic phrases. Halpern and Kerman call these word-units and they represent the linguistically meaningful segmentation of a text. This is where the action has to be to get the most bang for your Chinese conversion buck..

Level 3 Lexemic Conversion refers to converting not just the orthography but the actual words used in a text because many new technical terms coined in the past fifty years differ between the mainland, Taiwan, Hongkong and Singapore. For example, in mainland China, a computer is normally called a “计算机” jì suàn jī but in Taiwan it is called a “電腦” diàn naǒ. In my opinion, Halpern and Kerman have overestimated the importance of this level of conversion by failing to examine the wider problem domain.

In addition to these three levels of conversion, Halpern and Kerman mention the problem of handling proper nouns —we will return to this issue a bit later.

The Problem of Lexemic Conversion

I have a few issues with parts of Halpern and Kerman's presentation and analysis.

Let's discuss presentation first. In their treatment of Lexemic conversion, Halpern and Kerman refer to “计算机” jì suàn jī as “SC” (Simplified Chinese) while “電腦” diàn naǒ is said to be “TC” (Traditional Chinese). The problem here is that they are using terms that refer strictly to orthography (i.e., “SC” vs. “TC”) to describe lexical differences between the mainland and Taiwan. “计算机” jì suàn jī is a word that can be correctly written using either orthography: “计算机” (SC) or “計算機” (TC). The same is true for “電腦” diàn naǒ which can be correctly written as “电脑” (SC) or “電腦” (TC). It would have been better if the authors had made clearer distinctions between these orthographic and lexical differences.

But this is just nit-picking. I have a bigger issue with Halpern and Kerman's analysis of lexemic conversion as an essential component of a robust conversion system. The problem is that they fail to distinguish that there are different problem domains present, and therefore there are opportunities to tailor the solutions depending on the actual problem domain.

There are a few applications where lexemic conversion is important. If I am automobile manufacturer and I hire a company to translate my product documentation into various languages, then yes, I want my American customers to see the words “hood” and “gasoline” while my UK customers see “bonnet” and “petrol” instead. If my company's automobiles have onboard computers, then yes, my mainland Chinese customers may prefer to read “计算机” jì suàn jī while my Taiwan customers see “電腦” diàn naǒ. This is a translation problem. This is not an orthography conversion problem.

In many other application domains, Halpern and Kerman's lexemic conversion would, in the worst cases, result in a corrupted text. For example, suppose we wrote a web page —in Chinese— which explained some of the lexical differences between the mainland and Taiwan. Automated “lexemic conversion” would clearly be a total disaster in this case.

In fact, when we think carefully about the requirements of the Chinese-language world wide web in general, it should be clear that Halpern and Kerman's “lexemic conversion” should be used sparingly if at all. When I browse the web, I want to be able to read what the original author wrote in every case. I don't want some software in there automatically converting “计算机” jì suàn jī into “電腦” diàn naǒ or “bonnet” into “hood”. (But I wouldn't mind at all if I could choose a preferred Chinese orthography).

Halpern and Kerman also talk about the lexemic conversion of proper nouns, especially for foreign names such as “Kennedy” which is usually written as “肯尼迪” in the mainland but as “甘迺迪” in Taiwan. These are minor differences in the transliteration of foreign names. There are only a very few application domains where automated conversion of proper nouns at this level would be required. In my mind, this problem is about as important as the problem of converting British spellings like “analyse” and “centre” into American “analyze” and “center” —which is to say not important at all.

Take-home Message Number Two: Intelligently-designed systems that perform Level 1 (1-to-1 character mapping subset) and Level 2 (word-level orthographic) conversions are greatly needed. Level 3 (lexemic converters) are at best only required in a very narrow problem domain.

Where Do Things Currently Stand?

Where do things currently stand? How do search engines perform when presented with query terms in one or the other orthography? Does quality software already exist to perform high-quality automated conversion between the two orthographic systems? We will examine these questions in Part II.