Chinese word-segmented writing

Chinese sentences are written as strings of characters, with no marks between words. Hence, word segmentation according to the context (done either consciously or unconsciously) is a task for the reader. Chinese word-segmented writing, or Chinese word-separated writing, is a new writing style where texts are written with spaces between words like written English.[1]

There are many advantages or reasons of word-segmented writing. An important reason lies in the existence of ambiguous texts where only the author knows the intended meaning and the correct segmentation. For example, "美國會不同意。 美国会不同意。" may mean "美國 會 不同意。 美国 会 不同意。" (The US will not agree.) or "美 國會 不同意。 美 国会 不同意。" (The US Congress does not agree).[2]

History

In ancient China, texts were written without punctuation marks, which led to the reader needing to spend a considerable amount of time finding the boundary of a sentence. It was not until the early 1900s when the present punctuation marks were adopted. [3]

In the 1950s, there was a proposal for the employment of word-segmented writing in a discussion among the Chinese linguists, however it was not passed. [3]

In 1987, the idea of Chinese word-segmented writing was put forward again by Chen Liwei in an international conference on Chinese information processing. [4]

Chinese word-segmented writing was first put into application no later than 1998, when a paper entitled Written Chinese Word Segmentation Revisited: Ten advantages of word-segmented writing was published in a key academic journal in China. [5] The whole paper, seven pages altogether, was written word-segmentedly, with the abstract presented as:

摘要: 单词 的 切分 对 现代 汉语 的 运用、研究 和 计算机 信息 处理 等 都 具有 相当 重要 的 意义。本文 阐述 书面 汉语 分词 连写 的 十 大 好处 , 并 讨论 一些 实施 方面 的 问题。文章 全文 分词 连写。

In 2018, an one-paragraph short article was published on Wikiversity entitled Word segmentation of Hanzi,[6] with the Chinese text word-segmented as follows:

历史上,中国古文 是 没有 标点符号的。读者 需要 付出 额外的 精力 专注于 断句,而且 稍有差池 便会 造成 误读。所谓 差之毫厘 失之千里。引入 标点符号 是 一次 重大的 文字改革,使得 汉字文本的 阅读效率 有了 很大的 提高。但 中文的 改革 才 刚刚 起步, 远未达到 尽善尽美的 程度。至少 在 阅读效率 方面 仍然 存在着 一个 显而易见的 障碍 - 断词 (汉字的 分词连写)。

The first book written in word segmentation was 语言理论 (Language theories) published in 2000.[7]

Methods

The following are some methods or skills for word-segmented writing.

Guidance of the main purpose

The most important purpose of word-segmented writing is to express the intended meaning of the writer accurately and clearly. For example, the traditional non-word-segmented text "乒乓球拍卖完了。" has two possible meanings, which can be expressed in word-segmented writing as "乒乓 球拍 卖完了。" (Ping pong bats are sold out) and "乒乓球 拍卖 完了。" (the ping pong balls have been auctioned). The author is to make a selection to correctly express the intended meaning without ambiguity. [3]

Use of word dictionaries and linguistic knowledge

If not sure whether a character string is a legal word, the writer can check its existence in a reliable word dictionary, such as Xiandai Hanyu Cidian, 重編國語辭典修訂本 (Guoyu Dictionary) [8] and CEDICT. Or check whether it is a linguistically qualified word according to lexical, morphological and syntactical knowledge. [9]

Reference of the rules for Pinyin word segmentation

The Basic rules of Chinese phonetic alphabet orthography [10] is a China national standard for Chinese Pinyin expression and word segmentation.

The general rules are

  1. Use words as the basic writing units for Pinyin expressions. For example: rén (人, person), pǎo (跑, run), māmɑ (妈妈, mother), yuèdú (阅读, read), túshūɡuǎn (图书馆, library).
  2. A two-syllable and three-syllable expression of a concept is written consecutively (without spaces). For example: huánbǎo (环保, environmental protection), ɡōnɡɡuān (公关, public relations), chánɡyònɡcí (常用词, commonly-used words), duìbuqǐ (对不起, sorry).
  3. Names with four or more syllables that represent a concept are written-segmentedly by words or syllables (segments divided by speech pauses inside the phrase). Those that cannot be divided into words or syllables are written consecutively. For example: wúfènɡ ɡānɡɡuǎn (无缝钢管, seamless steel pipe), huánjìnɡ bǎohù guīhuà (环境保护规划, environmental protection planning), Zhōnɡɡuó Shèhuì Kēxuéyuàn (中国社会科学院, Chinese Academy of Social Sciences), yánjiūshēnɡyuàn (研究生院, graduate school), hónɡshízìhuì (红十字会, Red Cross Society)
  4. Single-syllable repeating words are to be written consecutively; double-syllable repeating words are written separately. For example: rénrén (人人, everyone), kànkɑn (看看, look), hónɡhónɡ de (红红的, very red), yánjiū yánjiū (研究研究, research research), xuěbái xuěbái (雪白雪白, snow white snow white). Repeating words in AABB structure are written consecutively. For example: láiláiwǎnɡwǎnɡ (来来往往, coming and going), qīnɡqīnɡchǔchǔ (清清楚楚, crystal clear), fānɡfānɡmiànmiàn (方方面面, all aspects).
  5. Monosyllabic prefixes (副 vice, 总 general/chief, 非 non, 反 anti, 超 super, 老 old, 阿 A, 可 able, 无 non, 半 semi, etc.) or monosyllable suffixes (子 zi, 儿 er, 头 man, 性 -ity, 者 person, 员 member, 家 expert, 手 specialist, 化 -ize, 们 plural, etc.) are written consecutively with the main word. For example: fùbùzhǎnɡ (副部长, vice minister), zǒnɡɡōnɡchénɡshī (总工程师, chief engineer), fùzǒnɡɡōnɡchénɡshī (副总工程师, vice chief engineer), fēijīnshǔ (非金属, non-metallic), kēxuéxìnɡ (科学性, scientific / scientificity), chénɡwùyuán (乘务员, flight attendant), xiàndàihuà (现代化, modernization), háizimen (孩子们, children).
  6. For the convenience of reading and understanding, a hyphen can be used between some parallel words or morphemes, or in some abbreviations. For example: bā-jiǔ tiān (八九天, eight or nine days), rén-jī duìhuà (人机对话, human-computer dialogue), Jīnɡ-Zànɡ Gāosù Gōnɡlù (京藏高速公路, Beijing-Tibet Expressway).

In addition to the general rules, there are specific rules for nouns, verbs, adjectives, pronouns, numerals, quantifiers, adverbs, prepositions, conjunctions, auxiliary words, interjections, onomatopoeias, idioms, sayings, as well as names of people and places.

For example, the pinyin transcription of Article 1 of the Universal Declaration of Human Rights in simplified Chinese characters:[11]

人人生而自由,在尊严和权利上一律平等。他们赋有理性和良心,并应以兄弟关系的精神相对待。

can be word-segmented into

Rénrén shēng ér zìyóu, zài zūnyán hé quánlì shàng yīlǜ píngděng. Tāmen fùyǒu lǐxìng hé liángxīn, bìng yīng yǐ xiōngdì guānxì de jīngshén xiāng duìdài.

Accordingly, the Chinese character text can be segmented into

人人 生 而 自由,在 尊严 和 权利 上 一律 平等。 他们 赋有 理性 和 良心, 并 应 以 兄弟 关系 的 精神 相 对待。

Reference of spoken language

In spoken language, there is usually a pause between two words (and pause is not allowed within a word), so it is natural to put a pause (represented by a space) between the words in written language.

Methods to identify word boundaries can also be found in Word#Word boundaries.

Width of a space

The space between two words should be set at half the width of a Chinese character, shorter than the distance between two lines. Because the average length of a Chinese word is about 2 characters, if a space is of full width of a Chinese character, longer than the inter-line distance, the lines of words will appear scattered, not compact. [12]

Mark of proper nouns

To further help the reader, the proper nouns should be marked as well, such as by underlines. [3] In fact this is already done in the Holy Bible (Union Version with modern punctuation).[13]

Comments

There are advantages and disadvantages of word-segmented writing.

Advantages

The advantages of Chinese word-segmented writing include: [14]

  1. Word-segmented writing is beneficial to language expression and understanding.
  2. Word-segmented writing is beneficial to Chinese teaching and learning.
  3. Word-segmented writing is beneficial to linguistic research.
  4. Word-segmented writing is beneficial to the definition, segmentation and application of Chinese words.
  5. Word-segmented writing is beneficial to computer natural language processing.
  6. Word-segmented writing is beneficial to automatic conversion between pinyin and Chinese characters.
  7. Word-segmented writing is beneficial to simplified-traditional Chinese character conversion.
  8. Word-segmented writing is beneficial to proofreading articles and preventing typos.
  9. Word-segmented writing is beneficial to document typesetting.
  10. Word-segmented writing is beneficial to software Sinicization or Westernization.

Disadvantage

The advantages of Chinese word-segmented writing include: [3]

  1. Word-segmented writing needs (about 1/4) more space.
  2. People are not used to writing in this way.
  3. Need to identify every word.
  4. The sentence does not look as tidy and neat as the traditional format without spaces.
  5. Most Chinese words are one or two character long, and it is not difficult to identify a word even if no boundary marks are used.

Computer-based word segmentation

Before word-segmented writing is popularized, computer-based word segmentation is often used for language information processing. The quality is getting better and better. But it still needs post-editing by human beings. And it will never be as reliable as word segmentation by the author personally.[15] [16]

See also

References

  1. Chen, Liwei (陈力为) (1996). "汉语书面语的分词问题- - 一个有关全民的信息化问题 (Written Chinese Word Segmentation: An issue relevant to national information technology)". Journal of Chinese Information Processing (中文信息学报). 10 (1996) (1): 11–13.
  2. Zhang, Xiaoheng (张小衡) (1998). "也谈汉语书面语的分词问题——分词连写十大好处 (Written Chinese Word Segmentation Revisited: Ten advantages of word-segmented writing)". Journal of Chinese Information Processing (中文信息学报). 12 (1998) (3): 57–63.
  3. Chen 1996, p. 12.
  4. Chen, Liwei (陈力为) (1987). "当前中文信息处理 中的几个问题及其发展前景 (Some issues in Chinese information processing and their perspective development)". Chinese Computer World (计算机世界). 21 (34).
  5. Zhang 1998, pp. 57–63.
  6. "English-Chinese/Word segmentation of Hanzi - Wikiversity".
  7. Peng, Zerun (彭泽润、李葆嘉 eds) (2000). 语言理论 (Language theories) (in Chinese). Changsha: 中南大学出版社 (Central South University Press). ISBN 978-7-810-61342-2.
  8. "教育部《重編國語辭典修訂本》2021".
  9. Zhang 1998, p. 61.
  10. http://www.moe.gov.cn/ewebeditor/uploadfile/2015/01/13/20150113091717604.pdf
  11. "Universal Declaration of Human Rights - Chinese, Mandarin (Simplified)". unicode.org.
  12. Zhang 1998, p. 62.
  13. Chinese Baptist Press, Hong Kong (translation) (1998). 聖經 現代標點和合本 (Holy Bible, Union Version with modern punctuation) (in Chinese). Hong Kong: Chinese Baptist Press (浸信會出版社). ISBN 962-933-101-2.
  14. Zhang 1998, pp. 57–61.
  15. "Chinese Word Segmentation".
  16. Zhang 1998, p. 57.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.