Ping-ying ( phonogram ) based input method is a built-in Chinese input-method for almost every operating system installed on modern PCs. Thus, it becomes handy when people need to type in some Chinese text into their PCs. The problem is that there are many homophones in the Chinese writing system. There are even more if we take into account the approximate pronunciation words. The consequences of all these are that there are lots of ping-ying related text input errors in electronic docu-ments in Chinese.
Modern electronic document processing systems usually come with spell check-ing functions to assist users on their writings. This function works very well for al-phabetic writing systems, like English; but usually fails to fulfill the same role for ideographic writing systems, like Chinese. The problem is rooted deeply in the way that how alphabetic writing systems and ideographic writing systems treat their char-acters and words.
A character in Chinese alphabet, like 字, can be a meaningful word itself or just be a part of a longer word, like 字典. The idea is that educated readers can always figure out which way to go by the context of those characters. Thus, there are no firm rules set to tell when a symbol should be treated as a character or a word. Readers must use the surrounding context of those symbols to decide for themself. This works well for Chinese-speaking people but posts great challenges for Chinese computing researchers.
Lots of research efforts have been put into Chinese computing to take up those challenges. Those efforts focused mainly on different Chinese input methods and Chinese word segmentation problems, but few of them talked about the spell checking for electronic documents in Chinese. In this thesis, we propose a Ping-Ying corpus based Chinese spell checking method to take up the challenge.
The spell checking mechanism proposed in this master thesis consists of two phases. In the first phase, we employ a simple word segmentation algorithm to break the document under consideration into sequences of words and translate those words into separated phonogram strings. In the second phase, we systematically re-construct all possible homophones and words with approximate pronunciations based on the phonogram strings we collected in the first phase. By comparing the words we build from the phonogram strings and the words we collected from the first phase, we can make some intelligent suggestions for possible text input errors.