|
In this dissertation, a new approach is proposed to extract lexical knowledge from large text corpora. This approach not only avoid human cost in compiling knowledge, but also gets rid of inconsistency introduced by human. It first partially parses text corpora, and then uses simple rules or existing knowledge to build more useful lexical knowledge. We have used this approach to extract noun phrases, acquire predicate- argument structures and classify adjectives. A new model for machine translation system is also proposed. This model integrates the qualitative and the quantitative viewpoints to machine translation systems. Not only does it reduce the computation time in processing long sentences, but also it utilizes the lexical knowledge extracted from large corpora. The model also consists of three components: analysis, transfer and synthesis, but they have different mechanisms from the traditional transfer based translation system. The analysis module is a shallow parser - a probabilistic chunker. In addition, to extract the indispensable knowledge from input sentence for enhancing the translation quality, two knowledge extractors are proposed. The first extractor is used to acquire noun phrases in input sentences. The second extractor is to determine the predicate argument structures in the input sentences. In order to augment the performance of MT system, some tools are also proposed. A sentence alignment based on content tags is proved to be very useful in aligning sentence. Applying such tool will do the good for the acquisition bilingual lexical knowledge. Besides, a topic identification algorithm is invented not only for anaphora resolution, but also for text domain identification.
|