|
In this thesis, we provide a solution not only for segmenting characters in mixed Chinese/English printed documents but also segmenting touching characters. we divide the segmented character into five classes: Chinese, alphanumeric, punctuation mark, Chinese touching character and alphanumeric touching character. Firstly, for confirmation of Chinese characters formed by single rectangle, we calculate the horizontal and vertical stroke crossing counts for character complexity analysis, if its omplexity is high, and it fits to the square style and the size of Chinese characters, then we confirm it is a Chinese character. Secondly, for confirmation of Chinese characters formed by two rectangles, if it also fits to the square style and the size of Chinese characters, then we give the largest rectangle to alphanumeric OCR, if the rectangle is rejected, the joined character must be a Chinese character. Otherwise, the joined rectangles are not treated as a Chinese character. We determine the attribute of touching characters by contextual relationship and information provided by self. The attribute of characters could make a judgement for correction the erroneous attributes. For the segmentation of touching characters, we adopt the strategy with two stages, the first stage is to search all possible cutting positons, the second stage is to search the proper cutting positions. Experiments to 12 documents, there are 5536 Chinese characters and 1672 alphanumeric characters among 7378 characters. The result is that 14 Chinese characters are segmented erroneously and alphanumeric characters are 35. The correct rate is 99.7% and 97.9%. The overall correct rate is 99.3%.
|