Asmus Freytag wrote on 11/9/2004, 1:55 AM: > Note that their approach used n-grams in byte space. A 4-gram would be > just > a pair of DBCS characters, a 2-gram would effectively be a frequency > table. Not really. A 4-gram in byte space will be ~ half a pair of DBCS characters and ~ half of one DBCS characters with one trail byte of the previous character and a lead byte of the next character. This won'b be a big deal in the case of Shift-JIS or Big5 since their lead byte and trail byte use quite different range. However for GB2312, EUC-KR, EUC-TW, EUC-JP, it is big problem since the lead byte and trail byte for the most common characters are using the same range (0xa1-0xfe).Received on Thursday, 11 November 2004 20:59:13 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 14 August 2008 18:35:20 GMT