- From: Andrzej Novosiolov <anovos@rs-ukraine.kiev.ua>
- Date: Fri, 24 Mar 2000 11:47:13 -0600
- To: html-tidy@w3.org
For now, Tidy can be used for Cyrillic pages cleanup in -raw mode. But there is a way to preserve Cyrillic letters and still benefit from converting character codes 126-159 to corresponding entitites. There are 5 different Cyrillic encodings which are used in web pages: koi8-r - Cyrillic letter codes are in range 160-255 windows-1251 - in range 160-255 iso-8859-5 - in range 160-255 cp866 (or ibm866) - in range 128-175, 224-255 x-mac-cyrillic - in range 128-159, 224-255 The most widely used encodings are koi8-r and windows-1251, other 3 are extremely rare. So I propose to add 2 more encodings: CYRWIN for windows-1251 CYRKOI for koi8-r and iso-8859-5 and still use RAW for cp866 and x-mac-cyrillic. The source changes are minimal (plus 2 more constants in html.h and 2 more switches processing in tidy.c and config.c): tidy.c, line 373: ----------------- if (in->encoding == RAW || in->encoding == ISO2022 || in->encoding == CYRKOI) /* don't convert koi8-r and iso-8859-5 to Unicode */ pprint.c, line 407: ------------------- if (MakeClean && CharEncoding != CYRKOI) /* CYRWIN may be cleaned up too */ pprint.c, line 483: ------------------- if ((c < ' ' && c != '\t') || (c > 126 && c < 160) || (c > 255) || ((CharEncoding == ASCII) && (c > 126))) /* let's convert character codes 126-159 to entities for CYRWIN and CYRKOI */
Received on Friday, 24 March 2000 13:12:54 UTC