- From: Terje Bless <link@tss.no>
- Date: Mon, 18 Jun 2001 17:39:12 +0200
- To: Christian Smith <csmith@barebones.com>
- cc: Martin Duerst <duerst@w3.org>, Nick Kew <nick@webthing.com>, Esmond Walshe <esmond.walshe@eeng.dcu.ie>, www-validator@w3.org
On 18.06.01 at 10:38, Christian Smith <csmith@barebones.com> wrote: >On Monday, June 18, 2001 at 1:39 AM, link@tss.no (Terje Bless) wrote: > >>Right. We can't guarantee a lossless roundtrip to ISO 10646[0] so every >>offset -- be it byte or character -- would need to be in terms of the >>UTF-8 encoded ISO 10646 version of the file. How would that affect your >>Error Browser Chris? > >Dealing with byte offsets would be nearly impossible. as long as I get >back a character offset everything should be fine. Yeah, but if your file is in MacRoman, will you be able to make use of a character offset obtained _after_ we've converted it into some encoding of UNICODE? >>Could we perhaps convert to Normalization Form C[2] and report Unicode >>character offsets (or even bytes if it's easier) from beginning of file? > >I don't see how the character offset is going to be any different between >UTF-8 and UTF-16. Byte based offsets would be different but character >offsets should be the same. Right. Did I imply anything else? :-) The problem I'm envisioning is where there are several possible UNICODE forms of a given MacRoman (or ISO-whatever, or ...) character. For instance, some combination symbols can be expressed either as their own unique code point, or as a set of equivalent combination characters. For instance, "LATIN CAPITAL LETTER A WITH RING ABOVE" ("Å") can be encoded as U+00C5, or decomposited as U+0041 U+030A; "LATIN CAPITAL LETTER A" followed by "COMBINING RING ABOVE". Which one you get is dependant on whether you use Normalization Form C (use combination chars) or Normalization Form D (decompose) or whatever random variant made sense at the time. :-|
Received on Monday, 18 June 2001 11:45:50 UTC