- From: Christian Smith <csmith@barebones.com>
- Date: Mon, 18 Jun 2001 11:56:16 -0400
- To: Terje Bless <link@tss.no>
- cc: Martin Duerst <duerst@w3.org>, Nick Kew <nick@webthing.com>, Esmond Walshe <esmond.walshe@eeng.dcu.ie>, www-validator@w3.org
On Monday, June 18, 2001 at 5:39 PM, link@tss.no (Terje Bless) wrote: > On 18.06.01 at 10:38, Christian Smith <csmith@barebones.com> wrote: > > >On Monday, June 18, 2001 at 1:39 AM, link@tss.no (Terje Bless) wrote: > > > >>Right. We can't guarantee a lossless roundtrip to ISO 10646[0] so every > >>offset -- be it byte or character -- would need to be in terms of the > >>UTF-8 encoded ISO 10646 version of the file. How would that affect your > >>Error Browser Chris? > > > >Dealing with byte offsets would be nearly impossible. as long as I get > >back a character offset everything should be fine. > > Yeah, but if your file is in MacRoman, will you be able to make use of a > character offset obtained _after_ we've converted it into some encoding of > UNICODE? I see no rason why not. Character X of the document should be the same regardless of whether the file is in 7bit ascii or encoded as Unicode. The byte offset will be different but not the character offset. > The problem I'm envisioning is where there are several possible UNICODE > forms of a given MacRoman (or ISO-whatever, or ...) character. For > instance, some combination symbols can be expressed either as their own > unique code point, or as a set of equivalent combination characters. For > instance, "LATIN CAPITAL LETTER A WITH RING ABOVE" ("Å") can be encoded as > U+00C5, or decomposited as U+0041 U+030A; "LATIN CAPITAL LETTER A" followed > by "COMBINING RING ABOVE". Which one you get is dependant on whether you > use Normalization Form C (use combination chars) or Normalization Form D > (decompose) or whatever random variant made sense at the time. :-| And would this cause you to report a different character offset or would you report the same character offset regardless? In either case I'm not overly concerned at this point. The worst that happens is the insertion point my not be set quite right in some cases. C'est le vie. -- Christian Smith | csmith@barebones.com | http://web.barebones.com He who dies with the most friends... Is still dead!
Received on Monday, 18 June 2001 11:56:20 UTC