- From: Terje Bless <link@tss.no>
- Date: Mon, 18 Jun 2001 01:39:01 +0200
- To: Martin Duerst <duerst@w3.org>
- cc: Christian Smith <csmith@barebones.com>, Nick Kew <nick@webthing.com>, Esmond Walshe <esmond.walshe@eeng.dcu.ie>, www-validator@w3.org
On 14.06.01 at 12:01, Martin Duerst <duerst@w3.org> wrote: >At 13:58 01/06/13 -0400, Christian Smith wrote: >>start offset (the offset of the beginning of the error) >>end offset (the offset of the end of the error) >> >>Note: These might be the same but would preferably specify a range. In >>any event the value should be the character offset from the beginning of >>the file. > >Please be very careful with this. We currently get a character position, >but please don't confuse this with what you probably are looking for (as >you are speaking about non-opened files, I guess it could be byte >positions). Once the date is converted from an arbitrary encoding to >utf-8, byte positions are pretty much lost. It's not completely impossible >to get them back, but it's a lot of dirty work. Right. We can't guarantee a lossless roundtrip to ISO 10646[0] so every offset -- be it byte or character -- would need to be in terms of the UTF-8 encoded ISO 10646 version of the file. How would that affect your Error Browser Chris? OTOH, we should be able to get lossless roundtrips for US-ASCII, ISO-8859-(1-5,7,9,10,13-16) and Windows-125(0,1,2) unless I'm misremembering badly here[1]. However that would require some special cases for the app. Could we perhaps convert to Normalization Form C[2] and report Unicode character offsets (or even bytes if it's easier) from beginning of file? That would require that applications using the interface perform the same transformation on the source file if they want to do more then just display the results and this would have the same roundtrip problem that we have. [0] - Because there are several one-to-many mappings when mapping into Unicode as well as many-to-one when mapping in the other direction, and I think there is even an overlap between the two groups. [1] - Unfortunately, I think MacRoman isn't possible and I'm unsure about the rest of the MacFoo character repertoires. [2] - I think Charlie does this (Martin?), but there are a few other libs that can do it and it's not impossible -- though not my idea of a fun time -- to roll our own normalizer for just NFC (or NFKC?).
Received on Sunday, 17 June 2001 20:04:35 UTC