Re: Producing an XML report from the validator from Terje Bless on 2001-06-17 (www-validator@w3.org from June 2001)

From: Terje Bless <link@tss.no>
Date: Mon, 18 Jun 2001 01:39:01 +0200
To: Martin Duerst <duerst@w3.org>
cc: Christian Smith <csmith@barebones.com>, Nick Kew <nick@webthing.com>, Esmond Walshe <esmond.walshe@eeng.dcu.ie>, www-validator@w3.org
Message-ID: <20010618020427-b01010704-f205f45a-0910-010c@192.146.238.91>

On 14.06.01 at 12:01, Martin Duerst <duerst@w3.org> wrote:

>At 13:58 01/06/13 -0400, Christian Smith wrote:
>>start offset    (the offset of the beginning of the error)
>>end offset              (the offset of the end of the error)
>>
>>Note: These might be the same but would preferably specify a range. In
>>any event the value should be the character offset from the beginning of
>>the file.
>
>Please be very careful with this. We currently get a character position,
>but please don't confuse this with what you probably are looking for (as
>you are speaking about non-opened files, I guess it could be byte
>positions). Once the date is converted from an arbitrary encoding to
>utf-8, byte positions are pretty much lost. It's not completely impossible
>to get them back, but it's a lot of dirty work.

Right. We can't guarantee a lossless roundtrip to ISO 10646[0] so every
offset -- be it byte or character -- would need to be in terms of the UTF-8
encoded ISO 10646 version of the file. How would that affect your Error
Browser Chris? OTOH, we should be able to get lossless roundtrips for
US-ASCII, ISO-8859-(1-5,7,9,10,13-16) and Windows-125(0,1,2) unless I'm
misremembering badly here[1]. However that would require some special cases
for the app.

Could we perhaps convert to Normalization Form C[2] and report Unicode
character offsets (or even bytes if it's easier) from beginning of file?
That would require that applications using the interface perform the same
transformation on the source file if they want to do more then just display
the results and this would have the same roundtrip problem that we have.

[0] - Because there are several one-to-many mappings when mapping into
      Unicode as well as many-to-one when mapping in the other direction,
      and I think there is even an overlap between the two groups.

[1] - Unfortunately, I think MacRoman isn't possible and I'm unsure
      about the rest of the MacFoo character repertoires.

[2] - I think Charlie does this (Martin?), but there are a few other
      libs that can do it and it's not impossible -- though not my idea
      of a fun time -- to roll our own normalizer for just NFC (or NFKC?).

Received on Sunday, 17 June 2001 20:04:35 UTC