Re: Producing an XML report from the validator

On 18.06.01 at 10:38, Christian Smith <csmith@barebones.com> wrote:

>On Monday, June 18, 2001 at 1:39 AM, link@tss.no (Terje Bless) wrote:
>
>>Right. We can't guarantee a lossless roundtrip to ISO 10646[0] so every
>>offset -- be it byte or character -- would need to be in terms of the
>>UTF-8 encoded ISO 10646 version of the file. How would that affect your
>>Error Browser Chris?
>
>Dealing with byte offsets would be nearly impossible. as long as I get
>back a character offset everything should be fine.

Yeah, but if your file is in MacRoman, will you be able to make use of a
character offset obtained _after_ we've converted it into some encoding of
UNICODE?


>>Could we perhaps convert to Normalization Form C[2] and report Unicode
>>character offsets (or even bytes if it's easier) from beginning of file?
>
>I don't see how the character offset is going to be any different between
>UTF-8 and UTF-16. Byte based offsets would be different but character
>offsets should be the same.

Right. Did I imply anything else? :-)


The problem I'm envisioning is where there are several possible UNICODE
forms of a given MacRoman (or ISO-whatever, or ...) character. For
instance, some combination symbols can be expressed either as their own
unique code point, or as a set of equivalent combination characters. For
instance, "LATIN CAPITAL LETTER A WITH RING ABOVE" ("Å") can be encoded as
U+00C5, or decomposited as U+0041 U+030A; "LATIN CAPITAL LETTER A" followed
by "COMBINING RING ABOVE". Which one you get is dependant on whether you
use Normalization Form C (use combination chars) or Normalization Form D
(decompose) or whatever random variant made sense at the time. :-|

Received on Monday, 18 June 2001 11:45:50 UTC