W3C home > Mailing lists > Public > public-wai-ert@w3.org > May 2006

Re: Bytes, character encodings and characters

From: Johannes Koch <johannes.koch@fit.fraunhofer.de>
Date: Wed, 10 May 2006 09:38:28 +0200
Message-ID: <44619874.7040104@fit.fraunhofer.de>
To: public-wai-ert@w3.org

Shadi Abou-Zahra wrote:
> Just to have a practical example:
> Say we have a string of 2 characters that are above U007F (but lets 
> still represent the string as "AB"). We want to store this string as a 
> snippet and point to the second character ("B") as the start of an 
> error. Let's say we start counting characters at 0, and so it would have 
> charOffset of 1.
> Now we serialize this string as ASCII (despite its bad 
> internationalization support) and get something like "&U0080;&U0081;" 
> (use your imagination).

&#x0080;&#x0081; :-)

> Would charOffset remain 1 (resolve character 
> references first, then count) or change to 7 (count in actual ASCII 
> characters)?

It's still charOffset 1. The EARL-reading XML-aware software will read 
&#x0080;&#x0081; and then create the two characters U0080 and U0081.

> I'm assuming the first approach (resolve then count) but we need to 
> agree on this and document it for others.


A different case:
The text snippet is 'foo&Ouml;bar'.
The XML for this is




We want to point to the 'b'. Now the charOffset is 9 because the 
resolved characters preceding the b are 'foo&Ouml;', not 'foo÷'.

> Also the fact that we start 
> counting strings at 0 (or 1 if people prefer).

Yep, that's important. As well for line and column numbers.

Johannes Koch - Competence Center BIKA
Fraunhofer Institute for Applied Information Technology (FIT.LIFE)
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628
Received on Wednesday, 10 May 2006 07:39:15 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:55:54 UTC