Re: Bytes, character encodings and characters

Shadi Abou-Zahra wrote:
> 
> Just to have a practical example:
> 
> Say we have a string of 2 characters that are above U007F (but lets 
> still represent the string as "AB"). We want to store this string as a 
> snippet and point to the second character ("B") as the start of an 
> error. Let's say we start counting characters at 0, and so it would have 
> charOffset of 1.
> 
> Now we serialize this string as ASCII (despite its bad 
> internationalization support) and get something like "&U0080;&U0081;" 
> (use your imagination).

€ :-)

> Would charOffset remain 1 (resolve character 
> references first, then count) or change to 7 (count in actual ASCII 
> characters)?

It's still charOffset 1. The EARL-reading XML-aware software will read 
€ and then create the two characters U0080 and U0081.

> I'm assuming the first approach (resolve then count) but we need to 
> agree on this and document it for others.

OK

A different case:
The text snippet is 'fooÖbar'.
The XML for this is

   <earl:textSnippet><![CDATA[foo&Ouml;bar]]></earl:textSnippet>

or

   <earl:textSnippet>foo&amp;Ouml;bar</earl:textSnippet>

We want to point to the 'b'. Now the charOffset is 9 because the 
resolved characters preceding the b are 'foo&Ouml;', not 'fooÖ'.

> Also the fact that we start 
> counting strings at 0 (or 1 if people prefer).

Yep, that's important. As well for line and column numbers.

-- 
Johannes Koch - Competence Center BIKA
Fraunhofer Institute for Applied Information Technology (FIT.LIFE)
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628

Received on Wednesday, 10 May 2006 07:39:15 UTC