Bytes, character encodings and characters

Hi group,

I didn't want to be rude, but I really could not see a problem with the 
textContent property. So I try to clarify my opinion.

1. I make a request for a resource I want to check.
2. I get a response containing a sequence of bytes and, if it is a text 
resource, hopefully a character encoding (CE) via some metadata 
(Content-Type header in HTTP). Otherwise I use a default CE.
3. I use the CE1 (specified or default) to transform the sequence of 
bytes into a sequence of characters. From now on, I'm on the character 
level, no bytes around anymore.
4. I extract a text snippet from the resource characters.
5. I create an EARL report containing the snippet. I'm still on the 
character level.
6. I want to store the EARL report on the file system or send them over 
the network. Therefore I have to transform the EARL report characters 
into a sequence of bytes using a character encoding CE2. CE2 is not 
required to be the same as CE1. However, CE2 should contain mappings for 
all characters in the EARL report.

Of course, there is a step #0 prior to #1:
0: An author creates a text document by writing characters, then storing 
the document on the web server file system by transforming the 
characters into a sequence of bytes using a character encoding CE0.

It may also happen that the document is created by merging different 
sources on the byte level instead of the character level. So there's a 
problem when source1 uses a different CE than source2. Transforming the 
merged bytes into a sequence of characters will not give the proper 
result. But this is a problem with resource document creation, not with 
EARL report creation.
-- 
Johannes Koch - Competence Center BIKA
Fraunhofer Institute for Applied Information Technology (FIT.LIFE)
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628

Received on Wednesday, 3 May 2006 15:47:01 UTC