Re: UTF-8 encoding error from Steven Pemberton on 2023-05-30 (public-ixml@w3.org from May 2023)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Tue, 30 May 2023 09:50:38 +0000
To: "Norm Tovey-Walsh" <norm@saxonica.com>, public-ixml@w3.org
Message-Id: <1685437862185.320233974.1989910658@cwi.nl>

On Monday 29 May 2023 18:11:31 (+02:00), Norm Tovey-Walsh wrote:

> > I got the first submission to my processor this week with a UTF-8
> > encoding error, which managed to hang the processor.
> 
> Curiously, I have no trouble with the grammar. But I also haven’t
> provided any way for the user to specify an encoding, so I’m not sure
> what Java is doing.

Yes, I have long laboured over how to properly deal with Unicode encoding errors.

My ixml system asks my Unicode decoder for the next Unicode character. 
Unicode characters are made up of variable length strings of one to four bytes:

 input: u*.
 u: u1; u2; u3; u4.

 u1: s1.                   {ascii}
 u2: s2, s0.             {#80-#7FF}
 u3: s3, s0, s0.       {#800-#FFFF}
 u4: s4, s0, s0, s0. {#10000-#10FFFF}

 s1: [#0-#7F]. {Single byte Unicode characters are just the ASCII characters}
 s2: [#C0-#DF].
 s3: [#E0-#EF].
 s4: [#F0-#F7].

 s0: [#80-#BF]. {Continuation characters can never start a Unicode character}
        {#F8-#FF are illegal anywhere}

Currently my system only looks at the first byte, and returns a string of that length; it doesn't check that the following bytes are s0's. If it finds an illegal start byte, it just returns an empty string, and lets the caller deal with the problem. In my case I report an error, and skip that byte. 

I could imagine that other systems just return the erroneous byte, which is in the Latin-1 range.

Steven

Received on Tuesday, 30 May 2023 09:50:47 UTC