Re: EXPath binary module - comments - bin:decode-string() from Michael Sokolov on 2013-08-05 (public-expath@w3.org from August 2013)

From: Michael Sokolov <sokolov@falutin.net>
Date: Mon, 05 Aug 2013 10:06:45 -0400
To: John Lumley <john@saxonica.com>
Cc: EXPath ML <public-expath@w3.org>
Message-id: <51FFB175.1050401@falutin.net>

By far the largest source of decoding errors I've seen is the confusion 
of various 1-byte character sets (like windows-1252, iso 8859-1) with 
utf-8.  In this case the majority of characters will be equivalent, but 
there will be occasional errors. It's reasonable either to provide some 
sort of error recovery: ie to insert replacement characters where ther 
are errors, or to simply raise an error.  Both behaviors are useful in 
different circumstances.

Given the API as currently specified, I think strict checking is the 
only choice.

It seems this spec is getting close to baked, but perhaps in some future 
revision of this spec, we might want to consider adding error recovery?

If we provide error recovery, there must to be some way to indicate that 
an error occurred, even while returning a value.  We could return a 
sequence of (decoded-string, result-status), eg.

Another useful feature here could be the ability to detect and strip or 
replace invalid characters taken from some character set (not 
encoding).  An example that comes up repeatedly for us is invalid HTML 
characters provided in XML (where they are not invalid).  Is there an 
opportunity to scan for this in some efficient way here?  So a possible 
elaboration of character decoding might be:

1) default behavior is to throw an error "an invalid byte sequence XXXX 
for character set YYYY was encountered at byte position ZZZZ" or 
something like that
2) user may request error recovery: either dropping the offending 
character, or replacing it with some other character
3) character sets would optionally be enhanced with additional 
restrictions (HTML chars only: XML chars only)

-Mike

On 8/5/13 9:30 AM, John Lumley wrote:
> There is an outstanding issue about handling decoding errors when 
> decoding strings which will need some addressing.  Such errors can 
> occur under the following circumstances:
>
>  1. The encoding is known but defined incorrectly (e.g. using UTF-8
>     when UTF-16 was used to encode)
>  2. The length to decode wasn't 'complete', i.e. some hanging
>     multi-octet characters were incomplete
>  3. There was a phasing error at the start, i.e. the start point was
>     not at a code-point boundary.
>
> We must assume that the decoding error can be detected of course. The 
> question then is what should be done, and whether any form of recovery 
> should be supported.
>
> The simplest of course is to thow an error (which try/catch can field) 
> - but do we want to try and tell what the error is? In some cases the 
> 'replacement character' can be substituted - this is especially true 
> with self-synchonising encodings such as UTF-8. But even then do we 
> want to signal the error, and if so, to where does the 'decode with 
> replacement character' string get returned? (In XSLT 3.0 we could 
> build a reporting structure that was bound to the $err:value 
> variable... XSLT-2.0 of course doesn't have a try)
>
> Others in this community will have far more experience of this issue 
> than I, so I'd welcome your thoughts. Decoding error management does 
> need to be defined for this function
>
> -- 
> *John Lumley* MA PhD CEng FIEE
> john@saxonica.com <mailto:john@saxonica.com>
> on behalf of Saxonica Ltd

Received on Monday, 5 August 2013 14:07:46 UTC