Re: EXPath binary module - comments - bin:decode-string() from John Lumley on 2013-08-06 (public-expath@w3.org from August 2013)

From: John Lumley <john@saxonica.com>
Date: Tue, 06 Aug 2013 14:23:20 +0100
To: Michael Sokolov <sokolov@falutin.net>
CC: EXPath ML <public-expath@w3.org>
Message-ID: <5200F8C8.5030608@saxonica.com>

On 05/08/2013 15:06, Michael Sokolov wrote:
> By far the largest source of decoding errors I've seen is the 
> confusion of various 1-byte character sets (like windows-1252, iso 
> 8859-1) with utf-8.  In this case the majority of characters will be 
> equivalent, but there will be occasional errors. It's reasonable 
> either to provide some sort of error recovery: ie to insert 
> replacement characters where ther are errors, or to simply raise an 
> error.  Both behaviors are useful in different circumstances.
>
> Given the API as currently specified, I think strict checking is the 
> only choice.
I agree - at this stage raising a straightforward 'malformed or 
unmappable input in decoding' error will be simplest. An implementation 
could furnish further information (e.g. bound to $err:value in 
xsl:catch) which could be used by an application programmer for some 
form of recovery, but this would be implementation dependent.
>
> It seems this spec is getting close to baked, but perhaps in some 
> future revision of this spec, we might want to consider adding error 
> recovery?
Yes - for a later version, unless someone raises strong objections now.
>
> If we provide error recovery, there must to be some way to indicate 
> that an error occurred, even while returning a value.  We could return 
> a sequence of (decoded-string, result-status), eg.
>
> Another useful feature here could be the ability to detect and strip 
> or replace invalid characters taken from some character set (not 
> encoding).  An example that comes up repeatedly for us is invalid HTML 
> characters provided in XML (where they are not invalid).  Is there an 
> opportunity to scan for this in some efficient way here?  So a 
> possible elaboration of character decoding might be:
>
> 1) default behavior is to throw an error "an invalid byte sequence 
> XXXX for character set YYYY was encountered at byte position ZZZZ" or 
> something like that
> 2) user may request error recovery: either dropping the offending 
> character, or replacing it with some other character
> 3) character sets would optionally be enhanced with additional 
> restrictions (HTML chars only: XML chars only)


-- 
*John Lumley* MA PhD CEng FIEE
john@saxonica.com <mailto:john@saxonica.com>
on behalf of Saxonica Ltd

Received on Tuesday, 6 August 2013 13:23:46 UTC