- From: John Lumley <john@saxonica.com>
- Date: Tue, 06 Aug 2013 14:23:20 +0100
- To: Michael Sokolov <sokolov@falutin.net>
- CC: EXPath ML <public-expath@w3.org>
- Message-ID: <5200F8C8.5030608@saxonica.com>
On 05/08/2013 15:06, Michael Sokolov wrote: > By far the largest source of decoding errors I've seen is the > confusion of various 1-byte character sets (like windows-1252, iso > 8859-1) with utf-8. In this case the majority of characters will be > equivalent, but there will be occasional errors. It's reasonable > either to provide some sort of error recovery: ie to insert > replacement characters where ther are errors, or to simply raise an > error. Both behaviors are useful in different circumstances. > > Given the API as currently specified, I think strict checking is the > only choice. I agree - at this stage raising a straightforward 'malformed or unmappable input in decoding' error will be simplest. An implementation could furnish further information (e.g. bound to $err:value in xsl:catch) which could be used by an application programmer for some form of recovery, but this would be implementation dependent. > > It seems this spec is getting close to baked, but perhaps in some > future revision of this spec, we might want to consider adding error > recovery? Yes - for a later version, unless someone raises strong objections now. > > If we provide error recovery, there must to be some way to indicate > that an error occurred, even while returning a value. We could return > a sequence of (decoded-string, result-status), eg. > > Another useful feature here could be the ability to detect and strip > or replace invalid characters taken from some character set (not > encoding). An example that comes up repeatedly for us is invalid HTML > characters provided in XML (where they are not invalid). Is there an > opportunity to scan for this in some efficient way here? So a > possible elaboration of character decoding might be: > > 1) default behavior is to throw an error "an invalid byte sequence > XXXX for character set YYYY was encountered at byte position ZZZZ" or > something like that > 2) user may request error recovery: either dropping the offending > character, or replacing it with some other character > 3) character sets would optionally be enhanced with additional > restrictions (HTML chars only: XML chars only) -- *John Lumley* MA PhD CEng FIEE john@saxonica.com <mailto:john@saxonica.com> on behalf of Saxonica Ltd
Received on Tuesday, 6 August 2013 13:23:46 UTC