- From: Michael Sokolov <sokolov@falutin.net>
- Date: Mon, 05 Aug 2013 10:06:45 -0400
- To: John Lumley <john@saxonica.com>
- Cc: EXPath ML <public-expath@w3.org>
- Message-id: <51FFB175.1050401@falutin.net>
By far the largest source of decoding errors I've seen is the confusion of various 1-byte character sets (like windows-1252, iso 8859-1) with utf-8. In this case the majority of characters will be equivalent, but there will be occasional errors. It's reasonable either to provide some sort of error recovery: ie to insert replacement characters where ther are errors, or to simply raise an error. Both behaviors are useful in different circumstances. Given the API as currently specified, I think strict checking is the only choice. It seems this spec is getting close to baked, but perhaps in some future revision of this spec, we might want to consider adding error recovery? If we provide error recovery, there must to be some way to indicate that an error occurred, even while returning a value. We could return a sequence of (decoded-string, result-status), eg. Another useful feature here could be the ability to detect and strip or replace invalid characters taken from some character set (not encoding). An example that comes up repeatedly for us is invalid HTML characters provided in XML (where they are not invalid). Is there an opportunity to scan for this in some efficient way here? So a possible elaboration of character decoding might be: 1) default behavior is to throw an error "an invalid byte sequence XXXX for character set YYYY was encountered at byte position ZZZZ" or something like that 2) user may request error recovery: either dropping the offending character, or replacing it with some other character 3) character sets would optionally be enhanced with additional restrictions (HTML chars only: XML chars only) -Mike On 8/5/13 9:30 AM, John Lumley wrote: > There is an outstanding issue about handling decoding errors when > decoding strings which will need some addressing. Such errors can > occur under the following circumstances: > > 1. The encoding is known but defined incorrectly (e.g. using UTF-8 > when UTF-16 was used to encode) > 2. The length to decode wasn't 'complete', i.e. some hanging > multi-octet characters were incomplete > 3. There was a phasing error at the start, i.e. the start point was > not at a code-point boundary. > > We must assume that the decoding error can be detected of course. The > question then is what should be done, and whether any form of recovery > should be supported. > > The simplest of course is to thow an error (which try/catch can field) > - but do we want to try and tell what the error is? In some cases the > 'replacement character' can be substituted - this is especially true > with self-synchonising encodings such as UTF-8. But even then do we > want to signal the error, and if so, to where does the 'decode with > replacement character' string get returned? (In XSLT 3.0 we could > build a reporting structure that was bound to the $err:value > variable... XSLT-2.0 of course doesn't have a try) > > Others in this community will have far more experience of this issue > than I, so I'd welcome your thoughts. Decoding error management does > need to be defined for this function > > -- > *John Lumley* MA PhD CEng FIEE > john@saxonica.com <mailto:john@saxonica.com> > on behalf of Saxonica Ltd
Received on Monday, 5 August 2013 14:07:46 UTC