- From: Richard Tobin <richard@cogsci.ed.ac.uk>
- Date: Fri, 14 Feb 2003 19:17:33 GMT
- To: www-xml-blueberry-comments@w3.org
This describes the issues I encountered in adding XML 1.1 support to RXP. RXP now accepts both XML 1.0 and XML 1.1 documents, applying the appropriate parsing rules to documents of each kind. The main changes required were: - Recording for each parser instance and entity its version. - Checking that entities did not have a version number later than the document entity. - Adding a new character type table. I accepted the overhead of an extra 64k here though I could have used more bits from the 1.0 table at the expense of slower lookup. - Using the appropriate table for checking character legality in the input functions, depending on the version of the document. - Using the appropriate table for checking name character legality in the name parsing functions, depending on the version of the document. Since characters are represented internally in UTF-16, and XML 1.1 allows characters above 0xffff in names, names can now contain surrogates. Fortunately the legal name characters above 0xffff can be distinguished by the high surrogate only, so it is not necessary to look at pairs when parsing names. - Changing the way input errors (eg bad UTF-8) are recorded: previously a SUB character was placed in the input stream, but SUB can now occur in the replacement text of internal entities (but see spec issues below). - Changes to line-end normalization on input, depending on the version of the document. - Changing character reference processing to allow references to control codes for 1.1 documents. - The structure of the parser makes it hard to handle NEL and LSEP characters in the XML declaration, because the encoding is not yet known. I do not allow them, see spec issues below. - I did not attempt to implement Unicode normalization checking. XML 1.1 is not appreciably more complex than XML 1.0, but some complexity arises from supporting both within the same parser. The need to apply different checks to characters in the input inner loop depending on the document version adds some overhead. If necessary this could be overcome by having 1.0 and 1.1 versions of the input code, but there are already several versions for different encodings and this would double the number. On the other hand, the I estimate that the overhead is only around 3%. Issues with the spec: XML 1.1 is not a strict superset of XML 1.0 (in the sense that a well-formed or valid 1.0 document does not necessarily remain well-formed if it is relabelled as 1.1) because of the exclusion of the C1 controls and DEL from the Char production. The main gain from this will be the detection of documents in Microsoft encodings such as cp1252 which are mislabelled as Latin-1. Whether this is worth the loss of superset-ness and the implementation overhead I am not sure. XML 1.0 was careful to only allow ASCII characters in the XML declaration. XML 1.1 should continue this, explicitly prohibiting NEL and LSEP as whitespace in the XML declaration. As it stands, the spec requires double escaping of control characters if they appear in internal entities. This is because internal entity replacement text must match the content production, and therefore can only contain characters matching Char. This is pointless and would be expensive to implement, so it should be changed. There should be a production for Char + the escapable controls, so that other specs can refer to it is as the "internal" character set of XML documents. The exclusion of NUL is important for existing APIs and the spec is right to exclude it. Canonical XML (and the unoffical canonical XML used in the test suite) will need to be changed to output controls as character references. -- Richard
Received on Friday, 14 February 2003 14:17:35 UTC