Notes on 
Guidelines For The Use of XML in IETF Protocols
draft-hollenbeck-ietf-xml-guidelines-01.txt
http://www.imc.org/ietf-xml-use/draft-hollenbeck-ietf-xml-guide.html

General - good, well thought out document, admirably concise and well referenced. Some vagarities remain, which I assume are inadvertent rather than deliberate loopholes so these should be tightened up. Overall, a model of the sort of pithy BCP directives I would like to see the TAG producing.

1.2 Scope - is it protocol in the sense of an xml-based transport, or in the sense of data that is transported over networks. "widely-used mechanism for representing structured data in protocol exchanges" seems to imply the latter. Its important to know which one, to be able to evaluate this document. The scope statement is confusing. On the one hand the scope is "guidelines for the use of XML content within a larger protocol" and on the other, use of "higher-level representation frameworks, based on XML, that have been designed as carriers of certain classes of information" is out of scope. This is confusing.

From reading the rest of the document, small xml fragments that form part of a larger protocol (for example, xml headers) seem s to be the target. The scope could more explicitly note that "all XML whis is transmitted over a network" is out of scope.

2. XML Selection Considerations
"XML processing speed can be an issue in some environments. XML processing can be slower because XML data streams may be larger than other representations, and the use of general purpose XML parsers will add a software layer with its own performance costs." 
Gives reasons *not* to use XML but little counter argument. Perhaps the desire is to stop whlesale thoughtless adoption of XML regardless of applicability, but a balanced selection section should also mentio benefirts. An example for the piece quoted above would be to append "which should be balanced against the footprint and lack of optimisation of a myriad of special purpose parsers in the same software".

4.1 XML Declarations
"In some cases, the XML used is a small fragment in a larger context, where the XML version and character encoding are specified externally. In those cases, the XML declaration might add extra overhead. " 
This seems to imply that XML might be encoded in something other than UTF-8 or UTF-16, or use a version other than 1.0, and omit this important information if the percentage of XML to overall data is small. This is clearly wrong, and will result in reduced interoperability and the growth of sniffing. Some implementations will give well formedness error, others will not. This is very bad. The growth of out of band alternatives to the xml encoding declaration should be resisted. We have one already, due to the interaction with the text/* media types, and it is one too many.

4.3 Well-Formedness
"An XML instance that is not well-formed is not really XML; well-formedness is the basis for syntactic compatibility with XML. Without well-formedness, most of the advantages of using XML disappear. For this reason, it is imperative that protocol specifications REQUIRE that XML instances be well-formed."

This is good but should be strtengthened. s/not really/NOT. Add that the parser MUST halt on well formedness error. s/most of the advantages/all of the advantages. What advantages could remain if a protocol specifies a non-well-formed use of XML? None that I can see, and a huge downside the first time that such a protocol gets defined. Nitpicking, "REQUIRE" looks impressive but is not part of the vocabulary of RFC 2119. "REQUIRED" is, and is a synonym for "MUST". Rewording to "For this reason, protocol specifications MUST require that XML instances be well-formed and that processors MUST halt on well formedness error." would be clearer.

(Incidentally the link in the document for RFC 2119 points to an ftp site and was broken when I tried it, but http://www.ietf.org/rfc/rfc2119.txt worked... also, its not clear whether they use "must" to mean "MUST" or not which affects the meaning of a lot of the document).

4.4 Validity and Extensibility
Wading past the very necessary 'should we mandate W3C XML Schema or let a thousand flowers bloom' portions, the most important part to my mind is:
"For whatever formalism chosen, there are often additional constraints that cannot be expressed in that formalism. These additional requirements should be clearly called out in the specification. "
I agree; using a DTD/Schema/whatever as the primary formalism should trigger a search for remaining constraints that cannot be described in that formalism and strenuous effort to make such additional constraints exactly described, clearly flagged as additions beyond the primary formalism, and ideally machine checkable. 

Suggest adding a sentence about the desirability of an overall validator for a particular usage, that first checks for well formedness; if ok, applies the primary formalism and, if the instance passes, applies the other constraints so that the entire set, or as much as is machine processable, can be checked at the one time.

4.5 4.5 Namespaces
Good, but implies that the only function of namespaces is avoidance of element and attribute name conflicts. Since they define "XML" to mean "the associated body of XML-related specifications" (or I assume they do, otherwise the claim in 2. XML Selection Considerations that "XML is still evolving. The formal specifications are still being influenced and updated as use experience is gained and applied." has little merit) then similarly "namespaces" should mean "the associated body of namespace-related, or namespace-aware, specifications. 

That being the case; firstly, their wording implies that any random namespace URI can be chosen as long as it avoids a name clash in a given instance and this is incorrect; secondly, the impact of namespaces onn object models should be mentioned. The methods and attributes on an object will vary depending on whether that object is in its own namespace, no namespace, or the werong (randomly chosen) namespace. Elements and attributes should be in their correct namespaces and the namespaces should be declared even if there is no clash.

The points about desirability of namespace URIs pointing to "something" using HTTP is well made and about as strong as current practice allows.

"In the case of namespaces in IETF standards-track documents, it would be useful if there were some permanent part of the IETF's own web space that could be used for this purpose. "
Yes, great. Wish we could assign them an action to go do that and say what the base URI is?

"In lieu of such, other permanent URIs can be used, e.g., URNs in the IETF URN namespace (see [13] and [14])." That makes me feel uncomfortable, but i don''t have articulate arguments right now to clarify the source of discomfort so i not it here and may get back to it later, or others might.

4.5.1 Namespaces and Attributes
"There is a frequently misunderstood aspect of the relationship between unprefixed attributes and the default XML namespace"
Great stuff, well done for pointing this out.

"As described in XML Alternatives there is no standard mechanism in XML for indicating whether or not new extensions are mandatory to recognize. XML-based protocol specifications should thus explicitly describe extension mechanisms and requirements to recognize or ignore extensions."
True in general but there are specific examples of such mechanisms so they should be listed here and the advice elsewhere in the document to re-use existing mechanisms should be applied here also. I'm thinking of the DOM feature tests and the SMIL and SVG requiredFeatures, requiredExtensions and test and switch capabilities.

4.6 Element and Attribute Design Considerations
Good material in general, helpful to the XML protocol designer. "Attribute values can contain only simple XML data types" is true to a limited extent depending on the formalism used and its expressive power. Attribute data is clearly used to describe richer content than many schema languages can express (such as the very simple width="3.5mm") and the interaction with section 4.4 Validity and Extensibility in terms of "additional constraints that cannot be expressed in that formalism" should be made explicit - its a trade off with advantages and disadvantages that should be made on a case by case basis, not a "can only contain". This applies to element content too, of course.

"Attributes used in protocol elements should contain only meta-data that describes the value of the enclosing element. " Yes, but that does not preclude richer attribute data or indeed, sparser than possible markup as their own example illustrates. Fully marked up it might be

<address>
      <type>
		<protocol>ip</protocol>
		<version>4</version>
	</type>
      <v4value>
		<first>10<first>
		<second>1<second>
		<third>2<third>
		<fourth>3<fourth>
	</v4value>
</address>

26 DOM nodes incluiding text nodes including ones that have just spaces.

And even so, schema languages might have difficulty expressing the constraint that the second child of address is v4value if the value of the content of the version child of the first child type element is "4". Its clear that  combining the ip and the 4 into a commonly used compound ipv4 and applying it as an attribute on address is a better method.

<address type="ipv4">
      <v4value>
		<first>10<first>
		<second>1<second>
		<third>2<third>
		<fourth>3<fourth>
	</v4value>
</address>

17 DOM nodes

Its also at least a defensible design choice to combine the four children of v4value into a commonly used dotted quad even though this forgoes validation using schemas that there are four children, in the correct order and with content in the allowed range.

<address type="ipv4">10.1.2.3</address>

<asbestos>
Three DOM nodes, easy readability, immediately understood by target audience, validation requires a microparser, so what.
</asbestos>

5. Internationalization Considerations
Good to see this considered. Good to see a similar form of title to "Security Considerations" since Internationalisation, like Security, is a design not a feature.

5.1 Character Sets
"XML provides native support for encoding information using the Unicode character set and its more compact representations including UTF-8 [4] and UTF-16 [21]."

Ugh. Mixes up the universal character set, character repertoires and character encodings straight off the bat. See
Character Set Considered Harmful
May 2, 1995
http://www.w3.org/MarkUp/html-spec/charset-harmful
linked from http://www.w3.org/TR
Side note to Dan Connoly - technically that expired in November, 1995 but is still applicable. Suggest ensuring it is fully reflected in the character model work from I18N WG.


Suggested rewording (requires renumbering of subsequent sections)

5.1 Character Repertoire
XML performs all character processing in terms of the Universal Character Set (UCS, link to Unicode 3.2 and ISO 10646). This provides a base for, but does not itself guarantee, an internationalized use of XML for protocols. There is a frequently misunderstood aspect of the relationship between numerical character references (NCRs) and encoding. NCRs always refer to the UCS, never to the encoding in use. Thus, any encoding can represent the entire character repertoire of the UCS by using NCRs.

5.2 Encoding
XML mandates support of the UTF-8 [4] and UTF-16 [21] encodings (in older IETF parlance, "character sets") and such usage may have but does not require an encoding declaration. Other encodings are also permitted, with no guarantee of support in a given XML parser, and if used MUST be specified using an "encoding" attribute in a document's XML declaration. XML parsers often support additional encodings but the selection of these is regionally dependent and should not be relied upon even for data exchange withing a geographically limited region. Because of this, to ensure interoperability, it is strongly recommended that UTF-8 or UTF-16 SHOULD be mandated for protocols that represent data using XML. 

5.3 Other Considerations
Good points about needing markup to convey certain information such as ruby and embedding controls. The use of markup to specify language changes and to allow bidi embedding more than one level deep could also usefully be noted here.

s/extended character sets/character repertoires greater than US-ASCII
because "character set' is often seen as synonymous with "charset" and thus with "encoding" and that is not the issue here.