- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Mon, 30 Oct 2023 11:31:03 -0600
- To: ixml <public-ixml@w3.org>
Four weeks ago I took an action (2023-10-03-d) to start a draft FAQ. My first attempt follows. Comments, suggestions, and additions (Qs, As, or QA pairs) welcome. Michael ................................................................ Frequently asked questions about Invisible XML * What is invisible XML? Invisible XML (or ixml) is a method of describing data (i.e. information in files or data streams) using a grammar, processing the data using the grammar, and obtaining an XML representation of the data. An ixml processor takes an *input string* and an *input grammar* as inputs, and produces an XML representation of a parse tree as output, if the grammar describes the input. If the grammar does not describe the input, it produces an XML document saying so. * What is the input string? The input *string* is a sequence of Unicode characters. As a consequence of this, code points explicitly identified by the Unicode spec as non-characters, and 16-bit values used in UTF-8 to represent surrogate characters, are not allowed in the input string. (For more details see below.) Which version of Unicode is used is implementation-defined. * What is the input grammar? The input grammar is also a sequence of Unicode characters, which is an ixml grammar. That is, it is described by the invisible-XML specification grammar (which is itself an ixml grammar). Like any string described by an ixml grammar, the input grammar has an XML representation which can be produced by an ixml processor. Some processors may allow grammars to be supplied in this XML representation. * What is the output? The ixml spec says that the output of a conforming ixml processor is an XML document. * What restrictions are there on ixml grammars? Any context-free grammar can be handled by an ixml processor. The ixml input grammar is not required to be an LL(1) grammar, or an LALR(1) grammar, or LL((/k/) or LALR(/k/) for some positive integer /k/. It is not required to be unambiguous. An ixml processor does check for some properties of the grammar that are likely to indicate mistakes made in preparing the grammar. Each nonterminal symbol must be defined exactly once: multiple definitions are not allowed, nor are undefined symbols. * Is ixml case-sensitive or case-insensitive? Like XML, Invisible XML is case-sensitive. The only place this becomes an issue is in the recogntion of nonterminal symbols. So concretely, the case-sensitivity of ixml means: Nonterminals are case-sensitive. So, for example, the nonterminal /x/ and the nonterminal /X/ are distinct. * Can I parse binary input? Binary formats can often be described using context-free grammars. But the ixml spec assumes that the input to the parsing process is a sequence of characters in the 'universal character set' described by ISO 10646 and Unicode. So binary output cannot be described by a standard ixml grammar. An ixml processor may offer extensions to support parsing of binary input, but that is outside the scope of the ixml specification. * What about non-Unicode characters? Or Unicode non-characters? Both the input string and the input grammar are sequences of Unicode characters. *Non-Unicode characters* -- that is, characters used in some writing system but not assigned code points in the (applicable version of the) universal character set -- can be represented using code points in the Unicode private-use area. In most applications, such characters are vanishingly rare. *Unicode non-characters* are code points explicitly identified by the Unicode spec as reserved for internal use, which will never be assigned to represent characters. There are 66 such code points: - the range U+FDD0 to U+FDEF; - the last two code points in the Basic Multilingual Plane, i.e. U+FFFE and U+FFFF; and - the last two code points in each of the 16 supplementary planes, i.e. U+01FFFE, U+01FFFF, U+02FFFE, U+02FFFF, U+03FFFE, U+03FFFF, ... U+0FFFFE, U+0FFFFF, U+10FFFE, U+10FFFF. Note that the 16-bit values used in UTF-8 to represent surrogate characters are not non-characters in the technical sense (they have a defined use in Unicode) but since they do not individually represent characters, they cannot appear individually in the input string. If they appear in syntactically well formed UTF-8 input, then for purposes of ixml processing what is present in the input is a character in one of the 16 supplementary planes of the Unicode space. * Can I produce non-XML output? Some processors may provide additional output formats, e.g. JSON, but those formats and the behavior that produces them are not defined by the ixml spec. * Can I use invisible XML from inside XSLT? XQuery? There are ixml processors which can be called from within XSLT or XQuery, either by using an extension function or by loading an XQuery or XSLT library which defines an ixml parsing function. There is currently (October 2023) work afoot to add ixml parsing functions to the standard XPath 4.0 function library. * What do 'implementation-defined' and 'implementation-dependent' mean? A feature is *implementation-defined* if the precise specification of the feature is provided by the documentation of the ixml implementation. For example: which version of Unicode is used by the processor? A behavior is *implementation-dependent* if it can vary from one processor to another, but the precise behavior is not defined -- neither by the ixml specification nor by the documentation of the processor. Sometimes things are described as implementation-dependent because it is impossible to predict what will happen in any given case: the behavior of a processor that runs out of memory is an example. In other cases, something is described as implementation-dependent (and not implementation-defined) in order to signal that processor behavior should not be relied upon. -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Monday, 30 October 2023 17:33:38 UTC