draft faq (action 2023-10-03-d)

Four weeks ago I took an action (2023-10-03-d) to start a draft FAQ.

My first attempt follows.  Comments, suggestions, and additions (Qs, As,
or QA pairs) welcome.

Michael

................................................................

Frequently asked questions about Invisible XML

* What is invisible XML?

Invisible XML (or ixml) is a method of describing data
(i.e. information in files or data streams) using a grammar,
processing the data using the grammar, and obtaining an XML
representation of the data.

An ixml processor takes an *input string* and an *input grammar* as
inputs, and produces an XML representation of a parse tree as output,
if the grammar describes the input.  If the grammar does not describe
the input, it produces an XML document saying so.

* What is the input string?

The input *string* is a sequence of Unicode characters.

As a consequence of this, code points explicitly identified by the
Unicode spec as non-characters, and 16-bit values used in UTF-8 to
represent surrogate characters, are not allowed in the input string.
(For more details see below.)

Which version of Unicode is used is implementation-defined.

* What is the input grammar?

The input grammar is also a sequence of Unicode characters, which is
an ixml grammar.  That is, it is described by the invisible-XML
specification grammar (which is itself an ixml grammar).

Like any string described by an ixml grammar, the input grammar has an
XML representation which can be produced by an ixml processor.  Some
processors may allow grammars to be supplied in this XML
representation.

* What is the output?

The ixml spec says that the output of a conforming ixml processor is
an XML document.

* What restrictions are there on ixml grammars?

Any context-free grammar can be handled by an ixml processor.

The ixml input grammar is not required to be an LL(1) grammar, or an
LALR(1) grammar, or LL((/k/) or LALR(/k/) for some positive integer
/k/.  It is not required to be unambiguous.

An ixml processor does check for some properties of the grammar that
are likely to indicate mistakes made in preparing the grammar.  Each
nonterminal symbol must be defined exactly once: multiple definitions
are not allowed, nor are undefined symbols.

* Is ixml case-sensitive or case-insensitive?

Like XML, Invisible XML is case-sensitive.

The only place this becomes an issue is in the recogntion of
nonterminal symbols.  So concretely, the case-sensitivity of ixml
means: Nonterminals are case-sensitive.

So, for example, the nonterminal /x/ and the nonterminal /X/ are
distinct.

* Can I parse binary input?

Binary formats can often be described using context-free grammars.

But the ixml spec assumes that the input to the parsing process is a
sequence of characters in the 'universal character set' described by
ISO 10646 and Unicode.  So binary output cannot be described by a
standard ixml grammar.

An ixml processor may offer extensions to support parsing of binary
input, but that is outside the scope of the ixml specification.

* What about non-Unicode characters?  Or Unicode non-characters?

Both the input string and the input grammar are sequences of Unicode
characters.

*Non-Unicode characters* -- that is, characters used in some writing
system but not assigned code points in the (applicable version of the)
universal character set -- can be represented using code points in the
Unicode private-use area.  In most applications, such characters are
vanishingly rare.

*Unicode non-characters* are code points explicitly identified by the
Unicode spec as reserved for internal use, which will never be
assigned to represent characters.  There are 66 such code points:

  - the range U+FDD0 to U+FDEF;
    
  - the last two code points in the Basic Multilingual Plane, i.e.
    U+FFFE and U+FFFF; and
    
  - the last two code points in each of the 16 supplementary planes, i.e.
    U+01FFFE, U+01FFFF,
    U+02FFFE, U+02FFFF,
    U+03FFFE, U+03FFFF,
    ...   
    U+0FFFFE, U+0FFFFF,
    U+10FFFE, U+10FFFF.

Note that the 16-bit values used in UTF-8 to represent surrogate
characters are not non-characters in the technical sense (they have a
defined use in Unicode) but since they do not individually represent
characters, they cannot appear individually in the input string.  If
they appear in syntactically well formed UTF-8 input, then for
purposes of ixml processing what is present in the input is a
character in one of the 16 supplementary planes of the Unicode space.


* Can I produce non-XML output?

Some processors may provide additional output formats, e.g. JSON, but
those formats and the behavior that produces them are not defined by
the ixml spec.

* Can I use invisible XML from inside XSLT?  XQuery?

There are ixml processors which can be called from within XSLT or
XQuery, either by using an extension function or by loading an XQuery
or XSLT library which defines an ixml parsing function.

There is currently (October 2023) work afoot to add ixml parsing
functions to the standard XPath 4.0 function library.

* What do 'implementation-defined' and 'implementation-dependent' mean?

A feature is *implementation-defined* if the precise specification of
the feature is provided by the documentation of the ixml
implementation.  For example: which version of Unicode is used by the
processor?

A behavior is *implementation-dependent* if it can vary from one
processor to another, but the precise behavior is not defined --
neither by the ixml specification nor by the documentation of the
processor.  Sometimes things are described as implementation-dependent
because it is impossible to predict what will happen in any given
case: the behavior of a processor that runs out of memory is an
example.  In other cases, something is described as
implementation-dependent (and not implementation-defined) in order to
signal that processor behavior should not be relied upon.  


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Monday, 30 October 2023 17:33:38 UTC