Normalizing line endings from Norm Tovey-Walsh on 2023-10-23 (public-ixml@w3.org from October 2023)

From: Norm Tovey-Walsh <norm@saxonica.com>
Date: Mon, 23 Oct 2023 17:43:14 +0100
To: ixml <public-ixml@w3.org>
Message-ID: <m2edhlnv2a.fsf@saxonica.com>
Hello,

Persuant to an action I took recently, here is a summary of issue #192
formulated as a menu of options from which the CG could choose.

Issue #192 is about normalizing line endings.

Invisible XML is designed to parse text. There’s nothing that prevents
an implementation from having a “binary” input method, but that’s not
what the specification anticipates. None of what follows would apply
if an implementation were operating in some sort of binary mode.

A text file is logically a sequence of lines of text. Leaving aside
punched cards, virtual punches, and other literally record-oriented
storage systems, most computer systems store text files as a sequence
of characters where some character or sequence of characters means “a
line break occurs here.”

Different computer systems use different conventions for how line
breaks are identified.

+ Unix (and Unix-derived systems) use a single newline, #A.
+ Pre-OS X MacOS systems used a single carriage return, #D.
+ Windows systems use a carriage return followed by a newline, #D #A.
+ Some mainframe systems use the next line character, #85.
+ Some systems may also use line separator, #2028.

One school of thought says that if a text file is logically a sequence
of lines, it’s irrelevant (from the user’s perspective at least) which
convention is used in any given file.

And it’s worth observing that none of the characters in question (#D,
#A, #85, or #2028) has any *other* reasonable interpretation.
Disregarding test suites, there are effectively no files that one
would expect to consist of lines delimited by one convention where one
of the other conventional characters occurred in the file.

At present, iXML does not treat the line ending characters any
differently than any other character. This has already lead to actual,
documented interoperability problems. This grammar works fine on my
system:

  file = line++NL, NL*.
  line = ~[#A]+.
   -NL = -#A.

But it will produce (probably) unexpected results on a Windows system
where the line elements will end with a #D character.

Conversely, this grammar will work fine on a Windows system:

  file = line++NL, NL*.
  line = ~[#A|#D]+.
   -NL = -#D, -#A.

But will entirely fail to parse text files created on other systems.

It is possible to write grammars that will work irrespective of the
convention:

  file = line++NL, NL*.
  line = ~[#A|#D|#85|#2028]+.
   -NL = -#D, -#A | -#A | -#85 | -#2028 .

But it’s tedious and error prone. What’s more it’s hard to test and
easy to forget.

It has been proposed that iXML should address this problem by defining
how line endings are handled by the processor. There appear to be three
practical options.

1. Do nothing. This is the status quo. Users who want to write
   grammars that will successfully parse text files created on the
   widest variety of systems can do so, but they will have to code
   defensively.

2. Do nothing normatively, but encourage implementations to provide an
   option that normalizes line endings on input.

3. Change the specification so that processors are required to
   normalize line endings on input. It seems reasonable to suggest
   that implementations provide an option that disables this behavior.

If we choose option 2 or 3, we have the additional question of what
the normalization should be. I strongly encourage the group to pick a
single, specific character.

I’m predisposed to prefer #A. It occurs in both Unix and Windows files
which I expect covers the overwhelming number of users and it is a
single character rather than a character sequence.

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica
Received on Monday, 23 October 2023 16:49:11 UTC