- From: Norm Tovey-Walsh <norm@saxonica.com>
- Date: Mon, 23 Oct 2023 17:43:14 +0100
- To: ixml <public-ixml@w3.org>
- Message-ID: <m2edhlnv2a.fsf@saxonica.com>
Hello, Persuant to an action I took recently, here is a summary of issue #192 formulated as a menu of options from which the CG could choose. Issue #192 is about normalizing line endings. Invisible XML is designed to parse text. There’s nothing that prevents an implementation from having a “binary” input method, but that’s not what the specification anticipates. None of what follows would apply if an implementation were operating in some sort of binary mode. A text file is logically a sequence of lines of text. Leaving aside punched cards, virtual punches, and other literally record-oriented storage systems, most computer systems store text files as a sequence of characters where some character or sequence of characters means “a line break occurs here.” Different computer systems use different conventions for how line breaks are identified. + Unix (and Unix-derived systems) use a single newline, #A. + Pre-OS X MacOS systems used a single carriage return, #D. + Windows systems use a carriage return followed by a newline, #D #A. + Some mainframe systems use the next line character, #85. + Some systems may also use line separator, #2028. One school of thought says that if a text file is logically a sequence of lines, it’s irrelevant (from the user’s perspective at least) which convention is used in any given file. And it’s worth observing that none of the characters in question (#D, #A, #85, or #2028) has any *other* reasonable interpretation. Disregarding test suites, there are effectively no files that one would expect to consist of lines delimited by one convention where one of the other conventional characters occurred in the file. At present, iXML does not treat the line ending characters any differently than any other character. This has already lead to actual, documented interoperability problems. This grammar works fine on my system: file = line++NL, NL*. line = ~[#A]+. -NL = -#A. But it will produce (probably) unexpected results on a Windows system where the line elements will end with a #D character. Conversely, this grammar will work fine on a Windows system: file = line++NL, NL*. line = ~[#A|#D]+. -NL = -#D, -#A. But will entirely fail to parse text files created on other systems. It is possible to write grammars that will work irrespective of the convention: file = line++NL, NL*. line = ~[#A|#D|#85|#2028]+. -NL = -#D, -#A | -#A | -#85 | -#2028 . But it’s tedious and error prone. What’s more it’s hard to test and easy to forget. It has been proposed that iXML should address this problem by defining how line endings are handled by the processor. There appear to be three practical options. 1. Do nothing. This is the status quo. Users who want to write grammars that will successfully parse text files created on the widest variety of systems can do so, but they will have to code defensively. 2. Do nothing normatively, but encourage implementations to provide an option that normalizes line endings on input. 3. Change the specification so that processors are required to normalize line endings on input. It seems reasonable to suggest that implementations provide an option that disables this behavior. If we choose option 2 or 3, we have the additional question of what the normalization should be. I strongly encourage the group to pick a single, specific character. I’m predisposed to prefer #A. It occurs in both Unix and Windows files which I expect covers the overwhelming number of users and it is a single character rather than a character sequence. Be seeing you, norm -- Norm Tovey-Walsh Saxonica
Received on Monday, 23 October 2023 16:49:11 UTC