Re: Round-tripping ixml? from Steven Pemberton on 2022-10-18 (public-ixml@w3.org from October 2022)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Tue, 18 Oct 2022 18:52:00 +0000
To: Michal Měchura <michmech@lexiconista.com>, public-ixml@w3.org
Message-Id: <1666119036943.4149247338.607397384@cwi.nl>

This is exactly what I suggest in my paper: unparsing is just parsing in reverse, and in a case like this you would get an ambiguous parse from which one serialisation would get chosen.


Steven

On Tuesday 18 October 2022 18:45:48 (+02:00), Michal Měchura wrote:



In the fully general case, the problem is intractable.

Grammars can lose information. Consider:

S = 'a', -'.'
  | 'a', -'?'
  | 'a', -'!' .


Given <S>a</S>, it’s impossible to know what the input was.

I think this would not be a problem. For the use cases I have in mind, it would be OK to round-trip into any one linearization, even if it isn’t exactly the one from which the XML had been parsed. For example, let’s say we have an iXML grammar which parses any one of these:

7 November
7 Nov
07 Nov


into this:

<date @day=”7” @month=”11”/>


and then linearizes it back into this:

7 November


That would be acceptable. We could say that this linearization is the “canonical” one while the others are “tolerated” for parsing but never output in linearization. There could be some heuristics to choose which linearization is canonical, let’s say always the shortest one (= smallest number of terminals) and/or always the first one listed in the rule.

Well, these are just suggestions from an outsider and a potential iXML user. Take it or leave it. :-)

M.

Received on Tuesday, 18 October 2022 18:52:18 UTC