RE: Noah Mendelsohn Comments on July 26 Draft of TAG Versioning Finding from noah_mendelsohn@us.ibm.com on 2006-09-05 (www-tag@w3.org from September 2006)

From: <noah_mendelsohn@us.ibm.com>
Date: Mon, 4 Sep 2006 22:09:33 -0400
To: "David Orchard" <dorchard@bea.com>
Cc: www-tag@w3.org
Message-ID: <OF018ED846.90B2EFA9-ON852571E0.0004D0C6-852571E0.000BDDB6@lotus.com>
Dave Orchard writes:

> I'm not pursuaded that a language doesn't include constraints on the
> language.  I think the key part is that the set of texts may be
> determined by the constraints. 

I think you're still missing my point.   I very much want to turn that 
around to say:  it's the set of texts that's fundamental.  The constraints 
are just a convenient shorthand for letting you know what's in the set.  I 
hope the examples below will explain why.

========
Example: 

Here's an example to illustrate the difference in the two approaches. 
Let's say that the language is the set of prime numbers less than 1 
million, expressed in the obvious way as character strings: "2", "3", "7", 
"11" and so on. 

Your definition of language is:  "A Language consists of a set of text, 
any syntactic constraints on the text, a set of information, any semantic 
constraints on the information, and the mapping between texts and 
information. "  Let's apply that to this prime number example.  I think 
you intend something like:

The Prime number language is (Based on Dave's approach):
* Set of texts: Strings (note that this is only a loose bound on the set 
of texts actually in the primes language)
* Syntactic constraints:  digits only 
* Set of information: an integer resulting from the mapping given below
* Semantic constraints:  each item in the information must have no 
divisors other than 1 and itself, and each item must be < 1,000,000
* The mapping:  the obvious atoi() mapping, such as defined by XML Schema 
for integer lexical to value mappings.

Note that what you're calling the set of texts isn't really the texts in 
the language;  it's just a starting point so that the syntactic 
constraints can be a bit smaller.  That seems messy.  If you want to talk 
about the set of texts that are really in the language, then you need to 
intersect all the constraints.  For reasons explained below, I think it's 
much cleaner and simpler to leave out all that mechanism, and just go 
with: 

The Prime number language is (Noah's preferred formulation):
* Set of texts: "2", "3", "7", "11"  ... "999983"  (I.e. it's really a 
set, and the set contains exactly the strings in the language, no more and 
no less. Whether I can conveniently enumerate it is a separate question.)
* The mapping:  the obvious atoi() mapping, such as defined by XML Schema 
for integer lexical to value mappings (same as for Dave's)

In my preferred formulation, the intensional constraint ("no member of the 
language may be divisible by any number other than 1 or itself") is just a 
shorthand for setting out the legal set of texts.  The constraints are not 
a part of the language, but they may be used as a convenient shorthand for 
specifying which texts are in the language.

==========

Why my strong preference?  Given this formulation, the test for membership 
of a text is merely a set membership test.  With Dave's, you need to test 
set membership, then syntactic legality, then map, then test semantics. It 
has to be harder to reason about all that.

Similarly, in my approach, the test for language compatibility can be a 
simple superset relation on the texts, along with a test that the 
information mapping is consistent.   For example, let's say I want to test 
whether another language L2 is compatible with the Primes language.  L2 is 
the odd numbers between 2 and 8, again represented as character strings:

Dave's definition of L2 would likely be:
* Loose bound on set of texts:  character strings
* Syntactic constraint: digits only
* Set of information: an integer resulting from the mapping given below
* Semantic constraints: 2<x<8; (x mod 2) == 1
* Mapping: atoi() 

My definition of L2:
* Set of texts: "2", "3", "5", "7"
* Mapping: atoi() 

With my definition, there's a clear answer:  L2 is a sublanguage of the 
Prime Number language and is in that sense compatible.  Every text in L2 
is in Primes, and each maps to the same information in both languages. 
Done.

I'm not sure how we cleanly express compatibility using the formulation 
with constraints.  The constraints for the two languages are expressed in 
a totally different way.   Did I write that "no divisors in English"?  Did 
I write a loop in Java?  Now I have to intersect that with (x mod 2 == 1). 
  We have to start asking all kinds of complicated questions about what it 
means when we try to compare constraints expressed in this way.  I think 
it's a big mistake to tangle those specification layers into the core 
definition of what the language is.  We're trying to build a simple, firm 
foundation for determining which languages can safely interoperate.

> When a processor determines whether a text is in the language, it 
> doesn't generate all the texts "in hand" and then compare, it will 
> look at the constraints and evaluate without having all the texts in
> hand.  I think any constraints are fundamentally part of the language. 

You're making a big leap from the mathematical characteristics of 
languages, the intersection of their texts, etc. to how in practice a 
processor would be built.  I don't think we want to tangle those.  Whether 
two languages have texts in common and have compatible interpretations is 
not fundamentally a statement about how you build processing sofware. It's 
a characteristic of the languages, whether interpreted by software or not.

I think your point is that for certain languages, enumeration is not the 
most practical way of defining the membership.   Agreed.  Indeed, if I 
eliminated the <1,000,000 in the primes example, enumeration would be 
impossible.  So, a processor will indeed likely use some encoding of the 
constraint, and when I tell you in an email what the language is I will 
not list the infinite set of its members.  I don't think the core notion 
of language needs to be something that fits into a finite specification in 
a computer.  As noted above, I am perfectly happy to consider the language 
of primes to be the infinite set of strings that happen to meet the 
primeness test, along with their mappings to abstract integers.

My whole point is that we can get a lot of mileage from very clean 
mathematical abstractions like sets of texts and information mappings.  I 
do think we need a chapter on the sort of constraint languages that are 
used to define such languages for use by software, and the techniques used 
to evolve the specifications of languages using those constraint-based 
systems.  That's where one would talk about how tools use constraint 
languages like XML Schema to help tailor processing software, and to aid 
in checking instances that are to be processed.  That's where you'd talk 
about what a processor would likely "have in hand", I think.

> Now I could flip this around and suggest we should go the opposite way
> and suggest removing Text Set and Information Set: languages have
> semantics, syntax and texts are in a language if they meet the syntax.
> Languages also have a mapping between any individual text and an
> individual information "item". 

I'm sorry, but that doesn't parse for me.  What do "languages have" and 
what "are in a language"?  I will say that I think sets are a fine 
mathematical formalism for comparing which texts are in which languages, 
and also for considering which information is conveyed compatibly in two 
or more languags.  I would be reluctant to drop the set formalism.

Thanks for the careful response to my concerns!  I'll look at the rest of 
your comments later.

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------






        "David Orchard" <dorchard@bea.com>
        09/04/2006 08:17 PM 
                 To: <noah_mendelsohn@us.ibm.com>
                 cc: <www-tag@w3.org>
                 Subject: RE: Noah Mendelsohn Comments on July 26 Draft of 
TAG Versioning Finding


Noah,

Part 1 of ? Parts.

I've gone through your comments.  Thanks again for doing some extensive
reviewing.  The comments that you made did not substantially conflict
with much of the work that I had done, which is goodness. 

I'm going to respond by quoting sections that you wrote followed by my
comments.  I hope that's the best way to work through the comments.

NM>>>
The finding claims that constraints are part of the language. I'm not
convinced that's a good formulation, since the constraints are embodied
in the set of texts & mappings. Stated differently, I think we're
confusing a "language" with "the specification of a lanuage", and those
are very different. So, I think a language should be a set of texts and
their interpretation as information, and I am very happy with the way
you present that much. 

I think we should have separate sections that talk about managing the
specifications for languages as they evolve, and certainly constraint
languages like XML Schema are among the good tools for writing
specifications. It's OK to talk about keeping a language and its
specification in sync. and to talk about constraint language features
that facilitate versioning. I don't think the constraints are the
language. I think they are emergent properties of the language that can
sometimes be usefully set down in mathematical and/or machine readable
notation such as regex's or XML Schemas. This is an important
distinction on which I disagree with the finding as drafted 
<<

I'm not pursuaded that a language doesn't include constraints on the
language.  I think the key part is that the set of texts may be
determined by the constraints.  Using one of your favourite examples, if
I create a language that has Red,Green,Blue.  There we have listed the
texts. 

But one of my favourite examples is the Name language, which has given
and family, and those are simply strings.  Whether Aaaaa and Aaaa and
Aaa and Aa are part of the language didn't even occur to me until I
wrote this.  When a processor determines whether a text is in the
language, it doesn't generate all the texts "in hand" and then compare,
it will look at the constraints and evaluate without having all the
texts in hand.  I think any constraints are fundamentally part of the
language. 

It seems to me that some languages, membership is determined by having
the set of texts, and in others the set of texts can be generated from
the constraints.  So, can we come up with a modelling mechanism that
allows a language to refer to one thing, rather than the 2 that it
currently does (texts and constraints).   Perhaps this was what the
"membership" bucket was an attempt to model.

Now I could flip this around and suggest we should go the opposite way
and suggest removing Text Set and Information Set: languages have
semantics, syntax and texts are in a language if they meet the syntax.
Languages also have a mapping between any individual text and an
individual information "item". 

I thought about breaking the relationship between language and syntax,
leaving syntax just connected to text set.  If I squint hard enough, I
can see that could work.  But I think that doesn't pass the common view
of language, which is that language is directly related to syntax rather
than indirectly via text set, see my name syntax example. 

NM>>
I think we can and should do better in telling a story about whether a
particular text is compatible as interpreted in L1 or L2, vs. the senses
in which languages L1 and L2 as a whole are compatible. I think the
story I would tell would be along the lines of: 

Of a particular text written per language L1 and interpreted per
language L2: "Let I1 be the information conveyed by Text T per language
L1. Text T is "fully compatible" with language L2 if and only if when
interpreted per language L2 to yield I2, I1 is the same as I2. Text T is
"incompatible" if any of the information in I2 is wrong (I.e. was not
present in I1 or replaces a value in I1 with a different one...this
rules disallows additional information, because only the information in
I1 is what the sender thought they were conveying, so anything else is
at best correct accidently). There are also intermediate notions of
compatibility: e.g. it may be that all of the information in I2 is
correct, but that I2 is a subset of I1. [Not sure whether we should name
some of these intermediate flavors, but if we do, they should be defined
precisely.]

Of languages L1 and L2: We say that language L2 is "fully backward
compatible" with L1 if every text in L1 is fully compatible with L2. We
say that language L1 is "backwards incompatible" with L2 if any text in
L1 is incompatible with L2. We say that Language L1 is "fully forwards
compatible" with L2 if every text in L2 is fully compatible with L1. We
say that L2 is "forwards incompatible with L1" if any text in L2 is
incompatible with L1. As with texts, there may be intermediate notions
of langauge compatibility for which we do not [or maybe we should?]
provide names here.

That all seems pretty simple and clean to me, and I think it's a firm
foundation for much of the rest of the analysis. Notice that it seems
natural to leave out discussion of the constraints in this layer; the
story gets simpler without them. The current draft seems to me a bit
loose in both talking about and defining issues for languages as a whole
vs. for individual texts. 
<<

I have been moving towards this space as well in my examination of
partial understanding.  It is yet another example of this "class" vs
"instance" that seems to always come up in modeling and system design. 

But I still disagree with the removal of syntax.  I could easily suggest
somewhat alternate wording that makes use of syntax and makes sense to
me.  "Of languages L1 and L2: We say that language L2 is "fully backward
compatible" with L1 if every text valid under L1's constraints is fully
compatible with L2".  I could even push it further and define the
Syntactic constraints, S1 and S2, then rephrase as "We say that language
L2 is "fully backward compatible" with L1 if every text valid under S1
..."

NM>>
Clarify focus on texts vs. documents
...
<<

I agree.  I've inserted part of one of your paragraphs.

My comments on your comments ends, 



> -----Original Message-----
> From: noah_mendelsohn@us.ibm.com [mailto:noah_mendelsohn@us.ibm.com] 
> Sent: Monday, August 28, 2006 4:22 PM
> To: David Orchard
> Cc: www-tag@w3.org
> Subject: Noah Mendelsohn Comments on July 26 Draft of TAG 
> Versioning Finding
> 
> First of all, thanks again to Dave for the truly heroic work 
> on the versioning finding.  This problem is as tough as they 
> get IMO, and I think the drafts are making really steady 
> progress.  Still, as I've mentioned on a number of 
> teleconferences, I have a number of concerns regarding the 
> conceptual layering in the draft versioning finding, and some 
> suggestions that I think will make it cleaner and more 
> effective.  Dan Connolly made the very good point that it is 
> really only approriate to raise such concerns in the context 
> of a detailed review of what has already been 
> drafted.   So, I've tried to do that. 
> 
> A copy of my annotated version of the July 26 draft is 
> attached.  I've taken quite a bit of trouble over these 
> comments, which are quite extensive, and while I'm sure that 
> they will prove to be only partly on the right track, I hope 
> they will get a detailed review not just from Dave 
> but also from other concerned TAG members.   Anyway, what 
> I've done is to 
> take Dave's July 26th draft and add comments marked up using 
> CSS highlighting.  These are in two main groups:
> 
> 1) An introductory section sets out some of the main 
> architectural issues and ideas that I've been trying to 
> convey.  I don't expect these will seem entirely justified 
> until you read the rest of the comments (if then), but I 
> think it's important to collect the significant ideas, and to 
> separate them from the smaller editorial suggestions.
> 
> 2) I've gone through about the first third of Dave's draft, 
> inserting detailed comments.  Some of these are purely 
> editorial, but most of them are aimed at motivating and 
> highlighting the concerns that led me to propose the major 
> points in that introductory chapter.  Indeed, I've tried to 
> hyperlink back from the running comments to the larger 
> points, as I think that helps to motivate them.
> 
> No editor working on a large draft entirely welcomes 
> voluminous comments, especially ones that have structural 
> implications.  Dave:  I truly hope this is ultimately useful, 
> and I look forward to working with you on it. 
> Where possible, I've tried to suggest text fragments you can 
> steal if you like them.  I actually am fairly excited, 
> because working through Dave's draft has helped me to 
> crystalize a number of things about versioning in my own 
> mind.  I think we're well along to telling a story that's 
> very clean, very nicely layered, and perhaps a bit simpler 
> and shorter than the current draft suggests.  I don't think 
> it involves throwing out vast swaths of what Dave has 
> drafted, so much as cleaning up and very carefully relayering 
> some concepts. 
> 
> BTW: I will be around on and off until about Wed. afternoon, 
> then gone 
> until after US Labor Day weekend.   Thanks again, Dave. 
> Really nice work!
> 
> Noah
> 
> [1] http://www.w3.org/2001/tag/doc/versioning-20060726.html
> 
> 
> 
> --------------------------------------
> Noah Mendelsohn
> IBM Corporation
> One Rogers Street
> Cambridge, MA 02142
> 1-617-693-4036
> --------------------------------------
> 
> 
> 
>
Received on Tuesday, 5 September 2006 02:09:57 UTC