Re: 2001-09-07#5 - literal problem from Pat Hayes on 2001-09-13 (w3c-rdfcore-wg@w3.org from September 2001)

From: Pat Hayes <phayes@ai.uwf.edu>
Date: Thu, 13 Sep 2001 10:47:17 -0500
To: "Jeremy Carroll" <jjc@hplb.hpl.hp.com>
Cc: w3c-rdfcore-wg@w3.org
Message-Id: <p0510100ab7c68205f4a9@[205.160.76.210]>
>Here is a heads-up on where Bill and I have got to.
>ACTION: 2001-09-07#5: Jeremy Caroll
>    Collaborate with Bill dehOra to produce analysis of the literal
>    problem, options, pros/cons for WG consideration.
>
>We are expecting to continue this action into next week.
>
>
>
>The issues we are addressing are:
>
>1: The Representation of Literals
>       e.g. a pair of a Unicode String and an RFC 3066 language tag.
>
>2: Equality
>       a complete equality rule for literals, correcting para 217-219 and
>para 220, which currently leave parts of equality as undefined.
>
>3: Preferred Mappings under rdf:parseType="Literal"
>   (of xml document fragments into the representation of [1]).
>
>   - Currently para 203 and para 220 are explicitly vague about this mapping.
>   - We wish to specify one or more preferred mappings.
>   - We may wish to specify more than one conformance level
>       + motivation
>          there are two significantly different use cases
>             + quoting simply xhtml
>             + quoting arbitrary XML fragments
>
>4: Cases in which applications may expect problems.
>   - Older RDF parsers will not be using the preferred mappings,
>   we should indicate to document writers which XML fragments are likely to
>be problematic.
>
>
>A less ambitious approach would be to not specify a preferred mapping and
>simply specify a minimal requirement about some markup that should work
>(e.g. embedded xhtml without namespaces and without references). This would
>be consistent with M&S and put off the work M&S defers until RDF 2.0.
>
>
>Questions for WG for tomorrow:
>+ Does the WG agree that a Literal is a <Unicode String,RFC 3066> pair?
>+ Does the WG agree that Literal equality should be defined?
>+ Does the WG agree that the new specs should descibe a specific Unicode
>string to be delivered by rdf:parseType="Literal"?
>
>Our current working text for the literal representation is:
>
>==========
>
>An RDF Literal is* a Unicode string, optionally** paired with a
>language tag (as defined in RFC3066).
>
>When comparing two RDF Literals, their Unicode strings must be
>equal for the RDF Literals to compare as equal. If both Literals
>have language tags, these tags must be equal for the Literals to
>be considered equal. If two Literals are found equal but only
>one has a language tag, the Literals should not*** be considered
>equal.
>
>The equality of Unicode strings is specified by W3C I18N WG;
>see [fixme:url]. Language tag equality is defined by RFC3066
>and is case insensitive.
>
>===========
>* is represented as, or is a? Do we pass by reference or value :)
>
>** equivalently we could delete 'optionally' and allow the language tag to
>be null, or default to "und" the ISO-639-2 undetermined language. Note, the
>following from RFC 3066, suggests 'Omitting'.
>
>    5. You SHOULD NOT use the UND (Undetermined) code unless the protocol
>       in use forces you to give a value for the language tag, even if
>       the language is unknown.  Omitting the tag is preferred.
>
>***the purpose of 'should not' is to allow applications some flexibility
>on dealing with language tags. That is, when a literal is equal to
>another but only one has a lag tag, they can be considered equivalent,
>which might be sufficient for some applications to make a match.

I find this odd. Why not let them be equal in this case? Omitting the 
language tag presumably means that no language information is being 
supplied. But in that case, there is no need to reject a match with 
an identical literal which does have a language tag, is there?

Or is the idea that omitting the language tag is a way of indicating 
that ANY language tag would be inappropriate?

(A rational way to make sense of this would be to assume that UND 
means 'not any known language' and a missing tag simply indicates no 
information; but that rational interpretation seems to be ruled out 
by the above quote from RFC 3066, unfortunately.)

>
>The truth table corresponding to that notion of equality is:
>
>Truth table for equality (s1,t1) == string1, tag1; f* means should not
>be true; assume s1!=s t1!=t according to the specs in question.
>
>         (s,_)  (s,t)  (s1,_)  (s1,t1)
>--------------------------------------
>(s,_)     t      f*      f       f
>(s,t)     f*     t       f       f
>(s1,_)    f      f       t       f*
>(s1,t1)   f      f       f*      t   (s,t1)  (s1,t)
>                                      ---------------
>(s,t1)    f*     f       f       f      t      f
>(s1,t)    f      f       f*      f      f      t
>
>

Seems to me all the f* could be t. Why not have (Smith, English) 
match (Smith,_) ? It has to be an English word or the first literal 
wouldn't be well-formed. If we reject this, why do we accept 
(Smith,_) matching (Smith,_)? After all, "Smith" might mean Sailor in 
Catalan.

Pat
-- 
---------------------------------------------------------------------
IHMC					(850)434 8903   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola,  FL 32501			(850)202 4440   fax
phayes@ai.uwf.edu 
http://www.coginst.uwf.edu/~phayes
Received on Thursday, 13 September 2001 11:47:23 UTC