- From: Jeremy Carroll <jjc@hpl.hp.com>
- Date: Thu, 05 Apr 2007 10:34:44 +0100
- To: public-rdf-dawg-comments@w3.org
This is a comment on: http://www.w3.org/TR/2007/WD-rdf-sparql-query-20070326/#func-langMatches specifically the text: [[ matches language-tag (first argument) per Matching of Language Tags [RFC4647] section 2.1. ]] Contents of comment: - issue statement - suggested editorial textual change - further analysis and options (the bulk of the message, which to large part can be ignored) Issue ===== Section 2.1 of RFC 4647 defines basic language ranges, without giving any semantics, nor defining an algorithm for "matches". Hence the word "matches" in the quoted text is unbound, and without clear meaning. Sections 3.3.1 and 3.3.2 and 3.4 each describe different matching algorithms that can be used with basic language ranges. Suggested text: =============== Replace [[ Returns true if language-range (second argument) matches language-tag (first argument) per Matching of Language Tags [RFC4647] section 2.1. ]] with [[ Returns true if language-range (second argument) matches language-tag (first argument). language-range is a basic language range per Matching of Language Tags [RFC4647] section 2.1. 'matches' is defined as basic filtering in [RFC4647] section 3.3.1. ]] Analysis ======== The algorithm of section 3.4 is not suitable since it is scoped as [[select[ing] the single language tag that best matches the [...] request]]. i.e. it always gives exactly one result, when matching against any non-empty set of languages - it does not define a boolean function: lang-tag x lang-range => boolean, but a selection function non-empty-list-of-lang-tags x lang-range => lang-tag The algorithm of section 3.3.2 is designed for extended language ranges, which are more appropriate for the new features of RFC 4646 (such as script subtags). The reference to section 2.1 is indicative that SPARQL is more interested in basic language ranges, which were already specified in RFC 3066, and are suited to matching lang tags that conform with RFC 3066 (and hence also with RFC 4646). The algorithm of section 3.3.1 is hence (IMO) currently the closest 'reading' of the SPARQL WD. Technically, the choice is: a) use basic language ranges (section 2.1) and basic filtering (3.3.1) or b) use extended language ranges (section 2.2) and extended filtering (3.3.2) FYGI, the extended language ranges are like language ranges except they permit a "*" in any subtag position, e.g. de-*-DE de-DE-* (but not de-DE*) When used with extended filtering, any -*- is effectively ignored, and treated as -, but note that an initial *- is significant. Then (simplifying by ignoring private use and other extensions) a lang range matches a lang tag if both a) the first subtags match b) (ignoring the *'s) the "-"-separated sequence of the language range is a subsequence (allowing arbitrary deletions) of the "-"-separated sequence of the language tag. The reason this is more appropriate for new RFC 4646 style tags is that RFC 4646 allows additional information, such as script subtags, to be inserted in the appropriate place in a tag. So, the example given in RFC 4647 is that de-DE basic matches de-DE (i.e. german as spoken in Germany) de-DE basic matches de-DE-1966 (i.e. german as spoken in Germany, written with the orthography of 1996) de-DE does not basic match de-Latf-DE (i.e. german, as spoken in Germany, written in the Fraktur variant of the Latin script) whereas both the basic matches are extended matches (indeed, any basic match is an extended match), but also de-DE extended matches de-Latf-DE which is probably more consistent behaviour from the end users point of view when using such new features of RFC 4646 style tags. It is plausible that some semantic web applications may well have a need for using extended language ranges like "*-Latn", for example, to populate some part of a web page, when no content exactly matching the current language preferences has been found. Many users have a preference for text in a script they can read, even if they don't understand it, over a perhaps intelligible word, written in a script that is not intelligible. This use case however, depends on widespread use of RFC 4646 script subtags, which, while possibly desirable is not a current actuality. Moreover, code that worked to end user satisfaction would also depend on appropriate deployment of section 4.1 of RFC 4646 (choice of language tag) either in the code or the processes of constructing the semantic web data or both, so that script codes were used consistently. Thus, I have suggested the more conservative change, but would be equally satisfied if the SPARQL WG wanted to embrace extended language ranges! Jeremy -- Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN Registered No: 690597 England
Received on Thursday, 5 April 2007 09:35:21 UTC