- From: Andy Seaborne <andy.seaborne@topquadrant.com>
- Date: Fri, 9 Sep 2016 10:35:05 +0100
- To: public-data-shapes-wg@w3.org
> * Constrain the valid language tags to a provided set, e.g. (@en, @de, > @fr) > > See my email, sh:langShape [ sh:in ( "en" "de" "fr" ) ] Do these match? "EN", "en-GB", "en-US", "@de-Latn-DE-1996" It seems easier to adopt RFC4647 matching (in which case they all match). To match "en" exactly, sh:not can be used for "not match en-*" or "not match *-*". In RDF 1.1, language tags compare case insensitively. In the RDF world, force-to-lower-case is common and endorsed by the RDF 1.1 specs. https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal RFC4647 defines language matching; section 3.3.1 is basic filtering. SPARQL has LANGMATCHES that applies RFC4647 A predicate for "language match" that uses RFC 4647 would be natural. sh:langShape [ sh:language ( "en" "de" "fr" ) ] Matches "en", "en-gb", "EN-GB", "en-uk", "de" , "de-de" sh:langShape [ sh:language ( "en-*") ] Matches "en-gb", "en-us" but not "en" http://www.ietf.org/rfc/rfc4647.txt https://www.w3.org/TR/sparql11-query/#func-langMatches The implementation burden of RFC4647 is not high. The algorithm is in the RFC if not using SPARQL. (I have no strong opinion on the predicate name) > * Require that all literals have/do not have a language tag > > Already exists: sh:datatype rdf:langString True, though more natural to users to a use language-match of "*", which is defined in RFC4647. sh:langShape [ sh:language ("*") ] > * Check that the language tag is 2-letter | 3-letter | does/does not > have hyphens > > sh:langShape [ sh:minLength 2 ; sh:maxLength 2 ; or: sh:pattern "... > regex ..." ] 2-letter, 3-letter is about the primary subtag? (the part up to the first "-") > * Check that the 2 or 3-letter tag is valid (I can't find the original use case for this on the issue log) This is outside the RFC4647 algorithm and needs a regex. > Assuming that the list of valid tags is stored somewhere, e.g. in an > rdf:List iso:ValidLanguages: "in" a list will need to be case insensitive. In the real world, data can be a bit messy. To pick an example close to me, "en-uk" does not officially exist but it is not that uncommon and seems to be tolerated. It would be good to both be able to cause a violation for it and also be able to be lax about it. Andy RFC3066: https://www.ietf.org/rfc/rfc3066.txt section 2.1 [[ The syntax of this tag in ABNF [RFC 2234] is: Language-Tag = Primary-subtag *( "-" Subtag ) Primary-subtag = 1*8ALPHA Subtag = 1*8(ALPHA / DIGIT) ]]
Received on Friday, 9 September 2016 09:35:36 UTC