- From: Andy Seaborne <andy.seaborne@topquadrant.com>
- Date: Fri, 9 Sep 2016 10:35:05 +0100
- To: public-data-shapes-wg@w3.org
> * Constrain the valid language tags to a provided set, e.g. (@en, @de,
> @fr)
>
> See my email, sh:langShape [ sh:in ( "en" "de" "fr" ) ]
Do these match? "EN", "en-GB", "en-US", "@de-Latn-DE-1996"
It seems easier to adopt RFC4647 matching (in which case they all
match). To match "en" exactly, sh:not can be used for "not match en-*"
or "not match *-*".
In RDF 1.1, language tags compare case insensitively. In the RDF world,
force-to-lower-case is common and endorsed by the RDF 1.1 specs.
https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
RFC4647 defines language matching; section 3.3.1 is basic filtering.
SPARQL has LANGMATCHES that applies RFC4647
A predicate for "language match" that uses RFC 4647 would be natural.
sh:langShape [ sh:language ( "en" "de" "fr" ) ]
Matches "en", "en-gb", "EN-GB", "en-uk", "de" , "de-de"
sh:langShape [ sh:language ( "en-*") ]
Matches "en-gb", "en-us" but not "en"
http://www.ietf.org/rfc/rfc4647.txt
https://www.w3.org/TR/sparql11-query/#func-langMatches
The implementation burden of RFC4647 is not high. The algorithm is in
the RFC if not using SPARQL.
(I have no strong opinion on the predicate name)
> * Require that all literals have/do not have a language tag
>
> Already exists: sh:datatype rdf:langString
True, though more natural to users to a use language-match of "*", which
is defined in RFC4647.
sh:langShape [ sh:language ("*") ]
> * Check that the language tag is 2-letter | 3-letter | does/does not
> have hyphens
>
> sh:langShape [ sh:minLength 2 ; sh:maxLength 2 ; or: sh:pattern "...
> regex ..." ]
2-letter, 3-letter is about the primary subtag? (the part up to the
first "-")
> * Check that the 2 or 3-letter tag is valid
(I can't find the original use case for this on the issue log)
This is outside the RFC4647 algorithm and needs a regex.
> Assuming that the list of valid tags is stored somewhere, e.g. in an
> rdf:List iso:ValidLanguages:
"in" a list will need to be case insensitive.
In the real world, data can be a bit messy. To pick an example close to
me, "en-uk" does not officially exist but it is not that uncommon and
seems to be tolerated. It would be good to both be able to cause a
violation for it and also be able to be lax about it.
Andy
RFC3066:
https://www.ietf.org/rfc/rfc3066.txt
section 2.1
[[
The syntax of this tag in ABNF [RFC 2234] is:
Language-Tag = Primary-subtag *( "-" Subtag )
Primary-subtag = 1*8ALPHA
Subtag = 1*8(ALPHA / DIGIT)
]]
Received on Friday, 9 September 2016 09:35:36 UTC