Re: Language maps and undefined language from Gregg Kellogg on 2017-04-12 (public-linked-json@w3.org from April 2017)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Tue, 11 Apr 2017 19:29:13 -0700
To: Robert Sanderson <azaroth42@gmail.com>
Cc: Jakob Voß <jakob.voss@gbv.de>, Linked JSON <public-linked-json@w3.org>
Message-Id: <88289B43-2ECE-46D1-96F6-E8F639801BDE@greggkellogg.net>
> On Apr 11, 2017, at 5:37 PM, Robert Sanderson <azaroth42@gmail.com> wrote:
> 
> 
> Please consider the I18n group at the W3C on the topic:
> 
>    https://www.w3.org/International/questions/qa-no-language <https://www.w3.org/International/questions/qa-no-language>
> 
> To excerpt the document:
> 
> > Use the subtag zxx when the text is known to be not in any language.
> 
> > [...] use xml:lang="" <http://www.w3.org/TR/REC-xml/#sec-lang-tag>, otherwise use xml:lang="und". These values indicate that we cannot determine, for one reason or another, what the appropriate language information is, or whether the text is non-linguistic. 
> 
> Note that we cannot use "", as noted, because PHP does not support empty string as the key of a dictionary... and thus we fallback to using "und”.

This likely depends on your perspective. In RDF, if there is no language, the datatype is xsd:string, otherwise, rdf:langString. IMO, xsd:string pretty definitively says that there is no language, closer to “zxx”.

For JSON-LD, you might imagine that something expanding to `{“@value”: “foo”}`, is not, in fact, making any such claim, so that “und” is appropriate. Otherwise, it would be asserted as `{“@value”: “foo”, “@type”: “xsd:string”}`. However, the RDF to JSON-LD transformation does not preserve “xsd:string” in this case (no RDF serializations really do).

Would we expand {“@value”: “foo”, “@language”: “und”} into simply {“@value”: “foo”}? That would be removing the assertion that the language is unknown, while {“@value”: “foo”, “@language”: “zxx”} is easier to see dropping the “@language”.

Of course, this might also depend on the range of the property: dc:title would seem to want something that would have a language, where dc:format probably not. But, looking at the range of properties, other than as defined in the context, is out of scope for JSON-LD. Associating a language-map with a term may be enough to indicate that any value should probably be considered to have a language. Round-tripping may become an issue, though.

Gregg

> Rob
> 
> 
> On Tue, Apr 11, 2017 at 5:28 PM, Robert Sanderson <azaroth42@gmail.com <mailto:azaroth42@gmail.com>> wrote:
> 
> The use case is when you have data from multiple sources, some with language tags and some without. When you aggregate the triples at the moment, you get garbage in the JSON-LD representation. The "perfection or nothing" approach proposed seems to be against the spirit of JSON-LD's "make it work for the developer" ethos.
> 
> I prefer UND compared to ZXX because there is likely to be linguistic content, it's just that we don't know which language (if any) it's in.  "Undetermined" seems to include the possibility of no language, whereas ZXX seems more explicitly not linguistic, and MIS/MUL are explicitly linguistic.  I would say that the vast majority of the time, legacy data does not have per-string language associations ... and thus the case of "we just don't know but we think it's linguistic" is also (thus) the vast majority of the cases. 
> 
> Rob
> 
> 
> On Mon, Apr 10, 2017 at 11:13 PM, Jakob Voß <jakob.voss@gbv.de <mailto:jakob.voss@gbv.de>> wrote:
> Hi,
> 
> Gregg Kellogg wrote:
> 
> > In CSVW, we coined “und” as the undefined/absent language.
> 
> "und" is a perfectly legal language tag, defined in the IANA language
> tag registry:
> 
> Type: language
> Subtag: und
> Description: Undetermined
> Added: 2005-10-16
> Scope: special
> 
> The other language tags in the "special" Scope are:
> 
> zxx: No linguistic content/Not applicable
> mis: Uncoded languages
> mul: Multiple languages
> 
> One might argue that "zxx" is actually equivalent to no language tag.
> Anyway "und" is actually used for "unknown language" in contrast to "no
> language". If your data
> model expects strings to always have languages "und" makes sense but in
> this case there should not be literal strings without language tag
> anyway (see JSKOS json-ld profile for SKOS for an example).
> 
> Robert wrote:
> 
> > If compaction would result in an attempt to add a string without an
> > associated language into a LanguageMap, then the processor SHOULD
> > assign the undefined language code `UND` as the key in the array.
> 
> I'd prefer this:
> 
> If compaction would result in an attempt to add a string without an
> associated language into a LanguageMap, then the processor MUST NOT
> include this string. Instead it SHOULD emit a warning to inform that the
> data to compact does not fit to the expected data model expressed
> by definition of a LanguageMap.
> 
> In theory, any kind of RDF data should be expressible with any kind of
> JSON-LD context. In practice each JSON-LD context defines a data model
> with implicit or explicit assumptions what RDF data to be expressible in
> a meaningful way. I prefer meaningful data over hacks to express data
> that does not conform to expectations anyway.
> 
> What's the actual use case of having non-language strings in language maps?
> 
> Jakob
> 
> 
> 
> 
> -- 
> Rob Sanderson
> Semantic Architect
> The Getty Trust
> Los Angeles, CA 90049
> 
> 
> 
> -- 
> Rob Sanderson
> Semantic Architect
> The Getty Trust
> Los Angeles, CA 90049
Received on Wednesday, 12 April 2017 02:29:49 UTC