W3C home > Mailing lists > Public > public-esw-thes@w3.org > February 2007

Re: [SKOS] languages and scripts

From: Jakob Voss <jakob.voss@gbv.de>
Date: Mon, 05 Feb 2007 19:46:02 +0100
Message-ID: <45C77B6A.6070006@gbv.de>
To: SKOS <public-esw-thes@w3.org>

Hi,

Looks like we stumbled upon the issue that scripts and languages are not
predefined but concepts of their own. :-)

Miles, AJ (Alistair) wrote:

> Firstly, an application may wish to distinguish between labels in
> different scripts, for display purposes. How is the script of a label
> to be represented in an RDF graph?
> 
> I found a standard list of script names [2], I believe for Japanese
> the values are as follows ..
> 
> * Hani (Kanji) 
> * Hira (Hiragana)
> * Kana (Katakana)
> * Latn (Latin)

Another example is Serbian that can be written in both cyrillic and
latin (the same applies for Greek, but for Serbian it more usual to use
both scripts as far as I know). See:

http://en.wikipedia.org/wiki/Serbian_Wikipedia
http://sr.wikipedia.org/sr-el/  (cyrillic)
http://sr.wikipedia.org/sr-ec/  (latin)

There is only one ISO-639-2 two-alpha-code ("sr") but there are two
ISO-639-2 three-alpha-code ("scc" and "srp").

The differences between Japanese scripts are not encoded in the current
ISO-639 but in ISO 15924:

500 Kanji
410 Hiragana
411 Katakana
412 alias for Hiragana + Katakana
413 Japanese (alias for Kanji + Hiragana + Katakana)

You see this is also not the absolute solution.

> I then had a look at RFC 3066 [3] to see if the script names could be
> used within language tags. To paraphrase, [3] says that a language
> tag can be built up from any number of subtags separated by "-"
> character. If I've understood it correctly, the first subtag is
> supposed to be the language code (from ISO 639-1 or ISO 639-2 e.g.
> "en"), the second subtag is supposed to be a country code (from ISO
> 3166), and the third subtag can be anything you want. So e.g. you can
> have "sgn-US-MA" for Martha's Vineyard Sign Language, which is found
> in the state of Massachusetts, US.

That's right, but by creating a notation (sic) for the concept
"Martha's Vineyard Sign Language" you just used ambigous,
non-semantic-web technique. Isn't that what we want to overcome with SKOS?

> If it is possible to embed script names in language tags, then the
> representational issue can be resolved.

You irritation shows the drawback of using notations only - it is not
clear how they can be created and what they stand for. That's why RDF
was invented: don't use language codes but URIs! I must admin that there
is no official encoding of language and script codes in SKOS or other
RDF dialect - looks like a hen-and-egg problem to me ;-)

How about something like:

<!-- Definition of the language -->
<skos:Concept rdf:about="http://mylanguageregistry/#sgn-US-MA">
  <skos:notation>sgn-US-MA</skos:notation>
  <skos:label>Martha's Vineyard Sign Language</skos:label>
</skos:Concept>

<!-- Definition of the Concept -->
<skos:Concept rdf:about="myconcept">
  <skos:label rdf:resource="#myLabel"/>
</skos:Concept>

<!-- Definition of the Language -->
<skos:SymbolicLabel rdf:about="#mylabel">
  <skos:language rdf:resource="http://mylanguageregistry/#sgn-US-MA"/>
  <skos:media rdf:resource="http://videos/video-of-the-sign.avi"/>
</skos:Label>


> Secondly, this bears on the cardinality of the skos:prefLabel
> property. The SKOS Core Guide [5] currently says, "A concept should
> have no more than one preferred lexical label per language." However,
> when working with multiple scripts a concept scheme would need one
> preferred lexical label per language per script.

Like I said before the restriction on skos:prefLabel mixes the label's
role of identification and its role of display - these mixture of roles
should be avoided in the final SKOS standard.

> Thirdly, this is another scenario which gives rise to the need for
> expressing relationships between lexical labels. E.g. if a concept
> has both preferred and alternative labels in multiple scripts, an
> application might want to display equivalent labels from different
> scripts beside each other to aid with reading.

I'm afraid if you introduce labels as concept (like in my example above)
this will let users create a lot of blown up concept schemes but looks
like it cannot be avoided.

Greetings,
Jakob
Received on Monday, 5 February 2007 18:46:14 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:38:55 GMT