[SKOS] languages and scripts

Hi all,

Just jotting down some notes prior to raising an issue ...

At DC2006 I spoke to Mitsuharu Nagamori from the University of Tsukuba (cced) about a SKOS encoding of the Japanese National Library classification scheme. Mitsuharu and I discussed design options for representing the features of the classification scheme. Mitsuharu also taught me about the various scripts that are used for the Japanese written language. I am still very ignorant about the Japanese language so please forgive me if I make any errors in this email.

As I understand it, there are several different scripts available for writing Japanese [1]. These are the Kanji script (characters of Chinese origin), the Hiragana script (a syllabary), the Katakana script (also a syllabary) and the Latin alphabet.

In the JNL classification scheme, all four scripts may be used. 

The general situation in which a concept may be labelled using multiple scripts within the same language gives rise to a number of potential issues.

Firstly, an application may wish to distinguish between labels in different scripts, for display purposes. How is the script of a label to be represented in an RDF graph?

I found a standard list of script names [2], I believe for Japanese the values are as follows ..

 * Hani (Kanji)
 * Hira (Hiragana)
 * Kana (Katakana)
 * Latn (Latin)

I then had a look at RFC 3066 [3] to see if the script names could be used within language tags. To paraphrase, [3] says that a language tag can be built up from any number of subtags separated by "-" character. If I've understood it correctly, the first subtag is supposed to be the language code (from ISO 639-1 or ISO 639-2 e.g. "en"), the second subtag is supposed to be a country code (from ISO 3166), and the third subtag can be anything you want. So e.g. you can have "sgn-US-MA" for Martha's Vineyard Sign Language, which is found in the state of Massachusetts, US.

So can you have e.g. "ja-JP-Kana" for japanese - Japan - Katakana script?

Then I found this email from Jeremy Carroll [4] that suggests you can put the script name and the country code the other way around, e.g. "zh-hant-TW". Does anyone know what the rules are for including script names in language tags, and where this is specified?

If it is possible to embed script names in language tags, then the representational issue can be resolved.

Secondly, this bears on the cardinality of the skos:prefLabel property. The SKOS Core Guide [5] currently says, "A concept should have no more than one preferred lexical label per language." However, when working with multiple scripts a concept scheme would need one preferred lexical label per language per script.

Thirdly, this is another scenario which gives rise to the need for expressing relationships between lexical labels. E.g. if a concept has both preferred and alternative labels in multiple scripts, an application might want to display equivalent labels from different scripts beside each other to aid with reading. 

That's all I have for now.

Cheers,

Alistair.

[1] http://en.wikipedia.org/wiki/Japanese_writing_system
[2] http://www.unicode.org/iso15924/iso15924-codes.html
[3] http://www.ietf.org/rfc/rfc3066.txt
[4] http://www.alvestrand.no/pipermail/ietf-languages/2004-March/001809.html
[5] http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-20051102/#secmulti
--
Alistair Miles
Research Associate
CCLRC - Rutherford Appleton Laboratory
Building R1 Room 1.60
Fermi Avenue
Chilton
Didcot
Oxfordshire OX11 0QX
United Kingdom
Web: http://purl.org/net/aliman
Email: a.j.miles@rl.ac.uk
Tel: +44 (0)1235 445440

Received on Monday, 5 February 2007 17:54:38 UTC