A comment about I18N proposal by Axel Polleres (ISSUE-126 and ISSUE-71)


At yesterday's teleconf, Ivan drew my attention to the proposal by Axel Polleres for owl:internationalizedString:


If I understood the idea correctly, Axel is proposing to have a datatype per language tag. Thus, you'd have something like lang:en
datatype, which would contain all strings in English (lang: is a namespace prefix yet to be defined). Furthermore, you might have
the lang:en-US datatype, which would contain all strings in the US variant of English. The datatype lang:en-US would be a
subdatatype of lang:en; hence, if you asked for all strings in English, you would obtain also all strings in the US variant as well.
Please correct me if I summarized the proposal incorrectly -- I apologize in advance.

I'm not really sure what the value space of all these datatypes would be. If you want to make literals of the form "aaa"@en and
"aaa"@en-US be different things (i.e., if you want to give them different identity), then you need to have different objects in the
value space. Axel's e-mail is silent about the value spaces; however, I assume that each literal with a language tag is still mapped
to a pair of the form


If this were not the case -- for example, if you mapped "aaa"@en and "aaa"@en-US to the same object "aaa" -- then there would be no
way you can distinguish different values in the interpretation of lang:en and lang:en-US. Hence, it seems reasonable for me to
assume that the value space of datatypes in the Axel's proposal is identical to the value space of my proposal in

Furthermore, Axel's proposal is silent about the treatment of xsd:string. Since the value spaces in my and his proposal are the
same, however, I don't see any problem in mapping literals of the form "aaa"^^xsd:string into ("aaa","") -- that is, into pairs with
the empty value tag.

In fact, it seems to me that Axel's proposal is more related to ISSUE-71, which asks for a mechanism for identifying all strings in
a particular language. My proposal hasn't so far addressed this issue at all. In fact, I believe that ISSUE-71 is orthogonal to the
problem of structuring the value space of internationalized strings (which is the main goal of ISSUE-126). To be more precise, I
believe that, if we addressed ISSUE-126 in the way I outlined earlier, there would be nothing preventing us from employing Axel's
proposal for addressing ISSUE-71. The only thing we need to do is define the value spaces for of each of different lang:* datatypes.
For example, the value space of lang:en would be defined as the set of pairs of the form


(I hope you understand my pidgin regular expressions). To summarize, I believe we can go forward with ISSUE-126 and come back to
ISSUE-71 later.

Regarding Axel's proposal for addressing ISSUE-71, it seems quite reasonable. I would like, however, to point out that ISSUE-71 can
be addressed in a rather simple way by simply adding another facet langTagPattern. This facet would take a regular expression and
would restrict the value space of owl:internationalizedString to the set of pairs in which the language tag matches the regular
expression. For example, the datatype restriction

    DatatypeRestriction( owl:internationalizedString langTagPattern "en[-*]" )

would have as the value space the set of pairs of the form


and would thus select all strings written in some variant of English. In contrast,

    DatatypeRestriction( owl:internationalizedString langTagPattern "en" )

would have as the value space the set of pairs of the form


and would select only the strings that have no sublanguage specified. The regular expressions would thus provide us with quite a bit
of flexibility; in particular, it would allow us to explicitly distinguish between values with no language tags, only the language
tag, language+sublanguage lag, and so. I believe also the proposal would be really simple to implement: the extensions to the
datatype reasoning algorithm from my ISWC 2008 paper are rather trivial.





Received on Thursday, 17 July 2008 08:56:00 UTC