RE: Issue with unbounded alphabets in datatypes from Boris Motik on 2008-10-20 (public-owl-dev@w3.org from October to December 2008)

From: Boris Motik <boris.motik@comlab.ox.ac.uk>
Date: Mon, 20 Oct 2008 23:04:40 +0100
To: "'Bijan Parsia'" <bparsia@cs.man.ac.uk>, "'Birte Glimm'" <birte.glimm@comlab.ox.ac.uk>
Cc: "'Owl Dev'" <public-owl-dev@w3.org>, "'Jie Bao'" <baojie@cs.rpi.edu>, <axel@polleres.net>, "'W3C OWL Working Group'" <public-owl-wg@w3.org>
Message-ID: <D19F134CBE204A18A339A3995C50A26E@wolf>

Hello,

The alphabet in rdf:text and the OWL 2 version of xsd:string is indeed unbounded. This has been done because fixing the semantics of
these datatypes to a particular version of Unicode would make it difficult to extend the alphabet in the future without affecting
the semantics of some OWL 2 ontologies. Consider the following axiom:

(1) ClassAssertion( MinCardinality( n DP DatatypeRestriction( xsd:string xsd:length 1 ) ) )

Let us assume that m is the number of characters in the alphabet of xsd:string. Then, axiom (1) is satisfiable if and only if n <=
m. Now this means that if we fix m today to some value m0, we won't be able to go in future beyond m0 without affecting the
semantics of some OWL 2 ontologies.

A way out of this dilemma is to make the alphabet infinite. Thus, the data range DatatypeRestriction( xsd:string xsd:length 1 )
actually contains an infinite number of strings of length one, and this means that axiom (1) is always satisfiable.

Please don't be confused by the fact that, even though the alphabet is assumed to be infinite, the number of constants that you have
at any given point in time is finite. Thus, the number of strings of length one that you can write is finite; however, this has
nothing to do with the fact that the number of strings of length one in the value space of xsd:string is infinite.

While this might cause some minor difficulties with reusing existing regular expression libraries, I don't think this is a
conceptual problem: even if you have an infinite alphabet, as long as you are dealing with finite regular expressions you can
represent all automata finitely; furthermore, the operations on automata don't change much at all. I hope I'll be able to provide
more details about this in the next few days.

There are other possibilities for addressing this issue.

1. We could fix OWL 2 to a particular version of Unicode. I consider this very bad because Unicode is revised very frequently.

2. We could pick m to be some very large number so that, for all practical intents and purposes, we never exceed this number. While
this has some appeal, it is a kind of a hack. In fact, I believe that I can provide guidance to implementors for the current
solution that would be much easier to put into practice than dealing with an automaton that deals with a very large alphabet.

Regards,

	Boris

> -----Original Message-----
> From: public-owl-dev-request@w3.org [mailto:public-owl-dev-request@w3.org] On Behalf Of Bijan Parsia
> Sent: 20 October 2008 22:33
> To: Birte Glimm
> Cc: Owl Dev; Jie Bao; axel@polleres.net; W3C OWL Working Group
> Subject: Re: Issue with unbounded alphabets in datatypes
> 
> 
> [OWL WG, please see:
> 	<http://www.w3.org/mid/
> 492f2b0b0810201028w4184475csa51d27429e05bc21@mail.gmail.com>]
> 
> Hi Birte,
> 
> On Oct 20, 2008, at 6:28 PM, Birte Glimm wrote:
> 
> > Hi all,
> >
> > I just wanted to raise a discussion about the currently proposed
> > assumption that the alphabet of the String-based datatypes is
> > unbounded.
> 
> I'm just wondering which text you think requires that. For
> xsd:string, it clearly seems that there is a finite alphabet:
> 	http://www.w3.org/TR/xmlschema-2/#string
> 
> Oh, but rdf:text:
> 	http://www.w3.org/2007/OWL/wiki/
> InternationalizedStringSpec#Definition_of_the_rdf:text_Datatype
> doesn't seem to have any constraint on alphabet.
> 
> Though in:
> 	http://www.w3.org/2005/rules/wiki/DTB#Symbol_Spaces
> 
> we read:
> """rdf:text (http://www.w3.org/2007/rif#text, for text strings with
> language tags attached).
> 
> This symbol space represents text strings with a language tag
> attached. The lexical space of rdf:text is the set of all Unicode
> strings of the form ...@LANG, i.e., strings that end with @LANG where
> LANG is a language identifier as defined in [BCP-47]"""
> 
> which seems to restrict it to Unicode (3.0?) strings.
> 
> I agree that allowing an unbounded alphabet is absurd. XML has had to
> face this as well. I think we should suck it up.
> 
> I've cced the editors of the rdf:text document so they can take note
> of the issue.
> 
> Cheers,
> Bijan.

Received on Monday, 20 October 2008 22:05:34 UTC