- From: Christian Chiarcos <christian.chiarcos@web.de>
- Date: Fri, 23 Nov 2018 15:55:31 +0100
- To: andy@seaborne.org
- Cc: hugh@glasers.org, SW-forum <semantic-web@w3.org>, w.g.j.beek@vu.nl
- Message-ID: <CAC1YGdhZJLd8FeEjxqq1947r-kJBVjejB4WNuCFfy_8vnQ3jvw@mail.gmail.com>
Dear all, language codes do matter, but are pretty inconvenient for multiple reasons: - comparability with untyped/plain strings (of course, and most obviously and counter-intuitive to RDF novices), - complexity (BCP47 defines (a) complex selection rules among ISO 639 language tags, and (b) complex rules for composition, e.g., with script and region codes), and - confusability (having 2-letter codes aside with 3-letter codes for the same language can let people used to work with 3-letter codes chose 2-letter codes, which is an easy error to make, but can result in failure to compare, e.g., "cat"@eng and "cat"@en. Not sure what should happen when you compare "рука"@sr-Cyrl with "рука"@sr. Both are identical, the first is just more explicit in stating that this is Cyrillic.) - coverage (for many applications, ISO639 simply isn't fine-grained or well-defined enough, and its extension is slow, bureaucratic and doubtful). A much more convenient solution would be to identify the language by means of a URI. This can be an ISO 639 category (see under http://id.loc.gov/vocabulary/iso639-2.html and http://id.loc.gov/vocabulary/iso639-1.html; for ISO 639, cf. http://www.lexvo.org/), or provided by another authority (e.g., https://glottolog.org/). Other properties (e.g., xsd datatypes) could also be stated about a literal. Two strings could be considered identical if the values are the same and the properties of one are a proper subset of the properties of the other. Not sure what the right data structure or representation should be. Maybe a kind of container structure for literal metadata (similar to the @ notation and the lang() properties that we have now). Best, Christian Am Fr., 23. Nov. 2018 um 15:03 Uhr schrieb Andy Seaborne <andy@seaborne.org >: > > > On 23/11/2018 12:03, Hugh Glaser wrote: > > Ah, good topic. > > > > So another thing I don't understand (:-)) is why we have to have > language tags on strings at all, and indeed datatypes. > > As someone who works with a product that is used by users in different > geographies, I can say that language tags matter. > > And I live near "cymru"@cy. > > > (OK, it's because of XML heritage or something, I guess.) > > But we have a perfectly good way of representing knowledge about things. > > It is a real pain to create these 3 component literals and to query for > different languages and datatypes in SPARQL. > > And worse still, if you want to query for strings that may or may not > have language tags on, you need to do some real messing about. > > STR(?var) in SPARQL. > > xsd:string("abc"@en) if you are lucky. > > > I often end up adding @en to all the strings, or removing region tags > etc., just so I can do things more easily, which is surely a Bad Thing. > > > > Surely languages and datatypes should simply be RDF properties of > Literals, which are 1 component things? > > Much easier to explain to developers, and for them to use. > > (If indeed they want to use raw RDF.) > > As in: > > "chat" rdf:lang "en" . > > ? > > That would make all occurrences of "chat" @en. > > They really are different literals. > > Andy > > > > > >> On 23 Nov 2018, at 11:48, Andy Seaborne <andy@seaborne.org> wrote: > >> > >> The RDF 1.1 WG did spend some time of this - both on putting the > langtag into the lexical space and putting the lang tag into the datatype. > Both are not so easy; in the end the rdf@langString at least meant all > literals had a datatype. > >> > >> With the lexical form is a pair (string, lang) and squeezing that into > a single string, it gets a bit unintuitive when strlen("hello@en") is 8, > not 5. SeeAlso rdf:plainLiteral. > >> > >> For datatypes, language tags have their own structure and hierarchy > (lang-script-region-...) for their requirements which does not really fit > with datatype subtyping very well. > >> > >> I don't think changes would simplify. > >> > >> We have what we have and people have been explaining to the wider > community (i.e. it's not just people on this list affected). So > "technically better" isn't the criterion, it should be "unlocks potential > that is currently, provably blocked". > >> > >> Andy > >> > >> On 23/11/2018 08:42, Wouter Beek wrote: > >>> Dear David, others, > >>> As another attempt at simplifying RDF, would it be possible to do away > >>> with the special status of language-tagged strings? > >>> In RDF 1.1 literals consist of 3 components: lexical form, datatype > >>> IRI, and language tag. The last component is only used in > >>> language-tagged strings. Would it be possible to define > >>> `rdf:langString' as a regular datatype IRI and have literals consist > >>> of 2 components instead? > >>> RDF 1.1 Concepts and Abstract Syntax currently contains many caveats > >>> to accommodate the idiosyncratic nature of language-tagged strings, > >>> e.g.,: > >>>> Language-tagged strings have the datatype IRI > http://www.w3.org/1999/02/22-rdf-syntax-ns#langString. No datatype is > formally defined for this IRI because the definition of datatypes does not > accommodate language tags in the lexical space. The value space associated > with this datatype IRI is the set of all pairs of strings and language tags. > >>> Would it be possible to define a regular lexical space, e.g., > >>> containing "hello@en"^^rdf:langString, together with a value-2-lexical > >>> and a lexical-2-value mapping? > >>> The N3 and SPARQL notation "hello"@en will of course still be > >>> available, and will be syntactic sugar for "hello@en"^^rdf:langString. > >>> --- > >>> Best regards, > >>> Wouter Beek. > >>> Email: w.g.j.beek@vu.nl > >>> WWW: https://wouterbeek.org > >>> Tel: +31647674624 > >> > > > > -- Prof. Dr. Christian Chiarcos Applied Computational Linguistics Johann Wolfgang Goethe Universität Frankfurt a. M. 60054 Frankfurt am Main, Germany office: Robert-Mayer-Str. 10, #401b mail: chiarcos@informatik.uni-frankfurt.de web: http://acoli.cs.uni-frankfurt.de tel: +49-(0)69-798-22463 fax: +49-(0)69-798-28931
Received on Friday, 23 November 2018 14:56:05 UTC