Re: Fixing an omission in R2RML: syntax of blank node labels

On 26 Apr 2012, at 10:20, Ivan Herman wrote:
> why is that long note on the various syntaxes necessary here? That looks to me as an implementation dependent detail that is not for the Rec.

1. No one in the WG noticed that we didn't handle it correctly, so I think it requires some highlighting and explanation.

2. Without the explanation, implementers will have to chase down the blank node identifier grammars in *every single RDF syntax* that exists, because blank node identifier syntax is *not standardized* across syntaxes in RDF. It's unlikely that implementers would actually do that, so it's likely that we'd end up with buggy code that breaks when unusual characters show up in a blank node ID.

3. The WG already has done the work and trawled all the specs. Thanks, Eric! Why not present the results of that to save time for implementers?

But ok, let's try to make it shorter while keeping the gist. Replace this text in the current spec:

[[
If the term type is rr:BlankNode: Return a blank node whose blank node identifier is the natural RDF lexical form corresponding to value.
]]

with this:

[[
If the term type is rr:BlankNode: Return a blank node that is unique to the natural RDF lexical form corresponding to value.

NOTE: RDF syntaxes and RDF APIs generally represent blank nodes with blank node identifiers. But the characters allowed in blank node identifiers differ between syntaxes, and not all characters occurring in value may be allowed, so a bijective mapping function from values to valid blank node identifiers may be required. The details of this mapping function are implementation-dependent, and an R2RML processors may have to use different functions for different output syntaxes or access interfaces. Strings matching the regular expression [a-zA-Z_][a-zA-Z_0-9-]* are valid blank node identifiers in all W3C-recommended RDF syntaxes.
]]

The change can be summarised as: “Allow an arbitrary implementation-dependent bijective before generating blank node identifiers, to account for disallowed characters in blank node identifier in some RDF syntaxes”. It's not my call to make, but this doesn't sound like something that would require another LC.

Best,
Richard





> 
> Ivan
> 
> 
> On Apr 26, 2012, at 03:56 , Richard Cyganiak wrote:
> 
>> The test case reviews have highlighted an oversight in R2RML.
>> 
>> A change to the semantics will be necessary to fix this. It's essentially just a simple bugfix, although it requires a somewhat lengthy informative explanation.
>> 
>> Section 11.2 of R2RML defines the “term generation rules”, and have the following to say about generating blank nodes:
>> 
>> [[
>> If the term type is rr:BlankNode: Return a blank node whose blank node identifier is the natural RDF lexical form corresponding to value.
>> ]]
>> http://www.w3.org/TR/2012/CR-r2rml-20120223/#generated-rdf-term
>> 
>> There is a problem though. “Value” at this point could be an arbitrary SQL value containing any characters. Blank node identifiers however are syntactically restricted in the various syntaxes. Worse, the restrictions are different in different syntaxes. So, the spec as written asks implementations that might generate illegal blank node identifiers.
>> 
>> The fix involves lots of handwaving because due to the different syntaxes I don't think it's practical to specify a single escaping scheme that works everywhere. Since blank node labels are not semantically meaningful, we can leave the choice of escaping scheme up to the implementations. But this requires some explaining. So I'd like to change the phrasing above to:
>> 
>> [[
>> If the term type is rr:BlankNode: Return a blank node generated by applying the implementation-dependent blank node labelling function to the natural RDF lexical form corresponding to value.
>> ]]
>> 
>> “blank node labelling function” would then be defined like this, including a very long NOTE:
>> 
>> [[
>> The blank node labelling function is an arbitrary implementation-dependent function whose inputs are strings, and whose outputs are blank nodes. The function MUST be bijective, that is, the inputs and outputs are in a 1:1 correspondence.
>> 
>> NOTE: In the various syntaxes and access interfaces for RDF, blank nodes are generally represented by a blank node identifier. The precise syntax and allowed characters for blank node identifiers differ between syntaxes and interfaces. An R2RML processor must have the ability to generate valid blank node identifiers from arbitrary input strings. This is the task of the blank node labelling function. R2RML processors may have to use different blank node labelling functions for different output syntaxes or access interfaces.
>> 
>> A string matching the regular expression [a-zA-Z_](([a-zA-Z_0-9-])*[a-zA-Z_0-9.-])? is a valid blank node identifier in Turtle, SPARQL, N-Triples and RDF/XML. The following algorithm is a simple blank node labelling function that produces such valid blank node identifiers (but not very readable ones) from any input string:
>> 
>> 1. Turn the input string into a byte sequence by UTF-8 encoding.
>> 2. Turn each byte into a two-digit hexadecimal number.
>> 3. Concatenate all digits into a string, prepend “blank”, and generate a blank node with this blank node identifier.
>> 
>> For example, the string “:-)” would yield a blank node identifier “blank3A2D29”.
>> ]]
>> 
> 
> 
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> FOAF: http://www.ivan-herman.net/foaf.rdf
> 
> 
> 
> 
> 

Received on Thursday, 26 April 2012 11:07:01 UTC