ITS > NIF conversion - one testing contributor minimum needed

Hi all,

we have one feature in ITS2 that is not yet tested: the conversion to 
NIF. To fill this gap, I would propose the following approach.

1) The conversion is tested with example files for one data category. No 
need to have the conversion output of several data categories in one 
output file.


2) Like the definition of the NIF > ITS algorithm
http://www.w3.org/TR/2013/WD-its20-20130521/#conversion-to-nif
the conversion output represents ITS information, but no information 
about how it was generated (local markup, global, interitance, defaults).


3) RDF can be represented in various formats: RDF/XML, RDFa, Turtle, ... 
for testing purpose having just one representation would be good, for 
ease of comparison.


4) It doesn't make sense to add the NIF conversion input / output to the 
test suite master file and to take it into account for comparison. The 
reason is that the test suite master file does a line by line comparison 
of test suite output. That doesn't provide useful info for NIF.


5) So how to compare output? The "meat" of the conversion to NIF is that 
for each node in a document that holds ITS information, a triple of the 
following form is generated:
SubjectURI ITS2DataCategorySpecificPredicate Value

"SubjectURI" is an URI that consists of the document base URI plus "#" 
plus character offsets. Example for "Dublin" in
http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html
this is
http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#char=12,18

"ITS2DataCategorySpecificPredicate" is an RDF predicate that identifies 
the data category information in question, e.g. for "Dublin" it is 
"itsrdf:taIdentRef".

"Value" is the data category value. For "Dublin" in the above example, 
this is http://dbpedia.org/resource/Dublin

So we would need to compare whether two implementations create the 
sample triples of above form for a given input document. "The same" does 
not mean "the same offsets", e.g. "#char=12,18", since white space 
normalization for the NIF conversion is not defined in ITS2. "The same"  
means: for each node that contains ITS information there must be a 
triple like above, with the same predicate and object.


6) Do we need automatic comparison?
With the definition of "the same" like above an automatic comparison of 
the output is hard. But it is not needed IMO: NIF conversion is one 
feature like e.g. Translate "global"; so having 5-10 input files that we 
can check manually would be sufficient. Also, we don't need to cover 
several data categories in one NIF conversion test file, and we might 
restrict the testing even to one data category: "text analyis", 
"translate", ...


7) How to do this practically?

Below is a template for the output ITS2NIF conversion, in Turtle 
serialization.

[
@prefix : <XXX-base-uriXXX#> .
      @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
      @prefix nif: 
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
      @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

     :char=0,XXX-complete-length-XXX     a nif:Context;
          nif:isString " XXX-complete-source-file-text-content-XXX";
          nif:sourceUrl <XXX-base-uriXXX> .

:char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX     a 
nif:RFC5147String;
          nif:anchorOf "XXX-annotated-string-XXX";
          nif:referenceContext :char=0,XXX-complete-length-XXX;


          XXX-annotation-predicate-XXX XXX-annotation-value-XXX .
]

The part starting with
":char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX     a 
nif:RFC5147String;"
and ending with
"XXX-annotation-predicate-XXX XXX-annotation-value-XXX ."
would be needed for each annotation.

Here is an example how the template woud be filled in for
http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html
Note that in below test file, I am only processing "text analysis" 
information, see 6) above for the rationale.

[
@prefix : 
<http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#> 
.
      @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
      @prefix nif: 
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
      @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

     :char=0,30     a nif:Context;
          nif:isString " Welcome to Dublin in Ireland!";
          nif:sourceUrl 
<http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html> 
.

     :char=12,18     a nif:RFC5147String;
          nif:anchorOf "Dublin";
          nif:referenceContext :char=0,30;
          itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin> .

:xpath(/html/body%5B1%5D/h2%5B1%5D/span%5B1%5D/text()%5B1%5D) 
nif:convertedFrom :char=12,18 .
]


8) What effort is needed from test suite contributors?

For Leroy / TCD as the test suite owner, I'd say no effort is needed 
except creating directories for NIF input / output and documenting them 
on the test suite main page, saying that they are part of the normative 
conformance testing.

We need at least one test suite contributor that would - in addition to 
me - implement the conversion exemplified in 7). The implementation 
should be pretty straightforward:
0 Re-use the code that generates your test suite output
1 Create the template under 7), fill in file base URI "XXX-base-uriXXX", 
"XXX-complete-source-file-text-content-XXX" and "XXX-complete-length-XXX"
2 For each node that has ITS 2 annotation:
2.1 create a subject URI. That is, instead of generating XPath in a line 
like /html/body[1]/h2[1]/span[1] , you
2.1.1 generate XXX-ITS-Annotation-Start-XXX: count the characters in all 
element nodes preceding the current node
2.1.2 generate XXX-ITS-Annotation-End-XXX: count the string length of 
the current node and calculate plus XXX-ITS-Annotation-Start-XXX
2.2 Now you have the subject URI via using above offsets and the base 
UR. And now you can create this triple:
:char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX     a 
nif:RFC5147String;
2.3 create the nif:anchorOf triple by putting the string value of the 
current node in, e.g. "Dublin";
2.4 create the nif:referenceContext triple by using the count of the 
complete document string length, e.g. :char=0,30;
2.5 for each annotation, create the triples, "e.g. itsrdf:taIdentRef 
<http://dbpedia.org/resource/Dublin> "
3 after the last annoation, put the dot "." at the end instead of ";".

9) How much testing do we need?

I have implemented the conversion for Text Analysis local, see
http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl
and an example conversion, in which the triples are shown with the RDF 
validator http://tinyurl.com/qhhjgmb
So if we would have one more implementer who could do 8) we would be done.
I'd be happy to contribute input files. I would also be happy to do this 
for any other data category. But before starting I'd like to know which 
data category to use, so that I don't need to redo the input files.


10) When do we need this?

We need this for finalizing ITS2, that is the testing needs to be done 
within the next three weeks. I hope that with the test suite based 
description under 8) the processing of the actual conversion is  
straightforward and a question of a few hours for whose who are 
producing test suite output anyway.


11) How critical is this?

We need - in addition to me - one more volunteer, otherwise we can't 
finalize ITS2.


Best,

Felix

Received on Thursday, 23 May 2013 07:17:23 UTC