- From: Felix Sasaki <fsasaki@w3.org>
- Date: Thu, 23 May 2013 09:16:52 +0200
- To: Multilingual Web LT-TESTS Public <public-multilingualweb-lt-tests@w3.org>
- Message-ID: <519DC264.4060400@w3.org>
Hi all,
we have one feature in ITS2 that is not yet tested: the conversion to
NIF. To fill this gap, I would propose the following approach.
1) The conversion is tested with example files for one data category. No
need to have the conversion output of several data categories in one
output file.
2) Like the definition of the NIF > ITS algorithm
http://www.w3.org/TR/2013/WD-its20-20130521/#conversion-to-nif
the conversion output represents ITS information, but no information
about how it was generated (local markup, global, interitance, defaults).
3) RDF can be represented in various formats: RDF/XML, RDFa, Turtle, ...
for testing purpose having just one representation would be good, for
ease of comparison.
4) It doesn't make sense to add the NIF conversion input / output to the
test suite master file and to take it into account for comparison. The
reason is that the test suite master file does a line by line comparison
of test suite output. That doesn't provide useful info for NIF.
5) So how to compare output? The "meat" of the conversion to NIF is that
for each node in a document that holds ITS information, a triple of the
following form is generated:
SubjectURI ITS2DataCategorySpecificPredicate Value
"SubjectURI" is an URI that consists of the document base URI plus "#"
plus character offsets. Example for "Dublin" in
http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html
this is
http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#char=12,18
"ITS2DataCategorySpecificPredicate" is an RDF predicate that identifies
the data category information in question, e.g. for "Dublin" it is
"itsrdf:taIdentRef".
"Value" is the data category value. For "Dublin" in the above example,
this is http://dbpedia.org/resource/Dublin
So we would need to compare whether two implementations create the
sample triples of above form for a given input document. "The same" does
not mean "the same offsets", e.g. "#char=12,18", since white space
normalization for the NIF conversion is not defined in ITS2. "The same"
means: for each node that contains ITS information there must be a
triple like above, with the same predicate and object.
6) Do we need automatic comparison?
With the definition of "the same" like above an automatic comparison of
the output is hard. But it is not needed IMO: NIF conversion is one
feature like e.g. Translate "global"; so having 5-10 input files that we
can check manually would be sufficient. Also, we don't need to cover
several data categories in one NIF conversion test file, and we might
restrict the testing even to one data category: "text analyis",
"translate", ...
7) How to do this practically?
Below is a template for the output ITS2NIF conversion, in Turtle
serialization.
[
@prefix : <XXX-base-uriXXX#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix nif:
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
:char=0,XXX-complete-length-XXX a nif:Context;
nif:isString " XXX-complete-source-file-text-content-XXX";
nif:sourceUrl <XXX-base-uriXXX> .
:char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a
nif:RFC5147String;
nif:anchorOf "XXX-annotated-string-XXX";
nif:referenceContext :char=0,XXX-complete-length-XXX;
XXX-annotation-predicate-XXX XXX-annotation-value-XXX .
]
The part starting with
":char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a
nif:RFC5147String;"
and ending with
"XXX-annotation-predicate-XXX XXX-annotation-value-XXX ."
would be needed for each annotation.
Here is an example how the template woud be filled in for
http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html
Note that in below test file, I am only processing "text analysis"
information, see 6) above for the rationale.
[
@prefix :
<http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#>
.
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix nif:
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
:char=0,30 a nif:Context;
nif:isString " Welcome to Dublin in Ireland!";
nif:sourceUrl
<http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html>
.
:char=12,18 a nif:RFC5147String;
nif:anchorOf "Dublin";
nif:referenceContext :char=0,30;
itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin> .
:xpath(/html/body%5B1%5D/h2%5B1%5D/span%5B1%5D/text()%5B1%5D)
nif:convertedFrom :char=12,18 .
]
8) What effort is needed from test suite contributors?
For Leroy / TCD as the test suite owner, I'd say no effort is needed
except creating directories for NIF input / output and documenting them
on the test suite main page, saying that they are part of the normative
conformance testing.
We need at least one test suite contributor that would - in addition to
me - implement the conversion exemplified in 7). The implementation
should be pretty straightforward:
0 Re-use the code that generates your test suite output
1 Create the template under 7), fill in file base URI "XXX-base-uriXXX",
"XXX-complete-source-file-text-content-XXX" and "XXX-complete-length-XXX"
2 For each node that has ITS 2 annotation:
2.1 create a subject URI. That is, instead of generating XPath in a line
like /html/body[1]/h2[1]/span[1] , you
2.1.1 generate XXX-ITS-Annotation-Start-XXX: count the characters in all
element nodes preceding the current node
2.1.2 generate XXX-ITS-Annotation-End-XXX: count the string length of
the current node and calculate plus XXX-ITS-Annotation-Start-XXX
2.2 Now you have the subject URI via using above offsets and the base
UR. And now you can create this triple:
:char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a
nif:RFC5147String;
2.3 create the nif:anchorOf triple by putting the string value of the
current node in, e.g. "Dublin";
2.4 create the nif:referenceContext triple by using the count of the
complete document string length, e.g. :char=0,30;
2.5 for each annotation, create the triples, "e.g. itsrdf:taIdentRef
<http://dbpedia.org/resource/Dublin> "
3 after the last annoation, put the dot "." at the end instead of ";".
9) How much testing do we need?
I have implemented the conversion for Text Analysis local, see
http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl
and an example conversion, in which the triples are shown with the RDF
validator http://tinyurl.com/qhhjgmb
So if we would have one more implementer who could do 8) we would be done.
I'd be happy to contribute input files. I would also be happy to do this
for any other data category. But before starting I'd like to know which
data category to use, so that I don't need to redo the input files.
10) When do we need this?
We need this for finalizing ITS2, that is the testing needs to be done
within the next three weeks. I hope that with the test suite based
description under 8) the processing of the actual conversion is
straightforward and a question of a few hours for whose who are
producing test suite output anyway.
11) How critical is this?
We need - in addition to me - one more volunteer, otherwise we can't
finalize ITS2.
Best,
Felix
Received on Thursday, 23 May 2013 07:17:23 UTC