- From: Felix Sasaki <fsasaki@w3.org>
- Date: Thu, 23 May 2013 09:16:52 +0200
- To: Multilingual Web LT-TESTS Public <public-multilingualweb-lt-tests@w3.org>
- Message-ID: <519DC264.4060400@w3.org>
Hi all, we have one feature in ITS2 that is not yet tested: the conversion to NIF. To fill this gap, I would propose the following approach. 1) The conversion is tested with example files for one data category. No need to have the conversion output of several data categories in one output file. 2) Like the definition of the NIF > ITS algorithm http://www.w3.org/TR/2013/WD-its20-20130521/#conversion-to-nif the conversion output represents ITS information, but no information about how it was generated (local markup, global, interitance, defaults). 3) RDF can be represented in various formats: RDF/XML, RDFa, Turtle, ... for testing purpose having just one representation would be good, for ease of comparison. 4) It doesn't make sense to add the NIF conversion input / output to the test suite master file and to take it into account for comparison. The reason is that the test suite master file does a line by line comparison of test suite output. That doesn't provide useful info for NIF. 5) So how to compare output? The "meat" of the conversion to NIF is that for each node in a document that holds ITS information, a triple of the following form is generated: SubjectURI ITS2DataCategorySpecificPredicate Value "SubjectURI" is an URI that consists of the document base URI plus "#" plus character offsets. Example for "Dublin" in http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html this is http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#char=12,18 "ITS2DataCategorySpecificPredicate" is an RDF predicate that identifies the data category information in question, e.g. for "Dublin" it is "itsrdf:taIdentRef". "Value" is the data category value. For "Dublin" in the above example, this is http://dbpedia.org/resource/Dublin So we would need to compare whether two implementations create the sample triples of above form for a given input document. "The same" does not mean "the same offsets", e.g. "#char=12,18", since white space normalization for the NIF conversion is not defined in ITS2. "The same" means: for each node that contains ITS information there must be a triple like above, with the same predicate and object. 6) Do we need automatic comparison? With the definition of "the same" like above an automatic comparison of the output is hard. But it is not needed IMO: NIF conversion is one feature like e.g. Translate "global"; so having 5-10 input files that we can check manually would be sufficient. Also, we don't need to cover several data categories in one NIF conversion test file, and we might restrict the testing even to one data category: "text analyis", "translate", ... 7) How to do this practically? Below is a template for the output ITS2NIF conversion, in Turtle serialization. [ @prefix : <XXX-base-uriXXX#> . @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :char=0,XXX-complete-length-XXX a nif:Context; nif:isString " XXX-complete-source-file-text-content-XXX"; nif:sourceUrl <XXX-base-uriXXX> . :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a nif:RFC5147String; nif:anchorOf "XXX-annotated-string-XXX"; nif:referenceContext :char=0,XXX-complete-length-XXX; XXX-annotation-predicate-XXX XXX-annotation-value-XXX . ] The part starting with ":char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a nif:RFC5147String;" and ending with "XXX-annotation-predicate-XXX XXX-annotation-value-XXX ." would be needed for each annotation. Here is an example how the template woud be filled in for http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html Note that in below test file, I am only processing "text analysis" information, see 6) above for the rationale. [ @prefix : <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#> . @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :char=0,30 a nif:Context; nif:isString " Welcome to Dublin in Ireland!"; nif:sourceUrl <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html> . :char=12,18 a nif:RFC5147String; nif:anchorOf "Dublin"; nif:referenceContext :char=0,30; itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin> . :xpath(/html/body%5B1%5D/h2%5B1%5D/span%5B1%5D/text()%5B1%5D) nif:convertedFrom :char=12,18 . ] 8) What effort is needed from test suite contributors? For Leroy / TCD as the test suite owner, I'd say no effort is needed except creating directories for NIF input / output and documenting them on the test suite main page, saying that they are part of the normative conformance testing. We need at least one test suite contributor that would - in addition to me - implement the conversion exemplified in 7). The implementation should be pretty straightforward: 0 Re-use the code that generates your test suite output 1 Create the template under 7), fill in file base URI "XXX-base-uriXXX", "XXX-complete-source-file-text-content-XXX" and "XXX-complete-length-XXX" 2 For each node that has ITS 2 annotation: 2.1 create a subject URI. That is, instead of generating XPath in a line like /html/body[1]/h2[1]/span[1] , you 2.1.1 generate XXX-ITS-Annotation-Start-XXX: count the characters in all element nodes preceding the current node 2.1.2 generate XXX-ITS-Annotation-End-XXX: count the string length of the current node and calculate plus XXX-ITS-Annotation-Start-XXX 2.2 Now you have the subject URI via using above offsets and the base UR. And now you can create this triple: :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a nif:RFC5147String; 2.3 create the nif:anchorOf triple by putting the string value of the current node in, e.g. "Dublin"; 2.4 create the nif:referenceContext triple by using the count of the complete document string length, e.g. :char=0,30; 2.5 for each annotation, create the triples, "e.g. itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin> " 3 after the last annoation, put the dot "." at the end instead of ";". 9) How much testing do we need? I have implemented the conversion for Text Analysis local, see http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl and an example conversion, in which the triples are shown with the RDF validator http://tinyurl.com/qhhjgmb So if we would have one more implementer who could do 8) we would be done. I'd be happy to contribute input files. I would also be happy to do this for any other data category. But before starting I'd like to know which data category to use, so that I don't need to redo the input files. 10) When do we need this? We need this for finalizing ITS2, that is the testing needs to be done within the next three weeks. I hope that with the test suite based description under 8) the processing of the actual conversion is straightforward and a question of a few hours for whose who are producing test suite output anyway. 11) How critical is this? We need - in addition to me - one more volunteer, otherwise we can't finalize ITS2. Best, Felix
Received on Thursday, 23 May 2013 07:17:23 UTC