- From: Felix Sasaki <fsasaki@w3.org>
- Date: Thu, 23 May 2013 09:30:08 +0200
- To: Phil Ritchie <philr@vistatec.ie>
- CC: Multilingual Web LT-TESTS Public <public-multilingualweb-lt-tests@w3.org>
- Message-ID: <519DC580.6050404@w3.org>
Cool, thanks a lot, Phil, that was fast! I will then create a few input - output files for LQI by Monday & let's see who else will step up. Best, Felix Am 23.05.13 09:24, schrieb Phil Ritchie: > Felix > > I volunteer. If the single test category could be LQI all the better. > > Phil > > > > On 23 May 2013, at 08:17, "Felix Sasaki" <fsasaki@w3.org > <mailto:fsasaki@w3.org>> wrote: > >> Hi all, >> >> we have one feature in ITS2 that is not yet tested: the conversion to >> NIF. To fill this gap, I would propose the following approach. >> >> 1) The conversion is tested with example files for one data category. >> No need to have the conversion output of several data categories in >> one output file. >> >> >> 2) Like the definition of the NIF > ITS algorithm >> http://www.w3.org/TR/2013/WD-its20-20130521/#conversion-to-nif >> the conversion output represents ITS information, but no information >> about how it was generated (local markup, global, interitance, defaults). >> >> >> 3) RDF can be represented in various formats: RDF/XML, RDFa, Turtle, >> ... for testing purpose having just one representation would be good, >> for ease of comparison. >> >> >> 4) It doesn't make sense to add the NIF conversion input / output to >> the test suite master file and to take it into account for >> comparison. The reason is that the test suite master file does a line >> by line comparison of test suite output. That doesn't provide useful >> info for NIF. >> >> >> 5) So how to compare output? The "meat" of the conversion to NIF is >> that for each node in a document that holds ITS information, a triple >> of the following form is generated: >> SubjectURI ITS2DataCategorySpecificPredicate Value >> >> "SubjectURI" is an URI that consists of the document base URI plus >> "#" plus character offsets. Example for "Dublin" in >> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html >> this is >> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#char=12,18 >> >> "ITS2DataCategorySpecificPredicate" is an RDF predicate that >> identifies the data category information in question, e.g. for >> "Dublin" it is "itsrdf:taIdentRef". >> >> "Value" is the data category value. For "Dublin" in the above >> example, this is http://dbpedia.org/resource/Dublin >> >> So we would need to compare whether two implementations create the >> sample triples of above form for a given input document. "The same" >> does not mean "the same offsets", e.g. "#char=12,18", since white >> space normalization for the NIF conversion is not defined in ITS2. >> "The same" means: for each node that contains ITS information there >> must be a triple like above, with the same predicate and object. >> >> >> 6) Do we need automatic comparison? >> With the definition of "the same" like above an automatic comparison >> of the output is hard. But it is not needed IMO: NIF conversion is >> one feature like e.g. Translate "global"; so having 5-10 input files >> that we can check manually would be sufficient. Also, we don't need >> to cover several data categories in one NIF conversion test file, and >> we might restrict the testing even to one data category: "text >> analyis", "translate", ... >> >> >> 7) How to do this practically? >> >> Below is a template for the output ITS2NIF conversion, in Turtle >> serialization. >> >> [ >> @prefix : <XXX-base-uriXXX#> . >> @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . >> @prefix nif: >> <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . >> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . >> >> :char=0,XXX-complete-length-XXX a nif:Context; >> nif:isString " XXX-complete-source-file-text-content-XXX"; >> nif:sourceUrl <XXX-base-uriXXX> . >> >> :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a >> nif:RFC5147String; >> nif:anchorOf "XXX-annotated-string-XXX"; >> nif:referenceContext :char=0,XXX-complete-length-XXX; >> >> >> XXX-annotation-predicate-XXX XXX-annotation-value-XXX . >> ] >> >> The part starting with >> ":char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a >> nif:RFC5147String;" >> and ending with >> "XXX-annotation-predicate-XXX XXX-annotation-value-XXX ." >> would be needed for each annotation. >> >> Here is an example how the template woud be filled in for >> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html >> Note that in below test file, I am only processing "text analysis" >> information, see 6) above for the rationale. >> >> [ >> @prefix : >> <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#> >> . >> @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . >> @prefix nif: >> <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . >> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . >> >> :char=0,30 a nif:Context; >> nif:isString " Welcome to Dublin in Ireland!"; >> nif:sourceUrl >> <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html> >> . >> >> :char=12,18 a nif:RFC5147String; >> nif:anchorOf "Dublin"; >> nif:referenceContext :char=0,30; >> itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin> . >> >> :xpath(/html/body%5B1%5D/h2%5B1%5D/span%5B1%5D/text()%5B1%5D) >> nif:convertedFrom :char=12,18 . >> ] >> >> >> 8) What effort is needed from test suite contributors? >> >> For Leroy / TCD as the test suite owner, I'd say no effort is needed >> except creating directories for NIF input / output and documenting >> them on the test suite main page, saying that they are part of the >> normative conformance testing. >> >> We need at least one test suite contributor that would - in addition >> to me - implement the conversion exemplified in 7). The >> implementation should be pretty straightforward: >> 0 Re-use the code that generates your test suite output >> 1 Create the template under 7), fill in file base URI >> "XXX-base-uriXXX", "XXX-complete-source-file-text-content-XXX" and >> "XXX-complete-length-XXX" >> 2 For each node that has ITS 2 annotation: >> 2.1 create a subject URI. That is, instead of generating XPath in a >> line like /html/body[1]/h2[1]/span[1] , you >> 2.1.1 generate XXX-ITS-Annotation-Start-XXX: count the characters in >> all element nodes preceding the current node >> 2.1.2 generate XXX-ITS-Annotation-End-XXX: count the string length of >> the current node and calculate plus XXX-ITS-Annotation-Start-XXX >> 2.2 Now you have the subject URI via using above offsets and the base >> UR. And now you can create this triple: >> :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a >> nif:RFC5147String; >> 2.3 create the nif:anchorOf triple by putting the string value of the >> current node in, e.g. "Dublin"; >> 2.4 create the nif:referenceContext triple by using the count of the >> complete document string length, e.g. :char=0,30; >> 2.5 for each annotation, create the triples, "e.g. itsrdf:taIdentRef >> <http://dbpedia.org/resource/Dublin> " >> 3 after the last annoation, put the dot "." at the end instead of ";". >> >> 9) How much testing do we need? >> >> I have implemented the conversion for Text Analysis local, see >> http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl >> and an example conversion, in which the triples are shown with the >> RDF validator http://tinyurl.com/qhhjgmb >> So if we would have one more implementer who could do 8) we would be >> done. >> I'd be happy to contribute input files. I would also be happy to do >> this for any other data category. But before starting I'd like to >> know which data category to use, so that I don't need to redo the >> input files. >> >> >> 10) When do we need this? >> >> We need this for finalizing ITS2, that is the testing needs to be >> done within the next three weeks. I hope that with the test suite >> based description under 8) the processing of the actual conversion >> is straightforward and a question of a few hours for whose who are >> producing test suite output anyway. >> >> >> 11) How critical is this? >> >> We need - in addition to me - one more volunteer, otherwise we can't >> finalize ITS2. >> >> >> Best, >> >> Felix > > > ************************************************************ > VistaTEC Ltd. Registered in Ireland 268483. > Registered Office, VistaTEC House, 700, South Circular Road, > Kilmainham. Dublin 8. Ireland. > > The information contained in this message, including any accompanying > documents, is confidential and is intended only for the addressee(s). > The unauthorized use, disclosure, copying, or alteration of this > message is strictly forbidden. If you have received this message in > error please notify the sender immediately. > ************************************************************ >
Received on Thursday, 23 May 2013 07:30:43 UTC