- From: Phil Ritchie <philr@vistatec.ie>
- Date: Thu, 23 May 2013 20:26:53 +0100
- To: "Felix Sasaki" <fsasaki@w3.org>
- Cc: "Multilingual Web LT-TESTS Public" <public-multilingualweb-lt-tests@w3.org>
- Message-ID: <7CEC544F-70B5-4515-8959-B2AC6568F862@vistatec.ie>
Felix Do we need to do the conversion for HTML and XML or would XML suffice (my preference). Phil. On 23 May 2013, at 08:17, "Felix Sasaki" <fsasaki@w3.org> wrote: > Hi all, > > we have one feature in ITS2 that is not yet tested: the conversion to NIF. To fill this gap, I would propose the following approach. > > 1) The conversion is tested with example files for one data category. No need to have the conversion output of several data categories in one output file. > > > 2) Like the definition of the NIF > ITS algorithm > http://www.w3.org/TR/2013/WD-its20-20130521/#conversion-to-nif > the conversion output represents ITS information, but no information about how it was generated (local markup, global, interitance, defaults). > > > 3) RDF can be represented in various formats: RDF/XML, RDFa, Turtle, ... for testing purpose having just one representation would be good, for ease of comparison. > > > 4) It doesn't make sense to add the NIF conversion input / output to the test suite master file and to take it into account for comparison. The reason is that the test suite master file does a line by line comparison of test suite output. That doesn't provide useful info for NIF. > > > 5) So how to compare output? The "meat" of the conversion to NIF is that for each node in a document that holds ITS information, a triple of the following form is generated: > SubjectURI ITS2DataCategorySpecificPredicate Value > > "SubjectURI" is an URI that consists of the document base URI plus "#" plus character offsets. Example for "Dublin" in > http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html > this is > http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#char=12,18 > > "ITS2DataCategorySpecificPredicate" is an RDF predicate that identifies the data category information in question, e.g. for "Dublin" it is "itsrdf:taIdentRef". > > "Value" is the data category value. For "Dublin" in the above example, this is http://dbpedia.org/resource/Dublin > > So we would need to compare whether two implementations create the sample triples of above form for a given input document. "The same" does not mean "the same offsets", e.g. "#char=12,18", since white space normalization for the NIF conversion is not defined in ITS2. "The same" means: for each node that contains ITS information there must be a triple like above, with the same predicate and object. > > > 6) Do we need automatic comparison? > With the definition of "the same" like above an automatic comparison of the output is hard. But it is not needed IMO: NIF conversion is one feature like e.g. Translate "global"; so having 5-10 input files that we can check manually would be sufficient. Also, we don't need to cover several data categories in one NIF conversion test file, and we might restrict the testing even to one data category: "text analyis", "translate", ... > > > 7) How to do this practically? > > Below is a template for the output ITS2NIF conversion, in Turtle serialization. > > [ > @prefix : <XXX-base-uriXXX#> . > @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . > @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . > @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . > > :char=0,XXX-complete-length-XXX a nif:Context; > nif:isString " XXX-complete-source-file-text-content-XXX"; > nif:sourceUrl <XXX-base-uriXXX> . > > :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a nif:RFC5147String; > nif:anchorOf "XXX-annotated-string-XXX"; > nif:referenceContext :char=0,XXX-complete-length-XXX; > > > XXX-annotation-predicate-XXX XXX-annotation-value-XXX . > ] > > The part starting with > ":char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a nif:RFC5147String;" > and ending with > "XXX-annotation-predicate-XXX XXX-annotation-value-XXX ." > would be needed for each annotation. > > Here is an example how the template woud be filled in for > http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html > Note that in below test file, I am only processing "text analysis" information, see 6) above for the rationale. > > [ > @prefix : <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#> . > @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . > @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . > @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . > > :char=0,30 a nif:Context; > nif:isString " Welcome to Dublin in Ireland!"; > nif:sourceUrl <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html> . > > :char=12,18 a nif:RFC5147String; > nif:anchorOf "Dublin"; > nif:referenceContext :char=0,30; > itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin> . > > :xpath(/html/body%5B1%5D/h2%5B1%5D/span%5B1%5D/text()%5B1%5D) nif:convertedFrom :char=12,18 . > ] > > > 8) What effort is needed from test suite contributors? > > For Leroy / TCD as the test suite owner, I'd say no effort is needed except creating directories for NIF input / output and documenting them on the test suite main page, saying that they are part of the normative conformance testing. > > We need at least one test suite contributor that would - in addition to me - implement the conversion exemplified in 7). The implementation should be pretty straightforward: > 0 Re-use the code that generates your test suite output > 1 Create the template under 7), fill in file base URI "XXX-base-uriXXX", "XXX-complete-source-file-text-content-XXX" and "XXX-complete-length-XXX" > 2 For each node that has ITS 2 annotation: > 2.1 create a subject URI. That is, instead of generating XPath in a line like /html/body[1]/h2[1]/span[1] , you > 2.1.1 generate XXX-ITS-Annotation-Start-XXX: count the characters in all element nodes preceding the current node > 2.1.2 generate XXX-ITS-Annotation-End-XXX: count the string length of the current node and calculate plus XXX-ITS-Annotation-Start-XXX > 2.2 Now you have the subject URI via using above offsets and the base UR. And now you can create this triple: > :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a nif:RFC5147String; > 2.3 create the nif:anchorOf triple by putting the string value of the current node in, e.g. "Dublin"; > 2.4 create the nif:referenceContext triple by using the count of the complete document string length, e.g. :char=0,30; > 2.5 for each annotation, create the triples, "e.g. itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin> " > 3 after the last annoation, put the dot "." at the end instead of ";". > > 9) How much testing do we need? > > I have implemented the conversion for Text Analysis local, see > http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl > and an example conversion, in which the triples are shown with the RDF validator http://tinyurl.com/qhhjgmb > So if we would have one more implementer who could do 8) we would be done. > I'd be happy to contribute input files. I would also be happy to do this for any other data category. But before starting I'd like to know which data category to use, so that I don't need to redo the input files. > > > 10) When do we need this? > > We need this for finalizing ITS2, that is the testing needs to be done within the next three weeks. I hope that with the test suite based description under 8) the processing of the actual conversion is straightforward and a question of a few hours for whose who are producing test suite output anyway. > > > 11) How critical is this? > > We need - in addition to me - one more volunteer, otherwise we can't finalize ITS2. > > > Best, > > Felix ************************************************************ VistaTEC Ltd. Registered in Ireland 268483. Registered Office, VistaTEC House, 700, South Circular Road, Kilmainham. Dublin 8. Ireland. The information contained in this message, including any accompanying documents, is confidential and is intended only for the addressee(s). The unauthorized use, disclosure, copying, or alteration of this message is strictly forbidden. If you have received this message in error please notify the sender immediately. ************************************************************
Received on Thursday, 23 May 2013 19:27:27 UTC