- From: Felix Sasaki via cvs-syncmail <cvsmail@w3.org>
- Date: Fri, 12 Oct 2012 15:45:07 +0000
- To: public-multilingualweb-lt-commits@w3.org
Update of /w3ccvs/WWW/International/multilingualweb/lt/drafts/its20 In directory hutz:/tmp/cvs-serv4699 Modified Files: its20.html its20.odd Log Message: NIF conversion described Index: its20.odd =================================================================== RCS file: /w3ccvs/WWW/International/multilingualweb/lt/drafts/its20/its20.odd,v retrieving revision 1.180 retrieving revision 1.181 diff -u -d -r1.180 -r1.181 --- its20.odd 11 Oct 2012 19:57:54 -0000 1.180 +++ its20.odd 12 Oct 2012 15:45:05 -0000 1.181 @@ -70,7 +70,7 @@ 1.0</loc>; it is designed to foster the creation of multilingual Web content, focusing on HTML5, XML based formats in general, and to leverage localization workflows based on the XML Localization Interchange File Format (XLIFF). In addition - to HTML5 and XML, algorithms to convert ITS attributes to RDFa and NIF are + to HTML5 and XML, algorithms to convert ITS attributes to NIF are provided.</p> </abstract> <status> @@ -87,7 +87,7 @@ 1.0</loc>; it is designed to foster the creation of multilingual Web content, focusing on HTML5, XML based formats in general, and to leverage localization workflows based on the XML Localization Interchange File Format (XLIFF). In addition - to HTML5 and XML, algorithms to convert ITS attributes to RDFa and NIF are + to HTML5 and XML, algorithms to convert ITS attributes to NIF are provided.</p> <p>This document is an updated Public Working Draft published by the <loc @@ -157,7 +157,6 @@ of these concepts (termed “ITS data categories”) as a set of elements and attributes called the <emph>Internationalization Tag Set (ITS)</emph>. The document provides implementations for HTML5, serializations in <ref - target="http://www.w3.org/TR/xhtml-rdfa-primer/">RDFa</ref> and <ref target="http://nlp2rdf.org/nif-1-0">NIF</ref>, and the schema languages XML DTD <ptr target="#xml10spec" type="bibref"/>, XML Schema <ptr target="#xmlschema1" type="bibref"/> and RELAX NG <ptr target="#relaxng" @@ -209,10 +208,10 @@ 1.0:</p> <list type="unorderd"> <item>ITS 2.0 data categories are intended to be format neutral, with - support for XML, HTML5, RDFa, and NIF: a data category + support for XML, HTML5, and NIF: a data category implementation only needs to support a single content format mapping in order to support a claim of ITS 2.0 conformance.</item> - <item>ITS 2.0 provides algorithms to generate RDFa and NIF out of HTML5 + <item>ITS 2.0 provides algorithms to generate NIF out of HTML5 or XML with ITS 2.0 metadata.</item> <item>A global implementation of ITS 2.0 requires at least the XPath version 1.0. Other versions of XPath or other query languages (e.g., @@ -1147,12 +1146,15 @@ claims to process ITS markup implementing the conformance clauses 2-1, 2-2 and 2-3, it <ref target="#rfc-keywords">MUST</ref> process that markup with HTML5 or with XML documents.</p></item> + <item><p xml:id="its-conformance-2-4"><emph>2-4:</emph> After processing ITS information on the basis of conformance clauses <ref target="#its-conformance-2-1">2-1</ref> and <ref target="#its-conformance-2-2">2-2</ref>, an application <ref target="#rfc-keywords" + >MAY</ref> convert an XML or HTML document (or its DOM representation) to NIF, using the algorithm described in <ptr target="#conversion-to-nif" type="specref"/>.</p></item> </list> + <note><p>The conformance clause <ref target="#its-conformance-2-4">2-4</ref> essentially means that the conversion to NIF is an optional feature of ITS 2.0, and that the conversion is independent of whether ITS information has been made available via the global or local selection mechanisms, see conformance clause <ref target="#its-conformance-2-1-1">2-1-1</ref>.</p></note> <p xml:id="its-processing-conformance-claims">Statements related to this conformance type <ref target="#rfc-keywords">MUST</ref> list all <ref target="#def-datacat">data categories</ref> they implement, and for each <ref target="#def-datacat">data category</ref> which type of selection - they support, and whether they support processing of XML and / or HTML5.</p> + they support, whether they support processing of XML and / or HTML5. If the implementation provides the conversion to NIF (see conformance clause <ref target="#its-conformance-2-4">2-4</ref>), this <ref target="#rfc-keywords">MUST</ref> be stated.</p> <note><p>The above conformance clauses are directly reflected in the <ref target="http://phaedrus.scss.tcd.ie/its2.0/its-testsuite.html#">ITS @@ -1161,7 +1163,7 @@ local selection, or both; they require the processing of defaults and precedence of selections (clauses 2-1-2 and 2-1-3); for each data category there are tests with linked rules (2-2); and all types of tests - are given for XML and HTML5 content (clause 2-3). Implementors are + are given for XML and HTML5 content (clause 2-3). In addition, there are test cases for conversion to NIF (clause 2-4). Implementors are encouraged to organize their documentation in a similar way, so that users of ITS 2.0 easily can understand the processing capabilities availably.</p></note> @@ -1673,11 +1675,101 @@ </list> </div> - <div xml:id="conversion-to-nif-and-RDFa"> - <head>Conversion to NIF and RDFa</head> - <p>This section will be written in an updated version of this document.</p> - <note type="ed">Here the algorithm for the conversion and some examples (HTML5 - its- input < RDFa and NIF output) need to be added.</note> + <div xml:id="conversion-to-nif"> + <head>Conversion to NIF</head> + <p>This section defines an algorithm to convert XML or HTML documents (or their DOM representations) that contain ITS metadata to the RDF-based format <ref + target="http://nlp2rdf.org/nif-1-0">NIF</ref>. The conversion results in RDF triples that rely on the ITS 2.0 ontology, see tbd.</p> + <note type="ed">Add link to ontology once it is done; assure that the examples use the correct base URIs for the ontology.</note> + <note><p>The algorithm is intended to extract the text from the XML/HTML/DOM for an NLP tool and can produce a lot of <quote>phantom</quote> predicates from excessive whitespace, which 1) increases the size of the intermediate mapping and 2) extracts this whitespace as text. This might decrease NLP performance. It is recommended to normalize whitespace in the input XML/HTML/DOM in order to minimize such phantom predicates. A normalized example is given below. The whitespace normalization algorithm itself is format dependend, e.g. it differs for HTML compared to general XML. Hence no normative algorithm for whitespace normalization is given as part of this specification.</p></note> + <exemplum xml:id="EX-HTML-whitespace-normalization"> + <head>Example of an HTML document with whitespace nornalized, as a preparation for conversion to NIF</head> +<eg><![CDATA[<html><body><h2 translate="yes">Welcome to <span + its-disambig-ident-ref="http://dbpedia.org/resource/Dublin" + translate="no">Dublin</span> in <b translate="no">Ireland</b>!</h2></body></html>]]></eg> + </exemplum> + <p xml:id="its2nif-algorithm">The conversion algorithm to generate NIF consists of seven steps.</p> + <list type="unordered"> + <item><p xml:id="its2nif-algorithm-step1">STEP 1: Get an ordered list of all text nodes of the document.</p></item> + <item><p xml:id="its2nif-algorithm-step2">STEP 2: Generate an XPath expression for each non-empty text node of all leaf elements and remember them.</p></item> + <item><p xml:id="its2nif-algorithm-step3">STEP 3: Get the text for each node and make a tuple with the XPath expressions (X,T). Since the text nodes have a certain order we now have a list of ordered tuples ((x0,t0), (x1,t1), ..., (xn,tn)).</p></item> + <item><p xml:id="its2nif-algorithm-step4">STEP 4 (optional): Serialize as XML or as RDF. The list with the XPath-to-text mapping can also be kept in memory. Part of a serialization example is given below.</p></item> + </list> + <eg><![CDATA[@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . +<http://example.com/exampledoc.html#xpath(x0)> + itsrdf:xpath2nif <http://example.com/exampledoc.html#offset_b0_e0> +<http://example.com/exampledoc.html#xpath(x1)> + itsrdf:xpath2nif <http://example.com/exampledoc.html#offset_b1_e1> +# ... +<http://example.com/exampledoc.html#xpath(xn)> + itsrdf:xpath2nif <http://example.com/exampledoc.html#offset_bn_en> +<mappings> + <mapping x="xpath(x0)" b="b0" e="e0" /> + <mapping x="xpath(x1)" b="b1" e="e1" /> + <!-- ... --> + <mapping x="xpath(xn)" b="bn" e="en" /> +</mappings>]]></eg> + <p>where</p> + <eg><![CDATA[b0 = 0 +e0 = b0 + (Number of characters of t0) +b1 = e0 +1 +e1 = b1 + (Number of characters of t1) +... +bn = e(n-1) +1 +en = bn + (Number of characters of tn) +]]></eg> + <p>Example (continued)</p> + <eg><![CDATA[@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . +# "Welcome to " +<http://example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/text()[1])> + itsrdf:nif <http://example.com/exampledoc.html#offset_0_11> . +# "Dublin" +<http://example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/span[1]/text()[1])> + itsrdf:nif <http://example.com/exampledoc.html#offset_11_17> . +# " in " +<http://example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/text()[2])> + itsrdf:nif <http://example.com/exampledoc.html#offset_17_21> . +# "Ireland" +<http://example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/b[1]/text()[1])> + itsrdf:nif <http://example.com/exampledoc.html#offset_21_28> . +# "!" +<http://example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/text()[3])> + itsrdf:nif <http://example.com/exampledoc.html#offset_28_29> . +# "Welcome to Dublin Ireland!" +<http://example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/text())> + itsrdf:nif <http://example.com/exampledoc.html#offset_0_29> . +<mappings> + <mapping x="xpath(/html/body[1]/h2[1]/text()[1])" b="0" e="11" /> + <mapping x="xpath(/html/body[1]/h2[1]/span[1]/text()[1])" b="11" e="17" /> + <mapping x="xpath(/html/body[1]/h2[1]/text()[2])" b="17" e="21" /> + <mapping x="xpath(/html/body[1]/h2[1]/b[1]/text()[1])" b="21" e="28" /> + <mapping x="xpath(/html/body[1]/h2[1]/text()[3])" b="28" e="29" /> + <mapping x="xpath(/html/body[1]/h2[1])" b="0" e="29" /> +</mappings>]]></eg> + <note type="ed">Below needs a reference to the ITS ontology, once available.</note> + <list type="unordered"> + <item><p xml:id="its2nif-algorithm-step5">STEP 5: Create a context URI and attach the whole concatenated text of the document as reference.</p></item> + <item><p xml:id="its2nif-algorithm-step6">STEP 6: Now attach any ITS metadata items from the XML/HTML/DOM input to respective NIF URIs using the ITS/RDF ontology (TODO Name).</p></item> + <item><p xml:id="its2nif-algorithm-step7">STEP 7: Omit all irrelevant URIs (those that do not carry annotations, they will just bloat the data).</p></item> + </list> + <eg><![CDATA[@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . +<http://example.com/exampledoc.html#offset_0_29> + rdf:type str:Context ; +# concatenate the whole text + str:isString "$(t0+t1+t2+...+tn)" ; + itsrdf:translate "yes"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo> ; + str:occursIn <http://example.com/exampledoc.html> . +<http://example.com/exampledoc.html#offset_11_17> + rdf:type str:String ; + itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo> ; + itsrdf:disambigIdentRef <http://dbpedia.org/resource/Dublin> ; + str:referenceContext <http://example.com/exampledoc.html#offset_0_29> . +<http://example.com/exampledoc.html#offset_21_28> + rdf:type str:String ; + itsrdf:translate "no"^^<http://www.w3.org/TR/its-2.0/its.xsd#yesOrNo> ; + str:referenceContext <http://example.com/exampledoc.html#offset_0_29> . +]]></eg> + <p>A complete sample output in RDF/XML format after step 7, given the input document <ptr target="#EX-HTML-whitespace-normalization" type="exref"/>, is available at <ref target="examples/nif/EX-nif-conversion-output.xml">examples/nif/EX-nif-conversion-output.xml</ref>.</p> + <note><p>The conversion to NIF is the basis for natural language processing (NLP) applications, creating for example named entity annotations. A non-normative algorithm to integrate these annotations into the original input document is given in <ptr target="#nif-backconversion" type="specref"/>. The algorithm in that appendix is non-normative since many choices depend on the actual NLP application.</p></note> </div> </div> <div xml:id="datacategory-description"> @@ -5237,6 +5329,64 @@ </item> </list> </div> + <div xml:id="nif-backconversion" type="inform"> + <head>Conversion NIF2ITS</head> + <p>The following algoritm relies on <ptr type="exref" target="#EX-HTML-whitespace-normalization"/>. It is assumed that the example has been converted to NIF, leading to the <ref target="examples/nif/EX-nif-conversion-output.xml">output</ref> exemplified for the <ref target="#its2nif-algorithm">ITS2NIF conversion algorithm</ref>.</p> + <p>As a natural language processing (NLP) tool, we choose <ref target="https://github.com/dbpedia-spotlight/dbpedia-spotlight#readme">DBpedia Spotlight</ref>. For this example let's assume DBpedia Spotlight linked "Ireland" to DBpedia:</p> + <eg><![CDATA[<http://example.com/exampledoc.html#offset_21_28> + rdf:type str:String ; + itsrdf:disambigIdentRef <http://dbpedia.org/resource/Ireland> . +<http://dbpedia.org/resource/Ireland> + rdf:type <http:/nerd.eurecom.fr/ontology#Country> . +]]></eg> + <p xml:id="nif2its-algorithm">The conversion algorithm to generate ITS out of NIF consists of two steps.</p> + <list type="unordered"> + <item><p xml:id="nif2its-algorithm-step1">STEP 1: Send the text to any NIF web service, which creates the NLP annotation. The output of the Web service will be a NIF representation that uses the itsrdf ontology directly.</p></item> + <item><p xml:id="nif2its-algorithm-step3">STEP 2: Use the mapping from ITS2NIF (available after <ref target="#its2nif-algorithm-step7">step 7</ref> of the ITS2NIF algorithm) to reintegrate annotations in the original ITS annotated document.</p> + </item> + </list> + <p>For step 2, three cases can occur.</p> + <note type="ed">Need to check that the annotations shown for case 1 and case 2 are conform to the latest definition of "disambiguation".</note> + <p>CASE 1: The NLP annotation created in NIF matches the text node. Solution: Attach the annotation to the parent element of the text node.</p> + <eg><![CDATA[# Based on: +<http://example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/b[1]/text()[1])> + itsrdf:nif <http://example.com/exampledoc.html#offset_21_28> . +# and: +<http://example.com/exampledoc.html#offset_21_28> + itsrdf:disambigIdentRef <http://dbpedia.org/resource/Ireland> . +# we can attach the metadata to the parent node: +<b its-disambig-ident-ref="http://dbpedia.org/resource/Dublin” + translate="no">Ireland</b> +]]></eg> + <p>CASE 2: The NLP annotation created in NIF is a substring of the text node. Solution: Create a new element, e.g. for HTML5 "span". A different input example is given below as case 2 is not covered in the original example input.</p> + <eg><![CDATA[# Input: + +<html> + <body> + <h2>Welcome to Dublin in Ireland!</h2> + </body> +</html> + +# ITS2NIF + +<http://example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/text()[1])> + itsrdf:nif <http://example.com/exampledoc.html#offset_0_29> + +# DBpedia Spotlight returns: + +<http://example.com/exampledoc.html#offset_21_28> + itsrdf:disambigIdentRef <http://dbpedia.org/resource/Ireland> . + +# NIF2ITS + +<html> + <body> + <h2 >Welcome to Dublin in <span + its-disambig-ident-ref="http://dbpedia.org/resource/Ireland” >Ireland</span>!</h2> + </body> +</html>]]></eg> + <p>Case 3: The NLP annotation created in NIF starts in one region and ends in another. Solution: No straight mapping is possible; a mapping can be created if both regions have the same parent.</p> + </div> <div xml:id="revisionlog" type="inform"> <head>Revision Log</head> <p xml:id="changelog-since-20120829">The following log records major changes that @@ -5270,7 +5420,7 @@ <item><p>Added new kind of user to <ptr target="#potential-users" type="specref"/>.</p></item> <item><p>Added the algorithm to obtain the value of the <ref target="#domain">Domain</ref> data category.</p></item> <item><p>Updated the <ref target="#allowedchars">Allowed Characters</ref> data category for the empty string case and the way to define "allow any characters"..</p></item> - + <item><p>Added sections related to NIF conversion (<ptr type="specref" target="#conversion-to-nif"/> and <ptr type="specref" target="#nif-backconversion"/>) and a related conformance clause <ref target="#its-conformance-2-4">2-4</ref>.</p></item> </list> <p xml:id="changelog-since-20120731">The following log records major changes that have been made to this document since the <ref @@ -5341,7 +5491,7 @@ target="#EX-term-local-html-1" type="exref"/></item> <item>Added placeholders for new data categories to <ptr target="#datacategory-description" type="specref"/></item> - <item>Added a placeholder section <ptr target="#conversion-to-nif-and-RDFa" + <item>Added a placeholder section <ptr target="#conversion-to-nif" type="specref"/></item> </list> </div> Index: its20.html =================================================================== RCS file: /w3ccvs/WWW/International/multilingualweb/lt/drafts/its20/its20.html,v retrieving revision 1.183 retrieving revision 1.184 diff -u -d -r1.183 -r1.184 --- its20.html 11 Oct 2012 19:57:54 -0000 1.183 +++ its20.html 12 Oct 2012 15:45:04 -0000 1.184 @@ -12,7 +12,7 @@ 1.0</a>; it is designed to foster the creation of multilingual Web content, focusing on HTML5, XML based formats in general, and to leverage localization workflows based on the XML Localization Interchange File Format (XLIFF). In addition - to HTML5 and XML, algorithms to convert ITS attributes to RDFa and NIF are + to HTML5 and XML, algorithms to convert ITS attributes to NIF are provided.</p></div><div> <h2><a name="status" shape="rect">Status of this Document</a></h2><p><strong>This document is an editors' copy that has no official standing.</strong></p></div><div class="toc"> @@ -53,15 +53,15 @@ <div class="toc3">5.2.3 <a href="#selection-local" shape="rect">Local Selection in an XML Document</a></div> </div> [...1080 lines suppressed...] @@ -3689,7 +3795,7 @@ about informative mappings of <a href="#lqissue-typevalues" shape="rect">Values for the Localization Quality Issue Type</a> to the <a href="http://www.w3.org/International/its/wiki/Tool_specific_mappings" shape="rect">ITS IG wiki</a>.</p></li><li><p>Added a <a href="#its-conformance-2-3" shape="rect">conformance clause</a> about HTML5 versus XML processing.</p></li><li><p>Added links to XML and HTML5 examples to the <a href="#datacategories-overview" shape="rect">data category overview - table</a>.</p></li><li><p>Added new kind of user to <a class="section-ref" href="#potential-users" shape="rect">Section 1.3.1: Potential Users of ITS</a>.</p></li><li><p>Added the algorithm to obtain the value of the <a href="#domain" shape="rect">Domain</a> data category.</p></li><li><p>Updated the <a href="#allowedchars" shape="rect">Allowed Characters</a> data category for the empty string case and the way to define "allow any characters"..</p></li></ol><p id="changelog-since-20120731">The following log records major changes that + table</a>.</p></li><li><p>Added new kind of user to <a class="section-ref" href="#potential-users" shape="rect">Section 1.3.1: Potential Users of ITS</a>.</p></li><li><p>Added the algorithm to obtain the value of the <a href="#domain" shape="rect">Domain</a> data category.</p></li><li><p>Updated the <a href="#allowedchars" shape="rect">Allowed Characters</a> data category for the empty string case and the way to define "allow any characters"..</p></li><li><p>Added sections related to NIF conversion (<a class="section-ref" href="#conversion-to-nif" shape="rect">Section 5.7: Conversion to NIF</a> and <a class="section-ref" href="#nif-backconversion" shape="rect">Appendix G: Conversion NIF2ITS</a>) and a related conformance clause <a href="#its-conformance-2-4" shape="rect">2-4</a>.</p></li></ol><p id="changelog-since-20120731">The following log records major changes that have been made to this document since the <a href="http://www.w3.org/TR/2012/WD-its20-20120731/" shape="rect">ITS 2.0 Working Draft 31 July 2012</a>.</p><ol class="depth1"><li><p>Added <a class="section-ref" href="#Disambiguation" shape="rect">Section 6.10: Disambiguation</a>.</p></li><li><p>Added <a class="section-ref" href="#preservespace" shape="rect">Section 6.17: Preserve Space</a>.</p></li><li><p>Added <a class="section-ref" href="#idvalue" shape="rect">Section 6.16: Id Value</a>.</p></li><li><p>Added support for different query language and reworked whole XPath and CSS Selectors integration.</p></li><li><p>Added examples to <a class="section-ref" href="#externalresource" shape="rect">Section 6.14: External Resource</a>.</p></li><li><p>Simplified <a class="section-ref" href="#LocaleFilter" shape="rect">Section 6.11: Locale Filter</a>.</p></li><li><p>Added a note about HTML5 and the attributes <code>dir</code> and @@ -3704,8 +3810,8 @@ <a class="section-ref" href="#relation-to-its10" shape="rect">Section 1.1.1: Relation to ITS 1.0</a></p></li><li><p>Created HTML5 based declarations for various data categories, see e.g. HTML5 declarations for the Terminology data category and the summary for - local data categories in <a class="section-ref" href="#selection-local" shape="rect">Section 5.2.3: Local Selection in an XML Document</a></p></li><li><p>Created examples for these declarations, see e.g. <a href="#EX-term-local-html-1" shape="rect">Example 38</a></p></li><li><p>Added placeholders for new data categories to <a class="section-ref" href="#datacategory-description" shape="rect">Section 6: Description of Data Categories</a></p></li><li><p>Added a placeholder section <a class="section-ref" href="#conversion-to-nif-and-RDFa" shape="rect">Section 5.7: Conversion to NIF and RDFa</a></p></li></ol></div><div class="div1"> -<h2><a href="#contents" shape="rect"><img src="images/topOfPage.gif" align="right" height="26" width="26" title="Go to the table of contents." alt="Go to the table of contents."/></a><a name="acknowledgements" id="acknowledgements" shape="rect"/>H Acknowledgements (Non-Normative)</h2><p>This document has been developed with contributions by the MultilingualWeb-LT + local data categories in <a class="section-ref" href="#selection-local" shape="rect">Section 5.2.3: Local Selection in an XML Document</a></p></li><li><p>Created examples for these declarations, see e.g. <a href="#EX-term-local-html-1" shape="rect">Example 39</a></p></li><li><p>Added placeholders for new data categories to <a class="section-ref" href="#datacategory-description" shape="rect">Section 6: Description of Data Categories</a></p></li><li><p>Added a placeholder section <a class="section-ref" href="#conversion-to-nif" shape="rect">Section 5.7: Conversion to NIF</a></p></li></ol></div><div class="div1"> +<h2><a href="#contents" shape="rect"><img src="images/topOfPage.gif" align="right" height="26" width="26" title="Go to the table of contents." alt="Go to the table of contents."/></a><a name="acknowledgements" id="acknowledgements" shape="rect"/>I Acknowledgements (Non-Normative)</h2><p>This document has been developed with contributions by the MultilingualWeb-LT Working Group: Mihael Arcan (DERI Galway at the National University of Ireland, Galway, Ireland), Pablo Badía (Linguaserve), Aaron Beaton (Opera Software), Luis Bellido (Universidad Politécnica de Madrid), Aljoscha Burchardt (German Research
Received on Friday, 12 October 2012 15:45:10 UTC