Note:

The algorithm is intended to extract the text from the XML/HTML/DOM representation for an NLP tool. It can produce a lot of "phantom" predicates because of an excessive use of whitespace characters, which 1) increases the size of the intermediate mapping representation, and 2) extracts these whitespace characters as text, and therefore might decrease the overall NLP performance. To avoid this situation, it is strongly recommended to normalize the whitespace characters in the input XML/HTML/DOM representation. A normalized example is given below. Since the whitespace normalization process is format dependent, for example, it differs for HTML compared to general XML, no normative algorithm for whitespace normalization can be given as part of this specification.

Example 25: Example of an HTML document with whitespace character normalization as preparation for the conversion to NIF

STEP 2: Generate an XPath expression for each non-empty text node of all leaf elements and memorize them.

STEP 6: Attach any ITS metadata annotations from the XML/HTML/DOM input to the respective NIF URIs.

STEP 7: Omit all URIs that do not carry annotations (they will just bloat the data).

Note:

The conversion to NIF is a possible basis for a natural language processing (NLP) application that creates, for example, named entity annotations. A non-normative algorithm to integrate these annotations into the original input document is given in Appendix F: Conversion NIF2ITS. This algorithm is non-normative because many decisions depend on the actually employed NLP application.