W3C home > Mailing lists > Public > public-xg-webid@w3.org > January 2012

Re: rdfa parsing issue -- was: fixed https://foafssl.org/test/WebId

From: Jürgen Jakobitsch <j.jakobitsch@semantic-web.at>
Date: Fri, 06 Jan 2012 22:03:11 +0100 (CET)
To: Henry Story <henry.story@bblfish.net>
Cc: "public-xg-webid@w3.org XG" <public-xg-webid@w3.org>, Damian Steer <pldms@mac.com>
Message-ID: <55bbd84e-64e2-4679-b875-b83e1eec924e@zcs>

i had exactly the same problem with rdfa parser from openrdf and DTD.

what you wanna do is :

1. create a catalog (catalog.xml and download all DTDs)
2. add file "CatalogManager.properties" to the classpath (in a maven project on netbeans you would simply put it in "other resources", so it gets jar'd)
3. modify the code so the xml reader uses that catalog.

my parser looks about so :

   CatalogResolver catRes
   Transformer transformer


   TransformerFactory transFact = TransformerFactory.newInstance();
   CatalogManager catMan = new CatalogManager("CatalogManager.properties");                  
   catRes = new CatalogResolver(catMan);                
   ClassLoader cl = RDFaParser.class.getClassLoader();                
   Templates cachedXSLT = transFact.newTemplates(new StreamSource(cl.getResourceAsStream(XSLT)));                
               transformer = cachedXSLT.newTransformer();

parserMethod (StreamSource source)

XMLReader reader = XMLReaderFactory.createXMLReader();
          reader.setFeature("http://xml.org/sax/features/validation", Boolean.FALSE);                                  
          Source sXML=new SAXSource(reader, new InputSource(source.getInputStream()));                         
	  transformer.transform(sXML, new StreamResult(out));                                        

it took me some time to get this catalog thing up and running, here are some links, 

1. http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/  (see here for the overall trouble)
2. http://xml.apache.org/commons/components/resolver/resolver-article.html
3. http://nwalsh.com/docs/articles/xml2003/
4. http://xerces.apache.org/xerces2-j/faq-xcatalogs.html
5. http://www.sagehill.net/docbookxsl/WriteCatalog.html
6. http://xml.apache.org/mirrors.cgi (download apache's resolver)

to save you some time :

1. find the catalog.xml and the catalog in use by WebIDRealm attached (copy the contents of catalog.zip to /usr/share/catalogs/ and make sure the files are readable)
   if you copy the contents elsewhere you need to change the path in CatalogManager.properties as well.
2. find the CatalogManager.properties in use by WebIDRealm attached

the basic workflow would :

1. CatalogManager reads CatalogManager.properties and finds path of catalog.xml
2. when resolving CatalogResolver looks in catalog.xml to see, if the docType (or module) is mapped there and tries to find the file in 
   path specified in the catalog.xml. 

if you have any questions regarding the catalog feel free to ask.

wkr j

----- Original Message -----
From: "Henry Story" <henry.story@bblfish.net>
To: "Damian Steer" <pldms@mac.com>
Cc: "Jürgen Jakobitsch" <j.jakobitsch@semantic-web.at>, "public-xg-webid@w3.org XG" <public-xg-webid@w3.org>
Sent: Friday, January 6, 2012 9:10:58 PM
Subject: Re: rdfa parsing issue -- was: fixed https://foafssl.org/test/WebId

Thanks Damian,

  that was very helpful. 

I have now fixed a couple of issues on my side now, and I see that Jürgen has updated his xhtml even to be closer to xhtml. So the foafssl.org tester should work with that resource in any case.

Btw, I get the following in the logs

 INF: [console logger] dispatch: 2sea.org GET /sea.jsp HTTP/1.1
ERROR [pool-3-thread-5] (RDFDefaultErrorHandler.java:40) - http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd(line 106 column 22): {E213} Unexpected end of file from server

It looks like the RDFa parser is following the DTDs. Is there a way to stop that? I guess the W3C does not serve those files.


On 6 Jan 2012, at 15:57, Damian Steer wrote:

> Hi Henry and Jürgen,
> On 06/01/12 12:49, Henry Story wrote:
>> Shellac's parser parses the xhtml correctly as xhtml in fact, but 
>> when the html parser is used it comes to a different conclusion.
> Yes, this is becoming a classic issue, and has nothing to do with RDFa
> (although RDFa obscures the issue horribly).
>> RDFA 1 is defined in xhtml only I understand, so it is true that we
>> are going beyond what the spec by trying to parse html too. Perhaps
>> this will be a lot simplified with rdfa1.1 which can be made to work
>> with html5.
> Yes, RDFa 1.0 is only really defined for xhtml, although useful work was
> done on html 5 at the time (there are some html 5 tests). RDFa 1.1 does
> address html 5, but note that it doesn't change anything here.
> The problem is this:
>    <div rel="foaf:depiction" href="http://2sea.org/2sealogo.png"/>
>    <div rel="cert:key">
> 	...
>    </div>
> An xml parser sees a closed div, followed by another div. An html parser
> sees a broken div so repairs it as follows:
>    <div rel="foaf:depiction" href="http://2sea.org/2sealogo.png">
>      <div rel="cert:key">
>        ...
>      </div>
>    </div> <!-- close that div -->
> i.e. one div contains another now, and thus you find
> <http://2sea.org/2sealogo.png> cert:key ....
> I ought to add a utility to switch the parser based on content type,
> however in practice there's so much broken xhtml out there that tag soup
> parsing is much safer (although it does lead to issues like this).
> My advice would be to expect tag soup parsing in the wild and change the
> html:
>    <div rel="foaf:depiction" href="http://2sea.org/2sealogo.png"></div>
> Hope this makes sense,
> Damian

Social Web Architect

| Jürgen Jakobitsch, 
| Software Developer
| Semantic Web Company GmbH
| Mariahilfer Straße 70 / Neubaugasse 1, Top 8
| A - 1070 Wien, Austria
| Mob +43 676 62 12 710 | Fax +43.1.402 12 35 - 22

| http://www.semantic-web.at/

| web   : http://www.turnguard.com
| foaf  : http://www.turnguard.com/turnguard
| skype : jakobitsch-punkt

Received on Friday, 6 January 2012 21:06:24 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:39:54 UTC