- From: Valmik Desai <valmik@wayne.edu>
- Date: Sun, 20 Apr 2003 00:56:19 -0400
- To: html-tidy@w3.org
- Cc: valmik@wayne.edu
To html-tidy, I am facing a problem using tidy. I am trying to convert a html to xml. Tidy has been a great help to me for this, but I am facing some problems in some cases. The Tidy works fine when I save the html page from the browser and the html is converted to xml witjout any errors with some warnings. However When I download it using a program and then run tidy it gives me errors and I cant convert html to xml. This is sample code in java I use to download a html: InputStream in = null; OutputStream out = null; URL url = new URL("http://"+siteurl); // Create the URL in = url.openStream(); // Open a stream to it out = new FileOutputStream(filename); // Now copy bytes from the URL to the output stream byte[] buffer = new byte[4096]; int bytes_read = 0; while(true) { bytes_read = in.read(buffer); if(bytes_read == -1) break; out.write(buffer, 0, bytes_read); } I use this command to convert an html to xml: tidy -asxml slash.html >slash.xml The html used for my experiments is www.slashdot.com. The errors I got are: 181 warnings, 11 errors were found! Not all warnings/errors were shown. This document has errors that must be fixed before using HTML Tidy to generate a tidied up version. URIs must be properly escaped, they must not contain unescaped characters below U+0021 including the space character and not above U+007E. Tidy escapes the URI for you as recommended by HTML 4.01 section B.2.1 and XML 1.0 section 4.2.2. Some user agents use another algorithm to escape such URIs and some server-sided scripts depend on that. If you want to depend on that, you must escape the URI by your own. For more information please refer to http://www.w3.org/International/O-URL-and-ident.html You may need to move one or both of the <form> and </form> tags. HTML elements should be properly nested and form elements are no exception. For instance you should not place the <form> in one table cell and the </form> in another. If the <form> is placed before a table, the </form> cannot be placed inside the table! Note that one form can't be nested inside another! The alt attribute should be used to give a short description of an image; longer descriptions should be given with the longdesc attribute which takes a URL linked to the description. These measures are needed for people using non-graphical browsers. Can anyone help me in this regarding. Regards, Valmik Desai.
Received on Sunday, 20 April 2003 00:56:21 UTC