W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2003

Problem converting to xml

From: Valmik Desai <valmik@wayne.edu>
Date: Sun, 20 Apr 2003 00:56:19 -0400
To: html-tidy@w3.org
Cc: valmik@wayne.edu
Message-Id: <e3aa8868.77582869.8195300@mirapointms2.wayne.edu>



To html-tidy,

I am facing a problem using tidy.
I am trying to convert a html to xml. Tidy has been a great
help to me for this, but I am facing some problems in some cases.

The Tidy works fine when I save the html page from the browser
and the html is converted to xml witjout any errors with some
warnings.
However When I  download it using a program and then run tidy
it gives me errors and I cant convert html to xml.

This is sample code in java I use to download a html:
      InputStream in = null;
      OutputStream out = null;

      URL url = new URL("http://"+siteurl);   // Create the URL
      in = url.openStream();        // Open a stream to it
      out = new FileOutputStream(filename);

          // Now copy bytes from the URL to the output stream
          byte[] buffer = new byte[4096];
          int bytes_read = 0;
	  while(true)
	  {
	      bytes_read = in.read(buffer);
	      if(bytes_read == -1)
		break;
              out.write(buffer, 0, bytes_read);
	  }

I use this command to convert an html to xml:
tidy -asxml slash.html >slash.xml

The html used for my experiments is www.slashdot.com.

The errors I got are: 
181 warnings, 11 errors were found! Not all warnings/errors
were shown.

This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

URIs must be properly escaped, they must not contain unescaped
characters below U+0021 including the space character and not
above U+007E. Tidy escapes the URI for you as recommended by
HTML 4.01 section B.2.1 and XML 1.0 section 4.2.2. Some user
agents
use another algorithm to escape such URIs and some server-sided
scripts depend on that. If you want to depend on that, you must
escape the URI by your own. For more information please refer to
http://www.w3.org/International/O-URL-and-ident.html

You may need to move one or both of the <form> and </form>
tags. HTML elements should be properly nested and form elements
are no exception. For instance you should not place the <form>
in one table cell and the </form> in another. If the <form> is
placed before a table, the </form> cannot be placed inside the
table! Note that one form can't be nested inside another!

The alt attribute should be used to give a short description
of an image; longer descriptions should be given with the
longdesc attribute which takes a URL linked to the description.
These measures are needed for people using non-graphical browsers.



Can anyone help me in this regarding.

Regards,
Valmik Desai.
Received on Sunday, 20 April 2003 00:56:21 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:54 GMT