W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2002

Fwd: JTidy & Un-Tagged Text in HTML Doc

From: Kim <kime@atlantic.net>
Date: Tue, 6 Aug 2002 05:38:34 -0400
Message-ID: <1028626714.3d4f991a83f40@webmail.atlantic.net>
To: html-tidy@w3.org
Somehow this eneded up in my mailbox? It is from jamesgc21@attbi.com





----- Forwarded message from Greg James <jamesgc21@attbi.com> -----
    Date: Mon, 5 Aug 2002 21:32:04 -0600
    From: Greg James <jamesgc21@attbi.com>
Reply-To: Greg James <jamesgc21@attbi.com>
 Subject: JTidy & Un-Tagged Text in HTML Doc
      To: html-tidy@w3.org

I'm trying to use JTidy to convert HTML pages to XML.  The HTML has several 'un-
tagged' entries.  For example:

<P><A name=Hit3><B>3.</B></A> <A 
href="http://www.matrixscience.com/cgi/protein_view.pl?
file=../data/20020130/FaioSfs.dat&hit=4">gi|11528046</A>  <B>Mass:</B> 74711  
<B>Score:</B> 43     
(AF197556) coat protein [Beet necrotic yellow vein virus]
<B> Observed    Mr(expt)   Mr(calc)    Delta   Start     End  Miss  Peptide</B>
   564.70     564.70     565.25     -0.55     168 -   171    0   FEDR
   828.00     828.00     828.51     -0.51      44 -    51    0   AANLSIIK
  1032.30    1032.30    1032.56     -0.26     509 -   519    0   AAVAMTALASK
  2271.60    2271.60    2271.16      0.44     556 -   578    0   
YVHTGIQGGAQLAGAMAVGAMLR
<B>No match to:</B> 1021.10, 3511.70

Is there an easy way to get JTidy to 'tag' the un-tagged text?  For example, 
the text between the <B>'s?  I'd rather not right a java program to tag these 
lines prior to sending it to JTidy.

I'm setting the following params on JTidy:

tidy.setMakeClean(true);
tidy.setBreakBeforeBR(true);
tidy.setShowWarnings(false);
tidy.setOnlyErrors(true);

Thanks.

----- End forwarded message -----


-- 


-------------------------------------------------------------------------
This message was sent through Atlantic.Net Webmail.                             
Sign up for fast, reliable dial-up service for only $19.95/mo.                  
Visit www.atlantic.net to learn more. 


Received on Tuesday, 6 August 2002 05:38:36 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:52 GMT