- From: Nilesh Chavan <mrnchavan@gmail.com>
- Date: Mon, 20 Jun 2011 22:54:59 -0400
- To: html-tidy@w3.org, Dave Raggett <dsr@w3.org>
- Message-ID: <BANLkTimGF8YxO3bmmUsabKsJcwFHVTf18g@mail.gmail.com>
Hello, I’m using tidy for creating a wellformed HTML output from a loosely organized HTML file. The HTML files has many closing tags missing. Here’s my sample HTML i/p: *HTML I/P* <p class="0">A <p class="1"><em class="bf">ACCOUNTING BASIS</em> <p class="2">Taxation, <cite class="section">3.3.3 <p class="1"><em class="bf">ACCRUAL BASIS ACCOUNTING,</em> <cite class="section">3.3.3 <p class="1"><em class="bf">AFFILIATED SERVICES GROUPS</em> <p class="2">Taxation, <cite class="section">3.3.5 <p class="1"><em class="bf">ANCILLARY SERVICES</em> <p class="2">Reimbursement <p class="3">Payment methodology <p class="4">Covered ancillary services, <cite class="section">5.1.2.2 <p class="1"><em class="bf">ANESTHESIOLOGY</em> <p class="2">Anti-kickback statute <p class="3">Case law and other guidance, <cite class="section">2.4.6.4 I’ve defined following parameters in tidy.config file: *Config File:* add-xml-decl:true #output-xhtml:true doctype:omit hide-comments:yes preserve-entities:yes uppercase-tags:0 # DO NOT specify input encoding here unless it never,ever changes. output-encoding:utf8 word-2000:false # bare: replaces nbsps with regular spaces as a side-effect # these nbsps are needed for clues so bare should be left false. bare:true enclose-text:yes numeric-entities:yes # clean: strips surplus tags from ms word originating docs. # clean consolidates similar styles and uses references to them. # trades document size for ease of parsing it -- leave this false. clean:true hide-comments:true # wrap: zero if you want to disable line wrapping wrap:0 # quote-nbsp: output non-breaking space characters as entities quote-nbsp:false show-warnings:false # *My O/p looks like this:* <p class="0">A</p> <p class="1"><em class="bf">ACCOUNTING BASIS</em></p> <p class="2">Taxation, <cite class="section">3.3.3</cite></p> <p class="1"><cite class="section"><em class="bf">ACCRUAL BASIS ACCOUNTING,</em> <cite class="section">3.3.3</cite></cite></p> <p class="1"><cite class="section"><em class="bf">AFFILIATED SERVICES GROUPS</em></cite></p> <p class="2"><cite class="section">Taxation, <cite class= "section">3.3.5</cite></cite></p> <p class="1"><cite class="section"><em class="bf">ANCILLARY SERVICES</em></cite></p> <p class="2"><cite class="section">Reimbursement</cite></p> <p class="3"><cite class="section">Payment methodology</cite></p> <p class="4"><cite class="section">Covered ancillary services, <cite class="section">5.1.2.2</cite></cite></p> <p class="1"><cite class="section"><em class="bf">ANESTHESIOLOGY</em></cite></p> <p class="2"><cite class="section">Anti-kickback statute</cite></p> <p class="3"><cite class="section">Case law and other guidance, <cite class="section">2.4.6.4</cite></cite></p> You can see the unwanted <cite> tags getting added in the data. I want the o/p to appear as follows: * * *Required O/p:* <p class="0">A</p> <p class="1"><em class="bf">ACCOUNTING BASIS</em></p> <p class="2">Taxation, <cite class="section">3.3.3</cite></p> <p class="1"><em class="bf">ACCRUAL BASIS ACCOUNTING,</em> <cite class="section">3.3.3</cite></p> <p class="1"><em class="bf">AFFILIATED SERVICES GROUPS</em></p> <p class="2">Taxation, <cite class="section">3.3.5</cite></cite></p> <p class="1"><em class="bf">ANCILLARYSERVICES</em></p> <p class="2">Reimbursement</p> <p class="3">Payment methodology</p> <p class="4">Covered ancillary services,<cite class="section">5.1.2.2</cite></p> <p class="1"><em class="bf">ANESTHESIOLOGY</em></p> <p class="2”>Anti-kickback statute</p> <p class="3">Case law and other guidance,<cite class="section">2.4.6.4</cite></p> Please advise the changes in the config file to get the above required o/p. Thanks!! Thanks in advance for your help!! Regards, Nilesh Chavan. ** *Cell: +1 (937) 301 0575*
Received on Tuesday, 21 June 2011 22:45:09 UTC