W3C home > Mailing lists > Public > www-html@w3.org > November 1998

Re: tidy -asxml fix

From: Dan Connolly <connolly@w3.org>
Date: Thu, 05 Nov 1998 12:50:57 -0600
Message-ID: <3641F391.6032@w3.org>
To: dsr@w3.org, www-html@w3.org
Dan Connolly wrote:
> 
> The tidy[1] -asxml feature is a pretty cool idea,
> but it's broken in the 1Sep release[2].
> 
> [1] http://www.w3.org/People/Raggett/tidy/
> [2] http://www.w3.org/People/Raggett/tidy01sep98.tgz
[...]
> Also, the XML declaration should be
>         -- nothing if the encoding is UTF-8 (or US-ASCII) or UTF-16
>         -- <?xml encoding="iso-8859-1" version="1.0">
>                 if the tidy output is -latin1
>                 and similar for -iso2022, but I don't know the
>                 details.
> 
> So FixDocType should take another argument for the encoding.
> I haven't hacked that up yet, but it should be easy.

OK... done. patch attached. (turns out the encoding is
a global variable, so I dind't have to add an argument.)

There are some limitations:

+       AddStringLiteral(lexer, "xml version=\"1.0\" encoding=\"");
+       /* @@FIXME: ISO2022 isn't any one character set
+          in the sense of
+          http://www.isi.edu/in-notes/iana/assignments/character-sets
+
+          And if it's raw, we really don't know
+       */
+       AddStringLiteral(lexer, CharEncoding==LATIN1 ? "iso-8859-1" :
"???");
+
+       AddStringLiteral(lexer, "\"");


The patch also adds -ansi to the gcc invocation;
I got some warnings about redeclaration of uint in platform.h
on linux, and this fixed it. But it made the // style comments
generate errors, so I supplimented them with #if 0/#endif.


-- 
Dan Connolly
http://www.w3.org/People/Connolly/
phone:+1-512-310-2971 (office, mobile)

--- lexer.c	1998/11/05 18:44:46	1.1
+++ lexer.c	1998/11/05 17:57:19
@@ -34,6 +34,8 @@
 #include "platform.h"
 #include "html.h"
 
+extern int CharEncoding; /* from tidy.c */
+
 AttVal *ParseAttrs(Lexer *lexer, bool *isempty);  /* forward references */
 void CheckAttributes(Lexer *lexer, Node *node);
 Node *CommentToken(Lexer *lexer);
@@ -714,7 +716,12 @@
 	{
 		s = &lexer->lexbuf[root->content->start];
 
-		if (s[0] == 'X' && s[1] == 'M' && s[2] == 'L')
+		if (s[0] == 'x' && s[1] == 'm' && s[2] == 'l')
+			return true;
+	}
+
+	if( CharEncoding == ASCII ||
+	    CharEncoding == UTF8 ){
 			return true;
 	}
 
@@ -728,7 +735,16 @@
 	root->content = xml;
 
     lexer->txtstart = lexer->txtend = lexer->lexsize;
-	AddStringLiteral(lexer, "XML version=\"1.0\"");
+	AddStringLiteral(lexer, "xml version=\"1.0\" encoding=\"");
+	/* @@FIXME: ISO2022 isn't any one character set
+	   in the sense of
+	   http://www.isi.edu/in-notes/iana/assignments/character-sets
+
+	   And if it's raw, we really don't know
+	*/
+	AddStringLiteral(lexer, CharEncoding==LATIN1 ? "iso-8859-1" : "???");
+
+	AddStringLiteral(lexer, "\"");
     lexer->txtend = lexer->lexsize;
 
     xml->start = lexer->txtstart;
--- pprint.c	1998/11/05 18:44:09	1.1
+++ pprint.c	1998/11/05 18:42:50
@@ -394,7 +394,7 @@
     {
         if (c > 127 && CharEncoding == ASCII)
         {
-            sprintf(entity, "&#x%x;", c);
+            sprintf(entity, "&#%d;", c);
 
             for (p = entity; *p; ++p)
 				AddC(*p, linelen++);
@@ -407,7 +407,8 @@
 	/* default treatment for ASCII */
 	if (c > 126 || (c < ' ' && c != '\t'))
 	{
-		if ((p = EntityName(c)) != null)
+		if (((p = EntityName(c)) != null)
+		    && XmlOut == false) /* don't use named entities in XML */
 			sprintf(entity, "&%s;", p);
 		else if (c > 255)
 			sprintf(entity, "&#x%x;", c);
@@ -503,8 +504,10 @@
 		    if (c == '\n')
 		    {
 			    PFlushLine(fout, indent);
+#if 0
 				//indent = 0;  /* kludge */
 				//InAttVal = true;
+#endif
 			    continue;
 		    }
 
--- Makefile	1998/11/05 18:44:46	1.1
+++ Makefile	1998/11/05 17:56:31
@@ -1,6 +1,6 @@
 # Makefile - for tidy
 
-CC= gcc
+CC= gcc -ansi
 
 CFLAGS= -O
 
Received on Thursday, 5 November 1998 13:50:10 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:15:37 GMT