- From: <www-html-request@w3.org>
- Date: Mon, 9 Nov 1998 09:24:00 -0500 (EST)
ÚImCrP:cø¾www19.w3.orgwww19.w3.org<www-html-request@w3.org>c=SE;a=400NET;p=TERACOM;l=GATEWAY29811051910WHN1ZHVPArg@lu.er<PAM@teracom.se>IA/oEwLsReceived: from www19.w3.org ([18.29.0.19]) by gateway2.teracom.se with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2232.9) id WHN1ZHVP; Thu, 5 Nov 1998 20:10:52 +0100 Received: (from daemon@localhost) by www19.w3.org (8.9.0/8.9.0) id NAA16054; Thu, 5 Nov 1998 13:50:13 -0500 (EST) Resent-Date: Thu, 5 Nov 1998 13:50:13 -0500 (EST) Resent-Message-Id: <199811051850.NAA16054@www19.w3.org> Message-ID: <3641F391.6032@w3.org> Date: Thu, 05 Nov 1998 12:50:57 -0600 From: Dan Connolly <connolly@w3.org> Organization: World Wide Web Consortium (http://www.w3.org/) X-Mailer: Mozilla 3.04 (WinNT; I) MIME-Version: 1.0 To: dsr@w3.org, www-html@w3.org References: <3641D9EC.EB0@w3.org> Content-Type: multipart/mixed; boundary="------------274B38103A1E" Subject: Re: tidy -asxml fix Resent-From: www-html@w3.org X-Mailing-List: <www-html@w3.org> archive/latest/736 X-Loop: www-html@w3.org Sender: www-html-request@w3.org Resent-Sender: www-html-request@w3.org Precedence: list This is a multi-part message in MIME format. --------------274B38103A1E Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Dan Connolly wrote: > > The tidy[1] -asxml feature is a pretty cool idea, > but it's broken in the 1Sep release[2]. > > [1] http://www.w3.org/People/Raggett/tidy/ > [2] http://www.w3.org/People/Raggett/tidy01sep98.tgz [...] > Also, the XML declaration should be > -- nothing if the encoding is UTF-8 (or US-ASCII) or UTF-16 > -- <?xml encoding="iso-8859-1" version="1.0"> > if the tidy output is -latin1 > and similar for -iso2022, but I don't know the > details. > > So FixDocType should take another argument for the encoding. > I haven't hacked that up yet, but it should be easy. OK... done. patch attached. (turns out the encoding is a global variable, so I dind't have to add an argument.) There are some limitations: + AddStringLiteral(lexer, "xml version=\"1.0\" encoding=\""); + /* @@FIXME: ISO2022 isn't any one character set + in the sense of + http://www.isi.edu/in-notes/iana/assignments/character-sets + + And if it's raw, we really don't know + */ + AddStringLiteral(lexer, CharEncoding==LATIN1 ? "iso-8859-1" : "???"); + + AddStringLiteral(lexer, "\""); The patch also adds -ansi to the gcc invocation; I got some warnings about redeclaration of uint in platform.h on linux, and this fixed it. But it made the // style comments generate errors, so I supplimented them with #if 0/#endif. -- Dan Connolly http://www.w3.org/People/Connolly/ phone:+1-512-310-2971 (office, mobile) --------------274B38103A1E Content-Type: text/plain; charset=us-ascii; name=",patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename=",patch" --- lexer.c 1998/11/05 18:44:46 1.1 +++ lexer.c 1998/11/05 17:57:19 @@ -34,6 +34,8 @@ #include "platform.h" #include "html.h" +extern int CharEncoding; /* from tidy.c */ + AttVal *ParseAttrs(Lexer *lexer, bool *isempty); /* forward references */ void CheckAttributes(Lexer *lexer, Node *node); Node *CommentToken(Lexer *lexer); @@ -714,7 +716,12 @@ { s = &lexer->lexbuf[root->content->start]; - if (s[0] == 'X' && s[1] == 'M' && s[2] == 'L') + if (s[0] == 'x' && s[1] == 'm' && s[2] == 'l') + return true; + } + + if( CharEncoding == ASCII || + CharEncoding == UTF8 ){ return true; } @@ -728,7 +735,16 @@ root->content = xml; lexer->txtstart = lexer->txtend = lexer->lexsize; - AddStringLiteral(lexer, "XML version=\"1.0\""); + AddStringLiteral(lexer, "xml version=\"1.0\" encoding=\""); + /* @@FIXME: ISO2022 isn't any one character set + in the sense of + http://www.isi.edu/in-notes/iana/assignments/character-sets + + And if it's raw, we really don't know + */ + AddStringLiteral(lexer, CharEncoding==LATIN1 ? "iso-8859-1" : "???"); + + AddStringLiteral(lexer, "\""); lexer->txtend = lexer->lexsize; xml->start = lexer->txtstart; --- pprint.c 1998/11/05 18:44:09 1.1 +++ pprint.c 1998/11/05 18:42:50 @@ -394,7 +394,7 @@ { if (c > 127 && CharEncoding == ASCII) { - sprintf(entity, "&#x%x;", c); + sprintf(entity, "&#%d;", c); for (p = entity; *p; ++p) AddC(*p, linelen++); @@ -407,7 +407,8 @@ /* default treatment for ASCII */ if (c > 126 || (c < ' ' && c != '\t')) { - if ((p = EntityName(c)) != null) + if (((p = EntityName(c)) != null) + && XmlOut == false) /* don't use named entities in XML */ sprintf(entity, "&%s;", p); else if (c > 255) sprintf(entity, "&#x%x;", c); @@ -503,8 +504,10 @@ if (c == '\n') { PFlushLine(fout, indent); +#if 0 //indent = 0; /* kludge */ //InAttVal = true; +#endif continue; } --- Makefile 1998/11/05 18:44:46 1.1 +++ Makefile 1998/11/05 17:56:31 @@ -1,6 +1,6 @@ # Makefile - for tidy -CC= gcc +CC= gcc -ansi CFLAGS= -O --------------274B38103A1E--
Received on Monday, 9 November 1998 09:24:44 UTC