HTML spec part 1

Dan Connolly (connolly@pixel.convex.com)
Fri, 20 Nov 92 23:14:35 -0600


Date: Fri, 20 Nov 92 23:14:35 -0600
From: connolly@pixel.convex.com (Dan Connolly)
Message-Id: <9211210514.AA15421@pixel.convex.com>
To: www-talk@nxoc01.cern.ch
Subject: HTML spec part 1
Cc: connolly@pixel.convex.com

> THIS IS A MESSAGE IN 'MIME' FORMAT.  Your mail reader does not support MIME.
> Some parts of this will be readable as plain text.
> To see the rest, you will need to upgrade your mail reader.

--PART.BOUNDARY.25512.15400.pixel.722322875.1
Content-type: text/plain
Content-Transfer-Encoding: quoted-printable

<TITLE>HyperText Mark-up Language</TITLE>
<H1>Text and Markup</H1>
This is an explanation of SGML syntax as it applies to HTML. It is design=
ed to take section 7, "Element Structure" and reduce it fromthe abstract =
system that is SGML to a concrete languag, HTML.<P>
<H2><A NAME=3D"Tags">Tags</A>
</H2>
The characters in an SGML document are organized into a heirarchy of elem=
ents by the use of tags. An HTML tag has the form<P>
<DL><DT>start tag<DD>"&lt;" name s+ attribute* s* ">"<DT>end tag<DD>"&lt;=
/" name s* ">"<DT>name<DD>letter (letter | digit){0,7}<DT>For example,</D=
L>
<XMP><Title>Here's the title of the Document</title></XMP>
<XMP><h1 >A Heading</h1></XMP>
<XMP><a NAME=3Dfoo>This text is the content of an A element.</a></XMP>
<XMP><A HREF=3D"http://info.cern.ch/hypertext/WWW/TheProject.html"></XMP>=

<XMP>This text is a link to the WWW documentation. </a></XMP>
Element names are not case sensitive. They are restricted to eight charac=
ters or less.<P>
<H4>Open Issue: Length of Names</H4>
If we use the default SGML declaration, names are restricted to eight cha=
racters. Some SGML parsers don't support other SGML declarations.<P>
But most do, these days most SGML applications use a declaration with a l=
arger value of NAMELEN.<P>
The length of an attribute value literal is similarly limited to 240 char=
acters. This might be a problem with long URLs. I think we should change =
it.<P>
<H2>Attributes</H2>
Some elements have associated attributes. The start tag specifies the val=
ues of the attributes for an element.<P>
<DL><DT>attribute<DD>name s* "=3D" s* (token|literal)<DT>token<DD>(letter=
|digit){1,8}<DT>letter<DD>[a-zA-Z]<DT>digit<DD>[0-9]<DT>literal<DD>'"' [^=
"]{0,240} '"' | "'" [^']{0,240} "'"</DL>
Each attribute of each element has a declared type. <P>
<H4>Open Issue: Anchor Names: NMTOKEN or ID?</H4>
The names of anchors within an HTML document should be unique. We can use=
 the SGML ID mechanism to specify this.<P>
But SGML IDs are names; that is, they start with a letter. Many HTML prod=
ucers use numbers for anchor names.<P>
<H4>Open Issue: Interpretation of Literals</H4>
Section 7.9.3 of the SGML standard states<P>
<UL><LI>An attribute value literal is interpreted as an attribute value b=
y replacing references within it, ignoring Ee and RS, and replacing RE or=
 SEPCHAR with SPACE.</UL>
For the SGML-impared, Ee is Entity End (like EOF); RS is '\n'; RE is '\r'=
; SEPCHAR is '\t' and SPACE is ' '.<P>
Since to date there are no HTML attributes containing newlines or spaces,=
 that is not much of an issue.<P>
But replacement of literals is. For one thing, this creates an interactio=
n between the syntax of URLs and SGML syntax. We could resolve this issue=
 by removing '&amp;' from <A HREF=3D"http://info.cern.ch/hypertext/WWW/Ad=
dressing/BNF.html#xalpha">the URL syntax</A>
=2E<P>
<H4>Historical Note</H4>
The NeXT implementation of the WWW browser, responsible for the creation =
of much of the existing HTML, does not surround attribute literals with q=
uotes. These productions describe the syntax produced by the NeXT:<P>
<DL><DT>NXattribute<DD>name "=3D" NXliteral<DT>NXliteral<DD> [^ >]+</DL>
<H2>Normal Text: #PCDATA</H2>
The symbol #PCDATA stands for parsed character data, the normal text char=
acters in an SGML document.<P>
The text consists of a stream of lines.The division into lines has no sig=
nificance apart from indicating a word end.<P>
The following character sequences are recognized as markup in #PCDATA:<P>=

<DL><DT>&lt;[a-zA-Z]<DD>"&lt;" serves as the Start Tag Open delimiter whe=
n followed by a letter. It is used to introduct <A HREF=3D"#Tags">tags</A=
>
 that start elements.<DT>&lt;/[a-zA-Z]<DD>"&lt;/" serves as the End Tag O=
pen delimiter when followed by a letter. It is used to introduce tags tha=
t terminate elements.<DT>&lt;!(--)([A-Za-z])([)<DD>"&lt;!" serves as the =
Markup Declaration Open delimiter when followed by a letter or "--" or "[=
". It has several uses in SGML. The only purpose it serves in HTML is to =
introduce comments.<DT>&amp;[a-zA-z]<DD>"&amp;" serves as the Entity Refe=
rence Open delimiterwhen followed by a letter. It is used to introduce en=
tities, or "macros." <DT>&amp;#[0-9A-Za-z]<DD>"&amp;#" followed by a lett=
er or a digit is the Character Reference Open delimiter. SGML idioms incl=
ude things like "&amp;#168;" and "&amp;#SPACE;". It is not used in HTML.<=
DT>]]><DD>"]]" when followed by ">" is Marked Section Close. While marked=
 sections are not used by SGML, this sequence of characters is recognized=
 and reported as an error by conforming SGML parsers.</DL>
<H4>Note to HTML Producers</H4>
Note that conforming SGML parsers will treat "&amp;", "&lt;", "&lt;/", an=
d "&lt;!" as normal text characters when they are not followed by a lette=
r. HTML producers are discouraged from taking advantage of this feature.<=
P>
All occurrences of the characters '&amp;' and '&lt;' should be represente=
d by <A HREF=3D"#Entities">entities</A>
=2E The marked section close delimiter can be avoided if all occurrences =
of '>' are represented by entities.<P>
 While the division of the stream of characters into lines is arbitrary, =
the recommended line length is 72 characters in order to allow the text t=
o be passed through systems which can only handle text with a limited lin=
e length.<P>
<H2>Literal Text: #RCDATA</H2>
The symbol #RCDATA stands for replaceable character data, the text withou=
t tags in an SGML document. It is used in HTML for sections where line br=
eaks and character widths are significant.<P>
Only the entity reference and end tag open delimiters are recognized in #=
RCDATA.<P>
Replaceable character data should be displayed in a fixed width font, so =
that any formatting done by character spacing on successive lines will be=
 maintained.<P>
The ASCII Horizontal Tab (HT) character should be interpreted as the smal=
lest positive nonzero number of spaces which will leave the number of cha=
racters so far on the line as a multiple of 8. Its use is not recommended=
 however.<P>
<H4>Historical Note</H4>
The original definition of literal text is not representable in SGML. Fro=
m <A HREF=3D"http://info.cern.ch/hypertext/WWW/MarkUp/Tags.html">Tags use=
d in HTML</A>
:<P>
<UL><LI>The text may contain any ISO Latin printable characters, includin=
g the tag opener, so long as it does not contain the closing tag in full.=
 </UL>
But in section 7.6 of the SGML standard:<P>
<UL><LI>The content of an element declared to be character data or replac=
eable character data is terminated only by an etago delimiter-in-context =
(which need not open a valid end-tag) ... .</UL>
This definition is a compromise: it allows most markup to be ignored, but=
 where the string "&lt;/" is needed, it can be represented as "&amp;lt;/"=
=2E We will probaly end up with some systems that display "&amp;lt;/" rat=
her than "&lt;/".<P>
<H2><A NAME=3D"Entities">Entities</A>
</H2>
In order to include characters that would otherwise be treated as markup,=
 SGML entity references refer to arbitrary sequences of characters. An HT=
ML entity reference has the form:<P>
<DL><DT>entity reference<DD>"&amp;" name ";"<DT>Entity names are case sen=
sitive.</DL>
<H4>Open Issue: Character Set</H4>
The default SGML declaration specifies ISO 646-1983 as the character set.=
 I believe it's straight forward to specify ISO Latin 1 in the SGML decla=
ration for HTML, but it's not clear that this is a good idea.<P>
The SGML standard includes a set of entities for ISO Latin 1 characters a=
s public text. For example, &amp;OElig is the OE ligature. If we include =
these entities in the HTML DTD, we could support Latin 1 characters while=
 maintaining a 7 bit language. This would require a table of the entity n=
ames in WWW clients.<P>


--PART.BOUNDARY.25512.15400.pixel.722322875.1
Content-type: text/plain

/*
Copyright (c) 1991 Bell Communications Research, Inc. (Bellcore)

Permission to use, copy, modify, and distribute this material 
for any purpose and without fee is hereby granted, provided 
that the above copyright notice and this permission notice 
appear in all copies, and that the name of Bellcore not be 
used in advertising or publicity pertaining to this 
material without the specific, prior written permission 
of an authorized representative of Bellcore.  BELLCORE 
MAKES NO REPRESENTATIONS ABOUT THE ACCURACY OR SUITABILITY 
OF THIS MATERIAL FOR ANY PURPOSE.  IT IS PROVIDED "AS IS", 
WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES.
*/
#include <stdio.h>

#define BASE64 1
#define QP 2 /* quoted-printable */

main(argc, argv)
int argc;
char **argv;
{
    int encode = 1, which = BASE64, i;
    FILE *fp = stdin;

    for (i=1; i<argc; ++i) {
        if (argv[i][0] == '-') {
            switch (argv[i][1]) {
                case 'u':
                    encode = 0;
                    break;
                case 'q':
                    which = QP;
                    break;
                case 'b':
                    which = BASE64;
                    break;
		default:
                    fprintf(stderr,
                       "Usage: mmencode [-u] [-q] [-b] [file name]\n");
                    exit(-1);
            }
        } else {
            fp = fopen(argv[i], "r");
            if (!fp) {
                perror(argv[i]);
                exit(-1);
            }
        }
    }
    if (which == BASE64) {
        if (encode) to64(fp, stdout); else from64(fp, stdout, NULL, 0);
    } else {
        if (encode) toqp(fp, stdout); else fromqp(fp, stdout, NULL, 0);
    }
    return(0);
}


--PART.BOUNDARY.25512.15400.pixel.722322875.1
Content-type: text/plain

/*
Copyright (c) 1991 Bell Communications Research, Inc. (Bellcore)

Permission to use, copy, modify, and distribute this material 
for any purpose and without fee is hereby granted, provided 
that the above copyright notice and this permission notice 
appear in all copies, and that the name of Bellcore not be 
used in advertising or publicity pertaining to this 
material without the specific, prior written permission 
of an authorized representative of Bellcore.  BELLCORE 
MAKES NO REPRESENTATIONS ABOUT THE ACCURACY OR SUITABILITY 
OF THIS MATERIAL FOR ANY PURPOSE.  IT IS PROVIDED "AS IS", 
WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES.
*/
#include <stdio.h>
#include <ctype.h>
#include <config.h>

extern char *index();
static char basis_64[] =
   "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

static char index_64[128] = {
    -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
    -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
    -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,62, -1,-1,-1,63,
    52,53,54,55, 56,57,58,59, 60,61,-1,-1, -1,-1,-1,-1,
    -1, 0, 1, 2,  3, 4, 5, 6,  7, 8, 9,10, 11,12,13,14,
    15,16,17,18, 19,20,21,22, 23,24,25,-1, -1,-1,-1,-1,
    -1,26,27,28, 29,30,31,32, 33,34,35,36, 37,38,39,40,
    41,42,43,44, 45,46,47,48, 49,50,51,-1, -1,-1,-1,-1
};

#define char64(c)  (((c) < 0 || (c) > 127) ? -1 : index_64[(c)])

/*
char64(c)
char c;
{
    char *s = (char *) index(basis_64, c);
    if (s) return(s-basis_64);
    return(-1);
}
*/

to64(infile, outfile) 
FILE *infile, *outfile;
{
    int c1, c2, c3, ct=0;
    while ((c1 = getc(infile)) != EOF) {
        c2 = getc(infile);
        if (c2 == EOF) {
            output64chunk(c1, 0, 0, 2, outfile);
        } else {
            c3 = getc(infile);
            if (c3 == EOF) {
                output64chunk(c1, c2, 0, 1, outfile);
            } else {
                output64chunk(c1, c2, c3, 0, outfile);
            }
        }
        ct += 4;
        if (ct > 71) {
            putc('\n', outfile);
            ct = 0;
        }
    }
    if (ct) putc('\n', outfile);
    fflush(outfile);
}

output64chunk(c1, c2, c3, pads, outfile)
FILE *outfile;
{
    putc(basis_64[c1>>2], outfile);
    putc(basis_64[((c1 & 0x3)<< 4) | ((c2 & 0xF0) >> 4)], outfile);
    if (pads == 2) {
        putc('=', outfile);
        putc('=', outfile);
    } else if (pads) {
        putc(basis_64[((c2 & 0xF) << 2) | ((c3 & 0xC0) >>6)], outfile);
        putc('=', outfile);
    } else {
        putc(basis_64[((c2 & 0xF) << 2) | ((c3 & 0xC0) >>6)], outfile);
        putc(basis_64[c3 & 0x3F], outfile);
    }
}

PendingBoundary(s, Boundaries, BoundaryCt)
char *s;
char **Boundaries;
int *BoundaryCt;
{
    int i;

    for (i=0; i < *BoundaryCt; ++i) {
        if (!strncmp(s, Boundaries[i], strlen(Boundaries[i]))) {
            char Buf[2000];
            strcpy(Buf, Boundaries[i]);
            strcat(Buf, "--\n");
            if (!strcmp(Buf, s)) *BoundaryCt = i;
            return(1);
        }
    }
    return(0);
}

from64(infile, outfile, boundaries, boundaryct) 
FILE *infile, *outfile;
char **boundaries;
int *boundaryct;
{
    int c1, c2, c3, c4;
    int newline = 1, DataDone = 0;

    while ((c1 = getc(infile)) != EOF) {
        if (isspace(c1)) {
            if (c1 == '\n') {
                newline = 1;
            } else {
                newline = 0;
            }
            continue;
        }
        if (newline && boundaries && c1 == '-') {
            char Buf[200];
            /* a dash is NOT base 64, so all bets are off if NOT a boundary */
            ungetc(c1, infile);
            fgets(Buf, sizeof(Buf), infile);
            if (boundaries
                 && (Buf[0] == '-')
                 && (Buf[1] == '-')
                 && PendingBoundary(Buf, boundaries, boundaryct)) {
                return;
            }
            fprintf(stderr, "Ignoring unrecognized boundary line: %s\n", Buf);
            continue;
        }
        if (DataDone) continue;
        newline = 0;
        do {
            c2 = getc(infile);
        } while (c2 != EOF && isspace(c2));
        do {
            c3 = getc(infile);
        } while (c3 != EOF && isspace(c3));
        do {
            c4 = getc(infile);
        } while (c4 != EOF && isspace(c4));
        if (c2 == EOF || c3 == EOF || c4 == EOF) {
            fprintf(stderr, "Premature EOF!\n");
            return;
        }
        if (c1 == '=' || c2 == '=') {
            DataDone=1;
            continue;
        }
        c1 = char64(c1);
        c2 = char64(c2);
        putc(((c1<<2) | ((c2&0x30)>>4)), outfile);
        if (c3 == '=') {
            DataDone = 1;
        } else {
            c3 = char64(c3);
            putc((((c2&0XF) << 4) | ((c3&0x3C) >> 2)), outfile);
            if (c4 == '=') {
                DataDone = 1;
            } else {
                c4 = char64(c4);
                putc((((c3&0x03) <<6) | c4), outfile);
            }
        }
    }
}

static char basis_hex[] = "0123456789ABCDEF";
static char index_hex[128] = {
    -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
    -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
    -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
     0, 1, 2, 3,  4, 5, 6, 7,  8, 9,-1,-1, -1,-1,-1,-1,
    -1,10,11,12, 13,14,15,-1, -1,-1,-1,-1, -1,-1,-1,-1,
    -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
    -1,10,11,12, 13,14,15,-1, -1,-1,-1,-1, -1,-1,-1,-1,
    -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1
};

#define hexchar(c)  (((c) < 0 || (c) > 127) ? -1 : index_hex[(c)])

/*
hexchar(c)
char c;
{
    char *s;
    if (islower(c)) c = toupper(c);
    s = (char *) index(basis_hex, c);
    if (s) return(s-basis_hex);
    return(-1);
}
*/

toqp(infile, outfile) 
FILE *infile, *outfile;
{
    int c, ct=0, prevc=255;
    while ((c = getc(infile)) != EOF) {
        if ((c < 32 && (c != '\n' && c != '\t'))
             || (c == '=')
             || (c >= 127)
             /* Following line is to avoid single periods alone on lines,
               which messes up some dumb smtp implementations, sigh... */
             || (ct == 0 && c == '.')) {
            putc('=', outfile);
            putc(basis_hex[c>>4], outfile);
            putc(basis_hex[c&0xF], outfile);
            ct += 3;
            prevc = 'A'; /* close enough */
        } else if (c == '\n') {
            if (prevc == ' ' || prevc == '\t') {
                putc('=', outfile); /* soft & hard lines */
                putc(c, outfile);
            }
            putc(c, outfile);
            ct = 0;
            prevc = c;
        } else {
            putc(c, outfile);
            ++ct;
            prevc = c;
        }
        if (ct > 72) {
            putc('=', outfile);
            putc('\n', outfile);
            ct = 0;
            prevc = '\n';
        }
    }
    if (ct) {
        putc('=', outfile);
        putc('\n', outfile);
    }
}

fromqp(infile, outfile, boundaries, boundaryct) 
FILE *infile, *outfile;
char **boundaries;
int *boundaryct;
{
    int c1, c2, sawnewline = 1;

    while ((c1 = getc(infile)) != EOF) {
        if (sawnewline && boundaries && (c1 == '-')) {
            char Buf[200], *s;

            ungetc(c1, infile);
            fgets(Buf, sizeof(Buf), infile);
            if (boundaries
                 && (Buf[0] == '-')
                 && (Buf[1] == '-')
                 && PendingBoundary(Buf, boundaries, boundaryct)) {
                return;
            }
            /* Not a boundary, now we must treat THIS line as q-p, sigh */
            for (s=Buf; *s; ++s) {
                if (*s == '=') {
                    if (!*++s) break;
                    if (*s == '\n') {
                        /* ignore it */
                        sawnewline = 1;
                    } else {
                        c1 = hexchar(*s);
                        if (!*++s) break;
                        c2 = hexchar(*s);
                        putc(c1<<4 | c2, outfile);
                    }
                } else {
                    putc(*s, outfile);
                }
            }
        } else {
            sawnewline = (c1 == '\n') ? 1 : 0;
            if (c1 == '=') {
                c1 = getc(infile);
                if (c1 == '\n') {
                    /* ignore it */
                    sawnewline = 1;
                } else {
                    c2 = getc(infile);
                    c1 = hexchar(c1);
                    c2 = hexchar(c2);
                    putc(c1<<4 | c2, outfile);
                    if (c2 == '\n') sawnewline = 1;
                }
            } else {
                putc(c1, outfile);
            }
        }
    }
}

--PART.BOUNDARY.25512.15400.pixel.722322875.1--