SGML features applied to HTML

SGML contains many features that allow a users to mark-up a document using any editor in a simple and natural manner. I will describe how I made use of these features when preparing my assignment in mathematical logic.

Extra features that I add to HTML are usually done by modifying the declaration. For an HTML file, the declaration must appear as:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">

I modification add to this declaration, creating a declaration of the form:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd" [
...
Added declarations go here
...
]>

Simple SGML features

The simplest kind of short hand I take advantage of is short references. This is actually legal HTML, and does not require modification of the declaration. Bits of text can be marked-up using the slant (‘/’) character. For example, ‘Here is some <EM/emphasized text/.’ Other short hand that can be used includes omitting the identifier in the end tag, so ‘Here is some <EM>emphasized text</>.’ is also acceptable. I make great use of this when marking up headings.

Another legal HTML construct I use is marked sections. Marked sections are useful for removing large parts of source by “commenting it out”. Begin the section to remove with ‘<![INGORE[’, and end the section with ‘]]>’. Remember that marked sections cannot be nested, and cannot include ‘]]>’. Other marked section keywords include TEMP, INCLUDE, CDATA, and RCDATA. CDATA marked sections indicate that there is no mark-up in the sections. So strings like ‘<EM>’ are parsed literally. I use CDATA sections in the creation of this document so that I can type HTML examples without having to escape all the tags. RCDATA section are similar to CDATA sections; however only entities are parsed literally, and mark-up is parsed as mark-up.

Almost all my documents I create use the same string for the title of the HTML file, and the level 1 headers (H1). Duplicating the string is error prone, so I often make a title entity for the document. For this document I would add ‘<!ENTITY title "SGML features applied to HTML">’ to the declaration. And I can use this entity for the TITLE and H1 elements as follows:

<TITLE/&title;/
<H1/&title;/

This requires modifying the declaration, so the resulting file is no longer legal HTML.

Not all the characters I used in my logic assignment are available as standard entities in HTML. For example, the double turnstile character (⊧) is character U+22A7 in Unicode. It is useful to add an entity for this character to the declaration.

<!ENTITY models CDATA "&#x22A7;">

Many sites have pages that reuse the same headers and footers. I put the mark-up for my headers and footers in a separate file called sharedHtml.ent. I can include this entity right into my declaration as follows:

<!ENTITY % shared-html SYSTEM "sharedHtml.ent">
%shared-html;

The sharedHtml.ent file has definitions for entites for headers and footers. This allows me to use the &header;, and &footer; entities in my HTML documents. If I want to change the headers and footers for my entire site, all I need to do is modify the sharedHtml.ent file.

A more robust way of referring to entity files is by referring to it with a FPI (Formal Public Identifier). Instead of adding a system entity to the HTML, add a public entity.

<!ENTITY % shared-html PUBLIC "-//Russell O'Connor//ENTITIES Shared HTML//EN">
%shared-html;

The FPI will be resolved by adding the following to your catalog file.

PUBLIC "-//Russell O'Connor//ENTITIES Shared HTML//EN" sharedHtml.ent

Using the features that do not modify the declaration of HTML produces legal HTML. But there are very few browsers that correctly handle HTML that uses such short hand. If the HTML declaration is modified, then the resulting file is not even legal HTML. The solution to both these problems is by using a program called sgmlnorm. sgmlnorm is part of the James Clark’s sp package and produces a normal form of an SGML file. It will expand all the the shorthand used above into a canonically marked-up document. To use this program, execute the following command.

sgmlnorm -d -chtml-catalog-file source-file > destination-file

sgmlnorm will need catalog file, dtd and entity files which are available from the W3C. The -d option tells sgmlnorm to write the DOCTYPE in the output.

More Advanced SGML features

In writing up a mathematical assignment, superscript and subscripts are often used. For the variable x₀, writing x<SUB/0/ repeatedly is a bit tedious. One option is to use entities for variables. Another option is to use the SHORTTAG features described in SGML and HTML Explained. This allows the subscripts and superscripts to be delimited by the ‘_’, and ‘^’ characters. This allows x₀ to be written as x_0_.

Instead of restricting the use of these delimiters to the MATH element, I wanted to use it in all block level structures. Here the declaration needs access to the %block; entity of the HTML DTD. The usual method of extending a document declaration would not give access to internal HTML entities, so more direct access to the HTML DTD is need. To access it I used the following declaration.

<!DOCTYPE HTML [
<!ENTITY BeginSup STARTTAG "SUP">
<!ENTITY BeginSub STARTTAG "SUB">
<!ENTITY EndSup ENDTAG "SUP">
<!ENTITY EndSub ENDTAG "SUB">
<!SHORTREF MapDefault "^" BeginSup
                      "_" BeginSub>
<!SHORTREF MapSup     "^" EndSup
                      "_" BeginSub>
<!SHORTREF MapSub     "^" BeginSup
                      "_" EndSub>
...
Other declarations go here
...

<!ENTITY % HtmlDtd PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
%HtmlDtd;

<!USEMAP MapDefault (%block;)>
<!USEMAP MapSup SUP>
<!USEMAP MapSub SUB>
]>

The HTML DTD is no longer specified in the top-line of the declaration. Instead an external entity called %HtmlDtd; is declared, and instantiated to make reference to the DTD. Code after the DTD is instantiated can refer to entities inside the DTD such as %block;. A drawback to using the HTML dtd this way is that the -d option of sgmlnorm no longer works.

I did not want line-breaks within the mathematical formulas I was writing. One method to achieve this is to use the non-breaking space entity ( ) everywhere I wanted a space. This process is tedious. Instead I added the NBSP element to HTML. This is simply done by adding it to the list of special inline element and creating an NBSP element.

...
<!ENTITY % special
   "A | IMG | APPLET | OBJECT | BR | SCRIPT |
    MAP | Q | SUB | SUP | SPAN | BDO | IFRAME | NBSP">
...
<!ENTITY % HtmlDtd PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
%HtmlDtd;
...
<!ELEMENT NBSP - - (%inline;)*>
<!ATTLIST NBSP
  %attrs;
>
...

The definition of the NBSP element must occur after the HTML DTD is instantiated since it refers to entities defined in the HTML DTD. The same short reference trick used for the SUP and SUB elements can be used to delimit NBSP sections. I chose the tilde character (~). Also I specified that inside NBSP element the space character will be a short reference to the non-breaking space character. The following is added to the declaration:

...
<!ENTITY BeginNbsp STARTTAG "NBSP">
<!ENTITY EndNbsp ENDTAG "NBSP">

<!SHORTREF MapDefault "^" BeginSup
                      "_" BeginSub
                      "~" BeginNbsp>
<!SHORTREF MapNbsp    "^" BeginSup
                      "_" BeginSub
                      "~" EndNbsp
                      " " nbsp>
<!SHORTREF MapSup     "^" EndSup
                      "_" BeginSub>
<!SHORTREF MapSub     "^" BeginSup
                      "_" EndSub>
<!USEMAP MapNbsp NBSP>
...

Now all spaces inside NBSP elements (which can be marked by tildes), will be automatically turned into non-breaking spaces. However, by adding the NBSP element to the document, it is not valid HTML, even after running it through sgmlnorm. To get derive valid HTML from my source, I made use of architectural forms.

In an ideal world, to add architectural support for HTML, all that I would need to do is add the following processing instruction before the declaration:

<?IS10744:arch
        name="HTML"
        public-id="-//Russell O'Connor//NOTATION HTML 4.01 Architecture//EN"
        dtd-public-id="-//W3C//DTD HTML 4.01//EN"
        doc-elm-form="HTML"
>
<!DOCTYPE HTML [
...

The problem with this is that sgmlnorm (and the rest of the programs in the sp package) currently does not recognize this form of an architecture declaration. So instead I added the following to the declaration.

...
<!ENTITY % HtmlDtd PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
%HtmlDtd;

<?IS10744 ArcBase HTML>
<!NOTATION HTML PUBLIC "-//Russell O'Connor//NOTATION HTML 4.01 Architecture//EN">
<!ATTLIST #NOTATION HTML
          ArcDTD CDATA #FIXED "%HtmlDtd"
>
...

The automatic bridging will attach all valid HTML elements to the HTML architecture. To extract the valid HTML file from within our modification, the sgmlnorm command is run with the -A HTML option. The command line is now

sgmlnorm -d -A HTML -chtml-catalog-file source-file > destination-file

The output is the a perfectly valid HTML, with declaration, and with the NBSP element removed. But the spaces that were changed to non-breaking spaces are still there, just as I wanted.

Some Results

The source file, and resulting file for my math assignment are available and can be compared. The source for this document can also be examined.

Russell O’Connor: contact me