SGML contains many features that allow a users to mark-up a document using any editor in a simple and natural manner. I will describe how I made use of these features when preparing my assignment in mathematical logic.
Extra features that I add to HTML are usually done by modifying the declaration. For an HTML file, the declaration must appear as:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
I modification add to this declaration, creating a declaration of the form:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" [ ... Added declarations go here ... ]>
The simplest kind of short hand I take advantage of is short
references. This is actually legal HTML, and does not require
modification of the declaration. Bits of text can be marked-up using
the slant (‘/’) character. For example,
‘Here is some <EM/emphasized text/.
’
Other short hand that can be used includes omitting the identifier in the end
tag, so ‘Here is some <EM>emphasized
text</>.
’ is also acceptable. I make great use of this when
marking up headings.
Another legal HTML construct I use is marked sections. Marked
sections are useful for removing large parts of source by
“commenting it out”. Begin the section to remove with
‘<![INGORE[
’, and end the section with
‘]]>
’. Remember that marked sections cannot be
nested, and cannot include ‘]]>
’. Other marked
section keywords include TEMP
, INCLUDE
, CDATA
, and
RCDATA
. CDATA
marked sections indicate that there is no
mark-up in the sections. So strings like
‘<EM>
’ are parsed literally. I use
CDATA sections in the creation of this document so that I can type HTML
examples without having to escape all the tags. RCDATA
section
are similar to CDATA
sections; however only entities are parsed
literally, and mark-up is parsed as mark-up.
Almost all my documents I create use the same string for the title
of the HTML file, and the level 1 headers (H1
). Duplicating the
string is error prone, so I often make a title entity for the
document. For this document I would add ‘<!ENTITY
title "SGML features applied to HTML">
’ to the
declaration. And I can use this entity for the TITLE
and
H1
elements as follows:
<TITLE/&title;/ <H1/&title;/
This requires modifying the declaration, so the resulting file is no longer legal HTML.
Not all the characters I used in my logic assignment are available as standard entities in HTML. For example, the double turnstile character (⊧) is character U+22A7 in Unicode. It is useful to add an entity for this character to the declaration.
<!ENTITY models CDATA "⊧">
Many sites have pages that reuse the same headers and footers. I put the mark-up for my headers and footers in a separate file called sharedHtml.ent. I can include this entity right into my declaration as follows:
<!ENTITY % shared-html SYSTEM "sharedHtml.ent"> %shared-html;
The sharedHtml.ent file has definitions for entites for headers and
footers. This allows me to use the &header;
, and
&footer;
entities in my HTML documents. If I want
to change the headers and footers for my entire site, all I need to do
is modify the sharedHtml.ent file.
A more robust way of referring to entity files is by referring to it with a FPI (Formal Public Identifier). Instead of adding a system entity to the HTML, add a public entity.
<!ENTITY % shared-html PUBLIC "-//Russell O'Connor//ENTITIES Shared HTML//EN"> %shared-html;
The FPI will be resolved by adding the following to your catalog file.
PUBLIC "-//Russell O'Connor//ENTITIES Shared HTML//EN" sharedHtml.ent
Using the features that do not modify the declaration of HTML produces legal HTML. But there are very few browsers that correctly handle HTML that uses such short hand. If the HTML declaration is modified, then the resulting file is not even legal HTML. The solution to both these problems is by using a program called sgmlnorm. sgmlnorm is part of the James Clark’s sp package and produces a normal form of an SGML file. It will expand all the the shorthand used above into a canonically marked-up document. To use this program, execute the following command.
sgmlnorm will need catalog file, dtd and entity files which are available from the W3C. The -d option tells sgmlnorm to write the DOCTYPE in the output.
In writing up a mathematical assignment, superscript and subscripts
are often used. For the variable x0, writing
x<SUB/0/
repeatedly is a bit tedious. One option
is to use entities for variables. Another option is to use the SHORTTAG
features described in SGML and HTML
Explained. This allows the subscripts and superscripts to be
delimited by the ‘_’, and ‘^’ characters.
This allows x0 to be written as x_0_
.
Instead of restricting the use of these delimiters to the MATH
element, I wanted to use it in all block level structures. Here the
declaration needs access to the %block;
entity of the HTML DTD.
The usual method of extending a document declaration would not give access
to internal HTML entities, so more direct access to the HTML DTD is
need. To access it I used the following declaration.
<!DOCTYPE HTML [ <!ENTITY BeginSup STARTTAG "SUP"> <!ENTITY BeginSub STARTTAG "SUB"> <!ENTITY EndSup ENDTAG "SUP"> <!ENTITY EndSub ENDTAG "SUB"> <!SHORTREF MapDefault "^" BeginSup "_" BeginSub> <!SHORTREF MapSup "^" EndSup "_" BeginSub> <!SHORTREF MapSub "^" BeginSup "_" EndSub> ... Other declarations go here ... <!ENTITY % HtmlDtd PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> %HtmlDtd; <!USEMAP MapDefault (%block;)> <!USEMAP MapSup SUP> <!USEMAP MapSub SUB> ]>
The HTML DTD is no longer specified in the top-line of the
declaration. Instead an external entity called %HtmlDtd;
is
declared, and instantiated to make reference to the DTD. Code after
the DTD is instantiated can refer to entities inside the DTD such as
%block;
. A drawback to using the HTML dtd this way is that the
-d option of sgmlnorm no longer works.
I did not want line-breaks within the mathematical formulas I was
writing. One method to achieve this is to use the non-breaking space
entity (
) everywhere I wanted a space. This
process is tedious. Instead I added the NBSP element to HTML. This is
simply done by adding it to the list of special inline element and
creating an NBSP element.
... <!ENTITY % special "A | IMG | APPLET | OBJECT | BR | SCRIPT | MAP | Q | SUB | SUP | SPAN | BDO | IFRAME | NBSP"> ... <!ENTITY % HtmlDtd PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> %HtmlDtd; ... <!ELEMENT NBSP - - (%inline;)*> <!ATTLIST NBSP %attrs; > ...
The definition of the NBSP element must occur after the HTML DTD is instantiated since it refers to entities defined in the HTML DTD. The same short reference trick used for the SUP and SUB elements can be used to delimit NBSP sections. I chose the tilde character (~). Also I specified that inside NBSP element the space character will be a short reference to the non-breaking space character. The following is added to the declaration:
... <!ENTITY BeginNbsp STARTTAG "NBSP"> <!ENTITY EndNbsp ENDTAG "NBSP"> <!SHORTREF MapDefault "^" BeginSup "_" BeginSub "~" BeginNbsp> <!SHORTREF MapNbsp "^" BeginSup "_" BeginSub "~" EndNbsp " " nbsp> <!SHORTREF MapSup "^" EndSup "_" BeginSub> <!SHORTREF MapSub "^" BeginSup "_" EndSub> <!USEMAP MapNbsp NBSP> ...
Now all spaces inside NBSP elements (which can be marked by tildes), will be automatically turned into non-breaking spaces. However, by adding the NBSP element to the document, it is not valid HTML, even after running it through sgmlnorm. To get derive valid HTML from my source, I made use of architectural forms.
In an ideal world, to add architectural support for HTML, all that I would need to do is add the following processing instruction before the declaration:
<?IS10744:arch name="HTML" public-id="-//Russell O'Connor//NOTATION HTML 4.01 Architecture//EN" dtd-public-id="-//W3C//DTD HTML 4.01//EN" doc-elm-form="HTML" > <!DOCTYPE HTML [ ...
The problem with this is that sgmlnorm (and the rest of the programs in the sp package) currently does not recognize this form of an architecture declaration. So instead I added the following to the declaration.
... <!ENTITY % HtmlDtd PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> %HtmlDtd; <?IS10744 ArcBase HTML> <!NOTATION HTML PUBLIC "-//Russell O'Connor//NOTATION HTML 4.01 Architecture//EN"> <!ATTLIST #NOTATION HTML ArcDTD CDATA #FIXED "%HtmlDtd" > ...
The automatic bridging will attach all valid HTML elements to the HTML architecture. To extract the valid HTML file from within our modification, the sgmlnorm command is run with the -A HTML option. The command line is now
The output is the a perfectly valid HTML, with declaration, and with the NBSP element removed. But the spaces that were changed to non-breaking spaces are still there, just as I wanted.
The source file, and resulting file for my math assignment are available and can be compared. The source for this document can also be examined.