Semantic web design

Semantic design is one of the many buzzwords floating carefree around the internet at the moment. Admittedly, it's not the biggest (that must be Ajax at the moment), but it is quite a good word. However, not many people know what is actually is.

The semantic web design I'm talking about is not part of Tim Berners-Lee's new semantic web, but it's related. The semantic web is an initiative to write new pages in a language based on RDF, which allows computers to extract the meaning from web pages. The semantic web design I'm talking about is the proper use of XHTML tags, to structure a page sanely. However, this also allows computer to "read" web pages, to a certain extent.

All the original HTML tags were created and designed to structure information, not to format it. However, since the explosion of the internet, this border was blurred by masses of ineducated poops, and HTML turned into a pea-soup of a formatting language. The W3C decided that this wasn't good enough, so they created XHTML. It was designed to be similar to HTML, but with all the extraneous formatting tags removed. The markup was semantic, and XML-compliant.

However, even when using XHTML, people somehow find ways to abuse the semantics, and use tags inappropriately. This article is trying to stamp that out.
Let's take a look at some markup for a sample page which has been written by a non-semantically-aware author. Concurrently, we'll look at the same page with good semantics. Both pages validate perfectly, yet the bad example is so much worse than the good one. Why?

Both examples start with the same valid markup - the DOCTYPE and the opening <html> and <head> tags:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-GB">
<head>

After this, it starts to go wrong. The bad example puts bad data into the <title> element:

<title>:) DrBob's amazing page :)</title>

This is not acceptable, as – although nothing strictly semantically incorrect has been done – the two smilie faces (":)") should not be in the title, as they are nothing to do with the page title, and are just confusing — semantics doesn't just cover the usage of tags, it also covers the contents of said tags.

The code should be written as follows; with the smilies removed:

<title>DrBob's amazing page</title>

Following that, some bog-standard code follows. However, the bad example has still managed to mess it up – even the simplest of code can be semantically incorrect:


<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="keywords" content="bad semantics,web design,web development,XHTML,semantic web" />
<meta name="author" content="Philip Withnall" />
<meta name="description" content="A page to teach about how not to do semantics in web design. Woo! I'm such a cool web designer! Pick me! Woo! (@Google: This mad text is all in the name of education.)" />

You'll notice the crazy content in the meta description tag: it's semantically incorrect. In a meta description tag, you should only include data relevant to the description of the page, not (as in this case) a very modest description of self.

Cleaned up, the code is as follows:


<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="keywords" content="good semantics,web design,web development,XHTML,semantic web" />
<meta name="author" content="Philip Withnall" />
<meta name="description" content="A page to teach about how to do semantics in web design." />

The next block of code in the bad example is quite interesting, and gives us an insight into why semantic web design is better:


<style type="text/css">
.title
{
font-size: x-large;
font-weight: bold;
}

h1,h2,h3,h4,h5,h5,address
{
/*Bah! I had to make these display inline to make them display right! Stupid W3C can't get anything correct!*/
display: inline;
}
</style>

We can see that the author's bad semantics have cost him appearances, as his incorrect use of heading tags (see below) meant that his page didn't display properly without CSS "enhancements". Any mildly-competent CSS coder would know that if you're making heading tags display inline, there's something seriously wrong.

Just for comparison, what follows is the CSS for the good example. Most of it can be eliminated or put into inline style attributes, however, as it's just to make the good example render similarly to the bad example:


<style type="text/css">
.italic
{
font-style: italic;
}

.bold
{
font-weight: bold;
}

.large
{
font-size: large;
}

.strikethrough
{
text-decoration: line-through;
}

.float_left
{
float: left;
}

.float_right
{
float: right;
}

.indent
{
padding-left: 20px;
}
</style>

The next block of code is completely correct, and parses fine, but the author of the bad example has made one of the worst (and unfortunely, most common) mistakes related to semantics: he's used a heading tag incorrectly:


</head>
<body>
<div class="title">DrBob's amazing page <h1>lolzorz</h1></div>

Heading tags are supposed to be used to logically structure the document, with logical headings of a short length in a tree-structure. Starting with the primary header tag is of the utmost importance, but the whole title must be in the tag, not just some small (and irrelevant) part of it. Cleaned up, the code is as follows:


</head>
<body>
<h1>DrBob's amazing page</h1>

The next section of text doesn't look too bad, but there are some mistakes, and some more offending with headers:


<div>I live at <address>Buckingham Palace</address> in <strong>England</strong>. Aren't I <h4>lucky</h4>! Whee!</div>
<p>My name is DrBob, and <em>I</em> come from <em>behind you</em>.</p>

The first offense is the encapsulation of text in a <div>. This element is for logically grouping sections of a document for styling or other purposes. It is not designed to contain text; for that, you should use a paragraph element. The next error is the use of the address element. Although its use is debatable, the official W3C HTML specification states that the address element should be used to encapsulate contact information for a web page, or a significant part of a web page, not somebody's postal address, unless it is their preferred contact medium with regards to the web page. Other errors in this section are regarding the use of emphasis and strong elements. These elements are not to be used with the mindset that "it's going to make the text italic", or "it'll make the text bold"! They should be used to signal emphasis on a section of text which would be emphasised if the text was read aloud. Although people would often make text italic or bold when that is applicable, it is probably not applicable in these circumstances, so the effect should be achieved using CSS. The final mistake in this section is the incorrect use of a level-4 heading. It can't be stressed enough that headers are to be used for logical document section titling, and certainly not for text sizing and font choices!

Corrected, the section is much easier to read:


<p>I live at <span class="italic">Buckingham Palace</span> in <span class="bold">England</span>. Aren't I <span class="bold">lucky</span>! Whee!</p>
<p>My name is DrBob, and <span class="italic">I</span> come from <span class="italic">behind you</span>.</p>

The next section is incorrect from the outset, as the author has chosen to indent it all using a blockquote element, instead of styles:


<blockquote>
<!-- Bah! I had to change it from a <p> to a <div> to get it to validate! Oh well. They're the same to me, anyway, because I know best. -->
<div>Look at my wonderful indented text! Aren't I so <h2>good</h2> at <abbr title="HyperText Markup Language">HTML</abbr>!</div>
<p>I know this page is <a href="http://validator.w3.org/check?uri=referer">valid <abbr>XHTML</abbr></a> and uses <a href="http://jigsaw.w3.org/css-validator/?check=referer">valid <acronym>CSS</acronym></a>. Aren't I <strong>so good</strong>!</p>
</blockquote>

The use of the blockquote element is bad, as the blockquote element should be used to denote long quotations, not to indent things.

In this section, there are also more examples of header misuse, and <div> tag misuse. There is also something else, which hasn't been seen before, and that's misuse and confusion over the abbreviation and acronym elements. The abbreviation tag should only (naturally enough) be used for abbreviations, so its use here to explain what "HTML" means is completely incorrect. The other two uses of acronym and abbreviation elements are also incorrect, as you should always specify a title attribute when using either element (and the author has used the wrong tag for HTML again).

Here's a clean version of this section:


<div class="indent">
<p>Look at my wonderful indented text! Aren't I so <span class="bold large">good</span> at <acronym title="HyperText Markup Language">HTML</acronym>!</p>
<p>I know this page is <a href="http://validator.w3.org/check?uri=referer">valid <acronym title="eXtensible HyperText Markup Language">XHTML</acronym></a> and uses <a href="http://jigsaw.w3.org/css-validator/?check=referer">valid <acronym title="Cascading StyleSheets">CSS</acronym></a>. Aren't I <span class="bold">so good</span>!</p>
</div>

By now, you'll be getting sick of this author misusing headers, so we'll skip that error in the next section:


<h3>Did you know I'm so good at HTML, I can make <del>text strikethrough</del>? Isn't that cool?!</h3>

Apart from the header misuse, there is only one problem with this section, and that's the incorrect use of the <del> tag. It is supposed to be used to indicate that a web page has been changed, and that the encapsulated content has been removed in the latest version, but this author is incorrectly using it to add a strikethrough effect to his text. (The author has also forgotten to add an <acronym> tag, but this isn't as important.) This is simple to clean up:


<p class="large">Did you know I'm so good at <acronym title="HyperText Markup Language">HTML</acronym>, I can make <span class="strikethrough">text strikethrough</span>? Isn't that cool?!</p>

The final section in our bad example page is also the largest, but only contains a few mistakes repeated many times:


<table border="1" width="100%">
<tr>
<td>
<cite>Here's a recent conversation I had with a friend of mine:</cite>
</td>
<td>
<dl>
<dt>Mark said:</dt>
<dd>So, have you looked into this new <q>semantic markup</q> thing?</dd>
<dt>DrBob said:</dt>
<dd>Hell no! I'm the best coder in the world already!</dd>
<dt>Mark said:</dt>
<dd>Are you sure? It looks like it has benefits.</dd>
<dt>DrBob said:</dt>
<dd>Go away. I hate you.</dd>
</dl>
</td>
</tr>
</table>
</body>
</html>

As you can see, the section contains a conversation log of an interaction between the author and one of his (no longer) friends. The first mistake is a blindingly obviously incorrect usage of a table. Many people have said it many times, and it can't hurt to say it again: tables should be used to display tabular data only, not used to lay out a page! The next mistake is the use of a citation tag in the wrong place. Citation tags should be used to show who a quote came from, and not at all in the manner shown in the example. The remaining mistakes in this section are all misuses of definition elements. The HTML specification defined several elements for use in defining terms. Here, they have been used to format the responses in a chat log, and this is completely wrong — if a computer were to read this page, it would record that "Mark said:" means "So, have you looked into this new semantic markup thing?", among other things.

A final mistake in this section is the incorrect use of a quote tag. As nobody said "semantic markup" (as far as any reader of the page can tell, anyway), it shouldn't be inside a <q> tag. Instead, it should just be quoted normally, as in the cleaned version:


<div class="float_left">
<p>Here's a recent conversation I had with a friend of mine:</p>
</div>
<div class="float_right">
<ul>
<li><cite>Mark</cite> said:
<blockquote><p>So, have you looked into this new "semantic markup" thing?</p></blockquote></li>
<li><cite>DrBob</cite> said:
<blockquote><p>Hell no! I'm the best coder in the world already!</p></blockquote></li>
<li><cite>Mark</cite> said:
<blockquote><p>Are you sure? It looks like it has benefits.</p></blockquote></li>
<li><cite>DrBob</cite> said:
<blockquote><p>Go away. I hate you.</p></blockquote></li>
</ul>
</div>
</body>
</html>

That concludes our little waltz through a semantically incorrect page. It has by no means covered every conceivable mistake regarding semantic readability, but it has covered the main ones. The main thing to think about when coding a web page, is what the actual meaning of the elements you are using is. Most of the time this is very obvious and self-explanatory, but if you're unsure at any time, it's always best to check with the W3C's specifications.

The future is sure to include more computers trawling through the internet looking for information. Whether it be the omnipresent Googlebot, or some university student's semantic web project, all computers need help in understanding web pages. Why make it harder for them, when it's so simple to make it easy?

For more information on common semantic mistakes, and the proper uses of elements involved in such mistakes, why not look at my semantics example page? It shows a table of common mistakes, and showcases some tricky (yet correct) semantics.