A text character usually lives as an Octet, which is a single byte
or 8 bits of data. Using 8 bits allows for 256 (a range from 0-255)
possible distinct character codes. While the HTTP protocol allows the full
256 character range of the ISO 8859-1 (ISO Latin) characters to be
transported, not all operating systems or applications may natively support
this range. In order to increase portability and viewability of this character
set on all browsers, HTML offers alternative representations of all the ISO
Latin characters using coded Character Entities (see index below.) These
case-sensitive, coded representations are created using characters from a
proper subset of the ISO Latin character set known as
ASCII.
Included in the Character Entity domain are both numbered and named entities:
Numbered Entity Syntax:&#charnumber;
Where charnumber is a distinct integer from 0-255.
Named Entity Syntax:&charname;
Where charname is a unique mnemonic shorthand of
the character to be represented.
Note: The trailing semi-colon character (';') is only
necessary if the character following the entity reference would
be recognized as part of the entity. Even so, it is probably wise to
always use this trailing termination character to be consistent.
Character entities can be used anywhere regular characters will be
displayed on screen.
In cases like IMG or INPUT, entities are used only for final display
purposes (ALT text for Images or VALUE for Input elements.)
Entities are not to be used in path names for URLs.
DTD Note: The " named character
entity was retracted from the HTML 3.2 DTD. There is still some confusion
as to WHY this was done, as this entity is in wide use, and exists in the
HTML 2.0, 3.0 and 4.0 DTDs. There are two differing stories as to why
it was deleted from the 3.2 DTD:
Dan Connolly (co-author of HTML 2.0) has said the omission was a mistake.
Dave Raggett (author of HTML 3.0, 3.2 and 4.0) has said that the
omission was intentional due to a disagreement in the HTML ERB over
which entities should be in HTML 3.2. Only the basic set of entities was
agreed upon. (Many thanks to a reader who sent me some mail clarifying this.)
Any documents using " will generate validation errors under the HTML
3.2 DTD, but it should be safe to leave these entities in legacy documents
due to wide legacy and future browser/DTD support. The alternate form of this
entity ('"') WILL validate and should be considered when
authoring new documents.
Browser Peculiarities
Internet Explorer 1.0-3.0 treated character entities case-insensitively,
such that "&EacuTE;" was treated the same as "é" In IE4.0+,
character entities are correctly case-sensitive.
IE seems to be VERY lenient on character entity parsing - it will allow
an author to leave off the trailing semi-colon in every case that I have tried,
whereas the Netscape 4.x+ and Opera browsers I tried choked the same way
for the same test cases about half the time (Netscape/Opera could handle
* .test*, *  test*, and *  test*, but they
couldn't handle * test*, * 1test*, * ptest*,
* 99test* and *  test*.) IE handles ALL
of these cases just fine and renders all of the attempted non-breaking
spaces. I leave it to the reader to infer equivalence classes for this
behavior, but the gist of this item is: don't forget the semi-colon!