B.1 Notes on invalid documents


This specification does not define how conforming user agents handle general error conditions, including how user agents behave when they encounter elements, attributes, attribute values, or entities not specified in this document.

However, to facilitate experimentation and interoperability between implementations of various versions of HTML, we recommend the following behavior:

  • If a user agent encounters an element it does not recognize, it should try to render the element’s content.
  • If a user agent encounters an attribute it does not recognize, it should ignore the entire attribute specification (i.e., the attribute and its value).
  • If a user agent encounters an attribute value it doesn’t recognize, it should use the default attribute value.
  • If it encounters an undeclared entity, the entity should be treated as character data.

We also recommend that user agents provide support for notifying the user of such errors.

Since user agents may vary in how they handle error conditions, authors and users must not rely on specific error recovery behavior.

The HTML 2.0 specification ([RFC1866]) observes that many HTML 2.0 user agents assume that a document that does not begin with a document type declaration refers to the HTML 2.0 specification. As experience shows that this is a poor assumption, the current specification does not recommend this behavior.

For reasons of interoperability, authors must not “extend” HTML through the available SGML mechanisms (e.g., extending the DTD, adding a new set of entity definitions, etc.).

B.2 Special characters in URI attribute values

B.2.1 Non-ASCII characters in URI attribute values

Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors sometimes specify them in attribute values expecting URIs (i.e., defined with %URI; in the DTD). For instance, the following href value is illegal:

<A href="http://foo.org/Håkon">...</A>

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

  1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
  2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

This procedure results in a syntactically legal URI (as defined in [RFC1738], section 2.2 or [RFC2141], section 2) that is independent of the character encoding to which the HTML document carrying the URI may have been transcoded.

Note. Some older user agents trivially process URIs in HTML using the bytes of the character encoding in which the document was received. Some older HTML documents rely on this practice and break when transcoded. User agents that want to handle these older documents should, on receiving a URI containing characters outside the legal set, first use the conversion based on UTF-8. Only if the resulting URI does not resolve should they try constructing a URI based on the bytes of the character encoding in which the document was received.

Note. The same conversion based on UTF-8 should be applied to values of the name attribute for the A element.