Special Characters and Encoding
All text in an XML document will be parsed by the parser, only text inside a
CDATA section will be ignored by the parser.
--------------------------------------------------------------------------------
XML Encoding
XML documents may contain foreign characters, like Norwegian æ ø å , or French
ê è é. To let your XML parser understand these characters, you should save your
XML documents as Unicode.
If no encoding is specified on an XML document, then by default, the ISO 10646
UTF-8 is assumed.
Windows 2000 Notepad files saved as Unicode use "UTF-16" encoding. The normal
encoding when no encoding is specified is ISO Western:
Note: Need to verify the following:
works all
will generate error NS6
works all
works all
The XML processor has to recognize the following character set names:
UTF-8
UTF-16
ISO-10646-UCS-2
ISO-10646-UCS-4
ISO-8859-1 to -9
ISO-8859-11
TIS-620
ISO-2022-JP
Shift-JIS
EUC-JP
XML Encoding Error Messages
If you try to load an XML document into Internet Explorer, you can get two
different errors indicating encoding problems:
An invalid character was found in text content.
You will get this error message if a character in the XML document does not
match the encoding attribute. Normally you will get this error message if your
XML document contains "foreign" characters, and the file was saved with a
single-byte encoding editor like Notepad, and no encoding attribute was
specified.
Switch from current encoding to specified encoding not supported.
You will get this error message if your file was saved as Unicode/UTF-16 but
the encoding attribute specified a single-byte encoding like Windows-1252,
ISO-8859-1 or UTF-8. You can also get this error message if your document was
saved with single-byte encoding, but the encoding attribute specified a
double-byte encoding like UTF-16.
--------------------------------------------------------------------------------
Escape Characters
Illegal XML characters have to be replaced by entity references. If you place a
character like "<" inside an XML element, it will generate an error because the
parser interprets it as the start of a new element. You cannot write something
like this:
if salary < 1000 then
To avoid this, you have to replace the "<" character with an entity reference,
like this:
if salary < 1000 then
There are 5 predefined entity references in XML:
< < less than
> > greater than
& & ampersand
' ' apostrophe
" " quotation mark
Only the characters "<" and "&" are strictly illegal in XML. Apostrophes,
quotation marks and greater than signs are legal, but it is a good habit to
replace them.
--------------------------------------------------------------------------------
CDATA - Arbitrary Character Data
Everything inside a CDATA section is ignored by the parser. If your text contains
a lot of "<" or "&" characters - as program code often does - the XML element can
be defined as a CDATA section.
A CDATA section is a section of element content that is marked for the parser to
interpret as only character data, not markup.
A CDATA section starts with "":
In the example above, everything inside the CDATA section is ignored by the
parser.
John Smith]]>
<sender>John Smith</sender>
A CDATA section cannot contain the string "]]>", therefore, nested CDATA
sections are not allowed. Also make sure there are no spaces or line breaks
inside the "]]>" string.
CDATA sections are useful for writing XML code as text data within an XML
document, XML data in plain reading format, not in code format to be
interpreted. If you were explaining XML tags within an XML document, you would
use CDATA or the < symbol. If you wanted to explain CDATA, you could not use
the ]]> tag because CDATA tags cannot be nested.
The preferred approach to using CDATA sections for encoding text that contains
the triad "]]>" is to use multiple CDATA sections by splitting each occurrence
of the triad just before the ">". For example, to encode "]]>" one would write:
]]>
This means that to encode "]]>" in the middle of a CDATA section, replace all
occurences with the following (this effectively stops and restarts the CDATA
section):
]]]]>