Extensible Markup Language (XML) - Encoding and Escaping

Special Characters and Encoding All text in an XML document will be parsed by the parser, only text inside a CDATA section will be ignored by the parser. -------------------------------------------------------------------------------- XML Encoding XML documents may contain foreign characters, like Norwegian æ ø å , or French ê è é. To let your XML parser understand these characters, you should save your XML documents as Unicode. If no encoding is specified on an XML document, then by default, the ISO 10646 UTF-8 is assumed. Windows 2000 Notepad files saved as Unicode use "UTF-16" encoding. The normal encoding when no encoding is specified is ISO Western: <?xml version="1.0" encoding="iso-8859-1"?> Note: Need to verify the following: <?xml version="1.0" encoding="UTF-8"?> works all <?xml version="1.0" encoding="UTF-16"?> will generate error NS6 <?xml version="1.0" encoding="ISO-8859-1"?> works all <?xml version="1.0" encoding="windows-1252"?> works all The XML processor has to recognize the following character set names: UTF-8 UTF-16 ISO-10646-UCS-2 ISO-10646-UCS-4 ISO-8859-1 to -9 ISO-8859-11 TIS-620 ISO-2022-JP Shift-JIS EUC-JP XML Encoding Error Messages If you try to load an XML document into Internet Explorer, you can get two different errors indicating encoding problems: An invalid character was found in text content. You will get this error message if a character in the XML document does not match the encoding attribute. Normally you will get this error message if your XML document contains "foreign" characters, and the file was saved with a single-byte encoding editor like Notepad, and no encoding attribute was specified. Switch from current encoding to specified encoding not supported. You will get this error message if your file was saved as Unicode/UTF-16 but the encoding attribute specified a single-byte encoding like Windows-1252, ISO-8859-1 or UTF-8. You can also get this error message if your document was saved with single-byte encoding, but the encoding attribute specified a double-byte encoding like UTF-16. -------------------------------------------------------------------------------- Escape Characters Illegal XML characters have to be replaced by entity references. If you place a character like "<" inside an XML element, it will generate an error because the parser interprets it as the start of a new element. You cannot write something like this: <message>if salary < 1000 then</message> To avoid this, you have to replace the "<" character with an entity reference, like this: <message>if salary < 1000 then</message> There are 5 predefined entity references in XML: < < less than > > greater than & & ampersand ' ' apostrophe " " quotation mark Only the characters "<" and "&" are strictly illegal in XML. Apostrophes, quotation marks and greater than signs are legal, but it is a good habit to replace them. -------------------------------------------------------------------------------- CDATA - Arbitrary Character Data Everything inside a CDATA section is ignored by the parser. If your text contains a lot of "<" or "&" characters - as program code often does - the XML element can be defined as a CDATA section. A CDATA section is a section of element content that is marked for the parser to interpret as only character data, not markup. A CDATA section starts with "<![CDATA[" and ends with "]]>": <script> <![CDATA[ function matchwo(a,b) { if (a < b && a < 0) then { return 1 } else { return 0 } } ]]> </script> In the example above, everything inside the CDATA section is ignored by the parser. <![CDATA] is the same as using < for instance, the following two lines are treated the same way, as character data, not markup tags: <![CDATA[<sender>John Smith</sender>]]> <sender>John Smith</sender> A CDATA section cannot contain the string "]]>", therefore, nested CDATA sections are not allowed. Also make sure there are no spaces or line breaks inside the "]]>" string. CDATA sections are useful for writing XML code as text data within an XML document, XML data in plain reading format, not in code format to be interpreted. If you were explaining XML tags within an XML document, you would use CDATA or the < symbol. If you wanted to explain CDATA, you could not use the ]]> tag because CDATA tags cannot be nested. The preferred approach to using CDATA sections for encoding text that contains the triad "]]>" is to use multiple CDATA sections by splitting each occurrence of the triad just before the ">". For example, to encode "]]>" one would write: <![CDATA[]]]]><![CDATA[>]]> This means that to encode "]]>" in the middle of a CDATA section, replace all occurences with the following (this effectively stops and restarts the CDATA section): ]]]]><![CDATA[>