Advertisements
Home > XML > Query XML with different languages or different UNICODE encodings [UTF-8 & UTF-16]

Query XML with different languages or different UNICODE encodings [UTF-8 & UTF-16]


While creating XML documents we should take care of the UTF encoding if we are dealing with NON-US or international data. Non-English foreign languages fall in different ranges and takes more memory size compared to English language and thus require different encoding schemes, as follows:

>> UTF-8: represents an octet (8 bit) loosless encoding of Unicode characters. UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet. UTF-8 is the default encoding for XML.

In an XML prolog the encoding is represented as:
‘<?xml version=”1.0″ encoding=”UTF-8″ ?>’

>> UTF-16: encoding is the 16-bit encoding of Unicode. It use 2 bytes per character (and sometimes combines two pairs), it makes implementation easier, but looks a bit overkill for Western languages encoding.

In an XML prolog the encoding is represented as:
‘<?xml version=”1.0″ encoding=”UTF-16″ ?>’

Let’s check both the encodings and issues if they are not used correctly:

-- ===================================================
-- XML containing NON Unicode English characters only:
-- ===================================================
DECLARE @xml as XML
SET @xml =
'<?xml version="1.0" encoding="utf-8"?>
<root>
	<to>Gaurav</to>
	<from>Garvit</from>
	<subject>Happy Birthday</subject>
	<body>Wish you a very very Happy Birth Day!!!</body>
</root>'

SELECT
	T.C.value('to[1]', 'VARCHAR(10)') AS 'TO',
	T.C.value('from[1]', 'VARCHAR(10)') AS 'FROM',
	T.C.value('subject[1]', 'VARCHAR(50)') AS 'SUBJECT',
	T.C.value('body[1]', 'VARCHAR(200)') AS 'BODY'
FROM @xml.nodes('//root') T(C)
GO
 Correct expected output:-

TO	FROM	SUBJECT		BODY
Gaurav	Garvit	Happy Birthday	Wish you a very very Happy Birth Day!!!

Let’s see how this works with international languages. Let’s change the language to French with same UTF-8 encoding:

-- ============================================
-- When XML contains Unicode French characters:
-- ============================================
DECLARE @xml as XML
SET @xml =
'<?xml version="1.0" encoding="utf-8"?>
<root>
	<to>Gaurav</to>
	<from>Garvit</from>
	<subject>Joyeux anniversaire</subject>
	<body>Je vous souhaite une très très heureux anniversaire!</body>
</root>'

SELECT
	T.C.value('to[1]', 'VARCHAR(10)') AS 'TO',
	T.C.value('from[1]', 'VARCHAR(10)') AS 'FROM',
	T.C.value('subject[1]', 'VARCHAR(50)') AS 'SUBJECT',
	T.C.value('body[1]', 'VARCHAR(200)') AS 'BODY'
FROM @xml.nodes('//root') T(C)
GO

We get an Error Message:
Msg 9420, Level 16, State 1, Line 2
XML parsing: line 6, character 31, illegal xml character

As a NON-English language (French) is used to so we have to use the UTF-16 encoding:

-- Let's change the encoding to UTF-16, i.e. for UNICODE characters:
DECLARE @xml as XML
SET @xml =
'<?xml version="1.0" encoding="utf-16"?>
<root>
	<to>Gaurav</to>
	<from>Garvit</from>
	<subject>Joyeux anniversaire</subject>
	<body>Je vous souhaite une très très heureux anniversaire!</body>
</root>'

SELECT
	T.C.value('to[1]', 'VARCHAR(10)') AS 'TO',
	T.C.value('from[1]', 'VARCHAR(10)') AS 'FROM',
	T.C.value('subject[1]', 'VARCHAR(50)') AS 'SUBJECT',
	T.C.value('body[1]', 'VARCHAR(200)') AS 'BODY'
FROM @xml.nodes('//root') T(C)
GO

Still getting the Error Message:
Msg 9402, Level 16, State 1, Line 2
XML parsing: line 1, character 39, unable to switch the encoding

Now, as we are dealing with UNICODE data so the XML string should also be of UNICODE type, so I’ll prefix ‘N’ before the string starts in SET statement:

-- Add N for to make the xml string Non-Unicode:
DECLARE @xml as XML
SET @xml =
N'<?xml version="1.0" encoding="utf-16"?>
<root>
	<to>Gaurav</to>
	<from>Garvit</from>
	<subject>Joyeux anniversaire</subject>
	<body>Je vous souhaite une très très heureux anniversaire!</body>
</root>'

SELECT
	T.C.value('to[1]', 'VARCHAR(10)') AS 'TO',
	T.C.value('from[1]', 'VARCHAR(10)') AS 'FROM',
	T.C.value('subject[1]', 'VARCHAR(50)') AS 'SUBJECT',
	T.C.value('body[1]', 'VARCHAR(200)') AS 'BODY'
FROM @xml.nodes('//root') T(C)
GO
Correct expected output:-
TO	FROM	SUBJECT			BODY
Gaurav	Garvit	Joyeux anniversaire	Je vous souhaite une très très heureux anniversaire!

As per MS BOL: “SQL Server stores XML data using the UTF-16 encoding scheme. Because UTF-16 data is variable-width, it is processed according to a byte-oriented protocol. This means that UTF-16 data can be treated in a way that is independent of the byte ordering on different computers (little endian versus big endian). Therefore, UTF-16 is well-suited for traversing different computers that use different encodings and byte-ordering systems. Because XML data is typically shared widely across networks, it makes sense to maintain the default UTF-16 storage of XML data in your database, and when you export XML data to clients.”

More info on: http://msdn.microsoft.com/en-US/library/ms145866%28v=SQL.90%29.aspx

Advertisements
  1. October 2, 2015 at 5:40 pm

    Very nice article! Extreme useful to demonstrate the XML encoding restrictions to developers when working with mssql ;]

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: