Unicode Strings

The original ASCII table is based on the English language and does not account for a lot of other languages and symbols. Several designs were introduced to try to resolve this, and lately it appears that Unicode (http://www.unicode.org) is the industry standard.

The Unicode standard addresses such topics as character encoding, character properties, visual representation, and more. From a very simplistic approach, Unicode tries to support characters and symbols from other languages by assigning every character (and there are tens of thousands of those, if not more) a unique integer number while maintaining compatibility with the ASCII table. This means that some Unicode characters are represented by 4 bytes, not 1 byte.

The problem starts when you want to write your Unicode string to file. Writing 4 bytes instead of 1 every time is space consuming. If the characters you use are simple English characters, they are all well within the ASCII table and thus contain less than 8 bits. To write them, you don't need 4 bytes—1 byte will suffice. In this case, you can choose to encode your Unicode string as an 8-bit value, known as UTF-8. UTF stands for Unicode Transformation Format.

If the characters you use are not all from the English alphabet, you might need more bytes to represent your Unicode string. In most cases, 2 bytes is more than enough, which means you can encode your Unicode string using UTF-16 encoding. However, you're also likely to use characters from the English alphabet (or other ASCII symbols), in which case some characters may be encoded with 8 bits. And so some encoding supports those variable size schemes as well.

From our perspective, it is sufficient to know that Python natively supports Unicode strings. And furthermore, we can encode and decode Unicode strings using a host of encoding schemes, including UTF-8 and UTF-16. Unicode strings follow the notation u''.

Working with Unicode

Unicode strings behave similarly to regular strings. As mentioned previously, if a character in the Unicode string matches the ASCII value, that value is used. So to construct a Unicode string made of ASCII characters, simply call the unicode() function with the string:

However, if that is the case, and you are using ASCII characters, what's the point of Unicode? Fair enough, let's add nonstandard characters, that is, characters with ordinal value above 128.

The value Oxa9 corresponds to © and is pretty hard to type on most keyboards. So instead, I've used its ordinal value in Unicode. Python retains the value as an ordinal value and does not print the symbol associated with it. If we dump the Unicode string to file, it will be possible to view the special characters in most editors or web browsers. In this case, the generated file is ../data/special.txt.

>>> open('../data/special.txt', 'wb').write(u.encode('utf-8'))

The write() method accepts only strings, not Unicode strings. So for us to be able to use the write() function and write the Unicode strings to file, we'll have to use the encodeQ method, which accepts the encoding to be used. In this case I've selected the UTF-8 encoding, which is widely popular.

The function decode() complements encode() and returns the decoded Unicode string:

>>> u.encode('utf-8') 'a string\xc2\xa9' >>> u.encode('utf-l6')

'\xff\xfea\xOO \x00s\x00t\x00r\x00i\x00n\x00g\x00\xa9\x00' >>> u.encode('utf-l6').decode('utf-l6') u'a string\xa9'

For example, to read an entire text file encoded in UTF-16, issue the following: open('../data/somefile.txt').read().decode('utf-l6').

Example: The Hebrew Alphabet

The purpose of this example is to generate a file with the Hebrew alphabet. If you don't have Hebrew installed on your system (which I suppose is the case for most folks out there), you can try other, more popular characters, as I'll show shortly.

■ Note Alphabet is a Hebrew word. It is composed of the first two letters of the Hebrew language: Aleph and Bet.

The Hebrew alphabet, shown in Figure 5-1, starts with the letter Aleph mapping to the value 0x5D0 and ends at the letter Tav, mapped to value 0x5EA in Unicode. Unless you have the Hebrew keyboard installed on your system, manually typing Hebrew letters is not a trivial task. Therefore, you'll have to construct the alphabet using a Unicode string. Since we don't want (or can't) manually type the values, we'll construct a list of Unicode letters and then generate a string from the list. The function unichr() will be used to construct the characters from their ordinal value, similar to the function chr():

>>> letters = [unichr(letter) for letter in range(0x5d0, 0x5eb)] >>> alephbet = ''.join(letters)

>>> open('../data/alephbet.txt', 'w').write(alephbet.encode('utf-l6'))

Ann ¡73p gqrtüi pii'?3T'ifrTinTiis

Figure 5-1. The Hebrew alphabet

Once we have the Unicode string, all that's required is to write it to file. Since the write() function only accepts strings, and not Unicode strings, we have to encode the string. In our case, the Hebrew Unicode values require 16 bits, so we therefore encode with UTF-16.

The Latin alphabet has special characters, as shown in Figure 5-2: the accented letters, starting at value 0xC0 and ending at 0x0FF. Therefore, you could modify the preceding script to generate the Latin special characters as follows:

>>> letters = [unichr(letter) for letter in range(0xc0, 0x100)] >>> latin = ''.join(letters)

>>> open('../data/latin.txt', 'w').write(latin.encode('utf-l6'))

Figure 5-2. Some interesting Latin characters

Figure 5-2. Some interesting Latin characters

Example: Writing Today's Date in the Current Locale

The purpose of this example is to print the current date and time, in a specific locale, to file If you're using the English United States locale, this will be rather boring. So instead, I've decided to use the Hebrew locale again, since we're all familiar with it by now (see Listing 5-19). Since Python doesn't always print other character sets in the interpreter, it's best to write the results to file and view them in a text viewer, editor, or web browser.

Listing 5-19. Today's Date in the Current Locale import locale from time import strftime, strptime, localtime from sys import platform if platform == 'linux2':

locale.setlocale(locale.LC_ALL, 'hebrew') elif platform == 'Win32':

locale.setlocale(locale.LC_ALL, 'Hebrew_Israel') elif platform == 'cygwin':

raise Exception('Cygwin not supported') else:

print "Untested platform: sys.platform today = strftime('%B %d, %Y', localtimeQ) todayU = unicode(today, 'cpl255')

open('../data/today.txt', 'w').write(todayU.encode('utf-l6'))

At first, I try to guess the encoding for Hebrew based on the current platform using the sys.platform value. As I've mentioned earlier, Linux, Windows, and Cygwin all have different locale abbreviations.

After I set the locale, I can query the preferred locale encoding with the function locale.getpreferredencoding(), which is quite useful in determining how to encode the Unicode string. Unfortunately, I have found that the preferred encoding in the case of Hebrew should be cpl255 and not the one returned by the getpreferredencoding() function. Lastly, I encode the string and write it to file using UTF-16.

Figure 5-3 shows the results.

Figure 5-3. A date in Hebrew

More on Unicode

The topic of Unicode is vast, and numerous books are available that discuss it. As for online information, I have found the Python library reference a valuable resource. If you're looking for information regarding i18n and l10n in general, including code pages and locale information, the following might prove useful:

• The Unicode Consortium: http://www.unicode.org.

• Wikipedia: http://en.wikipedia.org/wiki/Locale. Be sure to follow the links to topics such as character encoding and Unicode.

• International Components for Unicode: http://www.icu-project.org/.

Was this article helpful?

0 0

Post a comment