Character set encoding

Character encoding is the low-level representation of the letters, numbers, and symbols we see in our daily interactions with computers. Common encodings for documents in English are ISO-8859-1 (a superset of ASCII), UTF-8 (an 8 bit Unicode character encoding), and Windows-1252. There are a great number of character set encodings in use and a long and complicated history of how they came to be. This complexity often leads to problems. Typically, these problems are caused when the document is encoded with one encoding, but is interpreted as another.

If you don’t ever have to deal with character encoding issues, then consider yourself fortunate, as it can be a royal pain to decipher and correct large numbers of character encoding issues.

Why you might care

It is likely that you see character set encoding problems all the time. If you have ever opened an email, a web page, or document and some of the letters looked wrong then there this is a good chance this is due to a character set encoding mismatch. You are mostly likely to notice problems with curly quotes, bullets, and accented characters. If you are interested in learning more, there are some excellent sources at the end of this article.

Just to illustrate the extent of the problem–A composite approach to language/encoding detection](http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html) is the original research paper by the Netscape employees who wrote the character detection algorithm that is still used in Firefox. The page is encoded as ISO-8859-1, but the meta tags in the page are set to UTF-8. In most browsers, you should see the resulting funny looking characters due to the character encoding mismatch. Email can have character set encoding problems as well. RFC 2047 defines MIME extensions for non-ASCII text and HTML email has the same problems as web pages.

The best tools I have found are primarily open source command line-based utilities. Specialized GUIs are hard to come by although as I will describe a browser and some text editors will work for many basic tasks. I only tested the command line tools under Mac OS X, Linux, and FreeBSD Unix variants, although most can be compiled under Windows with Cygwin or similar systems. Some of the tools are available as pre-compiled Windows binaries.

Detecting character set encodings

The absolute quickest way to check to see if you have a character encoding problem is to open the web page or file in Firefox and go to the Character Encodings option under the view menu. You can experiment by changing to a different character encoding and see if your document displays correctly.

If you are unsure of which character set your document is encoded in then that is a good place to start. I would first try the file command. It is a standard utility in every modern Unix system I have used. The program attempts to determine many characteristics about the file including types of line ending and the text encoding of the file.

If you need more sophisticated tests for character encoding than the file command offers, then chardet, the Universal Encoding Detector, is your most sophisticated option. The software is a Python port of the code from Mozilla/Firefox code base that includes multiple character encoding auto-detection mechanisms. The most recent version now has a limited command line interface. Previously, it was only accessible to developers willing to wrap their own code around the library. rchardet is a Ruby variant.

Converting between character set encodings

It is possible to use a text editor many character encoding conversions, if you know or can guess the original encoding. Simply open your text file in your favorite editor such as the built in TextEdit or TextMate on(Mac OS X, TextPad or the E - TextEditor on Windows, Yudit on Unix systems with X-Windows, and GNU Emacs on most systems. Then simply select a different encoding in the editor and re-save the file.

Uni2ascii can perform both ends of the conversion between UTF-8 and a large number of encodings and formats including many ASCII variants, quoted printable, HTML, XML, and escapes for POSIX and many programming languages. I like many options to decompose UTF-8 into other encodings. The -B flag creates best effort ASCII by decomposing UTF-8 characters into a reasonable plain ASCII alternative. For example, the copyright symbol becomes (C). In my experiments, there were minor problems where the following characters were not converted middle dot (0x00B7/U+00B7), next line (0x0085/U+0085), and line separator (0x2028/U+2028). Aside from these the program did a tremendous job.

iconv/libiconv is the standard for character set conversion. The application needs to be used as a filter so it can be less convenient if you would prefer to operate on files directly.

I have used GNU Recode for a number of projects. Recode relies on libiconv and can process files directly. The release version of Recode has not been updated in many years, however it is under active development and a recent beta of Recode can be found on the author’s site.

convmv converts the character encoding of filenames (not the contents of the files) and can work on entire directories of files.

The Commetdocs service (formerly known as the iconv.com) allows you convert between many character sets and files types. The service is currently free.

I have not tried either extensively, but Enca the “Extremely Naive Charset Analyzer” and UTRAC the “Universal Text Recognizer and Converter” both provide extensive support for conversion between non-Western character encodings.

Examples

Example – Convert files to UTF-8:

iconv -f original_charset -t utf-8 oldfile.txt > newfile.txt

recode UTF-8 file.txt

Example – Convert UTF-8 into readable 7-bit ASCII. The -B option is equivalent to the flag combination -cdefx.

uni2ascii -B file.txt

find . -type f -exec recode utf8..ascii {} ;

Example: use convmv to convert the filenames of a directory of files from IS0-8895-1 to UTF-8. The –notest flag is a dry run feature that can be very useful for testing.

convmv -f iso-8859-1 -t utf8 --notest  directory/

The Future

In general, I recommend that people use the UTF-8 for all new documents. UTF-8 is capable of representing the vast majority of alphabets and is a mature internationally accepted standard. More than a year ago, Google found that the majority of the pages on the web used UTF-8 character encoding.

References

If you want to learn more about character encoding, the following sources are good places to start.

* This article originally appeared as Why Does My Text Look Funny? Character Set Encoding Detection and Conversion in my Messaging News “On Message Column.”