Why is Encoding So Confusing?

To embed strings into HTML you have to escape or encode characters such as <, >, and &. To embed strings into URLs you have to encode characters like /, and ?.  JavaScript has different rules if you want to embed content in string literals.  And then if you want to embed JavaScript in HTML you have to apply multiple levels of encoding.

On top of that you need to know if the string is ASCII, Latin-1 (ISO-8859-1), or UTF-8.  If you get it wrong, the results will look correct for plain ASCII text, but start playing up when you use characters outside this range.

Why is it so hard to get right?

My simple answer is that in most programming languages there are not separate string types for the different encoded forms of text.  The programmer is required to remember how values are encoded. A programmer inspecting the values of a variable may make assumptions about the encoding based on the current value of the variable. If not clearly documented in the code, it is easy for programmers to make mistakes.  For example, if one string is plain text and another is HTML encoded, its a common mistake to embed the string in the HTML encoded content without escaping HTML sensitive characters.

The problem with HTML, URL, UTF-8, printed-quotable etc encoding is they leave most ASCII text as ASCII text.  Visually inspecting a string does not make it clear what encodings have already been applied.

ASCII:    Hello World!
Latin-1:  Hello World!
UTF-8:    Hello World!
HTML:     Hello World!
URL:      Hello World%21

Think about hex or base64 encoding.  If you see a value, its pretty easy to guess that it has been encoded!

Hex:      48656c6c6f20576f726c6421
Base64:   SGVsbG8gV29ybGQh

There are advantages to the encodings that leave normally ASCII pretty well untouched.  These include it is easier to debug (you can read the text if its mainly ASCII), if you have plain ASCII text it is pretty storage efficient (hex doubles the storage overhead and base64 goes up by a third).

So what to do about it?  Since changing the existing programming languages is not likely, document your code!  Make it clear what the encoding of strings are (if any).  And always be mindful when combining strings to check their respective encodings.  Write test cases that use sensitive characters and strange encodings. I also try to avoid HTML encoded strings in business logic. I prefer to return a data structure and have the presentation layer worry about conversion to marked up output.  This also has the advantage that different presentation logic may not want HTML (e.g. a native mobile app).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: