On top of that you need to know if the string is ASCII, Latin-1 (ISO-8859-1), or UTF-8. If you get it wrong, the results will look correct for plain ASCII text, but start playing up when you use characters outside this range.
Why is it so hard to get right?
My simple answer is that in most programming languages there are not separate string types for the different encoded forms of text. The programmer is required to remember how values are encoded. A programmer inspecting the values of a variable may make assumptions about the encoding based on the current value of the variable. If not clearly documented in the code, it is easy for programmers to make mistakes. For example, if one string is plain text and another is HTML encoded, its a common mistake to embed the string in the HTML encoded content without escaping HTML sensitive characters.
The problem with HTML, URL, UTF-8, printed-quotable etc encoding is they leave most ASCII text as ASCII text. Visually inspecting a string does not make it clear what encodings have already been applied.
ASCII: Hello World! Latin-1: Hello World! UTF-8: Hello World! HTML: Hello World! URL: Hello World%21
Think about hex or base64 encoding. If you see a value, its pretty easy to guess that it has been encoded!
Hex: 48656c6c6f20576f726c6421 Base64: SGVsbG8gV29ybGQh
There are advantages to the encodings that leave normally ASCII pretty well untouched. These include it is easier to debug (you can read the text if its mainly ASCII), if you have plain ASCII text it is pretty storage efficient (hex doubles the storage overhead and base64 goes up by a third).
So what to do about it? Since changing the existing programming languages is not likely, document your code! Make it clear what the encoding of strings are (if any). And always be mindful when combining strings to check their respective encodings. Write test cases that use sensitive characters and strange encodings. I also try to avoid HTML encoded strings in business logic. I prefer to return a data structure and have the presentation layer worry about conversion to marked up output. This also has the advantage that different presentation logic may not want HTML (e.g. a native mobile app).