Broken font size when using UTF-8

Thu, Dec 1, 2011

Today I migrated one of my websites from the german ISO-8859-1 charset to UTF-8. I started with converting all documents to UTF-8 and checked the website afterwards. As expected, the german Umlaute turned out to be a mess. So the next thing was to change the charset definition in the HTML code:

<!-- old
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

After this, the whole layout moved about 20 px away from the top and the font size of many elements was way too big. Great. Sadly I didn’t recognize that all text being affected by this tragedy was located in tables. I was more like my page is randomly fucked up. A common problem of UTF-8 charset definitions not having any effect can be a differing charset send by the server in the HTTP headers. So I also added the following to my PHP code right before any output just to make sure:

<?php
header(&#39;Content-Type: text/html; charset=utf-8&#39;);
?>

This didn’t have any effect, but might have been necessary anyway. After that, I ran a HTML validator on my page, leading to a message like this:

Line 1: character "" not allowed in prolog

While I assumed some whitespace character being right before the doctype definition, I didn’t expect this to be related to my font size problem at all. After trying various combinations of not the fuck printing anything visible and invisible before the doctype and even echoing the doctype using PHP, I had a look at the page information box of firefox. UTF-8 seemed okay, but the page was rendered using quirks mode. According to Wikipedia, quirks mode (among other things) does not support font size inheritance in tables, which means the definition of a body font size of 0.8em and a table font size 0.8em will not result in the correct visible font size of 0.64em for text in a table. Since I did not define any table font size, quirks mode probably used a default of 1.0em. So why the hell was my page rendered in quirks mode? While the HTML Validator extension for firefox gave me the same error as validator.de.selfhtml.org did, it also provided some useful information of what could be the reason for that error. Among others it stated that using UTF-8 with BOM (byte-order mark, see what is a BOM?) can result in whitespaces or unexpected characters in a webpage’s source code. If you use Notepad++ to convert some document to UTF-8, this essentially means you are converting it to UTF-8 with BOM. And that little bastard byte from hell was neither visible in the firefox source code viewer nor in Notepad++. A hex editor does the job though. While HTML comments and whitespaces are allowed to appear before the doctype definition, the BOM leads to a ï»¿ character, which is not allowed in the HTML prolog. Many browsers will then switch to quirks mode and mess up font sizes defined in relative units such as em. In this case it was also responsible for moving the layout away from the top. So what did we learn?

Save your websites in UTF-8 without BOM.
Convert old documents to UTF-8 without BOM.
Define the charset both in the HTTP header and in the document itself.
Don’t use BOM in UTF-8. It’s for UTF-16 only.
Use the W3C website: They provide really good information such as this detailed article (including a small summary) on how to use charsets in HTML and CSS.