iconv misunderstands UTF-16 strings with no BOM

Reading Time: 2 minutes

I had a problem last week with converting UTF-16 encoded strings to UTF-8 using PHP’s iconv library on a Linux server. my code worked fine on my machine but the same code resulted in a rubbish unreadable characters on our production server.

Let me take you to the beginning of the problem. I had a Hexadecimal representation of a UTF-16 string like that

0635 0628 0627 062D 0020 0627 0644 062E 064A 0631

this is the equivalent of “Good morning” in Arabic “صباح الخير” . I had some lines of code that will convert this into a normal stream of UTF-16 bytes so I can be able to use iconv to convert the string to UTF-8. maybe you noticed that there is no BOM at the beginning of Hexadecimal representation of the string. so let me quote what is written on Unicode’s BOM FAQ page about this.

Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?

A: UTF-16 and UTF-32 use code units that are two and four bytes long respectively. For these UTFs, there are three sub-flavors: BE, LE and unmarked. The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.

so when there is no BOM, the string should be treated as big-endian. libiconv has a different opinion about this and will try to guess if it should use big-endian or little-endian depending on the operating system. so you will get different results on different machines.

The simple solution to this problem (after a long time trying to identify it), is to just tell iconv that I’m converting from UTF-16BE (big-endian) so it won’t try to guess the endianess of the bytes. so in php it will be like that

$result = iconv('UTF-16BE', 'UTF-8', $str);

or better, I can check the BOM before converting the Hexadecimal codes to a stream of bytes and taking the decision of converting from UTF-16BE or UTF-16LE depending on if it begins with FEFF or FFFE.

1 Comment iconv misunderstands UTF-16 strings with no BOM

  1. Álvaro G. Vicario

    Just found this issue when converting from UTF-8 to UCS-2 for a SMS gateway. The mb_convert_encoding() function seems to work properly, at least in my development and live environments.

    P.S. OpenID login does not seem to work. I don't know why I don't just drop it :(


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.