PHP DOMDocument loadHTML not encoding UTF-8 correctly

By | December 16, 2017
Questions:

I’m trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).

$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile); 

$divs = $dom->getElementsByTagName('div');

foreach ($divs as $div) {
    echo $dom->saveHTML($div);
}

The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:

echo $profile;

it displays correctly. I’ve tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.

What I see:

ã¤ãªãã¤å·ã·ã«ã´ã«ã¦ãã¢ã¤ã«ã©ã³ãç³»ã®å®¶åº­ã«ã9人åå¼ã®5çªç®ã¨ãã¦çã¾ãããå½¼ãå«ãã¦4人ã俳åªã«ãªã£ããç¶è¦ªã¯æ¨æã®ã»ã¼ã«ã¹ãã³ã§ãæ¯è¦ªã¯éµä¾¿å±ã®å®¢å®¤ä¿ã ã£ããé«æ ¡æ代ã¯ã­ã£ãã£ã®ã¢ã«ãã¤ãã«å¤ãã¿ãæè²è³éãåããªããã«ããªãã¯ç³»ã®é«æ ¡ã¸é²å­¦ã

What should be shown:

イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学

EDIT: I’ve simplified the code down to five lines so you can test it yourself.

$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;

Here is the html that is returned:

<div lang="ja"><p>イリノイ州シカゴã«ã¦ã€ã‚¢ã‚¤ãƒ«ãƒ©ãƒ³ãƒ‰ç³»ã®å®¶åº­ã«ã€</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>
Answers:

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly. There’s a workaround in SmartDOMDocument which should help you:

$profile = '<div><p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p></div>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML($dom->getElementsByTagName('div')->item(0));

An alternative is to prepend the HTML with an XML encoding declaration to treat the string as UTF-8, provided that the document doesn’t already contain one:

$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);

Questions:
Answers:

The problem is with saveHTML() and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.

The workaround is very simple:

If you try the default, you will get the error you described

$str = $dom->saveHTML(); // saves incorrectly

All you have to do is save as follows:

$str = $dom->saveHTML($dom->documentElement); // saves correctly

This line of code will get your UTF-8 characters to be saved correctly (use the same workaround if you are using saveXML()).


Note

  1. English characters do not cause any problem when you use saveHTML() without parameters (because English characters are saved as single byte characters in UTF-8)

  2. The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, …etc.)

I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.

Questions:
Answers:

Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).

Also in case of HTML, make sure you have declared the correct encoding using meta tags:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If it’s a CMS (as you’ve tagged your question with Joomla) you may need to configure appropriate settings for the encoding.

Questions:
Answers:

You could prefix a line enforcing utf-8 encoding, like this:

@$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);

And you can then continue with the code you already have, like:

$doc->saveXML()

Questions:
Answers:

You must feed the DOMDocument a version of your HTML with a header that make sense.
Just like HTML5.

$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;

maybe is a good idea to keep your html as valid as you can, so you don’t get into issues when you’ll start query… around 🙂 and stay away from htmlentities!!!! That’s an an necessary back and forth wasting resources.
keep your code insane!!!!

Questions:
Answers:

Works finde for me:

$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return  utf8_encode( $dom->saveHTML());

Questions:
Answers:

Problem is that when you add parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you’ll need to avoid the use of the parameter and use old string function to find what your are looking for.

I think the previous answer works for you, but since this workaround didn’t work for me, I’m adding that answer to help ppl who may be in my case.

Questions:
Answers:

Use it for correct result

$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;

This operation

mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');

It is bad way, because special symbols like &lt ; , &gt ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.

Questions:
Answers:

This took me a while to figure out but here’s my answer.

Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:

$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
    // error message
}
else {
    // process
}

This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here’s what works:

$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}

etc. Now everything’s right with the world. Hope this helps.

Questions:
Answers:

Try using utf8_encode

Leave a Reply

Your email address will not be published. Required fields are marked *