Using php I parse a textfile that contains unicode characters like
Just reading-in the file without any further encoding/decoding the smiley is parsed, then json_encoded and the ouput is \u00f0\u009f\u0098\u008d
A javascript file gets the .json data and outputs the 4 escaped characters as ð
Looking at a unicode table the symbol is called "SMILING FACE WITH HEART-SHAPED EYES" and has the unicode number U+1F60D (128525)
Is there a way to convert the 4 code units to the unicodenumber or ideally to a proper html-encoded way, in this case 😍
looking at conversions, the utf 8 code units look similar (F0 9F 98 8D 0A 0A), but I can't reproduce the 4 escaped units I get, so I don't even know what I'm looking at
Update: I made a mistake and edited the second paragraph: \u00f0\u009f\u0098\u008d already is the result of json_encode();
Here is the basic function to read the data from the file, looking at the source the smiley is "hardcoded", so you actually see it
function readLocalFile() {
$file_html = fopen('output.html', "r");
$html = "";
while(!feof($file_html)) {
$html .= fgets($file_html);
}
fclose($file_html);
// here I use regex to filter for specific tags, the result is an array
$cleanData = parseData($html);
saveToFile(json_encode($cleanData));
}
I just created a dummy.html with just as the content and this returns the correct result \ud83d\ude0d, in the context of the whole data it still is mangled as described above, weird
I have to look at the way the data is saved to output.html, that's where the problem has to be. I've been looking at the wrong part of the problem the whole time, d'oh!
Last Update: finally found the error. It was in the parseData-function, loadHTML somehow garbled the content, found the solution here: PHP DOMDocument loadHTML not encoding UTF-8 correctly