[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [TV] Character set oddity on HTML pages



In list.comp.tv, I wrote:
> In list.comp.tv, Gerph wrote:
>>
>>   Pokémon
>>
>> which I /think/ is the correct sequence for the e with an accent over it
>> if you were displaying ISO 8859-1 plain version of the UTF-8 encoded
>> character (sorry that sounds complicated).

Actually it's what you get when you:

   * Take a UTF-8 character (in this case 'e' with an acute accent)
   * Encode the two bytes making it up as two separate UTF-8 characters 
     (ie. we've now got 4 bytes)
   * Read those unicode characters back in
   * Convert them to HTML entities

> Yeah, I think Perl's either not being as clever as I thought it was, or 
> something's changed in a recent update.

I think it's the former (and an encoding change to UTF-8 on one of the 
websites which Perl wasn't automagically handling). I also think it's 
now fixed, but it may have ended up breaking one of the other channels. 
Let me know if the XML's no longer valid UTF-8 (or it/the website is 
displaying the wrong characters).

Cheers,

Andrew

-- 
Andrew Flegg -- mailto:andrew@xxxxxxxx  |  http://www.bleb.org/