:: Re: [DNG] grep handles ISO-8859 enc…
Etusivu
Poista viesti
Vastaa
Lähettäjä: Irrwahn
Päiväys:  
Vastaanottaja: dng
Aihe: Re: [DNG] grep handles ISO-8859 encoded text file as binary file.
On Thu, 28 Apr 2016 14:29:41 -0400, Hendrik Boom wrote:
> On Thu, Apr 28, 2016 at 07:49:58PM +0200, Irrwahn wrote:
>> On Thu, 28 Apr 2016 13:16:53 -0400, Hendrik Boom wrote:
>>> On Thu, Apr 28, 2016 at 06:53:35AM +0000, Noel Torres wrote:
>>>> Hughe Chung <janpenguin@???> escribió:
>> [...]
>>>>> $ grep tesselate dome_math.c
>>>>> Binary file dome_math.c matches
>> [...]
>>>> If I were to bet, I would say that the file dome_math.c is not
>>>> correctly formatted, or has an incorrect BOM at start, or so.
>>>
>>> I've occasionally had a program that accepted UTF-8 reject a file
>>> because it *had* a valid BOM at the start.
>> [...]
>>
>> That would be because the notion of a BOM makes not much
>> sense at all for UTF-8. There is no byte order issue with
>> UTF-8, yet some brilliant mind thought it would be a good
>> idea to define and allow one (EF BB BF) anyway. And, pray
>> tell, other brilliant minds decided to use it as a way to
>> tell UTF-8 from traditional single byte encodings. This is
>> absurd, as it is just as bad as any other heuristic one
>> may come up with to deduce text file character encoding.
>>
>> To add insult to injury, some poorly written text editing
>> tools insert a BOM without any need or even being asked to,
>> deliberately breaking otherwise perfectly fine 7-bit ASCII
>> files and rendering them incompatible to legacy software.
>
> Don't assume that ASCII is that fine.


I don't. I made a snide remark about bad software rendering
intact 7-bit ASCII files backwards incompatible by gratuitously
decorating them with a BOM.

> The majority of the world uses
> languages that don't fit in ASCII.


Obviously.

[Snipped explanation of various aspects of the imperfectness of
character encodings.]

> Even UTF-8 i hated in Japan, because a
> lot of characters that used to take two bytes now take three, and
> Japanese uses a lot of these now space-wasting characters.


You could find examples like that for any conceivable encoding,
as not all characters in contemporary use can be represented
even in two bytes. Hence (along with other advantages, e.g. no
zero bytes in encoding) the high adoption rate of UTF-8. At the
very least it avoids the UTF-32 bloat for the more commonly used
character sets. (Yes, I know about UTF-16 shifting, but this is
IMNSHO asking for the worst of both worlds.)

My 1.41¢ worth. And almost completely OT for DNG, so I'll stop
here. Sorry for the noise.

Regards
Urban