Author: Hendrik Boom Date: To: dng Subject: Re: [DNG] grep handles ISO-8859 encoded text file as binary file.
On Thu, Apr 28, 2016 at 07:49:58PM +0200, Irrwahn wrote: > On Thu, 28 Apr 2016 13:16:53 -0400, Hendrik Boom wrote:
> > On Thu, Apr 28, 2016 at 06:53:35AM +0000, Noel Torres wrote:
> >> Hughe Chung <janpenguin@???> escribió:
> [...]
> >>> $ grep tesselate dome_math.c
> >>> Binary file dome_math.c matches
> [...]
> >> If I were to bet, I would say that the file dome_math.c is not
> >> correctly formatted, or has an incorrect BOM at start, or so.
> >
> > I've occasionally had a program that accepted UTF-8 reject a file
> > because it *had* a valid BOM at the start.
> [...]
>
> That would be because the notion of a BOM makes not much
> sense at all for UTF-8. There is no byte order issue with
> UTF-8, yet some brilliant mind thought it would be a good
> idea to define and allow one (EF BB BF) anyway. And, pray
> tell, other brilliant minds decided to use it as a way to
> tell UTF-8 from traditional single byte encodings. This is
> absurd, as it is just as bad as any other heuristic one
> may come up with to deduce text file character encoding.
>
> To add insult to injury, some poorly written text editing
> tools insert a BOM without any need or even being asked to,
> deliberately breaking otherwise perfectly fine 7-bit ASCII
> files and rendering them incompatible to legacy software.
Don't assume that ASCII is that fine. The majority of the world uses
languages that don't fit in ASCII.
Back in the 90's, when I was implementing C, the C standard specifed
that a C program was mado of characters, and it did *not* specify that
those were ASCII characters. Now even with the various ISO nationaal
variants on ASCII, many characters were represented using multiple
bytes. Strings in the source code were a sequence of characters, not
bytes, and some of these characters could be of the two-byte
persuasion, represented at run-time by a pair of bytes, of course.
Some characters would be represented wiith one byte, ad some with two.
Now it just happened that one of the characters in Korean was
represented with a two-byte pair, and one of these bytes was a zero
byte. Such a zero byte was *not* a terminating byte; instead it
was part of a normal character in a normal string. If you use the
appropriate environ-sensitive string operations, it is not even
be recognised as a string terminator.
Needless to say, I got involved in all this because I had to fix the
bug in the C parser, which converted the string notation to
a C string internally, and ended up chopping it off when
this character showed up. Which is what one of put Korean users was
complaining about.
My point is that it would be good if there were some reliable way to
distinguish the character set a file is written in. I've standardised
on UTF-8 myself. Even UTF-8 i hated in Japan, because a
lot of characters that used to take two bytes now take three, and
Japanese uses a lot of these now space-wasting characters.