Re: [Libbitcoin] config file stuff @evoskuil

Skribent: Eric Voskuil
Dato:
Til: 'Amir Taaki', Libbitcoin
Emne: Re: [Libbitcoin] config file stuff @evoskuil

Amir,

Linux natively utilizes UTF-8 character encoding and other systems generally
do not. UTF-8 began as an computationally inefficient but compact
serialization format that had the advantage of ASCII isomorphism. Choices
made by various systems are evolutionary and Linux is the newcomer. OSX
didn't have full Unicode support until v10.2. Looking at a good number of
purportedly cross-platform projects I see that quite commonly these
differences are not taken into account. Despite the fact that it compiles
such code will be littered with cross-platform bugs. These bugs go unseen by
people working primarily with ASCII test data (since by design nearly all
code pages are isomorphic with ASCII at the first 7 bits), or with ANSI code
page text that matches the default process code page (luck).

Cross-platform text support requires more effort than using safe string and
path classes. I wish it were that easy, but boost string and path are not
designed to solve Unicode cross-platform problems. The changes I made do not
solve the problems either (as I documented) but they work under the test
assumptions mentioned above and were minimally disruptive. Ideally we should
solve the problems properly. Currently there is not yet much text
manipulation in any of the projects that requires anything more than ASCII,
which is perfectly safe as char/string (e.g. hash manipulation).

The problems arise from interaction with platform APIs and external
libraries. I'll use Windows as an example, but this is not limited to
Windows - it affects multiple flavors of Unix as well.

When you compile for Linux, the 8 bit char is used to store a fraction of a
character. It can take up to 6 chars to store a single Unicode character in
UTF-8. This of course makes a class like boost::string necessary, since
indexing through an array of char is impossible without knowledge of the
code page. We should always be careful to never use array manipulation of
strings unless we have restricted the domain to ASCII.

When you compile for Windows the 8 bit char is also used to store a fraction
of a character, and technically there is no reason it couldn't be UTF-8.
However no Windows APIs expect UTF-8 input (except for UTF-8 conversion
functions). Windows has for a long time fully supported Unicode but uses the
UTF-16 encoding, not UTF-8. Although UTF-16 is technically not fixed at 2
bytes it was treated that way for a long time (much like early C/C++ code
treated ANSI code pages - assuming they were 1 byte). This made UTF-16
index-able and therefore much more efficient from a processing perspective.
Windows APIs come in two flavors, ANSI and Wide.

ANSI is legacy, meant to provide support for non-Unicode applications
(written using char). ANSI APIs accept char characters and char* strings and
interpret them in the context of the current code page of the current
thread. This defaults to the process code page, which defaults to the user
code page, which defaults to the OS code page. Mixing code pages cannot be
done in a single API call.

Wide APIs accept UTF-16. That means characters must be represented in the 2
byte wchar_t representation (a C integral type). Modern Windows-only code
uses wchar_t for just about everything and only uses the Wide APIs.

In order to facilitate conditional compilation for both older ANSI platforms
and newer Unicode (Wide) platforms, the TCHAR macros were developed.
Additionally, all Windows APIs (and related types) that exist in both
flavors are represented by macros. Compilation is controlled by the _UNICODE
preprocessor definition. The native APIs generally end in "A" and the Wide
APIs end in "W", for example:

CreateProcessA(char* name, ... )
CreateProcessW(wchar_t* name)

int main(int argc, char* argv[])
int main(int argc, wchar_t* argv[])

Generally what I see in Linux-centric projects that purport to be
cross-platform is something like this:

int main(int argc, char* argv[])
{
    ...
    CreateProcess(argv[1], ... )
}

If _UNICODE is defined then this will not compile, since CreateProcess
compiles as CreateProcessW and char != wchar_t. This is the first problem
that I ran into with libbitcoin projects.

If _UNICODE is not defined then it will compile but fail localization tests
that assume Unicode support, since CreateProcess compiles as CreateProcessA
and CreateProcessA interprets char as ANSI in the current code page, not
UTF-8.

My workaround was to fix the compile issue, but the loc problem remains. One
of the annoying issues is that libconfig is not designed for Unicode. It is
documented to (sort of) work with UTF-8:

"1.5 Internationalization Issues

Libconfig does not natively support Unicode configuration files, but string
values may contain
Unicode text encoded in UTF-8; such strings will be treated as ordinary
8-bit ASCII
text by the library. It is the responsibility of the calling program to
perform the necessary
conversions to/from wide (wchar_t) strings using the wide string conversion
functions such
as mbsrtowcs() and wcsrtombs() or the iconv() function of the libiconv
library.

The textual representation of a floating point value varies by locale.
However, the
libconfig grammar specifies that floating point values are represented using
a period (‘.’)
as the radix symbol; this is consistent with the grammar of most programming
languages.
When a configuration is read in or written out, libconfig temporarily
changes the LC_NUMERIC
category of the locale of the calling thread to the “C” locale to ensure
consistent handling
of floating point values regardless of the locale(s) in use by the calling
program.

Note that the MinGW environment does not (as of this writing) provide
functions for
changing the locale of the calling thread. Therefore, when using libconfig
in that environment,
the calling program is responsible for changing the LC_NUMERIC category of
the locale
to the "C" locale before reading or writing a configuration."

http://www.hyperrealm.com/libconfig/libconfig.pdf

But I believe this means that it cannot properly support Unicode path names
on non-UTF-8 platforms (which is a common issue I've seen). In any case, the
translation mentioned above is required for the Unicode support that it does
provide. Personally I would much prefer XML-based config to libconfig, in
part because of strong support for encodings and type conversions.

Boost provides tstring and tpath to allow for conditional compilation in
support of cross-platform. I used this narrowly but when working with
non-ASCII text in cross-platform we need to either make the decision to
represent internally as UTF-8 and translated on the edges to/from UTF-16 or
to natively compile to UFT-16. The translation is ugly and costly, and
certainly more error-prone due to lack of compiler type checking.

e

-----Original Message-----
From: Libbitcoin [mailto:libbitcoin-bounces@lists.dyne.org] On Behalf Of
Amir Taaki
Sent: Tuesday, April 15, 2014 7:23 AM
To: Libbitcoin@???
Subject: [Libbitcoin] config file stuff @evoskuil

hey,

The tstring/config file stuff doesn't work on Linux. Can we not use boost
path's (which convert agnostically between platforms) on Windows?

https://github.com/spesmilo/sx/blob/master/src/config.cpp

btw I try to avoid using the ternary operator (like goto) because I like
code to be explicit and readable. Anyway it doesn't matter in SX though
(but would in libbitcoin).

If I know some more then we can collapse the load_config() code so there
isn't that ifdef.

_______________________________________________
Libbitcoin mailing list
Libbitcoin@???
https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/libbitcoin

Dette indlæg hører under følgende tråd:
	Det komplette tråd-træ sorteret efter dato
	Amir Taaki den
	Amir Taaki den

Donate to Dyne.org