:: Re: [Libbitcoin] config file stuff …
Startseite
Nachricht löschen
Nachricht beantworten
Autor: Amir Taaki
Datum:  
To: Libbitcoin
Betreff: Re: [Libbitcoin] config file stuff @evoskuil
No worries, Eric. Sounds annoying, and that's a lot for me to digest for a
simple issue. Just trying to find an equitable resolution since that code
was failing on linux:

https://bitcointalk.org/index.php?topic=259999.msg6099554#msg6099554

Either way we can leave it as is. I don't mind at all. Thanks for your
response.

> Amir,
>
> Linux natively utilizes UTF-8 character encoding and other systems
> generally
> do not. UTF-8 began as an computationally inefficient but compact
> serialization format that had the advantage of ASCII isomorphism. Choices
> made by various systems are evolutionary and Linux is the newcomer. OSX
> didn't have full Unicode support until v10.2. Looking at a good number of
> purportedly cross-platform projects I see that quite commonly these
> differences are not taken into account. Despite the fact that it compiles
> such code will be littered with cross-platform bugs. These bugs go unseen
> by
> people working primarily with ASCII test data (since by design nearly all
> code pages are isomorphic with ASCII at the first 7 bits), or with ANSI
> code
> page text that matches the default process code page (luck).
>
> Cross-platform text support requires more effort than using safe string
> and
> path classes. I wish it were that easy, but boost string and path are not
> designed to solve Unicode cross-platform problems. The changes I made do
> not
> solve the problems either (as I documented) but they work under the test
> assumptions mentioned above and were minimally disruptive. Ideally we
> should
> solve the problems properly. Currently there is not yet much text
> manipulation in any of the projects that requires anything more than
> ASCII,
> which is perfectly safe as char/string (e.g. hash manipulation).
>
> The problems arise from interaction with platform APIs and external
> libraries. I'll use Windows as an example, but this is not limited to
> Windows - it affects multiple flavors of Unix as well.
>
> When you compile for Linux, the 8 bit char is used to store a fraction of
> a
> character. It can take up to 6 chars to store a single Unicode character
> in
> UTF-8. This of course makes a class like boost::string necessary, since
> indexing through an array of char is impossible without knowledge of the
> code page. We should always be careful to never use array manipulation of
> strings unless we have restricted the domain to ASCII.
>
> When you compile for Windows the 8 bit char is also used to store a
> fraction
> of a character, and technically there is no reason it couldn't be UTF-8.
> However no Windows APIs expect UTF-8 input (except for UTF-8 conversion
> functions). Windows has for a long time fully supported Unicode but uses
> the
> UTF-16 encoding, not UTF-8. Although UTF-16 is technically not fixed at 2
> bytes it was treated that way for a long time (much like early C/C++ code
> treated ANSI code pages - assuming they were 1 byte). This made UTF-16
> index-able and therefore much more efficient from a processing
> perspective.
> Windows APIs come in two flavors, ANSI and Wide.
>
> ANSI is legacy, meant to provide support for non-Unicode applications
> (written using char). ANSI APIs accept char characters and char* strings
> and
> interpret them in the context of the current code page of the current
> thread. This defaults to the process code page, which defaults to the user
> code page, which defaults to the OS code page. Mixing code pages cannot be
> done in a single API call.
>
> Wide APIs accept UTF-16. That means characters must be represented in the
> 2
> byte wchar_t representation (a C integral type). Modern Windows-only code
> uses wchar_t for just about everything and only uses the Wide APIs.
>
> In order to facilitate conditional compilation for both older ANSI
> platforms
> and newer Unicode (Wide) platforms, the TCHAR macros were developed.
> Additionally, all Windows APIs (and related types) that exist in both
> flavors are represented by macros. Compilation is controlled by the
> _UNICODE
> preprocessor definition. The native APIs generally end in "A" and the Wide
> APIs end in "W", for example:
>
> CreateProcessA(char* name, ... )
> CreateProcessW(wchar_t* name)
>
> int main(int argc, char* argv[])
> int main(int argc, wchar_t* argv[])
>
> Generally what I see in Linux-centric projects that purport to be
> cross-platform is something like this:
>
> int main(int argc, char* argv[])
> {
>     ...
>     CreateProcess(argv[1], ... )
> }

>
> If _UNICODE is defined then this will not compile, since CreateProcess
> compiles as CreateProcessW and char != wchar_t. This is the first problem
> that I ran into with libbitcoin projects.
>
> If _UNICODE is not defined then it will compile but fail localization
> tests
> that assume Unicode support, since CreateProcess compiles as
> CreateProcessA
> and CreateProcessA interprets char as ANSI in the current code page, not
> UTF-8.
>
> My workaround was to fix the compile issue, but the loc problem remains.
> One
> of the annoying issues is that libconfig is not designed for Unicode. It
> is
> documented to (sort of) work with UTF-8:
>
> "1.5 Internationalization Issues
>
> Libconfig does not natively support Unicode configuration files, but
> string
> values may contain
> Unicode text encoded in UTF-8; such strings will be treated as ordinary
> 8-bit ASCII
> text by the library. It is the responsibility of the calling program to
> perform the necessary
> conversions to/from wide (wchar_t) strings using the wide string
> conversion
> functions such
> as mbsrtowcs() and wcsrtombs() or the iconv() function of the libiconv
> library.
>
> The textual representation of a floating point value varies by locale.
> However, the
> libconfig grammar specifies that floating point values are represented
> using
> a period (‘.’)
> as the radix symbol; this is consistent with the grammar of most
> programming
> languages.
> When a configuration is read in or written out, libconfig temporarily
> changes the LC_NUMERIC
> category of the locale of the calling thread to the “C” locale to ensure
> consistent handling
> of floating point values regardless of the locale(s) in use by the calling
> program.
>
> Note that the MinGW environment does not (as of this writing) provide
> functions for
> changing the locale of the calling thread. Therefore, when using libconfig
> in that environment,
> the calling program is responsible for changing the LC_NUMERIC category of
> the locale
> to the "C" locale before reading or writing a configuration."
>
> http://www.hyperrealm.com/libconfig/libconfig.pdf
>
> But I believe this means that it cannot properly support Unicode path
> names
> on non-UTF-8 platforms (which is a common issue I've seen). In any case,
> the
> translation mentioned above is required for the Unicode support that it
> does
> provide. Personally I would much prefer XML-based config to libconfig, in
> part because of strong support for encodings and type conversions.
>
> Boost provides tstring and tpath to allow for conditional compilation in
> support of cross-platform. I used this narrowly but when working with
> non-ASCII text in cross-platform we need to either make the decision to
> represent internally as UTF-8 and translated on the edges to/from UTF-16
> or
> to natively compile to UFT-16. The translation is ugly and costly, and
> certainly more error-prone due to lack of compiler type checking.
>
> e
>
> -----Original Message-----
> From: Libbitcoin [mailto:libbitcoin-bounces@lists.dyne.org] On Behalf Of
> Amir Taaki
> Sent: Tuesday, April 15, 2014 7:23 AM
> To: Libbitcoin@???
> Subject: [Libbitcoin] config file stuff @evoskuil
>
> hey,
>
> The tstring/config file stuff doesn't work on Linux. Can we not use boost
> path's (which convert agnostically between platforms) on Windows?
>
> https://github.com/spesmilo/sx/blob/master/src/config.cpp
>
> btw I try to avoid using the ternary operator (like goto) because I like
> code to be explicit and readable. Anyway it doesn't matter in SX though
> (but would in libbitcoin).
>
> If I know some more then we can collapse the load_config() code so there
> isn't that ifdef.
>
> _______________________________________________
> Libbitcoin mailing list
> Libbitcoin@???
> https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/libbitcoin
>
>