:: Re: [Libbitcoin] config file stuff …
Top Page
Delete this message
Reply to this message
Author: Eric Voskuil
Date:  
To: 'Amir Taaki', Libbitcoin
Subject: Re: [Libbitcoin] config file stuff @evoskuil
Nice, I'm on it. Also, I'm almost done with summarizing our meeting notes and will place them on the unsystem wiki.

e

-----Original Message-----
From: Libbitcoin [mailto:libbitcoin-bounces@lists.dyne.org] On Behalf Of Amir Taaki
Sent: Tuesday, April 15, 2014 12:54 PM
To: Libbitcoin@???
Subject: Re: [Libbitcoin] config file stuff @evoskuil

btw andreas wants to have a libbitcoin section + use the sx tools for
examples throughout the book about bitcoin.

if we can make a single binary it can be distributed with the cd.

http://shop.oreilly.com/product/0636920032281.do

> No worries, Eric. Sounds annoying, and that's a lot for me to digest for a
> simple issue. Just trying to find an equitable resolution since that code
> was failing on linux:
>
> https://bitcointalk.org/index.php?topic=259999.msg6099554#msg6099554
>
> Either way we can leave it as is. I don't mind at all. Thanks for your
> response.
>
>> Amir,
>>
>> Linux natively utilizes UTF-8 character encoding and other systems
>> generally
>> do not. UTF-8 began as an computationally inefficient but compact
>> serialization format that had the advantage of ASCII isomorphism.
>> Choices
>> made by various systems are evolutionary and Linux is the newcomer. OSX
>> didn't have full Unicode support until v10.2. Looking at a good number
>> of
>> purportedly cross-platform projects I see that quite commonly these
>> differences are not taken into account. Despite the fact that it
>> compiles
>> such code will be littered with cross-platform bugs. These bugs go
>> unseen
>> by
>> people working primarily with ASCII test data (since by design nearly
>> all
>> code pages are isomorphic with ASCII at the first 7 bits), or with ANSI
>> code
>> page text that matches the default process code page (luck).
>>
>> Cross-platform text support requires more effort than using safe string
>> and
>> path classes. I wish it were that easy, but boost string and path are
>> not
>> designed to solve Unicode cross-platform problems. The changes I made do
>> not
>> solve the problems either (as I documented) but they work under the test
>> assumptions mentioned above and were minimally disruptive. Ideally we
>> should
>> solve the problems properly. Currently there is not yet much text
>> manipulation in any of the projects that requires anything more than
>> ASCII,
>> which is perfectly safe as char/string (e.g. hash manipulation).
>>
>> The problems arise from interaction with platform APIs and external
>> libraries. I'll use Windows as an example, but this is not limited to
>> Windows - it affects multiple flavors of Unix as well.
>>
>> When you compile for Linux, the 8 bit char is used to store a fraction
>> of
>> a
>> character. It can take up to 6 chars to store a single Unicode character
>> in
>> UTF-8. This of course makes a class like boost::string necessary, since
>> indexing through an array of char is impossible without knowledge of the
>> code page. We should always be careful to never use array manipulation
>> of
>> strings unless we have restricted the domain to ASCII.
>>
>> When you compile for Windows the 8 bit char is also used to store a
>> fraction
>> of a character, and technically there is no reason it couldn't be UTF-8.
>> However no Windows APIs expect UTF-8 input (except for UTF-8 conversion
>> functions). Windows has for a long time fully supported Unicode but uses
>> the
>> UTF-16 encoding, not UTF-8. Although UTF-16 is technically not fixed at
>> 2
>> bytes it was treated that way for a long time (much like early C/C++
>> code
>> treated ANSI code pages - assuming they were 1 byte). This made UTF-16
>> index-able and therefore much more efficient from a processing
>> perspective.
>> Windows APIs come in two flavors, ANSI and Wide.
>>
>> ANSI is legacy, meant to provide support for non-Unicode applications
>> (written using char). ANSI APIs accept char characters and char* strings
>> and
>> interpret them in the context of the current code page of the current
>> thread. This defaults to the process code page, which defaults to the
>> user
>> code page, which defaults to the OS code page. Mixing code pages cannot
>> be
>> done in a single API call.
>>
>> Wide APIs accept UTF-16. That means characters must be represented in
>> the
>> 2
>> byte wchar_t representation (a C integral type). Modern Windows-only
>> code
>> uses wchar_t for just about everything and only uses the Wide APIs.
>>
>> In order to facilitate conditional compilation for both older ANSI
>> platforms
>> and newer Unicode (Wide) platforms, the TCHAR macros were developed.
>> Additionally, all Windows APIs (and related types) that exist in both
>> flavors are represented by macros. Compilation is controlled by the
>> _UNICODE
>> preprocessor definition. The native APIs generally end in "A" and the
>> Wide
>> APIs end in "W", for example:
>>
>> CreateProcessA(char* name, ... )
>> CreateProcessW(wchar_t* name)
>>
>> int main(int argc, char* argv[])
>> int main(int argc, wchar_t* argv[])
>>
>> Generally what I see in Linux-centric projects that purport to be
>> cross-platform is something like this:
>>
>> int main(int argc, char* argv[])
>> {
>>     ...
>>     CreateProcess(argv[1], ... )
>> }

>>
>> If _UNICODE is defined then this will not compile, since CreateProcess
>> compiles as CreateProcessW and char != wchar_t. This is the first
>> problem
>> that I ran into with libbitcoin projects.
>>
>> If _UNICODE is not defined then it will compile but fail localization
>> tests
>> that assume Unicode support, since CreateProcess compiles as
>> CreateProcessA
>> and CreateProcessA interprets char as ANSI in the current code page, not
>> UTF-8.
>>
>> My workaround was to fix the compile issue, but the loc problem remains.
>> One
>> of the annoying issues is that libconfig is not designed for Unicode. It
>> is
>> documented to (sort of) work with UTF-8:
>>
>> "1.5 Internationalization Issues
>>
>> Libconfig does not natively support Unicode configuration files, but
>> string
>> values may contain
>> Unicode text encoded in UTF-8; such strings will be treated as ordinary
>> 8-bit ASCII
>> text by the library. It is the responsibility of the calling program to
>> perform the necessary
>> conversions to/from wide (wchar_t) strings using the wide string
>> conversion
>> functions such
>> as mbsrtowcs() and wcsrtombs() or the iconv() function of the libiconv
>> library.
>>
>> The textual representation of a floating point value varies by locale.
>> However, the
>> libconfig grammar specifies that floating point values are represented
>> using
>> a period (‘.’)
>> as the radix symbol; this is consistent with the grammar of most
>> programming
>> languages.
>> When a configuration is read in or written out, libconfig temporarily
>> changes the LC_NUMERIC
>> category of the locale of the calling thread to the “C” locale to ensure
>> consistent handling
>> of floating point values regardless of the locale(s) in use by the
>> calling
>> program.
>>
>> Note that the MinGW environment does not (as of this writing) provide
>> functions for
>> changing the locale of the calling thread. Therefore, when using
>> libconfig
>> in that environment,
>> the calling program is responsible for changing the LC_NUMERIC category
>> of
>> the locale
>> to the "C" locale before reading or writing a configuration."
>>
>> http://www.hyperrealm.com/libconfig/libconfig.pdf
>>
>> But I believe this means that it cannot properly support Unicode path
>> names
>> on non-UTF-8 platforms (which is a common issue I've seen). In any case,
>> the
>> translation mentioned above is required for the Unicode support that it
>> does
>> provide. Personally I would much prefer XML-based config to libconfig,
>> in
>> part because of strong support for encodings and type conversions.
>>
>> Boost provides tstring and tpath to allow for conditional compilation in
>> support of cross-platform. I used this narrowly but when working with
>> non-ASCII text in cross-platform we need to either make the decision to
>> represent internally as UTF-8 and translated on the edges to/from UTF-16
>> or
>> to natively compile to UFT-16. The translation is ugly and costly, and
>> certainly more error-prone due to lack of compiler type checking.
>>
>> e
>>
>> -----Original Message-----
>> From: Libbitcoin [mailto:libbitcoin-bounces@lists.dyne.org] On Behalf Of
>> Amir Taaki
>> Sent: Tuesday, April 15, 2014 7:23 AM
>> To: Libbitcoin@???
>> Subject: [Libbitcoin] config file stuff @evoskuil
>>
>> hey,
>>
>> The tstring/config file stuff doesn't work on Linux. Can we not use
>> boost
>> path's (which convert agnostically between platforms) on Windows?
>>
>> https://github.com/spesmilo/sx/blob/master/src/config.cpp
>>
>> btw I try to avoid using the ternary operator (like goto) because I like
>> code to be explicit and readable. Anyway it doesn't matter in SX though
>> (but would in libbitcoin).
>>
>> If I know some more then we can collapse the load_config() code so there
>> isn't that ifdef.
>>
>> _______________________________________________
>> Libbitcoin mailing list
>> Libbitcoin@???
>> https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/libbitcoin
>>
>>
>
>



_______________________________________________
Libbitcoin mailing list
Libbitcoin@???
https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/libbitcoin