This is an update on libbitcoin-server work. The 2.0 release went out,
but as I noted we still had one bug. Several of us have been working to
clearly define and resolve the issue.
I’ve defined a pending set of 2.1 releases (server/node/blockchain) to
coincide with a libbitcoin 2.9 release. There are already a number of
changes merged for this update. This note is to update you on the
reasons for the changes and current status on the issue.
Without a resolution I can’t say for certain what the issue is in
libbitcoin-server. However I’ll give you my best idea based on what I’ve
seen and others have reported.
First, for context, there have been issues stretching back for a few
months. As you know Obelisk was built on LevelDB and libbitcoin-server
was the first release of Amir’s blockchain replacement. This change
coincided with a major refactoring of the libbitcoin + Obelisk codebase
into libbitcoin, libbitcoin-blockchain, libbitcoin-node and libbitcoin
server.
Following the completion of the blockchain work and the first ability to
sync the server we were seeing two issues. There was a periodic segfault
of the server and there were sync stalls that often corresponded to
large memory and/or CPU resource consumption.
Phillip and I spent a good amount of time reviewing the blockchain
implementation. Given that the blockchain is essentially sequential in
its operations, it’s fairly straightforward to debug through it.
Eventually we found an “off-by-one" (word) error in one of the indexes.
Correcting this issue seems to have eliminated the segfault and other
unpredictable issues resulting from the error.
https://github.com/libbitcoin/libbitcoin-blockchain/commit/081472609da207d07f9f5e2c14af4094018ca9fe
William located and fixed an issue precipitated by a change in the
satoshi client 0.10.0 release:
https://github.com/libbitcoin/libbitcoin/commit/bb430e2f210a165c410c4a948a76
26ff3a5a6497
Subsequently I integrated libbitcoin-consensus, as a default build
option. This started causing problems when the sync progressed beyond a
certain point. Since it was only happening with the consensus lib, this
was easy to track down. I had misparameterized the call to the library:
https://github.com/libbitcoin/libbitcoin-blockchain/commit/4e41ba7a5c884c0e3268d90ea4f2a838e7fe7cb4#diff-163d5cf7d56a2dee6a59490f0e0da0a2R81
Following the resolution of these three issues (which as you can see
were tiny in terms of code), the stall issue remained. I have recently
made its resolution my top priority, and made some changes to improve
our ability to troubleshoot it.
I’ve introduced thread priority management, which has resolved the CPU
consumption issue. This was really necessary even without the stall, but
it hadn’t yet been a priority. The implementation is simple and works
well. The common code is in libbitcoin and has been applied to both node
and server console apps:
https://github.com/evoskuil/libbitcoin/commit/9e02a98f5e430e79be8cb579dc853d19ad615de1
https://github.com/libbitcoin/libbitcoin-node/commit/cb6fcd717e140cb7e76fb2e1a1e4a12bec11e9b9
https://github.com/libbitcoin/libbitcoin-server/commit/c62f0cac5c86148ac1deb7db6dd293343ebb4d50#diff-56554a1b07b84a375a5340c28dcd9e00R80
In order to more narrowly define the problem, I also spent some time
updating libbitcoin-node. The node binary is now ‘bn’ and the bx, bn,
and bs build configurations are rationalized. bn accepts a few command
line args, including --initchain, --version and --help. These are
currently hard-wired and there is no configuration file. However
full_node.hpp now accepts the full range of node-related configuration
parameters (below reflects the original server defaults):
#define BN_P2P_CONNECTIONS 8
#define BN_P2P_HOSTS 1000
#define BN_P2P_ORPHAN_POOL 20
#define BN_P2P_TX_POOL 2000
#define BN_THREADS_DISK 6
#define BN_THREADS_MEMORY 1
#define BN_THREADS_NETWORK 1
#define BN_HISTORY_START 0
#define BN_HOSTS_FILENAME "hosts"
#define BN_DIRECTORY "blockchain"
https://github.com/libbitcoin/libbitcoin-node/blob/master/include/bitcoin/node/full_node.hpp#L34
bn eliminates from the problem the potential interaction of the obelisk
protocol implementation, however it reproduces the exact stall issue as
does bs. So I recommend node for sync testing at this point. I've also
modified libbitcoin-server so that it more cleanly builds on
libbitcoin-node (shared logging for example). At some point I'll derive
server_node from full_node and pass a derived configuration class from
server to node, and integrate command line and configuration settings
into node, so it will truly be that server is a layer over node. However
at this point bn is every bit as robust as bs while it is running. The
bn console also allows you to type in a bitcoin address for which it
will fetch history from the blockchain (bs doesn't do this).
In order to ensure we weren’t just looking at a core performance issue
that affects the mmap on HDDs I purchased a couple of SSDs and installed
them on my Windows and Linux test platforms. This significantly improves
disk performance, and makes testing much faster, as the startup and
shutdown times are also much improved.
I also theorized that the parameterization of the various services used
to build the server may not be optimal. There is an outstanding issue on
sync flood resulting from the filling of the orphan pool, where the pool
is hard-wired to a 20 block circular buffer. So I refactored libbitcoin,
libbitcoin-blockchain, libbitcoin-node and libbitcoin-server to allow
all parameterization to be injected via construction, all the way up to
the full_node class. Eventually I’ll pull the additional options out to
the config file as well. But this made it easy to vary the parameters
that control resource allocation.
After varying each parameter independently, and several together, it’s
clear that each has an impact, but interestingly the resource usage does
not vary significantly when the numbers are raised to very high levels
(such as a 1000 block orphan buffer and 10000 tx mempool). I also varied
thread count, which does interesting but expected things to console
output, and has some impact on performance. Varying max peer connections
also matters. But the bottom line is that nothing configurable prevents
the stall. Stopping and restarting the service can clear the stall
immediately, in many cases, but in some cases it builds up again
immediately. I’ve found this to be the case on both my linux and windows
platforms on mainnet at around block 337,500.
I’ve seen memory consumption rise almost linearly, which looks a lot
like an infinite loop or recursion. Usually, on its own, this stops and
sometimes restarts. So it feels like there is a feedback loop across
threads that is terminating based on a race condition (so it can
eventually clear itself). That’s some real speculation, but this is the
best I have without better diagnostics. I've also seen once the process
get terminated by the OS before it has a chance to recover.
It’s also possible that we are behind in network protocol, which could
be complicating the issue. Presently Neill is reviewing the protocol so
we can resolve any issues we may have communicating with satoshi nodes.
Our logs report quite a bit, but they don’t indicate any issues, so
again diagnostics would help.
Finally, we have no unit tests against the network stack. This includes
libbitcoin::network, libbitcoin-node and libbitcoin-server.
libbitcoin-blockchain has over 80% coverage, though I do not believe
this is a blockchain issue, and even though libbitcoin is also over 80%,
none of it hits the ‘network’ namespace. So working in the networking
code is quite hazardous, as a regression can be introduced such that the
only means of detection might be synchronizing the blockchain, although
testnet does sync, so even that is clearly insufficient.
As such I’ve recently spent some time to integrate test coverage
execution and reporting into the builds via libbitcoin-build. We now
have one of the Travis builds generating coverage reports and Coveralls
reporting integrated with GitHub for all repos.
https://github.com/libbitcoin/libbitcoin-build
I’ve added a few trivial tests in libbitcoin-node and will add some in
libbitcoin-server, just to get things started. This required the recent
refactoring of libbitcoin-server into library/console/test outputs. The
above numbers are actually quite good apart from the network stack.
libbitcoin-client appears deceptively low. It is however well-covered
indirectly through the libbitcoin-explorer network tests.
libbitcoin-explorer is well covered, though thorough testing of its
primitives would get it well above 80%. So node and server are outliers,
and testing libbitcoin::network should get libbitcoin close to 90%
coverage. Help with test coverage is always welcome. From this point
forward we should reject any commits that lower the test bar, which I
have configured in Coveralls-GitHub integration.
So while Neill is working on Bitcoin protocol review I'm now working on
instrumenting the network stack so that we can get a view into
concurrency issues. Unit tests aren't good for this, and it's an
essential aspect of ongoing maintenance and performance tuning, so
needed in any case. I do believe that this will surface a
concurrency/race issue and that we will develop a better process
regarding protocol changes.
e