I've recently merged a change to libbitcoin-node which cleans up
logging. There are a few misleading log messages pertaining to blocks.
There are actually no errors and the behavior I see is normal, so I've
changed the descriptions and locations of the messages so they are more
obvious and appropriate. The only actual errors I ever see pertain to
connection failures, but nothing that appears unusual.
I think at this point that stall is due to not getting a necessary
block, and the memory consumption results from accumulating subsequent
blocks into the orphan pool. Sometimes this is steep and other times
minimal. When the block is obtained the stall clears, but high memory
consumption can severely degrade performance and the process can get
killed before the stall clears. I'm going to add some instrumentation
now so I can monitor the orphan pool and block requests.
e
On 05/24/2015 06:22 PM, Eric Voskuil wrote:
> This is an update on libbitcoin-server work. The 2.0 release went out,
> but as I noted we still had one bug. Several of us have been working to
> clearly define and resolve the issue.
>
> I’ve defined a pending set of 2.1 releases (server/node/blockchain) to
> coincide with a libbitcoin 2.9 release. There are already a number of
> changes merged for this update. This note is to update you on the
> reasons for the changes and current status on the issue.
>
> Without a resolution I can’t say for certain what the issue is in
> libbitcoin-server. However I’ll give you my best idea based on what I’ve
> seen and others have reported.
>
> First, for context, there have been issues stretching back for a few
> months. As you know Obelisk was built on LevelDB and libbitcoin-server
> was the first release of Amir’s blockchain replacement. This change
> coincided with a major refactoring of the libbitcoin + Obelisk codebase
> into libbitcoin, libbitcoin-blockchain, libbitcoin-node and libbitcoin
> server.
>
> Following the completion of the blockchain work and the first ability to
> sync the server we were seeing two issues. There was a periodic segfault
> of the server and there were sync stalls that often corresponded to
> large memory and/or CPU resource consumption.
>
> Phillip and I spent a good amount of time reviewing the blockchain
> implementation. Given that the blockchain is essentially sequential in
> its operations, it’s fairly straightforward to debug through it.
> Eventually we found an “off-by-one" (word) error in one of the indexes.
> Correcting this issue seems to have eliminated the segfault and other
> unpredictable issues resulting from the error.
>
> https://github.com/libbitcoin/libbitcoin-blockchain/commit/081472609da207d07f9f5e2c14af4094018ca9fe
>
>
> William located and fixed an issue precipitated by a change in the
> satoshi client 0.10.0 release:
>
> https://github.com/libbitcoin/libbitcoin/commit/bb430e2f210a165c410c4a948a76
>
> 26ff3a5a6497
>
> Subsequently I integrated libbitcoin-consensus, as a default build
> option. This started causing problems when the sync progressed beyond a
> certain point. Since it was only happening with the consensus lib, this
> was easy to track down. I had misparameterized the call to the library:
>
> https://github.com/libbitcoin/libbitcoin-blockchain/commit/4e41ba7a5c884c0e3268d90ea4f2a838e7fe7cb4#diff-163d5cf7d56a2dee6a59490f0e0da0a2R81
>
>
> Following the resolution of these three issues (which as you can see
> were tiny in terms of code), the stall issue remained. I have recently
> made its resolution my top priority, and made some changes to improve
> our ability to troubleshoot it.
>
> I’ve introduced thread priority management, which has resolved the CPU
> consumption issue. This was really necessary even without the stall, but
> it hadn’t yet been a priority. The implementation is simple and works
> well. The common code is in libbitcoin and has been applied to both node
> and server console apps:
>
> https://github.com/evoskuil/libbitcoin/commit/9e02a98f5e430e79be8cb579dc853d19ad615de1
>
>
> https://github.com/libbitcoin/libbitcoin-node/commit/cb6fcd717e140cb7e76fb2e1a1e4a12bec11e9b9
>
>
> https://github.com/libbitcoin/libbitcoin-server/commit/c62f0cac5c86148ac1deb7db6dd293343ebb4d50#diff-56554a1b07b84a375a5340c28dcd9e00R80
>
>
> In order to more narrowly define the problem, I also spent some time
> updating libbitcoin-node. The node binary is now ‘bn’ and the bx, bn,
> and bs build configurations are rationalized. bn accepts a few command
> line args, including --initchain, --version and --help. These are
> currently hard-wired and there is no configuration file. However
> full_node.hpp now accepts the full range of node-related configuration
> parameters (below reflects the original server defaults):
>
> #define BN_P2P_CONNECTIONS 8
> #define BN_P2P_HOSTS 1000
> #define BN_P2P_ORPHAN_POOL 20
> #define BN_P2P_TX_POOL 2000
> #define BN_THREADS_DISK 6
> #define BN_THREADS_MEMORY 1
> #define BN_THREADS_NETWORK 1
> #define BN_HISTORY_START 0
> #define BN_HOSTS_FILENAME "hosts"
> #define BN_DIRECTORY "blockchain"
>
> https://github.com/libbitcoin/libbitcoin-node/blob/master/include/bitcoin/node/full_node.hpp#L34
>
>
> bn eliminates from the problem the potential interaction of the obelisk
> protocol implementation, however it reproduces the exact stall issue as
> does bs. So I recommend node for sync testing at this point. I've also
> modified libbitcoin-server so that it more cleanly builds on
> libbitcoin-node (shared logging for example). At some point I'll derive
> server_node from full_node and pass a derived configuration class from
> server to node, and integrate command line and configuration settings
> into node, so it will truly be that server is a layer over node. However
> at this point bn is every bit as robust as bs while it is running. The
> bn console also allows you to type in a bitcoin address for which it
> will fetch history from the blockchain (bs doesn't do this).
>
> In order to ensure we weren’t just looking at a core performance issue
> that affects the mmap on HDDs I purchased a couple of SSDs and installed
> them on my Windows and Linux test platforms. This significantly improves
> disk performance, and makes testing much faster, as the startup and
> shutdown times are also much improved.
>
> I also theorized that the parameterization of the various services used
> to build the server may not be optimal. There is an outstanding issue on
> sync flood resulting from the filling of the orphan pool, where the pool
> is hard-wired to a 20 block circular buffer. So I refactored libbitcoin,
> libbitcoin-blockchain, libbitcoin-node and libbitcoin-server to allow
> all parameterization to be injected via construction, all the way up to
> the full_node class. Eventually I’ll pull the additional options out to
> the config file as well. But this made it easy to vary the parameters
> that control resource allocation.
>
> After varying each parameter independently, and several together, it’s
> clear that each has an impact, but interestingly the resource usage does
> not vary significantly when the numbers are raised to very high levels
> (such as a 1000 block orphan buffer and 10000 tx mempool). I also varied
> thread count, which does interesting but expected things to console
> output, and has some impact on performance. Varying max peer connections
> also matters. But the bottom line is that nothing configurable prevents
> the stall. Stopping and restarting the service can clear the stall
> immediately, in many cases, but in some cases it builds up again
> immediately. I’ve found this to be the case on both my linux and windows
> platforms on mainnet at around block 337,500.
>
> I’ve seen memory consumption rise almost linearly, which looks a lot
> like an infinite loop or recursion. Usually, on its own, this stops and
> sometimes restarts. So it feels like there is a feedback loop across
> threads that is terminating based on a race condition (so it can
> eventually clear itself). That’s some real speculation, but this is the
> best I have without better diagnostics. I've also seen once the process
> get terminated by the OS before it has a chance to recover.
>
> It’s also possible that we are behind in network protocol, which could
> be complicating the issue. Presently Neill is reviewing the protocol so
> we can resolve any issues we may have communicating with satoshi nodes.
> Our logs report quite a bit, but they don’t indicate any issues, so
> again diagnostics would help.
>
> Finally, we have no unit tests against the network stack. This includes
> libbitcoin::network, libbitcoin-node and libbitcoin-server.
> libbitcoin-blockchain has over 80% coverage, though I do not believe
> this is a blockchain issue, and even though libbitcoin is also over 80%,
> none of it hits the ‘network’ namespace. So working in the networking
> code is quite hazardous, as a regression can be introduced such that the
> only means of detection might be synchronizing the blockchain, although
> testnet does sync, so even that is clearly insufficient.
>
> As such I’ve recently spent some time to integrate test coverage
> execution and reporting into the builds via libbitcoin-build. We now
> have one of the Travis builds generating coverage reports and Coveralls
> reporting integrated with GitHub for all repos.
>
> https://github.com/libbitcoin/libbitcoin-build
>
> I’ve added a few trivial tests in libbitcoin-node and will add some in
> libbitcoin-server, just to get things started. This required the recent
> refactoring of libbitcoin-server into library/console/test outputs. The
> above numbers are actually quite good apart from the network stack.
> libbitcoin-client appears deceptively low. It is however well-covered
> indirectly through the libbitcoin-explorer network tests.
> libbitcoin-explorer is well covered, though thorough testing of its
> primitives would get it well above 80%. So node and server are outliers,
> and testing libbitcoin::network should get libbitcoin close to 90%
> coverage. Help with test coverage is always welcome. From this point
> forward we should reject any commits that lower the test bar, which I
> have configured in Coveralls-GitHub integration.
>
> So while Neill is working on Bitcoin protocol review I'm now working on
> instrumenting the network stack so that we can get a view into
> concurrency issues. Unit tests aren't good for this, and it's an
> essential aspect of ongoing maintenance and performance tuning, so
> needed in any case. I do believe that this will surface a
> concurrency/race issue and that we will develop a better process
> regarding protocol changes.
>
> e
>