Autor: Eric Voskuil Datum: To: libbitcoin Betreff: [Libbitcoin] V3 database reliability
One of the most significant problems with maintaining a libbitcoin node/server has been hard shutdown recovery. Inevitably a host has a power or uncontrolled maintenance restart. I've advised operators that database corruption *must* be assumed following any uncontrolled shutdown. Given long rebuild times and, under some conditions, difficulty in detecting a restart, this has made it very difficult to maintain quality of service.
The nature of the problem is a tradeoff between performance/simplicity and reliability. The database is a set of memory mapped files that index blocks, txs, spends and addresses as hash tables and block height as an array. Both hash tables and arrays offer constant time lookup. With sufficient memory the entire structure becomes memory-resident and otherwise is paged by the operating system as necessary.
This is near perfect scaling as more hardware means more speed, and more data does not reduce seed. However there is a cost for moving updates from volatile RAM to "disk". Doing so for every update would defeat some of the performance advantage, and would still be subject to failure during write. Currently writes are only guaranteed to be flushed to disk by controlled shutdown, at which time the memory maps are explicitly flushed. This is the source of the slight shutdown pause when the store is large. If this does not happen the store is corrupt and must be rebuilt.
The resolution is not to sacrifice performance but to manage the tradeoff. Some servers do not need microsecond block commit and others (miners) may need it. With a redundant server the potential loss of the store is immaterial. And with sufficient RAM v3 can rebuild from the network in 15 minutes. So a miner would care only to know if the store was corrupt, but not to trade performance in order to limit that possibility. On the other hand a merchant or wallet would readily accept millisecond block commit in exchange for a much lower chance of shutdown corruption.
n light of these considerations I've implemented a new config setting and related behaviors that optimize for these types of scenarios. The central concept is to provide a persistent indication (sentinel file) of an uncontrolled shutdown while the database is inconsistent (i.e. not flushed to disk). When the server/node starts up it will treat this sentinel as indication of database corruption. The simple implementation would be to write this at startup and clear it at shutdown. This is essentially the default behavior (it's actually a little narrower than that), optimized for performance.
The configuration option affects the second phase of block acceptance (blocks are the only data written to the database). Initial block download (IBD) is extremely fast, with very little time between writes. There is no benefit to flushing between writes, as a new write immediately invalidates the last. So a shutdown during IBD will be a corruption, and there is no performance hit. On the other hand catch-up sync (CUS) is fast, though slow in comparison, and as the chain becomes complete there are extremely long delays (~10 min) between writes.
The Boolean config option enables flush-to-disk following each CUS block write. So the sentinel file exists only during the write, which is typically sub-millisecond. The flush itself however takes on the order of 10 milliseconds. Even for mining operations this can be acceptable, since the block notification fires after validation and is not delayed for flush (or even write).
With this setting enabled it's very difficult to (even intentionally) hard shutdown while leaving the sentinel file. The downside is that CUS will be noticeably slower until fully-synced. This can be mitigated by either using a high checkpoint or restating with the setting enabled after getting caught up. This is implemented in latest master and can be used with CUS by setting a single checkpoint (block 0) - given that IBD is currently disabled. Node (vs. server) is currently recommended.
In any configuration it will now be impossible to have a corrupted database unless one manually purges the sentinel file or there is an insidious hardware fault that trashes data. [I'm planning to add a validate-on-startup option (mainly for benchmark testing validation and store performance independent of the network/peers) that could be used to confirm a store, but I really don't see any reason for doing so.]