Re: [Libbitcoin] Proper handling after unexpected shutdown?

Auteur: Eric Voskuil
Date:
À: mlmikael
CC: libbitcoin
Sujet: Re: [Libbitcoin] Proper handling after unexpected shutdown?

> Now, what if the database is found to be [in]correct at runtime

The database could be determined to be *potentially* corrupted at
startup. This is not too hard, but not yet implemented. I've been
considering it for version3. In this implementation the node would
simply state that the store is corrupt and shut down.

It would be up to the administrator to either override this warning (by
deleting the sentinel file that indicates the corruption) or deleting
the store and (1) re-synchronizing, or (2) restoring from backup.

Once the node is running there is no potential for corruption short of
hardware failures (including hard shutdown) or software bugs. These
aren't things that can reasonably be guarded against at runtime.
Hardware faults should be monitored at a lower level, and software
quality is not a runtime issue.

> what logics and exceptions are in place to handle that, are there
guarantees that LibBitcoin will not SIGSEGV, go into undefined behavior,
etc.?

No software or hardware can implement such guarantees. Someone selling
software could financially compensate customers for such failures, but I
don't know anyone who does except possibly under MILSPEC contracts.

e

On 05/16/2016 09:25 PM, mlmikael wrote:
> Ah, that approach makes all sense.
>
> Now, what if the database is found to be [in]correct at runtime, what logics
> and exceptions are in place to handle that, are there guarantees that
> LibBitcoin will not SIGSEGV, go into undefined behavior, etc.?
>
> On 2016-05-17 00:23, Eric Voskuil wrote:
>> This would not be sufficient. It requires atomicity of the write of the
>> checksum and the data that has been summed. There is no facility to
>> guarantee that atomicity, which is the original problem. Furthermore
>> there can be inconsistency between two tables, so the atomicity needs to
>> span files at the same time as protecting writes to a single file.
>>
>> Implementing these sort of guarantees requires a significant amount of
>> overhead:
>>
>> https://en.wikipedia.org/wiki/Atomicity_(database_systems)#Implementation
>>
>> Given that the blockchain is merely a cache of public data, there is no
>> reason to suffer this overhead. Corruption can be detected and the cache
>> rebuilt. Optimization consists in preventing and reliably detecting the
>> corruption, deploying with redundancy, and optimizing the cache rebuild.
>>
>> Along with hash table indexing, this design decision is fundamental to
>> the version2 blockchain and material to its performance benefits.
>>
>> e
>>
>> On 05/16/2016 07:51 AM, mlmikael wrote:
>>> A thought -
>>>
>>> Additional robustness could be achieved by storing checksums of the
>>> involved data in the database files *and their location*, at even
>>> intervals, together with some kind of overarching checksum information
>>> that is written at "checkpoints" where the database is known to be in a
>>> consistent state.
>>>
>>> That way it would be possible to get a guarantee that the storage media
>>> has integrity in relation with LibBitcoin's logics (presuming ECC RAM
>>> and a watertight CPU), by just reading the whole database file to check
>>> that the overarching checksum is correct (and to be safe, individual
>>> checksums too).
>>>
>>> Would that be of value?
>>>
>>> On 2016-05-16 03:11, Eric Voskuil wrote:
>>>> It is not presently possible to know whether there is corruption when a
>>>> hard shutdown has occurred. On the other hand I'm not aware of any case
>>>> where a corruption has occurred apart from a hard shutdown.
>>>>
>>>> Validating the data would require hash validation against each
>>>> transaction and block. As for the indexes, it would probably be faster
>>>> to rebuild them than to validate them. In version2 it will probably be
>>>> as fast to rebuild as it would be to validate, assuming bandwidth is
>>>> not
>>>> constrained and you have a checkpoint near the top. This is because the
>>>> cost for reading the entire store is basically the same as for writing
>>>> the entire store.
>>>>
>>>> For this reason I haven't been planning to implement store
>>>> validation/repair. On the other hand, it is very fast and easy to
>>>> detect
>>>> at startup that a shutdown previously occurred during a write. I have
>>>> been planning to implement this detection in version2. The fix would be
>>>> to rebuild the store, which again shouldn't be slower than a full
>>>> validation.
>>>>
>>>> The store is very reliable if it is shutdown properly. So I would
>>>> recommend the following precautions in a production environment:
>>>>
>>>> 1) As your chain grows, periodically add checkpoints to your
>>>> configuration settings file. Don't pick points too close to the top or
>>>> they could get reorganized out. If your block pool is 50 then 51 blocks
>>>> deep is entirely safe, since you can't reorganize deeper than that
>>>> anyway. These additions will significantly speed a rebuild from the
>>>> network. You could also rely on public sources, but this creates a
>>>> centralization risk.
>>>>
>>>> 2) Periodically shut down a server and copy the store files to another
>>>> directory on the same drive (or elsewhere). If you have a hard
>>>> shutdown,
>>>> change settings to use the saved location. The updated checkpoints from
>>>> #1 will get you back to the top pretty quickly.
>>>>
>>>> 3) Maintain a second server on an independent device, using the same
>>>> procedures. Configure each to exclude the other as a peer so that any
>>>> corruption on one cannot affect the other. Having a second server will
>>>> allow you to keep running while performing #2.
>>>>
>>>> Step three is recommended in a production environment apart from
>>>> recovery purposes. When you post a tx from a client to a server you
>>>> will
>>>> not know for sure if the network has "accepted" the transaction until
>>>> it's mined (and sufficiently deep in the chain). However if you want
>>>> some confidence in that it is being distributed to miners you should
>>>> query for the tx using the other server. Given they are mutually
>>>> excluded as peers the presence of the tx will prove that it has moved
>>>> through at least one external node.
>>>>
>>>> Using the above technique requires two servers always up and the
>>>> ability
>>>> to shut one down periodically. So maintaining a robust production
>>>> environment requires at least three servers (and the ability to shift
>>>> traffic away from the down server). I recommend four servers, with
>>>> clients configured to send transactions to either of two and to
>>>> retrieve
>>>> from the mempool of either of the other two (single redundant
>>>> failover).
>>>> Other queries can be balanced across all four. This allows you to bring
>>>> down one server in either of the two pools. The pools of course must be
>>>> configured to exclude each others members as peers.
>>>>
>>>> e
>>>>
>>>>
>>>> On 05/15/2016 07:20 AM, mlmikael wrote:
>>>>> Hi Eric,
>>>>>
>>>>> Say that my machine shuts down unexpectedly. Perhaps at startup I
>>>>> won't
>>>>> even know that it did shut down unexpectedly so the LibBitcoin
>>>>> database
>>>>> could be in an inconsistent state.
>>>>>
>>>>> To mitigate that it would be great to do some kind of read
>>>>> operation for
>>>>> the whole database, that provides a verification deep enough to prove
>>>>> that the probability of an inconsistency is smaller than 1 in
>>>>> ~10^10-10^20 .
>>>>>
>>>>> I.e., is there any cheaper way than doing a full local sync do a new
>>>>> directory.
>>>>>
>>>>> What's in the box now and what do you suggest?
>>>>>
>>>>> Mlmikael
>>>>>
>>>
>>>
>

Ce message fait partie du fil suivant :
	Arborescence complète du fil triée par date
	mlmikael à
	mlmikael à

Donate to Dyne.org