This resolves https://github.com/ledgerwatch/erigon/issues/10135
All enums are constrained by their owning type which forces package
includsion and hence type registration.
Added tests for each type to check the construction cycle.
Implementation of db and snapshot storage for additional synced hiemdall
waypoint types
* Checkpoint
* Milestones
This is targeted at the Astrid downloader which uses waypoints to verify
headers during syncing and fork choice selection.
Post milestones for heimdall these types are currently downloaded by
erigon but not persisted locally. This change adds persistence for these
types.
In addition to the pure persistence changes this PR also contains a
refactor step which is part of the process of extracting polygon related
types from erigon core into a seperate package which may eventually be
extracted to a separate module and possibly repo.
The aim is rather than the core `turbo\snapshotsync\freezeblocks` having
to know about types it manages and how to exaract and index their
contents this can concern it self with a set of macro shard management
actions.
This process is partially completed by this PR, a final step will be to
remove BorSnapshots and to simplify the places in the code which has to
remeber to deal with them. This requires further testing so has been
left out of this PR to avoid delays in delivering the base types.
# Status
* Waypont types and storage are complete and integrated in to the
BorHeimdall stage, The code has been tested to check that types are
inserted into mdbx, extracted and merged correctly
* I have verified that when produced from block 0 the new snapshot
correctly follow the merging strategy of existing snapshots
* The functionality is enables by a **--bor.waypoints=true** this is
false by default.
# Testing
This has been tested as follows:
* Run a Mumbai instance to the tip and check current processing for
milestones and checkpoints
# Post merge steps
* Produce and release snapshots for mumbai and bor mainnet
* Check existing node upgrades
* Remove --bor.waypoints flags
In the current go 1.21 version used in the project, slices are no longer
an experimental feature and have entered the standard library
Co-authored-by: alex.sharov <AskAlexSharov@gmail.com>
### Change ###
Adds a `disableBlockDownload` boolean flag to current implementation of
sentry multi client to disable built in header and body download
funcitonality.
### Long Term ###
Long term we are planning to refactor sentry multi client and de-couple
it from custom header and body download logic.
### Context ###
Astrid uses its own body download logic which is de-coupled from sentry
multi client.
When both are used at the same time (using `--polygon.sync=true`) there
are 2 problematic scenarios:
- restarting Astrid takes a very long time due to the init logic of
sentry multi client. It calls `HeaderDownload.RecoverFromDb` which is
coupled to the Headers stage in the stage loop. So if Astrid has fetched
1 million headers but hasn't committed execution yet then this will
result in very slow start up since all 1 million blocks have to be read
from the DB. Example logs:
```
[INFO] [04-16|12:55:42.254] [downloader] recover headers from db left=65536
...
[INFO] [04-16|13:03:42.254] [downloader] recover headers from db left=65536
```
- debug log messages warning about sentry consuming being slow since
Astrid does not use `HeaderDownload` and `BodyDownload` so there is
nothing consuming the headers and bodies from these data structures.
This has no logical impact, however clogs resources. Example logs:
```
[DBUG] [04-16|14:03:15.311] [sentry] consuming is slow, drop 50% of old messages msgID=BLOCK_HEADERS_66
[DBUG] [04-16|14:03:15.311] [sentry] consuming is slow, drop 50% of old messages msgID=BLOCK_HEADERS_66
```
Problematic situation
`runPeer` blocks on `rw.ReadMsg()`, however in the meantime the peer
gets penalised.
Expected behaviour
the peer to get disconnected and for sentry to generate a Disconnect
event
Actual behaviour
no disconnect event gets generated, peer is stuck in `rw.ReadMsg()`
Fix
call `pi.peer.Disconnect(reason)` as part of `peerInfo.Remove(reason)`
during `Penalize`
1. `Disconnect` sends a disc reason to `p.disc` channel
2. `p.disc` channel is read in `Peer.run` -
https://github.com/ledgerwatch/erigon/blob/devel/p2p/peer.go#L279
3. it causes the function to exit and in its defer call close `p.closed`
channel
4. `p.closed` channel is used as a closing channel in the
`protoRW.closed` in both `ReadMsg` and `WriteMsg` so once it is closed
those functions exit
P2P fails on restart because rawdb.ReadCurrentHeader returns a nil
header. It looks like ReadHeadHeaderHash fails to find the current
header hash. However the correct hash is returned by ReadHeadBlockHash.
Let's use ReadHeadBlockHash, because the status needs to report a header for which we have a full block body.
this salt used for creating RecSplit indices. all nodes will have
different salt, but 1 node will use same salt for all files. it allows
using value of `murmur3(key)` to read from all files (importnat for
non-existing keys - which require all indices check).
- add `snapshots/salt-blocks.txt
- this PR doesn't require re-index
- it's step1. in future releases we will add data_migration script which
will "check if all indices use same salt or re-index" (but at this time
most of users will not affected).
- new indices will use `salt-blocks.txt`
The responsibility to maintain the status data is moved from the
stageloop Hook and MultiClient to the new StatusDataProvider. It reads
the latest data from a RoDB when asked. That happens at the end of each
stage loop iteration, and sometimes when any sentry stream loop
reconnects a sentry client.
sync.Service and MultiClient require an instance of the
StatusDataProvider now. The MessageListener is updated to depend on an
external statusDataFactory.
This PR contains a couple of changes related to bor snapsots:
Its bigger than intended as I used it to produce patch bor snapsots -
and the changes are no difficult to untangle so I want to merge them as
a set.
1. It has some downloader changes which add the following features:
- Added snapshot-lock.json which contains a list of the files/hashes
downloaded - which can be used to manage local state
- Remove version flag and added this to a snapshot type - it has been
used for testing v2 download but is set at v1 for thor PR (see below for
details)
- Manage the state of downloads in the download db - this optimises meta
data look-ups on restart during/after download. For mumbai retrieving
torrent info can take up to 15mins even after download is completed.
2. It has a rationalization of the snapshot processing code to remove
duplicate code between snapshot types and standardize the interfaces to
extract blocks (Dump...) and Index blocks.
- This enables the removal of a separate BorSnapshot and probably
CaplinSnapshot type as the base snapshot code can handle the addition of
new snapshot types.
- Simplifies the addition of new snapshot types (I want to add
borchecploints and bormilestones) as the can be some
- Removes the double iteration from retire blocks
- Aid the insertion of bor validation code on indexing as the common
insertion point is now well defined.
I have tested these changes by syncing mumbai from scratch and by using
it for producing a bor-mainner patch - which starts sync in the middle
of the chain by downloading a previously existing snapshot segment.
I have identified the following issues that I think need to be resolved
before we can use v2 .segs for polygon:
1. For both mumbai and mainnet - downloads are very slow. This looks
like its because lack of peers means that we're hitting the web erver
with many small requests for pieces, which I think the server interpres
as some for of DOS and stops giving us data.
2. Because of the lack of torrents - we can't get metadata - thus don't
start downloading - even if a webpeer is availible.
I'll look to resolve these in the next week or so at which point I can
update the .toml files to include v2 and retest a sync from scratch.
If any DB method is called while Close() is waiting for db.kv.Close()
(it waits for ongoing method calls/transactions to finish)
a panic: "WaitGroup is reused before previous Wait has returned" might
happen.
Use context cancellation to ensure that new method calls immediately
return during db.kv.Close().
Mdbx now takes a logger - but this has not been pushed to all callers -
meaning it had an invalid logger
This fixes the log propagation.
It also fixed a start-up issue for http.enabled and txpool.disable
created by a previous merge
This change introduces additional processes to manage snapshot uploading
for E2 snapshots:
## erigon snapshots upload
The `snapshots uploader` command starts a version of erigon customized
for uploading snapshot files to
a remote location.
It breaks the stage execution process after the senders stage and then
uses the snapshot stage to send
uploaded headers, bodies and (in the case of polygon) bor spans and
events to snapshot files. Because
this process avoids execution in run signifigantly faster than a
standard erigon configuration.
The uploader uses rclone to send seedable (100K or 500K blocks) to a
remote storage location specified
in the rclone config file.
The **uploader** is configured to minimize disk usage by doing the
following:
* It removes snapshots once they are loaded
* It aggressively prunes the database once entities are transferred to
snapshots
in addition to this it has the following performance related features:
* maximizes the workers allocated to snapshot processing to improve
throughput
* Can be started from scratch by downloading the latest snapshots from
the remote location to seed processing
## snapshots command
Is a stand alone command for managing remote snapshots it has the
following sub commands
* **cmp** - compare snapshots
* **copy** - copy snapshots
* **verify** - verify snapshots
* **manifest** - manage the manifest file in the root of remote snapshot
locations
* **torrent** - manage snapshot torrent files
This adds a simulator object with implements the SentryServer api but
takes objects from a pre-existing snapshot file.
If the snapshot is not available locally it will download and index the
.seg file for the header range being asked for.
It is created as follows:
```go
sim, err := simulator.NewSentry(ctx, "mumbai", dataDir, 1, logger)
```
Where the arguments are:
* ctx - a callable context where cancel will close the simulator torrent
and file connections (it also has a Close method)
* chain - the name of the chain to take the snapshots from
* datadir - a directory potentially containing snapshot .seg files. If
not files exist in this directory they will be downloaded
* num peers - the number of peers the simulator should create
* logger - the loger to log actions to
It can be attached to a client as follows:
```go
simClient := direct.NewSentryClientDirect(66, sim)
```
At the moment only very basic functionality is implemented:
* get headers will return headers by range or hash (hash assumes a
pre-downloaded .seg as it needs an index
* the header replay semantics need to be confirmed
* eth 65 and 66(+) messaging is supported
* For details see: `simulator_test.go
More advanced peer behavior (e.g. header rewriting) can be added
Bodies/Transactions handling can be added
Current value: 16 was added by me 1 year ago and didn't mean anything.
Never seen this field holding much data, probably can increase.
Currently I see logs like (and 10x like this):
[DBUG] [11-24|06:59:38.353] slow peer or too many requests, dropping its old requests name=erigon/v2.54.0-aeec5...
# Background
Erigon currently uses a combination of Victoria Metrics and Prometheus
client for providing metrics.
We want to rationalize this and use only the Prometheus client library,
but we want to maintain the simplified Victoria Metrics methods for
constructing metrics.
This task is currently partly complete and needs to be finished to a
stage where we can remove the Victoria Metrics module from the Erigon
code base.
# Summary of changes
- Adds missing `NewCounter`, `NewSummary`, `NewHistogram`,
`GetOrCreateHistogram` functions to `erigon-lib/metrics` similar to the
interface VictoriaMetrics lib provides
- Minor tidy up for consistency inside `erigon-lib/metrics/set.go`
around return types (panic vs err consistency for funcs inside the
file), error messages, comments
- Replace all remaining usages of `github.com/VictoriaMetrics/metrics`
with `github.com/ledgerwatch/erigon-lib/metrics` - seamless (only import
changes) since interfaces match
Making the addReplyMatcher channel unbuffered makes the loop
going too slow sometimes for serving parallel requests.
This is an alternative fix for keeping the channel buffered.
Problem:
Some goroutines are blocked on shutdown:
1. table close <-tab.closed // because table loop pending
1. table loop <-refreshDone // because lookup shutdown blocks doRefresh
1. lookup shutdown <-it.replyCh // because it.queryfunc (findnode -
ensureBond) is blocked, and not returning errClosed (if it returns and
pushes to it.replyCh, then shutdown() will unblock)
1. findnode - ensureBond <-rm.errc // because the related replyMatcher
was added after loop() exited, so there's nothing to push errClosed and
unlock it
If addReplyMatcher channel is buffered, it is possible that
UDPv4.pending() adds a new reply matcher after closeCtx.Done().
Such reply matcher's errc result channel will never be updated, because
the UDPv4.loop() has exited at this point. Subsequent discovery
operations will deadlock.
Solution:
Revert to an unbuffered channel.
This fixes an issue where the mumbai testnet node struggle to find
peers. Before this fix in general test peer numbers are typically around
20 in total between eth66, eth67 and eth68. For new peers some can
struggle to find even a single peer after days of operation.
These are the numbers after 12 hours or running on a node which
previously could not find any peers: eth66=13, eth67=76, eth68=91.
The root cause of this issue is the following:
- A significant number of mumbai peers around the boot node return
network ids which are different from those currently available in the
DHT
- The available nodes are all consequently busy and return 'too many
peers' for long periods
These issues case a significant number of discovery timeouts, some of
the queries will never receive a response.
This causes the discovery read loop to enter a channel deadlock - which
means that no responses are processed, nor timeouts fired. This causes
the discovery process in the node to stop. From then on it just
re-requests handshakes from a relatively small number of peers.
This check in fixes this situation with the following changes:
- Remove the deadlock by running the timer in a separate go-routine so
it can run independently of the main request processing.
- Allow the discovery process matcher to match on port if no id match
can be established on initial ping. This allows subsequent node
validation to proceed and if the node proves to be valid via the
remainder of the look-up and handshake process it us used as a valid
peer.
- Completely unsolicited responses, i.e. those which come from a
completely unknown ip:port combination continue to be ignored.
-
* call getEnode before NodeStarted to make sure it is ready for RPC
calls
* fix connection error detection on macOS
* use a non-default p2p port to avoid conflicts
* disable bor milestones on local heimdall
* generate node keys for static peers config
Problem:
"Started P2P networking" log message contains port zero on startup,
e.g.: 127.0.0.1:0 because of the outdated localnodeAddrCache.
Solution:
Call updateLocalNodeStaticAddrCache after updating the port.