Commit Graph

757 Commits

Author SHA1 Message Date
Mark Holt
1558fc7835
Refactored types to force runtime registrations to be type dependent (#10147)
This resolves https://github.com/ledgerwatch/erigon/issues/10135

All enums are constrained by their owning type which forces package
includsion and hence type registration.

Added tests for each type to check the construction cycle.
2024-05-01 06:41:19 +07:00
Mark Holt
714c259acc
Bor waypoint storage (#9793)
Implementation of db and snapshot storage for additional synced hiemdall
waypoint types

* Checkpoint
* Milestones

This is targeted at the Astrid downloader which uses waypoints to verify
headers during syncing and fork choice selection.

Post milestones for heimdall these types are currently downloaded by
erigon but not persisted locally. This change adds persistence for these
types.

In addition to the pure persistence changes this PR also contains a
refactor step which is part of the process of extracting polygon related
types from erigon core into a seperate package which may eventually be
extracted to a separate module and possibly repo.

The aim is rather than the core `turbo\snapshotsync\freezeblocks` having
to know about types it manages and how to exaract and index their
contents this can concern it self with a set of macro shard management
actions.

This process is partially completed by this PR, a final step will be to
remove BorSnapshots and to simplify the places in the code which has to
remeber to deal with them. This requires further testing so has been
left out of this PR to avoid delays in delivering the base types.

# Status

* Waypont types and storage are complete and integrated in to the
BorHeimdall stage, The code has been tested to check that types are
inserted into mdbx, extracted and merged correctly
* I have verified that when produced from block 0 the new snapshot
correctly follow the merging strategy of existing snapshots
* The functionality is enables by a **--bor.waypoints=true** this is
false by default.

# Testing

This has been tested as follows:

* Run a Mumbai instance to the tip and check current processing for
milestones and checkpoints

# Post merge steps

* Produce and release snapshots for mumbai and bor mainnet
* Check existing node upgrades
* Remove --bor.waypoints flags
2024-04-29 18:31:51 +01:00
luchenhan
06dfaea457
chore: fix some function names (#10117)
Signed-off-by: luchenhan <hanluchen@aliyun.com>
2024-04-29 12:48:26 +00:00
Alex Sharov
3ad651e286
nodedb: UpdateNode method to create 1 rwtx instead of 2 (#10109) 2024-04-29 09:47:51 +07:00
carehabit
9001183632
all: use the built-in slices library (#9842)
In the current go 1.21 version used in the project, slices are no longer
an experimental feature and have entered the standard library

Co-authored-by: alex.sharov <AskAlexSharov@gmail.com>
2024-04-26 03:21:25 +00:00
Alex Sharov
ab361e4747
move temporal package to erigon-lib (#10015)
Co-authored-by: awskii <artem.tsskiy@gmail.com>
2024-04-22 15:29:25 +01:00
milen
bca27f3b70
p2p/sentry/sentry_multi_client: flag to disable block download code (#9957)
### Change ### 
Adds a `disableBlockDownload` boolean flag to current implementation of
sentry multi client to disable built in header and body download
funcitonality.

### Long Term ###
Long term we are planning to refactor sentry multi client and de-couple
it from custom header and body download logic.

### Context ### 
Astrid uses its own body download logic which is de-coupled from sentry
multi client.

When both are used at the same time (using `--polygon.sync=true`) there
are 2 problematic scenarios:
- restarting Astrid takes a very long time due to the init logic of
sentry multi client. It calls `HeaderDownload.RecoverFromDb` which is
coupled to the Headers stage in the stage loop. So if Astrid has fetched
1 million headers but hasn't committed execution yet then this will
result in very slow start up since all 1 million blocks have to be read
from the DB. Example logs:
```
[INFO] [04-16|12:55:42.254] [downloader] recover headers from db     left=65536
...
[INFO] [04-16|13:03:42.254] [downloader] recover headers from db     left=65536
```

- debug log messages warning about sentry consuming being slow since
Astrid does not use `HeaderDownload` and `BodyDownload` so there is
nothing consuming the headers and bodies from these data structures.
This has no logical impact, however clogs resources. Example logs:
```
[DBUG] [04-16|14:03:15.311] [sentry] consuming is slow, drop 50% of old messages msgID=BLOCK_HEADERS_66
[DBUG] [04-16|14:03:15.311] [sentry] consuming is slow, drop 50% of old messages msgID=BLOCK_HEADERS_66
```
2024-04-17 14:57:02 +03:00
milen
890b8b52c0
p2p/sentry: fix missing disconnect events after penalising a peer (#9928)
Problematic situation
`runPeer` blocks on `rw.ReadMsg()`, however in the meantime the peer
gets penalised.

Expected behaviour
the peer to get disconnected and for sentry to generate a Disconnect
event

Actual behaviour
no disconnect event gets generated, peer is stuck in `rw.ReadMsg()`

Fix
call `pi.peer.Disconnect(reason)` as part of `peerInfo.Remove(reason)`
during `Penalize`

1. `Disconnect` sends a disc reason to `p.disc` channel
2. `p.disc` channel is read in `Peer.run` -
https://github.com/ledgerwatch/erigon/blob/devel/p2p/peer.go#L279
3. it causes the function to exit and in its defer call close `p.closed`
channel
4. `p.closed` channel is used as a closing channel in the
`protoRW.closed` in both `ReadMsg` and `WriteMsg` so once it is closed
those functions exit
2024-04-15 10:40:15 +00:00
battlmonstr
fcad3a0328
data races running TestMiningBenchmark (#8704) (#9926)
* fix GrpcServer.p2pServer race
* fix HeaderDownload.latestMinedBlockNumber race

see https://github.com/ledgerwatch/erigon/issues/8704
2024-04-15 06:14:30 +00:00
awskii
43b20c1b40
fix race conditions (#9906)
fixes two minor race conditions
2024-04-12 17:58:10 +07:00
battlmonstr
63c71181b2
p2p/sentry: StatusDataProvider ReadCurrentHeader error (#9890)
P2P fails on restart because rawdb.ReadCurrentHeader returns a nil
header. It looks like ReadHeadHeaderHash fails to find the current
header hash. However the correct hash is returned by ReadHeadBlockHash.

Let's use ReadHeadBlockHash, because the status needs to report a header for which we have a full block body.
2024-04-11 07:35:26 +00:00
milen
48592ea7ad
p2p/sentry: allow SendMessageById(GetBlockBodiesMsg) (#9825) 2024-04-09 09:45:06 +02:00
Shoham Chakraborty
95c8e37be4
tests: Remove torrent simulator (#9845) 2024-04-02 22:42:49 +08:00
Shoham Chakraborty
1d04dc52b7
Heimdall simulator (#9819) 2024-03-30 08:39:36 +00:00
Alex Sharov
ff709f9474
1 seed for all block indices (#9719)
this salt used for creating RecSplit indices. all nodes will have
different salt, but 1 node will use same salt for all files. it allows
using value of `murmur3(key)` to read from all files (importnat for
non-existing keys - which require all indices check).
	
- add `snapshots/salt-blocks.txt
- this PR doesn't require re-index 
- it's step1. in future releases we will add data_migration script which
will "check if all indices use same salt or re-index" (but at this time
most of users will not affected).
- new indices will use `salt-blocks.txt`
2024-03-28 14:52:11 +02:00
battlmonstr
f3f4756471
p2p/sentry: status data provider refactoring (#9747)
The responsibility to maintain the status data is moved from the
stageloop Hook and MultiClient to the new StatusDataProvider. It reads
the latest data from a RoDB when asked. That happens at the end of each
stage loop iteration, and sometimes when any sentry stream loop
reconnects a sentry client.

sync.Service and MultiClient require an instance of the
StatusDataProvider now. The MessageListener is updated to depend on an
external statusDataFactory.
2024-03-27 12:35:23 +00:00
Jacek Glen
af429b8527
forkvalidator: remove unsued references (#9760)
Remove unused references to `forkValidator` and simplify parameters. No
change to any logic
2024-03-20 12:29:06 +01:00
pavedroad
89954cd562
chore: remove repetitive words (#9685)
Signed-off-by: pavedroad <qcqs@outlook.com>
2024-03-13 09:50:19 +00:00
Dmytro
4608add377
Dvovk/sentry peer refactor (#9633) 2024-03-09 09:43:41 +07:00
Alex Sharov
b61da39cf0
Less crowded trace logs (#9620) 2024-03-08 09:09:54 +00:00
milen
e4d37a8c53
polygon/p2p: message listener to penalize for invalid rlp (#9581) 2024-03-05 15:44:35 +00:00
Alex Sharov
01ab4be532
don't filter out e3 files (#9423)
backport of  #9422
2024-02-28 08:51:44 +00:00
milen
6b39197f0b
polygon/p2p: implement download headers (#9399) 2024-02-19 12:29:17 +01:00
Mark Holt
413a931acc
Bor related snapshot changes (#9311)
This PR contains a couple of changes related to bor snapsots:

Its bigger than intended as I used it to produce patch bor snapsots -
and the changes are no difficult to untangle so I want to merge them as
a set.

1. It has some downloader changes which add the following features:

- Added snapshot-lock.json which contains a list of the files/hashes
downloaded - which can be used to manage local state
- Remove version flag and added this to a snapshot type - it has been
used for testing v2 download but is set at v1 for thor PR (see below for
details)
- Manage the state of downloads in the download db - this optimises meta
data look-ups on restart during/after download. For mumbai retrieving
torrent info can take up to 15mins even after download is completed.

2. It has a rationalization of the snapshot processing code to remove
duplicate code between snapshot types and standardize the interfaces to
extract blocks (Dump...) and Index blocks.

- This enables the removal of a separate BorSnapshot and probably
CaplinSnapshot type as the base snapshot code can handle the addition of
new snapshot types.
- Simplifies the addition of new snapshot types (I want to add
borchecploints and bormilestones) as the can be some
- Removes the double iteration from retire blocks
- Aid the insertion of bor validation code on indexing as the common
insertion point is now well defined.

I have tested these changes by syncing mumbai from scratch and by using
it for producing a bor-mainner patch - which starts sync in the middle
of the chain by downloading a previously existing snapshot segment.

I have identified the following issues that I think need to be resolved
before we can use v2 .segs for polygon:

1. For both mumbai and mainnet - downloads are very slow. This looks
like its because lack of peers means that we're hitting the web erver
with many small requests for pieces, which I think the server interpres
as some for of DOS and stops giving us data.
2. Because of the lack of torrents - we can't get metadata - thus don't
start downloading - even if a webpeer is availible.

I'll look to resolve these in the next week or so at which point I can
update the .toml files to include v2 and retest a sync from scratch.
2024-02-03 08:43:56 +00:00
milen
f8ca251a61
p2p/sentry/simulator: skip TestSimulatorStart - manual runs only for now (#9292) 2024-01-23 20:32:43 +02:00
battlmonstr
e979d79c08
p2p: panic in enode DB Close on shutdown (#9237) (#9240)
If any DB method is called while Close() is waiting for db.kv.Close()
(it waits for ongoing method calls/transactions to finish)
a panic: "WaitGroup is reused before previous Wait has returned" might
happen.

Use context cancellation to ensure that new method calls immediately
return during db.kv.Close().
2024-01-16 15:34:31 +07:00
ddl
79499b5cac
refactor(p2p/dnsdisc): replace strings.IndexByte with strings.Cut (#9236)
similar to https://github.com/ledgerwatch/erigon/pull/9202
2024-01-15 18:46:26 +00:00
battlmonstr
04498180dc
p2p/discv4: revert gotreply handler change from #8661 (#9119) (#9195)
The handler had race conditions in the candidates processing goroutine.
2024-01-11 15:04:46 +00:00
Mark Holt
19bc328a07
Added db loggers to all db callers and fixed flag settings (#9099)
Mdbx now takes a logger - but this has not been pushed to all callers -
meaning it had an invalid logger

This fixes the log propagation.

It also fixed a start-up issue for http.enabled and txpool.disable
created by a previous merge
2023-12-31 17:10:08 +07:00
Mark Holt
79ed8cad35
E2 snapshot uploading (#9056)
This change introduces additional processes to manage snapshot uploading
for E2 snapshots:

## erigon snapshots upload

The `snapshots uploader` command starts a version of erigon customized
for uploading snapshot files to
a remote location.  

It breaks the stage execution process after the senders stage and then
uses the snapshot stage to send
uploaded headers, bodies and (in the case of polygon) bor spans and
events to snapshot files. Because
this process avoids execution in run signifigantly faster than a
standard erigon configuration.

The uploader uses rclone to send seedable (100K or 500K blocks) to a
remote storage location specified
in the rclone config file.

The **uploader** is configured to minimize disk usage by doing the
following:

* It removes snapshots once they are loaded
* It aggressively prunes the database once entities are transferred to
snapshots

in addition to this it has the following performance related features:

* maximizes the workers allocated to snapshot processing to improve
throughput
* Can be started from scratch by downloading the latest snapshots from
the remote location to seed processing

## snapshots command

Is a stand alone command for managing remote snapshots it has the
following sub commands

* **cmp** - compare snapshots
* **copy** - copy snapshots
* **verify** - verify snapshots
* **manifest** - manage the manifest file in the root of remote snapshot
locations
* **torrent** - manage snapshot torrent files
2023-12-27 22:05:09 +00:00
Mark Holt
df0699a12b
Added sentry simulator implementation (#9087)
This adds a simulator object with implements the SentryServer api but
takes objects from a pre-existing snapshot file.

If the snapshot is not available locally it will download and index the
.seg file for the header range being asked for.

It is created as follows: 

```go
sim, err := simulator.NewSentry(ctx, "mumbai", dataDir, 1, logger)
```

Where the arguments are:

* ctx - a callable context where cancel will close the simulator torrent
and file connections (it also has a Close method)
* chain - the name of the chain to take the snapshots from
* datadir - a directory potentially containing snapshot .seg files. If
not files exist in this directory they will be downloaded
 *  num peers - the number of peers the simulator should create
 *  logger - the loger to log actions to

It can be attached to a client as follows:

```go
simClient := direct.NewSentryClientDirect(66, sim)
```

At the moment only very basic functionality is implemented:

* get headers will return headers by range or hash (hash assumes a
pre-downloaded .seg as it needs an index
* the header replay semantics need to be confirmed
* eth 65 and 66(+) messaging is supported
* For details see: `simulator_test.go

More advanced peer behavior (e.g. header rewriting) can be added
Bodies/Transactions handling can be added
2023-12-27 14:56:57 +00:00
battlmonstr
c1146bda49
p2p: skip TestUDPv4_smallNetConvergence on Linux (#8731) (#8962) 2023-12-12 17:06:48 +07:00
Alex Sharov
427f2637d2
mdbx: hard-limit of small db's dirty_space (#8850)
it didn't cause problems yet. but it seems a good idea in-general.
2023-11-29 15:09:55 +01:00
milen
230b013096
metrics: separate usage of prometheus counter and gauge interfaces (#8793) 2023-11-24 16:15:12 +01:00
Alex Sharov
3db9467c94
increase peer tasks queue size (#8825)
Current value: 16 was added by me 1 year ago and didn't mean anything.
Never seen this field holding much data, probably can increase.

Currently I see logs like (and 10x like this): 
[DBUG] [11-24|06:59:38.353] slow peer or too many requests, dropping its old requests name=erigon/v2.54.0-aeec5...
2023-11-24 12:42:08 +01:00
Alex Sharov
23f23bc971
disable disc tests on Mac (#8822)
TestUDPv4_smallNetConvergence tests are often timeout on mac - disabling
this tests on mac CI
2023-11-23 16:00:42 +07:00
milen
34c0fe29ad
metrics: swap remaining VictoriaMetrics usages with erigon-lib/metrics (#8762)
# Background

Erigon currently uses a combination of Victoria Metrics and Prometheus
client for providing metrics.

We want to rationalize this and use only the Prometheus client library,
but we want to maintain the simplified Victoria Metrics methods for
constructing metrics.

This task is currently partly complete and needs to be finished to a
stage where we can remove the Victoria Metrics module from the Erigon
code base.

# Summary of changes

- Adds missing `NewCounter`, `NewSummary`, `NewHistogram`,
`GetOrCreateHistogram` functions to `erigon-lib/metrics` similar to the
interface VictoriaMetrics lib provides
- Minor tidy up for consistency inside `erigon-lib/metrics/set.go`
around return types (panic vs err consistency for funcs inside the
file), error messages, comments
- Replace all remaining usages of `github.com/VictoriaMetrics/metrics`
with `github.com/ledgerwatch/erigon-lib/metrics` - seamless (only import
changes) since interfaces match
2023-11-20 12:23:23 +00:00
battlmonstr
a5ff524740
p2p: fix discovery shutdown (#8725) - alternative fix (#8757)
Making the addReplyMatcher channel unbuffered makes the loop
going too slow sometimes for serving parallel requests.
This is an alternative fix for keeping the channel buffered.
2023-11-17 11:02:28 +01:00
battlmonstr
3ca7fdf7e9
p2p: fix discovery shutdown (#8725) (#8735)
Problem:
Some goroutines are blocked on shutdown:
1. table close <-tab.closed // because table loop pending
1. table loop <-refreshDone // because lookup shutdown blocks doRefresh
1. lookup shutdown <-it.replyCh // because it.queryfunc (findnode -
ensureBond) is blocked, and not returning errClosed (if it returns and
pushes to it.replyCh, then shutdown() will unblock)
1. findnode - ensureBond <-rm.errc // because the related replyMatcher
was added after loop() exited, so there's nothing to push errClosed and
unlock it

If addReplyMatcher channel is buffered, it is possible that
UDPv4.pending() adds a new reply matcher after closeCtx.Done().
Such reply matcher's errc result channel will never be updated, because
the UDPv4.loop() has exited at this point. Subsequent discovery
operations will deadlock.

Solution:
Revert to an unbuffered channel.
2023-11-17 09:13:44 +07:00
Giulio rebuffo
274f84598c
Automation tool to automatically upload caplin's snapshot files to R2 (#8747)
Upload beacon snapshots to R2 every week by default
2023-11-16 20:59:43 +01:00
Alex Sharov
35bfffd621
sys deps up (#8695) 2023-11-11 15:04:18 +03:00
Mark Holt
509a7af26a
Discovery zero refresh timer (#8661)
This fixes an issue where the mumbai testnet node struggle to find
peers. Before this fix in general test peer numbers are typically around
20 in total between eth66, eth67 and eth68. For new peers some can
struggle to find even a single peer after days of operation.

These are the numbers after 12 hours or running on a node which
previously could not find any peers: eth66=13, eth67=76, eth68=91.

The root cause of this issue is the following:

- A significant number of mumbai peers around the boot node return
network ids which are different from those currently available in the
DHT
- The available nodes are all consequently busy and return 'too many
peers' for long periods

These issues case a significant number of discovery timeouts, some of
the queries will never receive a response.

This causes the discovery read loop to enter a channel deadlock - which
means that no responses are processed, nor timeouts fired. This causes
the discovery process in the node to stop. From then on it just
re-requests handshakes from a relatively small number of peers.

This check in fixes this situation with the following changes:

- Remove the deadlock by running the timer in a separate go-routine so
it can run independently of the main request processing.
- Allow the discovery process matcher to match on port if no id match
can be established on initial ping. This allows subsequent node
validation to proceed and if the node proves to be valid via the
remainder of the look-up and handshake process it us used as a valid
peer.
- Completely unsolicited responses, i.e. those which come from a
completely unknown ip:port combination continue to be ignored.
-
2023-11-07 08:48:58 +00:00
battlmonstr
d92898a508
p2p: silkworm sentry (#8527) 2023-11-02 08:35:13 +07:00
Dmytro
9adf31b8eb
bytes transfet separated by capability and category (#8568)
Co-authored-by: Mark Holt <mark@distributed.vision>
2023-10-27 22:30:28 +03:00
battlmonstr
f1c81dc14e
devnet: fix node startup on macOS (#8569)
* call getEnode before NodeStarted to make sure it is ready for RPC
calls
* fix connection error detection on macOS
* use a non-default p2p port to avoid conflicts
* disable bor milestones on local heimdall
* generate node keys for static peers config
2023-10-26 12:58:01 +07:00
Dmytro
ec59be2261
Dvovk/sentinel and sentry peers data collect (#8533) 2023-10-23 17:33:08 +03:00
a
436493350e
Sentinel refactor (#8296)
1. changes sentinel to use an http-like interface

2. moves hexutil, crypto/blake2b, metrics packages to erigon-lib
2023-10-22 01:17:18 +02:00
battlmonstr
e04dee12fd
p2p: bad p2p server port in the log (#8493)
Problem:
"Started P2P networking" log message contains port zero on startup,
e.g.: 127.0.0.1:0 because of the outdated localnodeAddrCache.

Solution:
Call updateLocalNodeStaticAddrCache after updating the port.
2023-10-17 10:40:02 +07:00
Alex Sharov
6d9a4f4d94
rpcdaemon: must not create db - because doesn't know right parameters (#8445) 2023-10-12 14:11:46 +07:00
Alex Sharov
404719c292
Medbx: add label to error messages, UpdateForkChoice: add ctx to erorrs, MemDb: increase db-limit from 512Mb to 512Gb (#8434) 2023-10-11 12:53:34 +07:00