Eth 2.0 Dev Update #50 — “Multiclient Testnet Restart + Slasher Improvements”

https://medium.com/prysmatic-labs/eth-2-0-dev-update-50-multiclient-testnet-restart-slasher-improvements-f1e9339b1922?source=rss----33260c5177c6---4

Eth 2.0 Dev Update #50 — “Multiclient Testnet Restart + Slasher Improvements”

This is our 50th update! Thank you everyone for all the support so far!

Schlesi Multiclient Testnet Sunsetted, Witti Testnet To Succeed

The Schlesi multiclient testnet has been absolutely phenomenal for getting the implementers to focus on getting multiclient working. Teku, Lighthouse, Nimbus, Lodestar and Prysm have all been actively syncing on the network, with Lighthouse, Teku and Prysm having active validators!

However, around 11 AM UTC on May 15th , it had been observed that Schlesi had been nearly 50 epochs away from finality. While the initial cause of the finality delay has not been ascertained, they’re not rare in a test network where participants are not obligated/incentivized to keep their validators running. The lack of finality eventually lead to large increases of memory usage in Teku’s beacon nodes which resulted in Teku’s validators having difficulties producing blocks and attestations. This eventually led to Teku nodes forking and branching off from the chain that Prysm/Lighthouse nodes were on.

The initial chain split, bottom-most chain is Teku, middle (above Teku) is Lighthouse and Prysm

After the initial split and almost 300 epochs of no finality, there was a second chain split which irreversibly split the chain further at slot 150496. There was a block submitted that had an attester slashing, which revealed differences in how all clients tally rewards and penalties for slashed validators. This revealed 2 different penalty calculation bugs in Prysm! After this chain split, it was decided that the best course of action would be to restart with a fresh slate. While the chain could’ve been recovered at this point, we decided it was best to immediately use what was learned from this event and get started on a new multiclient test network! Efforts have already been made on launching the Witti Eth2.0 Testnet! It’s not ready just yet, but it will be very soon, stay tuned!

Low Participation/Finality Incident on the Topaz Testnet Now Resolved

We experienced a finality incident due to a bug and dip in participation on May 19th as evidenced by the public charts on beaconcha.in and etherscan. These incidents led to all validators accumulating penalties for the entire period of no finality. We had good evidence the issue started due to this pull request affecting how validators compute their duties being merged into master.

The problem we had was a small discrepancy in clock time when requesting duties. The validator would receive the wrong duties from what they expected, given a small enough discrepancy in time could cause a total deviation in requesting the right duties. Instead, we reverted the usage of the clock time to instead use the epoch specified in the validator’s RPC request to compute duties to fix the problem.

Along with the above issue, participation was also low as one of our teammates who holds 15% of validators turned his validators off due to his system performing an automatic update. The chain has since recovered perfectly and the bug has been resolved.

Preston also tagged all our community members to make sure they were doing their part. Thanks everyone!

📝 Merged Code, Pull Requests, and Issues

Big Improvements to Advanced State Management in Beacon Nodes
Multiple improvements have been made on new state management functionality to make it more user friendly. One deficiency that was fixed is, during migration, if an archived slot happens to be a skip block, then the archived slot will not be having archived state saved. This was fixed in #5950. Every archived slot will have an archived state. We also have improved new state management to better handle genesis cases plus the edge cases which were reported while supporting slasher. Another great thing we did was to make `SlotsPerArchivedPoint` configurable. The default `SlotsPerArchivedPoint` has been updated from 256 slots to 2048 slots to better align with community feedback. After feedback from the users, we have added a new flag `skip-regen-historical-states`, this flag allows the node to skip regenerate and resave all the historical states. This allows an easy switch over to experiment new state management and to reduce concurrent memory usages.

P2P Networking Improvements
With Prysm participating in Schlesi and Topaz, it has allowed us to test our networking code with other clients. Previously, all peer communication was Prysm -> Prysm, which made it easier to run and test for but also hid many subtle bugs and ways that Prysm diverged from the current Networking Spec. Prysm was eagerly connecting with and handshaking other clients, whereas the spec required that only the dialing peer initiate the handshake. We were also incorrectly handling status messages from forked peers, abruptly disconnecting with them.

These are just some of the few improvements we have been carrying out the past couple of weeks. Stay tuned for more improvements as we upgrade to the v0.12 version of the spec!

Slasher Runtime Improvements and Middleware Progress
Our team member Ivan has been focusing on improving the Slasher clients performance to encourage more people to run them. While performing historical slashing detection on the network, over 1400 unique slashable validators were detected! However, our latest alpha8 release has a processing bug that doesn’t allow historical attestations to be received, this will be fixed in the next release! We sent out a few of these slashings, with one proposer getting almost a full GöETH from proposing a block with the attester slashing.

Validator 7104 hit the jackpot!

Many of these great improvements will be available in the next release, so be ready to update!

Consensys’s fantastic “Ethereum 2.0 Staking Report” revealed that a feature most stakers care about is slashing protection, and we agree! Being slashed while running a honest validator is something we understand is a concern for our users.

Local db slashing protection is already in place on our validator client, yet the local protection has its faults (limited to the clients local history). The need of an all knowing node that can reply with certainty that an attestation or block is not slashable is very important for protecting validators.

Thankfully, we already have the “Hash slinging slasher” which serves as this. Until now, the slasher was only being used to find slashings on the network and send them to the beacon node for block inclusion. The slasher is designed to do this since it stores all blocks and attestations that have ever published to the network, and this functionality can be used to protect a validator from slashing as well. You can already try the new feature from our master branch using the “ — enable-external-slasher-protection” flag.
More enhancements for this feature will be coming soon.

Validator Client Improvements from Contributors
Contributors to Prysm have been increasing and we appreciate any help provided in this massive effort! This update, we’re giving a shout out to michaelhly and rkapka for their awesome contributions to the validator client. Michael has added a new validator client command that allows users to log out the public keys and statuses of any validator keys they have locally. He also made another contribution to allow an endpoint of our API to accept a list of validator indices, instead of only public keys. Thanks Michael!

Huge thanks to rkapka for adding a great new command that allows users to change passwords for their validators keys, a great feature for convenience if you ever wish to change your validator keystore passwords! Speaking of passwords, our manual password entry is now tested thanks to connerwstein, thank you Conner!

If anyone is interested in contributing, feel free to join our Discord and reach out!

Optimizations to Initial Synchronization Process
Initial synchronization has got yet another round of updates, where our main aim was to make sure that we can speed up the overall process without compromising on robustness. Our team member Faraz made quite a few important optimization pull requests which have already been merged into our main repo: #5853, #5881#5887.

These PRs allow us to utilize network bandwidth more efficiently, by relying on a weighted round robin when polling peers (so that no peer is unfairly overloaded), all this while staying on the safe side of a peers’ rate limiter requirements.

The synchronization speedup is considerable, in tests we’ve been conducting this week overall performance was around 90–100blks/sec. While our current unit tests do pass, we still need a bit more time and more detailed tests before the final optimization pull request is merged into the master branch. If you want to track the progress of it, please, follow PR #5798.

🔜 Upcoming Work

Spec update to v0.12.0
The EF researchers have released the latest version of the spec, v0.12.0! This release includes important BLS standard updates, and several improvements to the spec that are results of the Schlesi testnet discoveries. The update is mainly composed of penalty/rewards fixes and improvements, including one that ensures optimal performing validators do not decrease in balance during times of delayed finality. This has been a highly requested improvement for a while, so it’s great to see it made! We have already gotten great progress on the changes required.

Custom chain configuration stress testing
Production readiness is a difficult topic to address perfectly for a mainnet release. However, there are various metrics we can measure clearly and aim to optimize in order to have more confidence in Prysm’s robustness in mainnet. Namely, being able to run short slot duration testnets for a significant amount of time is important to understand how the node can recover under stress. Although official metrics and expectations for multiclient have yet to be published by EF, some of the items we foresee to be important for Prysm are:

  • Able to recover from N epochs since finality
  • Able to recover from a chain split
  • Able to recover from majority nodes and validators going offline
  • Able to handle high periods of forking in short slot duration configurations
  • Able to handle N = {4000, 16000, 100000} validators with various node configurations for a decent period of time

We have set up our infrastructure to be able to test these targets locally and gather evidence for what our bottlenecks are. Most of the problems we see are in attestation aggregation and block processing, especially in the short slot-time testnets. So far, however, Prysm’s resource usage looks on par with our regular Topaz testnet when running with 1 second slot times:

We will be updating the community on Prysm’s readiness regarding these metrics via our biweekly updates over the coming weeks.

Interested in Contributing?

We are always looking for devs interested in helping us out. If you know Go or Solidity and want to contribute to the forefront of research on Ethereum, please drop us a line and we’d be more than happy to help onboard you :).

Check out our contributing guidelines and our open projects on Github. Each task and issue is grouped into the Phase 0 milestone along with a specific project it belongs to.

As always, follow us on Twitter or join our Discord server and let us know what you want to help with.

Official, Prysmatic Labs Ether Donation Address

0x9B984D5a03980D8dc0a24506c968465424c81DbE

Official, Prysmatic Labs ENS Name

prysmatic.eth


Eth 2.0 Dev Update #50 — “Multiclient Testnet Restart + Slasher Improvements” was originally published in Prysmatic Labs on Medium, where people are continuing the conversation by highlighting and responding to this story.