Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Make neqo pass amplificationlimit QNS test #1875

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

larseggert
Copy link
Collaborator

Fixes #1183

neqo-transport/src/connection/mod.rs Outdated Show resolved Hide resolved
neqo-transport/src/connection/tests/handshake.rs Outdated Show resolved Hide resolved
neqo-transport/src/recovery.rs Outdated Show resolved Hide resolved
qns/interop.sh Outdated Show resolved Hide resolved
@larseggert larseggert marked this pull request as ready for review May 6, 2024 09:24
@larseggert
Copy link
Collaborator Author

One test not passing, see #1490 (comment)

@larseggert larseggert linked an issue May 6, 2024 that may be closed by this pull request
Copy link

github-actions bot commented May 7, 2024

Failed Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

@larseggert
Copy link
Collaborator Author

@martinthomson @KershawChang if any of you understands the timer code, I'd appreciate some help here. The idle_timeout_crazy_rtt persistently crashes with assertion failed: earliest > now, which seems to be a long-standing issue.

It's straightforward to reproduce:

# RUST_LOG=trace cargo nextest run --no-capture idle_timeout_crazy_rtt
...
0s 35ms DEBUG [idle_timeout_crazy_rtt] processing Server c11688b68f20d6a8b4f512e565e90a
0s 35ms TRACE [Server c11688b68f20d6a8b4f512e565e90a] process_output Handshaking Instant { tv_sec: 5648807, tv_nsec: 122490916 }
0s 35ms TRACE IdleTimeout::expiry@Instant { tv_sec: 5648807, tv_nsec: 122490916 } pto=325ms, ka=false => Instant { tv_sec: 610448805, tv_nsec: 622490916 }
0s 35ms TRACE [LossRecovery] timeout Instant { tv_sec: 5648807, tv_nsec: 122490916 }
0s 35ms TRACE detect lost ap: now=Instant { tv_sec: 5648807, tv_nsec: 122490916 } delay=112.5ms
0s 35ms TRACE detect lost hs: now=Instant { tv_sec: 5648807, tv_nsec: 122490916 } delay=112.5ms
0s 35ms TRACE lost=11, time sent Instant { tv_sec: 5648806, tv_nsec: 822490916 } is before lost_delay 112.5ms
0s 35ms DEBUG packet_lost this=0x600002f741e0, pn=11, ps=675
0s 35ms DEBUG [NewReno CongCtrl 662/2674 ssthresh 2674] Cong event -> recovery; cwnd 2674, ssthresh 2674
0s 35ms DEBUG [NewReno CongCtrl 662/2674 ssthresh 2674] state -> RecoveryStart
0s 35ms DEBUG on_packets_lost this=0x600002f741e0, bytes_in_flight=662, cwnd=2674, state=RecoveryStart
0s 35ms TRACE detect lost in: now=Instant { tv_sec: 5648807, tv_nsec: 122490916 } delay=112.5ms
0s 35ms TRACE lost=11, time sent Instant { tv_sec: 5648806, tv_nsec: 822490916 } is before lost_delay 112.5ms
0s 35ms DEBUG packet_lost this=0x600002f741e0, pn=11, ps=662
0s 35ms DEBUG on_packets_lost this=0x600002f741e0, bytes_in_flight=0, cwnd=2674, state=RecoveryStart
0s 35ms DEBUG [Server c11688b68f20d6a8b4f512e565e90a] Lost: Crypto(CryptoRecoveryToken { space: Handshake, offset: 0, length: 628 })
0s 35ms INFO Lost crypto frame space=hs offset=0 length=628
0s 35ms DEBUG [Server c11688b68f20d6a8b4f512e565e90a] Lost: Crypto(CryptoRecoveryToken { space: Initial, offset: 0, length: 90 })
0s 35ms INFO Lost crypto frame space=in offset=0 length=90
0s 35ms TRACE [Server c11688b68f20d6a8b4f512e565e90a] output Instant { tv_sec: 5648807, tv_nsec: 122490916 }
0s 35ms DEBUG [LossRecovery] get send profile Instant { tv_sec: 5648807, tv_nsec: 122490916 }
0s 35ms DEBUG [Server c11688b68f20d6a8b4f512e565e90a] output_path send_profile SendProfile { limit: 255, pto: None, probe: (), paced: false }
0s 35ms DEBUG Building Initial dcid 0900000000c1f2c64a scid 0800000000c32312
0s 35ms DEBUG Building Handshake dcid 0900000000c1f2c64a scid 0800000000c32312
0s 35ms DEBUG Building Short dcid 0900000000c1f2c64a
0s 35ms DEBUG TX blocked, profile=SendProfile { limit: 255, pto: None, probe: (), paced: false }
0s 35ms TRACE [Server c11688b68f20d6a8b4f512e565e90a] Get callback delay Instant { tv_sec: 5648807, tv_nsec: 122490916 }
0s 35ms TRACE ack_time for ap = None
0s 35ms TRACE ack_time for hs = None
0s 35ms TRACE ack_time for in = None
0s 35ms TRACE IdleTimeout::expiry@Instant { tv_sec: 5648807, tv_nsec: 122490916 } pto=325ms, ka=false => Instant { tv_sec: 610448805, tv_nsec: 622490916 }
0s 35ms TRACE [Server c11688b68f20d6a8b4f512e565e90a] Idle/keepalive timer Instant { tv_sec: 610448805, tv_nsec: 622490916 }
0s 35ms TRACE [LossRecovery] next_timeout loss=None pto=Some(Instant { tv_sec: 5648807, tv_nsec: 122490916 })
0s 35ms TRACE [Server c11688b68f20d6a8b4f512e565e90a] Loss recovery timer Instant { tv_sec: 5648807, tv_nsec: 122490916 }
thread 'idle_timeout_crazy_rtt' panicked at neqo-transport/src/connection/mod.rs:1068:9:
assertion failed: earliest > now
@KershawChang
Copy link
Collaborator

KershawChang commented May 7, 2024

@martinthomson @KershawChang if any of you understands the timer code, I'd appreciate some help here. The idle_timeout_crazy_rtt persistently crashes with assertion failed: earliest > now, which seems to be a long-standing issue.

I think we need to implement what Martin suggested to do in this comment.
I'll try to give it a shot.
In order to unblock this, maybe we can remove that debug assertion first?

@larseggert
Copy link
Collaborator Author

larseggert commented May 7, 2024

In order to unblock this, maybe we can remove that debug assertion first?

Then it dies here

thread 'idle_timeout_crazy_rtt' panicked at test-fixture/src/sim/mod.rs:202:25:
assertion `left != right` failed
  left: 0ns
 right: 0ns

And when I take that out, too, it spins forever.

Copy link

github-actions bot commented May 7, 2024

Firefox builds for this PR

The following builds are available for testing. Crossed-out builds did not succeed.

@larseggert
Copy link
Collaborator Author

Making this a draft until the test failure is fixed.

@larseggert larseggert marked this pull request as draft May 31, 2024 12:50
@larseggert larseggert marked this pull request as draft May 31, 2024 12:50
Copy link

codecov bot commented Jun 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.99%. Comparing base (2e419cc) to head (bf0b698).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1875   +/-   ##
=======================================
  Coverage   94.99%   94.99%           
=======================================
  Files         112      112           
  Lines       36366    36371    +5     
=======================================
+ Hits        34546    34552    +6     
+ Misses       1820     1819    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@larseggert larseggert marked this pull request as ready for review June 18, 2024 11:00
Copy link

github-actions bot commented Jun 18, 2024

Benchmark results

Performance differences relative to 2e419cc.

coalesce_acked_from_zero 1+1 entries: Change within noise threshold.
       time:   [189.65 ns 190.16 ns 190.73 ns]
       change: [-1.3971% -0.9570% -0.5150%] (p = 0.00 < 0.05)

Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) high mild
5 (5.00%) high severe

coalesce_acked_from_zero 3+1 entries: Change within noise threshold.
       time:   [232.27 ns 232.84 ns 233.47 ns]
       change: [-0.9548% -0.5922% -0.2226%] (p = 0.00 < 0.05)

Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe

coalesce_acked_from_zero 10+1 entries: Change within noise threshold.
       time:   [232.27 ns 232.85 ns 233.58 ns]
       change: [-1.0816% -0.6282% -0.1029%] (p = 0.01 < 0.05)

Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) low mild
5 (5.00%) high severe

coalesce_acked_from_zero 1000+1 entries: Change within noise threshold.
       time:   [216.18 ns 216.43 ns 216.68 ns]
       change: [-1.3019% -0.6518% -0.0209%] (p = 0.04 < 0.05)

Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) high mild
6 (6.00%) high severe

RxStreamOrderer::inbound_frame(): No change in performance detected.
       time:   [119.57 ms 119.74 ms 119.99 ms]
       change: [-0.2703% -0.0082% +0.2759%] (p = 0.96 > 0.05)

Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe

transfer/Run multiple transfers with varying seeds: Change within noise threshold.
       time:   [55.861 ms 58.819 ms 61.812 ms]
       thrpt:  [64.713 MiB/s 68.005 MiB/s 71.607 MiB/s]
change:
       time:   [+0.8767% +8.8633% +17.466%] (p = 0.03 < 0.05)
       thrpt:  [-14.869% -8.1417% -0.8691%]

Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild

transfer/Run multiple transfers with the same seed: No change in performance detected.
       time:   [63.853 ms 70.430 ms 76.973 ms]
       thrpt:  [51.966 MiB/s 56.794 MiB/s 62.644 MiB/s]
change:
       time:   [-19.554% -8.8575% +3.1440%] (p = 0.14 > 0.05)
       thrpt:  [-3.0481% +9.7183% +24.307%]
1-conn/1-100mb-resp (aka. Download)/client: No change in performance detected.
       time:   [287.17 ms 293.54 ms 299.50 ms]
       thrpt:  [333.89 MiB/s 340.67 MiB/s 348.22 MiB/s]
change:
       time:   [-1.6402% +0.7881% +3.3389%] (p = 0.56 > 0.05)
       thrpt:  [-3.2310% -0.7820% +1.6675%]

Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low mild
1 (10.00%) high mild

1-conn/10_000-parallel-1b-resp (aka. RPS)/client: No change in performance detected.
       time:   [406.71 ms 410.20 ms 413.77 ms]
       thrpt:  [24.168 Kelem/s 24.378 Kelem/s 24.588 Kelem/s]
change:
       time:   [-2.2923% -1.0746% +0.1154%] (p = 0.08 > 0.05)
       thrpt:  [-0.1153% +1.0863% +2.3461%]

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

1-conn/1-1b-resp (aka. HPS)/client: No change in performance detected.
       time:   [67.949 ms 68.263 ms 68.623 ms]
       thrpt:  [14.572  elem/s 14.649  elem/s 14.717  elem/s]
change:
       time:   [-0.7625% -0.0222% +0.6877%] (p = 0.95 > 0.05)
       thrpt:  [-0.6830% +0.0223% +0.7683%]

Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) low mild
1 (1.00%) high mild
6 (6.00%) high severe

Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 194.4 ± 90.0 107.6 358.7 1.00
neqo msquic reno on 267.1 ± 7.6 254.4 279.8 1.00
neqo msquic reno 267.9 ± 20.9 248.7 322.7 1.00
neqo msquic cubic on 269.1 ± 11.9 252.4 285.5 1.00
neqo msquic cubic 268.0 ± 10.4 254.0 287.4 1.00
msquic neqo reno on 204.1 ± 85.6 151.3 368.2 1.00
msquic neqo reno 159.4 ± 61.3 111.4 391.4 1.00
msquic neqo cubic on 145.5 ± 14.6 115.2 170.7 1.00
msquic neqo cubic 215.3 ± 68.5 160.3 393.4 1.00
neqo neqo reno on 214.5 ± 57.9 164.2 390.1 1.00
neqo neqo reno 235.8 ± 102.9 167.7 520.8 1.00
neqo neqo cubic on 262.7 ± 107.7 179.7 530.9 1.00
neqo neqo cubic 223.8 ± 70.0 155.1 379.1 1.00

⬇️ Download logs

@mozilla mozilla deleted a comment from github-actions bot Jun 18, 2024
@larseggert
Copy link
Collaborator Author

Removing the early return makes it not pass the amplificationlimit test anymore though. Sigh.

@larseggert
Copy link
Collaborator Author

This again is waiting for a fix to the idle_timeout_crazy_rtt test failure.

@KershawChang
Copy link
Collaborator

I've tried to give this another look and I found that the problem might be about when the function discard_packet is called.

pub fn discard_packet(&mut self, sent: &SentPacket, now: Instant, stats: &mut Stats) {

Before this change, the server was able to update the RTT to the value set here

0s326ms INFO [pri-unv-path:090000000037faf420 [fe80::1]:443->[fe80::1]:443] discarding a packet without an RTT estimate; guessing RTT=2999654.700000001s

With this change, the server can only update the RTT to a small value:

0s202ms INFO [pri-unv-path:0800000000b3b9a6 [fe80::1]:443->[fe80::1]:443] discarding a packet without an RTT estimate; guessing RTT=300.000001ms

Unfortunately, I still can't figure out the correct way to fix this.

@larseggert
Copy link
Collaborator Author

With this change, the server can only update the RTT to a small value:

0s202ms INFO [pri-unv-path:0800000000b3b9a6 [fe80::1]:443->[fe80::1]:443] discarding a packet without an RTT estimate; guessing RTT=300.000001ms

I don't see this log line when I run the test for this PR? (I do see the one for the main branch.)

Unfortunately, I still can't figure out the correct way to fix this.

What seems to happen is that set_valid() is never called on the server-side for the initial path, because this patch moves that to after the path is actually validated. So the server stays amplification-limited. I'm not sure how that would affect the timers though.

@larseggert larseggert force-pushed the fix-1183 branch 2 times, most recently from 2fb43fc to f574643 Compare July 17, 2024 05:42
@larseggert
Copy link
Collaborator Author

Some QNS L1/C1 tests are failing because #1981 is needed.

@larseggert larseggert added the needs-review Needs a review by a CODEOWNER. label Jul 17, 2024
Copy link
Collaborator

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change makes sense to me. One minor comment.

neqo-transport/src/recovery/mod.rs Outdated Show resolved Hide resolved
larseggert added a commit to larseggert/neqo that referenced this pull request Jul 19, 2024
* Cancel runs after 20 min, and report that that happened
* Build NSS from source, to make SSLKEYLOGFILE work
* Shorten retention for some artifacts
* Run interop server at `debug` log level

Factored out of mozilla#1875, which was a convenient test case.
@larseggert
Copy link
Collaborator Author

So this seems to work, in the sense that there are no more A failures when we are the server.

But maybe there are also more L1 and C1 failures now, and possibly more frequent deadlocks (= runs that get cancelled by the timeout)?

It's difficult to say something definitive here, since we have had these occur randomly all along, and they may also not be due to bugs on our end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-review Needs a review by a CODEOWNER.
4 participants