Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Error handling to fix bugs #271

Open
wants to merge 89 commits into
base: master
Choose a base branch
from

Conversation

Ruben2424
Copy link
Contributor

@Ruben2424 Ruben2424 commented Dec 30, 2024

See #270
This PR implements some of the ideas from #177

Summary

Summary of the most important changes

Error Types

This PR changes how the Error handling in H3 works.
Errors can occur from 2 places:

  • Errors from the peer h3 implementation or the quic layer come via the quic traits.
  • When h3 encounters a Protocol error, this error also needs to be handled

Error from Quic Layer

Before this PR, Errors from the Quic layer came with a generic Error type, which implemented this trait:

/// Trait that represent an error from the transport layer
pub trait Error: std::error::Error + Send + Sync {
    /// Check if the current error is a transport timeout
    fn is_timeout(&self) -> bool;

    /// Get the QUIC error code from connection close or stream stop
    fn err_code(&self) -> Option<u64>;
}

With this PR the Error types used in the quic traits are types defined in h3.
For Connection level traits:

/// Error type to communicate that the quic connection was closed
///
/// This is used by to implement the quic abstraction traits
#[derive(Clone)]
pub enum ConnectionErrorIncoming {
    /// Error from the http3 layer
    ApplicationClose {
        /// http3 error code
        error_code: u64,
    },
    /// Quic connection timeout
    Timeout,
    /// This variant can be used to signal, that an internal error occurred within the trait implementations
    /// h3 will close the connection with H3_INTERNAL_ERROR
    InternalError(String),
    /// A unknown error occurred (not relevant to h3)
    ///
    /// For example when the quic implementation errors because of a protocol violation
    Undefined(Arc<dyn std::error::Error + Send + Sync>),
}

And for Stream level traits:

/// Error type to communicate that the stream was closed
///
/// This is used by to implement the quic abstraction traits
/// When an error within the quic trait implementation occurs, use ConnectionErrorIncoming variant with InternalError
#[derive(Debug, Clone)]
pub enum StreamErrorIncoming {
    /// Stream is closed because the whole connection is closed
    ConnectionErrorIncoming {
        /// Connection error
        connection_error: ConnectionErrorIncoming,
    },
    /// Stream was closed by the peer
    StreamReset {
        /// Error code sent by the peer
        error_code: u64,
    },
    /// A unknown error occurred (not relevant to h3)
    ///
    /// H3 will handle this exactly like a StreamReset
    /// like closing the connection with an error if http3 forbids a stream end for example with the control stream
    Unknown(Arc<dyn std::error::Error + Send + Sync>),
}

They make it easier to separate stream errors from connection errors and there are no invalid states.

Return Error types

The User facing Error Type changed from:

/// The error kind.
#[derive(Clone)]
pub(crate) struct ErrorImpl {
    pub(crate) kind: Kind,
    cause: Option<Arc<dyn std::error::Error + Send + Sync>>,
}

/// Some errors affect the whole connection, others only one Request or Stream.
/// See [errors](https://www.rfc-editor.org/rfc/rfc9114.html#errors) for mor details.
#[derive(PartialEq, Eq, Hash, Clone, Copy, Debug)]
pub enum ErrorLevel {
    /// Error that will close the whole connection
    ConnectionError,
    /// Error scoped to a single stream
    StreamError,
}

// Warning: this enum is public only for testing purposes. Do not use it in
// downstream code or be prepared to refactor as changes happen.
#[doc(hidden)]
#[non_exhaustive]
#[derive(Clone, Debug)]
pub enum Kind {
    #[non_exhaustive]
    Application {
        code: Code,
        reason: Option<Box<str>>,
        level: ErrorLevel,
    },
    #[non_exhaustive]
    HeaderTooBig {
        actual_size: u64,
        max_size: u64,
    },
    // Error from QUIC layer
    #[non_exhaustive]
    Transport(Arc<dyn quic::Error>),
    // Connection has been closed with `Code::NO_ERROR`
    Closed,
    // Currently in a graceful shutdown procedure
    Closing,
    Timeout,
}

to

/// This enum represents the closure of a connection because of an a closed quic connection
/// This can be either from this endpoint because of a violation of the protocol or from the remote endpoint
///
/// When the code [`Code::H3_NO_ERROR`] is used bei this peer or the remote peer, the connection is closed without an error
/// according to the [h3 spec](https://www.rfc-editor.org/rfc/rfc9114.html#name-http-3-error-codes)
#[derive(Debug, Clone)]
#[non_exhaustive]
pub enum ConnectionError {
    /// The error occurred on the local side of the connection
    #[non_exhaustive]
    Local {
        /// The error
        error: LocalError,
    },
    /// Error returned by the quic layer
    /// I might be an quic error or the remote h3 connection closed the connection with an error
    #[non_exhaustive]
    Remote(ConnectionErrorIncoming),
    /// Timeout occurred
    #[non_exhaustive]
    Timeout,
}

/// This enum represents a local error
#[derive(Debug, Clone, Hash)]
#[non_exhaustive]
pub enum LocalError {
    #[non_exhaustive]
    /// The application closed the connection
    Application {
        /// The error code
        code: Code,
        /// The error reason
        reason: String,
    },
    #[non_exhaustive]
    /// Graceful closing of the connection initiated by the local peer
    Closing,
}

/// This enum represents a stream error
#[derive(Debug, Clone)]
#[non_exhaustive]
pub enum StreamError {
    /// The error occurred on the stream
    #[non_exhaustive]
    StreamError {
        /// The error code
        code: Code,
        /// The error reason
        reason: String,
    },
    /// Stream was Reset by the peer
    RemoteReset {
        /// Reset code received from the peer
        code: Code,
    },
    /// The error occurred on the connection
    #[non_exhaustive]
    ConnectionError(ConnectionError),
    /// Error is used when violating the MAX_FIELD_SECTION_SIZE
    ///
    /// This can mean different things depending on the context
    /// When sending a request, this means, that the request cannot be sent because the header is larger then permitted by the server
    /// When receiving a request, this means, that the server sent a
    ///
    HeaderTooBig {
        /// The actual size of the header block
        actual_size: u64,
        /// The maximum size of the header block
        max_size: u64,
    },
    /// Received a GoAway frame from the remote
    ///
    /// Stream operations cannot be performed
    RemoteClosing,
    /// Undefined error propagated by the quic layer
    Undefined(Arc<dyn std::error::Error + Send + Sync>),
}

The pros of the new error types:

  • Better separation of stream and connection errors
  • It is now clear on which peer the error occured (locally or remote)

Server incoming Request streams

The accept method signature changed from:

pub async fn accept(
        &mut self,
    ) -> Result<Option<(Request<()>, RequestStream<C::BidiStream, B>)>, Error>

to

pub async fn accept(&mut self) -> Result<Option<RequestResolver<C, B>>, ConnectionError>

The new Method returns a RequestResolver struct, which provides a method to resolve the request to get the Request itself and the Request stream.
Pros:

  • No head of line blocking, blocking new incoming requests while headers of the first are received
  • Better separation of Error levels, accept returns a ConnectionError and resolve_request returns a StreamError

Internal Error Handling

The shared state of a Connection and its Streams changed from:

#[doc(hidden)]
#[non_exhaustive]
pub struct SharedState {
    // Peer settings
    pub peer_config: Settings,
    // connection-wide error, concerns all RequestStreams and drivers
    pub error: Option<Error>,
    // Has a GOAWAY frame been sent or received?
    pub closing: bool,
}

#[derive(Clone)]
#[doc(hidden)]
pub struct SharedStateRef(Arc<RwLock<SharedState>>);

to

#[derive(Debug)]
/// This struct represents the shared state of the h3 connection and the stream structs
pub struct SharedState2 {
    /// The settings, sent by the peer
    settings: OnceLock<Settings>,
    /// The connection error
    connection_error: OnceLock<ErrorOrigin>,
    /// The connection is closing
    closing: AtomicBool,
    /// Waker for the connection
    waker: AtomicWaker,
}

which is stored in an Arc within the structs.
Pros:

  • Leveraging the type system (Settings are sent once and error is only set once)
  • Possibility to Wake the connection after a Error is set
  • If accessed only the relevant data is blocked not the whole struct

If an error occurs on the RequestStream it sets the error and wakes the Connection.
The Connection which is polled always, closes the connection if needed.

Datagram traits

In order to adapt the new handling to h3-webtransport and h3-datagram a few things were cleaned up and the datagram traits where changed, to move more code to h3-datagram which was previously implemented in h3-webtransport or h3-quinn. This also allows to poll for incoming datagrams or sending while still polling the h3 Connection

Open Todos

  • Fix h3-webtransport and h3-datagram
  • cleanup

@Ruben2424
Copy link
Contributor Author

Connection Errors in RequestStream now also close the connection. And the tests are fixed.
The current Error types are a bit confusing. I will see if I can fix this. Maybe we can make this type of error impossible by forcing to close connection in order to create the user facing error struct.
IMO it would also be good to make it clear to the user if the occurred locally or by the peer.
After this PR is merged I think we could make a new release.

…d replacing it with Atomics or OnceLock structures for each different thing
…started but no headers sent.

This is the fist step to fix this with a test. But the test is not perfect yet
@Ruben2424
Copy link
Contributor Author

@seanmonstar i noticed another Problem. The async accept method for the server implementation, waits for headers in the first bidirectional stream it receives:

#[cfg_attr(feature = "tracing", instrument(skip_all, level = "trace"))]
pub async fn accept(
&mut self,
) -> Result<Option<(Request<()>, RequestStream<C::BidiStream, B>)>, Error> {
// Accept the incoming stream
let mut stream = match poll_fn(|cx| self.poll_accept_request_stream(cx)).await {
Ok(Some(s)) => FrameStream::new(BufRecvStream::new(s)),
Ok(None) => {
// We always send a last GoAway frame to the client, so it knows which was the last
// non-rejected request.
self.shutdown(0).await?;
return Ok(None);
}
Err(err) => {
match err.inner.kind {
crate::error::Kind::Closed => return Ok(None),
crate::error::Kind::Application {
code,
reason,
level: ErrorLevel::ConnectionError,
} => return Err(self.inner.close(code, reason.unwrap_or_default())),
_ => return Err(err),
};
}
};
let frame = poll_fn(|cx| stream.poll_next(cx)).await;
let req = self.accept_with_frame(stream, frame)?;
if let Some(req) = req {
Ok(Some(req.resolve().await?))
} else {
Ok(None)
}
}

This line waits for a bidirectional stream and also drives the connection forward by listening to the control-stream and with the changes from this PR also gets notified over mpsc when a connection-error in a different request-stream occurs.

let mut stream = match poll_fn(|cx| self.poll_accept_request_stream(cx)).await {

This lines from the accept method waits for headers in the accepted bidirectional stream.

let frame = poll_fn(|cx| stream.poll_next(cx)).await;
let req = self.accept_with_frame(stream, frame)?;

When a client starts a bidirectional stream but sends no headers, the accept method of this connection will block, not receiving any other requests or any data on the control-stream.

This also introduces some kind of head of line blocking, because the server cannot process any other request if the first one is slower.

I see two possible solutions:

  1. change the accept functionality to poll for new bidirectional streams and also to keep track of all bidirectional streams which did not receive headers yet. With this a few async functions will probably have to become poll functions.
  2. Introduce a new struct, which holds the bidirectional stream. This new type will be returned by the accept method. The new type has a public function to await the request headers and return the Request with the stream. As the accept method does at the moment.

I would prefer the second option, because this is a low level crate and with this the user has more control. Also the number of maximal concurrent steams is managed by the quic layer. So no need to hold this information in h3.

https://www.rfc-editor.org/rfc/rfc9114#name-streams

In contrast to HTTP/2, stream concurrency in HTTP/3 is managed by QUIC. QUIC considers a stream closed when all data has been received and sent data has been acknowledged by the peer. HTTP/2 considers a stream closed when the frame containing the END_STREAM bit has been committed to the transport. As a result, the stream for an equivalent exchange could remain "active" for a longer period of time. HTTP/3 servers might choose to permit a larger number of concurrent client-initiated bidirectional streams to achieve equivalent concurrency to HTTP/2, depending on the expected usage patterns.

What do you think?

@seanmonstar
Copy link
Member

I think it'd probably be best to tackle that in a follow-up PR. Good work!

@Ruben2424
Copy link
Contributor Author

I can do it in a "follow-up" PR. But i probably will do the "follow-up" first because this PR will get easier with that problem fixed.

@Ruben2424
Copy link
Contributor Author

I think it'd probably be best to tackle that in a follow-up PR. Good work!

I noticed, that I need to address this here, because I need changes already made in this PR. And to continue this PR I need this fixed.
I hope it will not become to difficult to review.

@seanmonstar
Copy link
Member

No problem, let me know when you think it's ready and I'll give it a review!

@Ruben2424
Copy link
Contributor Author

I think I am almost done with this PR. I left a summary in the first comment and a list of open todos.

@Ruben2424 Ruben2424 changed the title Fix bug in RequestStream Refactor Error handling to fix bugs Mar 16, 2025
@Ruben2424 Ruben2424 marked this pull request as ready for review March 23, 2025 18:16
@Ruben2424
Copy link
Contributor Author

No problem, let me know when you think it's ready and I'll give it a review!

@seanmonstar I think this PR is ready for a first review. Let me know what you think. I summarized the changes in the initial PR description (see above).

@Ruben2424 Ruben2424 linked an issue Mar 24, 2025 that may be closed by this pull request
@Ruben2424 Ruben2424 requested a review from Copilot March 26, 2025 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Protocol violation in RequestStream
2 participants