Game Servers on Linux: Part 1

Efficient Networking with io_uring

November 24, 2025

A quick introduction to this series, then I’ll get on with the main content. A lot of game server engineers come from a game client background. This is usually a good thing: most of the work on a game server benefits from a more engine-oriented perspective (generating the server authoritative simulation, optimizing physics, generating rollbacks, etc.). This usually means, however, that many of these engineers don’t come from a strong Linux backgroundTraditionally games have been a windows, mobile, or console affair. Those headwinds might be changing, but it’s still early days. I’ve done game server work coming from the other side: a more Linux-y, “traditional backend services” background. The point of this series is to highlight a number of Linux features that I’ve found useful for making game servers more efficient, reliable, and secure. Also, apologies but I’m going to have most code examples (where applicable) written in Rust. I know C++ would be more helpful, and I’ll translate them eventually. Alright, on with our regularly scheduled programming…

When it comes to modern dedicated game servers, cpu usage is a big issue. The main reason for this is that while normal backend services can serve thousands of clients with a single instance, game servers generally host only a few dozen clients at most. We pay a fixed cost for running a server, but we don’t get to amortize that cost over a large number of clients. While I was at Rec Room, for example, we packed tens of servers per cpu core to keep costs manageable. Add on top of that the fact that we’re generally sending and receiving a lot more data per client than your typical backend service, and costs can add up quickly.

A good rule of thumb for most Linux networking services is that you spend most of your time making syscalls. This is both because you generally need syscalls to meaningfully interact with a network, but also because syscalls are slow. Making syscalls involves a context switch from user mode to kernel mode, which is orders of magnitude slower than normal function calls.

For backend services these syscalls generally consist of send and recvfor a good chunk of this blog post I’ll say send or recv, but these can stand in for the family of related networking calls (e.g. sendto, recvfrom, etc). among a few others. Game servers may be one of the few exceptions to the syscall rule of thumb since we spend a lot of our time doing physics, in-memory game state operations, and other work that doesn’t require much in the way of syscalls. Even so, if you get enough clients connected, networking syscalls can still account for a decent portion of our overall CPU time. Let’s start by demonstrating just how much time that can be.

The cost of networking syscalls

Let’s set up the simplest representation of a game server we can: a UDP broadcast server. All we’ll do is take all packets and send them to all connected clients“UDP is connectionless!” I hear you say. By “connected” in this context, I merely mean that we have received at least one packet from that client in the lifetime of the server.. Let’s first add a server with a fixed tick rate:

// Create the socket and the channel for passing
// received packets back to the main thread
let (socket, rx) = setup_socket(port);

let time_per_tick = Duration::from_millis(20); // 50 ticks per second
let mut last_tick = Instant::now();
let mut addr_set: HashSet<SocketAddr> = HashSet::new();

loop {
    let now = Instant::now();
    // Check if it's time for the next tick
    if now.duration_since(last_tick) >= time_per_tick {
        // Run our "per-tick" game logic (e.g. broadcasting packets)
        game_tick(&rx, &socket, &mut addr_set);
        last_tick = now;
    } else {
        // Sleep until the next tick
        let time_to_sleep = time_per_tick - now.duration_since(last_tick);
        if time_to_sleep > Duration::from_millis(0) {
            thread::sleep(time_to_sleep);
        }
    }
}

Then let’s add our functions for creating the socket and handling incoming packets:

⊕Apologies for the excessive use of expect and unwrap in these examples. I’m sacrificing some robustness for clarity.

fn setup_socket(port: u16) -> (UdpSocket, Receiver<(SocketAddr, Vec<u8>)>) {
    let socket = UdpSocket::bind(("0.0.0.0", port)).expect("Failed to bind socket");
    let (tx, rx) = mpsc::channel();
    let cloned_socket = socket.try_clone().expect("Failed to clone socket");
    thread::spawn(move || {
        // Use a buffer of 1500 bytes, the maximum size of a typical Ethernet frame
        let mut buf = [0u8; 1500];
        loop {
            match cloned_socket.recv_from(&mut buf) {
                Ok((size, src)) => {
                    // Create a copy of the packet that we can pass to the main thread
                    let packet = buf[..size].to_vec();
                    tx.send((src, packet)).expect("Failed to send packet to main thread");
                }
                Err(e) => {
                    // Do nothing on timeout errors
                    if e.kind() != std::io::ErrorKind::WouldBlock
                        && e.kind() != std::io::ErrorKind::TimedOut
                    {
                        eprintln!("Failed to receive packet: {}", e);
                    }
                }
            }
        }
    });
    return (socket, rx);
}

fn game_tick(
    rx: &Receiver<(SocketAddr, Vec<u8>)>,
    socket: &UdpSocket,
    addr_set: &mut HashSet<SocketAddr>,
) {
    while let Ok((src, packet)) = rx.try_recv() {
        addr_set.insert(src);
        for addr in addr_set.iter() {
            socket
                .send_to(&packet, addr)
                .expect("Failed to send packet");
        }
    }
}

Right off the bat, there are a few ergonomic issues with this approach. Networking is inherently event-based, while our main game loop is schedule-based. This requires that we spin up a second thread that does all of the listening and passes packets back to the main thread.

To benchmark this base implementation, I ran a simple load test against this serverI ran this on a single core “Premium AMD” machine from DigitalOcean. The performance of these machines can vary greatly depending on the specific machine you are allocated, so all of the benchmarks in this post are from the same machine. consisting of 10 clients, each sending 1024-byte packets every 10ms.

Implementation	Packets Lost (%)	Avg Latency (ms)	P99 Latency (ms)	CPU Usage (%)
Basic Recv	0.024%	10ms	21ms	7.16%

Running the server with strace, we see the following breakdown of syscalls:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 57.72   16.807322         168    100009         9 recvfrom
 38.14   11.103956          11    999955           sendto
  4.13    1.203268         218      5504         1 clock_nanosleep
  0.00    0.000461          19        24           write
  ...
------ ----------- ----------- --------- --------- ------------------
100.00   29.116518          26   1105625        11 total

This is pretty unsurprising: we’re spending most of our time in recvfrom and sendto. One of the bits that may be surprising is just how much slower recvfrom is compared to sendto. This comes down to a few factors, the most important of which is that recvfrom is a blocking call. When we call recvfrom, if there is no data available, the kernel has to put our thread to sleep and then wake it up when data arrives. That overhead adds up quickly. sendto, on the other hand, is non-blocking in this case since the kernel can usually just copy the data into its networking buffers and return immediately.

Since this program is basically all syscalls, it stands to reason that if we can reduce the number of syscalls we make, we can reduce our overall CPU usage. Unfortunately, traditional networking syscalls are pretty limited in this regard. For example, each call to recvfrom can only receive a single packet. This is a pretty common limitation, so much so that linux introduced a new I/O interface specifically to address this problem: io_uring.

What is io_uring?

Introduced in 2019 in Linux kernel 5.1, io_uring is a fancy way to do asynchronous I/O on Linux. It does this by setting up two ring buffers in shared memory between the kernel and user space: a submission queue and a completion queue. The submission queue is where user space code submits I/O requests to the kernel, while the completion queue is where the kernel notifies user space code of completed I/O operations.

At a high level the flow of a typical io_uring operation looks like this:

We set up an io_uring instance (io_uring_setup syscall) which creates the shared memory ring buffers.
We register any resources we want to use with io_uring (e.g. socket file descriptors, buffer rings, etc.) with their respective registration syscalls.
We queue up I/O operations in the submission queue (e.g. recv, send, read, write, etc.) with the appropriate io_uring_prep_* functions.
We submit the queued I/O operations (submission queue etries, or SQEs) to the kernel with the io_uring_enter syscall. a. We can optionally wait for a certain number of completions to be available in the completion queue before returning. We can also choose to set a timeout for this wait, or not wait at all.
We read completed I/O operations (completion queue entries, or CQEs) from the completion queue.

In simpler terms, io_uring gives us input and output queues, and an api to notify the kernel that we’ve added new work to the input queue, and to check for completed work in the output queue.

Great! What does this mean for us? Well, it means that we can batch up a bunch of I/O requests and submit them all at once to the kernelideally at the end of a networking tick when we’ve queued up all our I/O operations.. This reduces the number of syscalls we need to make, and thus reduces the amount of time we spend in syscalls. Additionally, since the kernel can notify us of completed I/O operations via the completion queue, we can avoid blocking calls altogether. If we don’t need to block on recvfrom, we can handle both sending and receiving in a single thread With more advanced usage of io_uring, we can even control when we allow the kernel to do work on the submission queue, which can be helpful if we want to reduce context switching while running game server logic..

This fits really well with our game server model. We can batch up all of our send calls at the end of the game tickThis is already necessary for a number of reasons. The largest of which is that we’re often sending multiple messages to a given client in a given frame. In order to save on bandwidth we will coallesce all of those messages into a single larger packet. This means that, unless we hit the MTU, we have to wait until the end of any logic that could result in us sending a packet., and we can just pick up received packets at the beginning of the tick.

Implementing our server with io_uring

To implement our server with io_uring, we’ll use the io-uring crateFor C/C++, liburing is generally the way to go.. First, let’s set up our socket and io_uring instance:

// This is zero in this case since we're only using one socket
const SOCKET_IDENTIFIER: Fixed = Fixed(0);

// This is an arbitrary identifier for our buffer group
const BUFFER_GROUP_ID: u16 = 0xbead;

enum TokenType {
    Recv,
    Send(Vec<u8>),
}

fn setup_iouring_socket(port: u16) -> (UdpSocket, IoUring, FixedSizeBufRing, Slab<TokenType>) {
    // Build our io_uring instance
    let mut io_uring = IoUring::builder()
        // Signal to the kernel that we will only be submitting entries from a single thread
        .setup_single_issuer()
        // Setup a large completion queue to allow for many in-flight operations
        .setup_cqsize(8192)
        // Don't stop submitting if we hit an error on one entry
        .setup_submit_all()
        // Build with a submission queue size of 8192 entries
        .build(8192)
        .expect("Failed to create IoUring");

    // There is some complexity here that I'm glossing over for the sake of brevity:
    // You can think of this as a large buffer that will be broken into smaller,
    // uniform chunks for use with recv operations.
    // See: https://man7.org/linux/man-pages/man3/io_uring_register_buf_ring.3.html
    let buf_ring = register_buf_ring::Builder::new(BUFFER_GROUP_ID)
        .buf_cnt(1024)
        .buf_len(MAX_PACKET_SIZE)
        .build()
        .expect("Failed to create buffer ring");

    // Register the buffer ring with io_uring. This allows us to use the buffer
    // ring for recv operations (necessary for multishot recv).
    buf_ring.rc.register(&mut io_uring).expect("Failed to register buffer ring");

    let socket = UdpSocket::bind(("0.0.0.0", port)).expect("Failed to bind socket");

    // This is used to track when it's safe to reuse/drop buffers. At a high level,
    // this gives us a usize that we can pass with the io_uring entry to identify
    // the operation later.
    let mut slab = Slab::with_capacity(256);
    let recv_token_index = slab.insert(TokenType::Recv);

    // Register the socket's file descriptor with io_uring.
    io_uring.submitter()
        .register_files(&[socket.as_raw_fd()])
        .expect("Failed to register socket fd");

    // Prepare the multishot recv entry
    let recv = prep_multi_recv(recv_token_index, &MSG_HDR as *const _);

    unsafe {
        // Push the multishot recv entry to the submission queue.
        io_uring
            .submission()
            .push(&recv)
            .expect("Failed to push multi-recv entry");
    }

    // Submit the io_uring entries to the kernel
    // This should generate a single syscall to io_uring_enter
    io_uring
        .submit()
        .expect("Failed to submit io_uring entries");

    return (socket, io_uring, buf_ring, slab);
}

// This prepares a "multishot" recv operation. These allow us to continuously receive
// packets until the operation returns an entry without the IORING_CQE_F_MORE flag set.
fn prep_multi_recv(
    token_index: usize,
    msghdr: *const libc::msghdr,
) -> io_uring::squeue::Entry {
    return opcode::RecvMsgMulti::new(
        SOCKET_IDENTIFIER,
        msghdr,
        BUFFER_GROUP_ID,
    )
    .build()
    .user_data(token_index as _);
}

The most interesting part of this setup is the use of a “multishot” recv operationIntroduced in Linux kernel 6.0. This is a special type of recv that allows us to continuously receive packets without needing to re-submit a new recv operation each time, which significantly reduces syscall churn.

Next, let’s implement the game tick function:

fn iouring_recv_game_tick(
    socket: &UdpSocket,
    io_uring: &mut IoUring,
    buf_ring: &register_buf_ring::FixedSizeBufRing,
    slab: &mut Slab<TokenType>,
    addr_set: &mut HashSet<SocketAddr>,
    addr_map: &mut HashMap<SocketAddr, Box<libc::sockaddr_in>>,
) {
    // Process the completion queue
    let (submitter, mut sq, cq) = io_uring.split();
    let mut count = sq.len();

    // Iterate through all the completion queue entries
    for cqe in cq {
        // Get the token index we stored in the user_data field
        let token_index = cqe.user_data() as usize;
        match slab.get(token_index) {
            Some(TokenType::Recv) => {
                let result = cqe.result();
                if result > 1500 || result < 0 {
                    eprintln!("Multi-recv returned unexpected result: {}", result);
                    continue;
                }

                // Once again, I'm going to gloss over the details of buffer rings here.
                // For the purposes of this example, just know that we can use the flags
                // on the cqe to get the buffer we used for this recv operation.
                let bufs = buf_ring.rc.get_bufs(buf_ring, result as u32, cqe.flags());
                if bufs.len() != 1 {
                    continue;
                }
                let buf_entry = &bufs[0];
                let data = buf_entry.as_slice();

                // Create empty msghdr to parse the `from` address
                let mut msghdr: libc::msghdr = unsafe { mem::zeroed() };
                msghdr.msg_namelen = 32;
                msghdr.msg_controllen = 0;
                let msg = RecvMsgOut::parse(data, &msghdr).expect("Failed to parse RecvMsgOut");
                let name_data = unsafe { *(msg.name_data().as_ptr() as *const libc::sockaddr_in) };

                let addr = SocketAddr::V4(SocketAddrV4::new(
                    u32::from_be(name_data.sin_addr.s_addr).into(),
                    u16::from_be(name_data.sin_port),
                ));

                addr_set.insert(addr);
                for addr in addr_set.iter() {
                    // Get or create a cached sockaddr_in for this address
                    let sockaddr = addr_map.entry(*addr).or_insert_with(|| Box::new(libc::sockaddr_in {
                        sin_family: libc::AF_INET as u16,
                        sin_port: u16::to_be(addr.port()),
                        sin_addr: libc::in_addr {
                            s_addr: u32::to_be(match addr {
                                SocketAddr::V4(a) => a.ip().to_owned().into(),
                                _ => 0,
                            }),
                        },
                        sin_zero: [0; 8],
                    }));

                    // Get a pointer to the sockaddr_in
                    let addr_ptr: *const libc::sockaddr_in = &**sockaddr;

                    // Create an owned copy of the payload to send
                    let payload = msg.payload_data().to_vec();
                    let payload_ptr = payload.as_ptr();
                    let payload_len = payload.len();

                    let index = slab.insert(TokenType::Send(payload));

                    // Use a zero-copy send operation to send the packet
                    let send = opcode::SendZc::new(
                                    SOCKET_IDENTIFIER_V4,
                                    payload_ptr,
                                    payload_len as u32,
                                )
                                .dest_addr(addr_ptr as *const _)
                                .dest_addr_len(size_of::<libc::sockaddr_in>() as u32)
                                .build()
                                .user_data(index as _);
                    unsafe {
                        sq.push(&send).expect("Failed to push send entry");
                    }

                    // If we have enough entries queued up, submit them to the kernel,
                    // otherwise we may run out of space in the submission queue.
                    if sq.len() - count >= 1024 {
                        submitter.submit().expect("Failed to submit io_uring entries");
                        count = sq.len();
                    }
                }

                // If the MORE flag is not set, we need to re-submit a new multishot recv
                if cqe.flags() & IORING_CQE_F_MORE == 0 {
                    let recv = prep_multi_recv(token_index, &MSG_HDR as *const _);
                    unsafe {
                        sq.push(&recv).expect("Failed to push multi-recv entry");
                    }
                }
            }
            Some(TokenType::Send(_)) => {
                // SendZc can produce multiple completions for a single send operation.
                // We only care about the notification completion (has IORING_CQE_F_NOTIF
                // flag set).
                if !cqueue::notif(cqe.flags()) {
                    continue;
                }
                slab.remove(token_index);
            },
            None => {
                eprintln!("Received unknown token index: {}", token_index);
            }
        }
    }
    // Sync the submission queue before submitting
    sq.sync();

    submitter.submit().expect("Failed to submit io_uring entries");
}

Some interesting things to note here: First, we handle both receiving and sending in the same function/thread. Second, you may have noticed that we’re passing these token_index values into the user_data field of the submission queue entries. These can be any arbitrary u64 value. In our case, we’re using them to store the index of the operation in our slab so that we can store arbitrary data associated with each operation (in our TokenType enum).

All this is well and good, but how does it perform? Running the same load test as before, we get the following results:

Implementation	Packets Lost (%)	Avg Latency (ms)	P99 Latency (ms)	CPU Usage (%)
Basic Recv	0.024%	10ms	21ms	7.16%
io_uring Recv	0.0%	11ms	21ms	6.41%

That’s looking better! We’ve gotten a 10.5% reduction in CPU usage and a slight drop in packet loss. Looking at strace:

⊕You can mostly ignore the restart_syscall entries here: these are just the kernel’s way of handling interrupted syscalls (in this case, clock_nanosleep).

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 82.87    4.685087         824      5680       635 io_uring_enter
 13.65    0.771613          22     34957     29938 restart_syscall
  3.46    0.195726          38      5027      5019 clock_nanosleep
  0.01    0.000474          18        25         1 write
...
------ ----------- ----------- --------- --------- ------------------
100.00    5.653862         123     45824     35594 total

We’ve reduced our syscalls from over a million to just over 44k! Most of our time is now spent in io_uring_enter, which is the syscall that submits our batched I/O requests to the kernel. This is a huge improvement, and shows the power of io_uring for reducing syscall overhead.

Now this is hardly a magic bullet: 10% CPU savings is nothing to scoff atsavings of 10% would allow us to host 11% more game servers per core., but it’s also the best-case scenario. A real game server running any non-trivial game logic will likely see less dramatic improvements. That said, every little bit helps, and one of the nice things about io_uring is that it scales well with more traffic. As you handle more load, the benefits of batching I/O requests become more pronounced.

The exact code I used for benchmarking can be found here.

Game Servers on Linux: Part 1

The cost of networking syscalls

What is io_uring?

Implementing our server with io_uring

Further reading