Monday, August 19, 2013

This blog post is mostly so I don't forget this kind of stuff. mentions "% execution stalled". This is the core i7 document rather than the Sandy Bridge document, but bear with me.

The formula is:


However, there's no UOPS_EXECUTED.CORE_STALL_CYCLES in the PMC documentation, nor is it in the Intel SDM chapter on performance counters.

But wait! It kind of is there. There /is/ UOPS_EXECUTED.THREAD, which is "Counts the total number of uops to be executed per thread each cycle." In the same block, it says that to count stall cycles, set CMASK=1, INV=1. Ok, so how does one do that with PMC?

# pmcstat -S UOPS_EXECUTED.THREAD,inv,cmask=1 -T -w 5

Now, it seems to be showing me the ACPI wait and MWAIT functions as high sample events - which is odd, as I didn't think this particular PMC measured C1 and MWAIT states. I'll chase this up.

For Sandy Bridge it's UOPS_DISPATCHED.THREAD - this counts dispatched micro-operatons per-thread each cycle. CMASK=1,INV=1 counts the number of stall cycles.

Tuesday, August 13, 2013

Profiling on superscalar architectures, or "no, instruction counts don't necessarily matter nowdays.."

I could spend a lot of time brain dumping various things to do with profiling on anything more recent than Pentium 1 (or, well, anything remotely to do with MIPS of that era, but I digress.) In any case, there's plenty of useful writings about it so I'm not going to - just jump over to .

However, I thought I'd provide a basic example of where "instructions" doesn't actually matter, as well as a shortcoming of the current FreeBSD tools.

My network testing stack does a whole lot of read() and write() syscalls to achieve its goal. For those who know what's going on, I hope you know where I'm taking this. Anyway..

Firstly, the standard. "pmcstat -S instructions -T". This prints a "top" like output counting instructions retired.
Figure 1. # pmcstat -S instructions -T -w 5

This looks like the contention is in the mutexes protecting socket receive and the TCP output path. Sure, but why is it contending?

The problem with doing it based on instructions retired is that it hides any issues to do with stalls. There's a bunch of sources of potential stalls - memory reads, memory writes, stuff not being in cache but being needed for instructions that are running. They're generally either side-effects of operations not being able to complete in time (eg if you have a whole lot of completed operations that need to push stuff out to memory to continue, but there's no free bandwidth to queue memory writes) but sometimes it's just from straight bulk memory copies.

If you're interested about the Intel microarchitecture and how all of these pieces fit together to process an instruction stream in parallel, with all of the memory controller paths coming in and out, have a read of this: .

Ok, so let's look at general stalls. There's a bunch of L1, L2, LLC (last level cache, think "L3" here) operations that can be looked at, as well as stuff that FreeBSD's PMC framework doesn't support  - notably some of the stuff on the "uncore" - the shared cache and pipelines between all cores on a socket. It supports the events implemented using MSRs, but not events implemented using the PCIe configuration space.

So, without further ado:

Figure 2. # pmcstat -S RESOURCE_STALLS.ANY -T -w 5
Yup. This looks much more like what I'd expect. The CPU is stalled doing copyout(). This is a mostly-read() workload, so that's what I'd expect. mb_free_ext() is interesting; I'll go look at that.

Now, copyout() is doing a bulk copy. So, yes - I'd expect that to be hurting. mb_free_ext() shouldn't be doing very much work though - I'll do some digging to see what's going on there.

The final output is from the Intel performance tuning overview tools. You can find them here - . There's a nice overview tool (pcm.x) which will output the basic system overview. I like this; it gives a very simple overview of how efficient things are.
Figure 3. "pmc.x 1" running on FreeBSD-10.

Now, this isn't a stock version of pcm.x - I've hacked it up to look slightly saner when doing live reporting - but it still provides exactly the same output in that format. Note the instructions per CPU cycle and the amount of cache misses. It's .. very inefficient. Tsk.

So in summary - don't just do instruction count based profiling. You want to establish whether there are any memory and cache bottlenecks. If you're doing HPC, you want to also check to see if you're hitting SSE, FPU, divider unit and other kinds of math processing stalls.

Now - what would I like to see in FreeBSD?

  • The hwpmc framework needs to grow support for more of the socket and system events - right now it's very "core" focused.
  • Some better top-level tools to provide a system summary like Intel's pcm.x tool would be nice.
  • Some better documentation (read: better than just this wiki page!) looking at how to actually profile and understand system behaviour would be desirable.
  • .. and tying this into dtrace would be great too.
  • Once we get some (more) NUMA awareness, it would be great to have the uncore stuff reporting on things like QPI traffic, cache line and memory accesses from remote sockets/cores, and other stuff to do with NUMA allocation and awareness.
Mostly, however, I'd like to get this stuff into the hands of developers sooner rather than later so they can start running this as a normal day to day thing.

Wednesday, August 7, 2013

Why, oh why am I seeing RST frames from FreeBSD when I have a high connection rate?

I started seeing something odd in my testing. I was only getting around 120-odd new connections a second being accepted by the test server. I know FreeBSD needs some tuning to make it perform at high request rates, but .. hell. The odd thing? The other requests were sometimes getting RST frames (and the client would error out with a "connection reset by peer"; sometimes not.)

After doing some digging, I discovered that.. I was doing something a little odd in my testing framework and it (surprise!) elicited some very negative behaviour from FreeBSD. Said behaviour is actually valid - it's to avoid denial of service attacks. But it's worth talking about.

My test client was bursting 'n' connections per thread each second. So, I would do a test of say, 128 new connections back to back, each second, from each thread. This is definitely odd (but easy to implement!)

Here's what the server was doing.

Firstly - there's a "syncache". The syncache handles incoming embryonic requests (ie, the SYN from a remote peer.) It's separate from the rest of the TCP stack so a large flood of new connections (valid or otherwise) doesn't need to grab TCP stack locks in order to process these frames, or waste RAM with PCB (protocol control block) entries for these embryonic requests. It also makes it easier to time out half-completed requests - the PCB will only have completed or closing connections.

If the handshake succeeds but there's a failure in allocating a new PCB or socket for the connection, the TCP stack can return an RST to the peer.

If the syncache fills up, it should be sending syncookies. (google "SYN cookies" for more information.) The point of using SYN cookies is that it doesn't fill the syncache up with embryonic connections - there's a cookie that the client will reflect back to the server that validates the connection.

If the syncookie exchange suceeds but the application can't create new sockets fast enough (ie, servicing the accept() socket queue quickly enough), the TCP stack will throw an RST back at the client.

Now, for the fun bits.

  • The RST responses back to the server are rate limited - via net.inet.icmp.icmplim. Yes, it's not just for rate limiting ICMP responses.
  • So the client would see some connections hit an RST and fail immediately; others just wouldn't get the ACK and would try again, so..
  • .. over time, there'd be a burst of new connections every second from the client (causing the issue) as well as the connection retransmits for embryonic-but-not-yet-finished connections
When I staggered the new connections over smaller, quicker bursts (so instead of 128 connections a second per thread,  I'd do 12 connections every 100mS) then the problem went away. This is better behaviour (I can connect thousands of new connections a second here!) but I still expect to see this problem in the real world. As I approach my intended TCP connection rate (100,000 connections a second - which isn't specifically a Netflix requirement, but an "Adrian proxy load" requirement! - I'm going to start seeing microbursts of new connections that will temporarily look like back-to-back new connections, thus triggering this bug.

So, to work around this for now, one just has to bump up the accept queue depth (sysctl kern.ip.somaxconn) to something much higher than the default of 128.

Now - why is this happening? My theory is this:
  • We're getting this burst of frames coming in the NIC;
  • The syncache / cookie code is being run in the NIC RX path;
  • The new connection path gets run and quickly overflows the syncache and new connection queue handling in the TCP stack, as the userland code doesn't get a notification in time
  • .. so the accept queue overflows before userland gets a chance to run, and we start sending rate limited RSTs.

Tuesday, August 6, 2013

Hacking on the Intel 10GE driver (ixgbe) for fun and .. not-crashing

My job at Netflix can be summed up as "make the Open Connect Platform better." This involves, among other things, improving FreeBSD in ways that make it more useful for content delivery.

I interpret that as including "find ways to break things."

So, I've written a crappy little multi-threaded network library ( which is absolutely, positively crappy and FreeBSD specific. Right now all it does is TCP and UDP network smashing using read() / write() for TCP, and recvfrom() / sendto() for UDP.

The aim with this is to stress test things and find where they break. So, the first thing I've written is a very simple TCP client/server - the client connects to the server and just write()s a lot of data.

.. except, that the clients are light-weight, in C, and multi-threaded.

So, I end up 'n' threads, with 'm' TCP sockets each, all doing write(). Right now I'm watching 4 threads with 12,288 sockets each sending data.

The test hardware is a pair of 1ru supermicro boxes with Intel E3-1260L CPUs in them, 32GB of RAM and dual-port Intel 82599EB 10GE NICs. The NICs are channel-bonded (using LACP) through a Cisco ASR9k switch.

I initially tested this on FreeBSD-9. I was rudely reminded of how utterly crappy the default mbuf sizing is. I constantly ran out of mbufs. Then, since FreeBSD-10 is on the cards, I just updated everything to the latest development branch and ran with it.

The result? The test ran for about 90 seconds before things got plainly pissed. The client (sender) would immediately hang. I'd get short packet errors, the LACP session would get unstable... everything was just plain screwed. The server (receiver) never saw any issues. I also saw lots of RX stalls, where one ring would seemingly fill up - and the whole RX path just ground to a halt. In addition, I'd also see a whole lot of out of order TCP segments on the server (receiver) side. Grr.

So, cue some driver hacking to see what was going on, reading the Intel 82599EB datasheet (that's freely available, by the way!) as well as discussions with Intel, Verisign and a few other companies that are using Intel 10GE hardware quite heavily, and here's what was discovered.

There's a feature called "RX_COPY" where small packets that are received are copied into a small, new mbuf - and the existing receive buffer is left in the RX ring. This improves performance - there's less churn of the mbuf allocator for those larger buffers. However. there was some dangling pointers around the management of that, leading so some stuff being DMAed where it shouldn't .. which, since ACKs and LACP frames are "small", would be triggered by this. Since the sender (client) is sending lots of segments, it's going to be receiving a lot of ACKs and this explains why the receiver (server) didn't hit this bug.

Next, the RX stalls. By default, if one of the RX rings fills up, the whole RX engine stalls. This is apparently configurable (read the data sheet!) but it's not on by default in FreeBSD/Linux. One of the verisign guys found the problem - in the general MSIX interrupt handler path, it was acknowledging all of the interrupts that were currently pending, rather than only the ones that were activated. The TX/RX interrupts are routed to other MSIX messages and thus should be handled by those interrupt threads. So, under sufficient load - and if you had any link status flaps - you may hit a situation where the non-packet MSIX interrupt thread runs, ACKs all the interrupts, and you immediately end up filling up the RX ring. You won't generate a subsequent interrupt as you've already hit the limit and the hardware won't generate anything further.. so you're stuck. That's been fixed. The annoying bit? It was fixed in the Linux driver but not the FreeBSD driver. Growl.

So, now the driver behaves much, much better. I can smash it with 20 gigabits a second of TCP traffic across 50,000 odd sockets and nary a crash/hang. But what bugs me is the out-of-order TCP packets on the receiver side of things.

The reason - it's highly likely due to the driver architecture. The driver will schedule deferred packet processing using the taskqueue if the interrupt handler ends up with too many packets to work with. Now, this taskqueue is totally separate to the interrupt thread - which means, you can have both of them running at the same time, and on separate CPUs.

So I'm going to hack the driver up to not schedule the taskqueue and instead just poke the hardware to post another interrupt to do further processing. I hope this will resolve the out of order TCP frames being received.