Introduction
While working on some completely unrelated piece of code, I discovered what seemed to be the impossible: FreeBSD was much faster than my 2.6 Linux system on certain system calls. And it was quite noticeable, a magnitude or more. This clearly couldn't be right, so I ended up doing some tests and I wrote a blog entry on it. Well, this continue being an issue, so I decided to expand on the topic.
Test setup
The primary test system is a Pentium 4 2.8GHz with HT, so all kernels are built with SMP support. For comparison, I've run the same tests on a few other platforms, including FreeBSD. I don't claim any of these tests to be scientifically correct in any way, but something is definitely wrong here.
Benchmarks
I checked the same code on my older system, running a 2.4 kernel on an old Celeron CPU. To my satisfaction, it was indeed very fast, even faster than FreeBSD. I double checked my PIII machine running a 2.6.6 kernel, it was also reasonably fast.
So, why was my main Linux system so damn slow? Well, as it turns out, not all kernels are created (i.e. built) equal/ I did a number of benchmarks on my system, and discovered that pretty much all my pre-built kernels (from Fedora Core 2 "rawhide") were slow, but only on a Pentium-4 system. My custom built kernel was fast
The numbers below are clock cycles per system call (smaller is better):
| Kernel | gettimeofday() | uname() | chdir() | open() |
| FreeBSD 4.x on P4 | 1591 |
29446 |
8977 |
2583 |
| RedHat 2.4 on Celeron | 461 |
592 |
1198 |
864 |
| FC2 2.6.6 on PIII | 1076 |
1286 |
2547 |
1425 |
| FC2 2.6.3 i686 8k stack on P4 | 8373 |
1681 |
33115 |
29725 |
| FC2 2.6.7 i686 4k stack on P4 | 8453 |
8680 |
41695 |
37885 |
| FC3 2.6.9 i686 on P4 | 8720 |
9058 |
14154 |
9887 |
| custom 2.6.7 8k stack on P4 | 890 |
1031 |
4601 |
1396 |
| custom 2.6.7 4k stack on P4 | 828 |
1002 |
4295 |
1377 |
| custom 2.6.9 on P4 | 814 |
990 |
4514 |
1315 |
My custom kernel configuration is "optimized" for my particular system, the configs are available here. The simple program that I used to benchmark this is available here. This code only works on x86 platforms.
Observations
First of all, this is obviously only a problem with P4 systems, so I did a bit research (but not a lot). Apperently P4 CPUs have a new instructions for handling system calls, SYSENTER, while older Pentium systems uses the 0x80 interupt. As far as I can tell, there's a new virtual system call layer that is supposed to handle this, called vsyscall. There's some info on this in this Kerneltrap.org article. I guess I assumed that vsyscall would handle all combinations of CPUs and builds, but maybe it's not?
"Solution"
The solution for me was to recompile the "stock" Linux kernel, based on the FC3 kernel configuration file. I did quite a few changes to the config, and I also didn't apply any of the RedHat/Fedora patches to the kernel source.
The latter is the key to the solution, after trying numerous configurations and options, I narrowed it down to the set of patches Fedora applies. Dave Jones kindly confirmed this, and pointed to the exact set of patches: exec-shield. If you kernel applies the same or a similar patch, you should definitely be aware of this problem.
I haven't yet found an easy way to build the Fedora kernel without just the exec-shield patches, but I'm still working on that. Unfortunately the patches co-depend on each other a bit, so it's not as simple as just leaving out the patch.
I've filed a bug against Fedora Core for this, see Bugzilla bug #139318.
Effects
Obviously system call performance isn't a huge factor on overall system performance, although I can see certain applications suffering more than others due to this problem. To get a somewhat interesting comparison, I did a compile of the kernel source on a "slow" kernel vs my custom built one. Here are the results, output is from the time command:
"Slow" kernel:
# time gmake
real 21m10.495s
user 19m0.013s
sys 3m6.801s
"Fast" kernel:# time gmake
real 19m17.596s
user 17m55.808s
sys 1m51.115s