Python performance through times

I recently compiled all Python version from v2.2 to 3.0b, to see how their performance compares. I decided to not use pybench, but to take some of the benchmarks from the [http://shootout.alioth.debian.org/gp4/|Computer Language Benchmarks Game] instead (hoping they are slightly more "real use" realistic). I compiled all versions of Python identically, using the same compiler (4.3.0) and the same optimization options ("-O3 -march=core2 -mtune=core2"). All benchmarks were run 20 times for every python version, and the fastest run for each benchmarks and interpreter was picked. This obviously gives a "best case" scenario (I think), the other alternative would be to do a median or average, but I wanted to avoid any unfairness due to system/OS activities.

The benchmarks had to be ported to support Python3000 (v3.0b3), but these changes were mostly trivial (__print__'s and __xrange__'s), so I don't think that should affect the results. My test system (a Core2 Duo box with plenty of RAM) was "unused" during the entire test run (which took over 6 hours to complete). Alright, so what are the results? The most interesting data is the relative performance index. This is the average of each test as compared to Python v2.2.3, which therefore has an index of "1.0". This also means that each test has equivalent weight in the total index calculation (a higher index is better).

::{img src=http://www.ogre.com/files/ogre.com/py-performance-index.png}::

I'm also including the results for each individual benchmark, in the following graph (times in seconds, lower is better):

::{img src=http://www.ogre.com/files/ogre.com/py-performance-bench.png}::

__Update__: On request from a friend, I tried compiling with "-Os" instead of "-O3", and not surprisingly, compiling for size is not advantageous on my Core2 box. This is in line with the results from the Firefox tests I did before. Again, the 4MB L2 cache probably negates any benefits from compiling for size.

I'm not going to make any comments about what might have happened after v2.4.x, but it's good to see that Python3k is getting very promising results.

Hacking: 

Comments

> the hope was that some of

> the hope was that some of these benchmarks would be more "realistic"

The funny think is that you don't seem to have used the ones that might be considered more realistic :-)

fasta
k-nucleotide
n-body
reverse-complement
regex-dna

And note that the current benchmarks game has moved to a quad-core machine http://shootout.alioth.debian.org/

Yeah, I was hoping to run ...

Yeah, I was hoping to run them all, but almost all of them only run on "modern" Python's, and none run on 3.0 (without porting). I fixed a few where I knew it wouldn't make a difference on the results, but many of them use/depend on list comprehension for example, which earlier versions of python do not support. Maybe I should take another stab at making them "universal" for all python versions... :)

Re: If you kept the numbers...

If you kept the numbers, I'm sure some people would appreciate if you posted them. Boxplots and all that fun.

performance index

this a good comparison. it seems that each test has equivalent weight in the total index calculation.