It is common practice in the Python world to write C/C++ extensions to optimize performance, but what do you do when that is not enough? How could you find bottlenecks within your extensions? Use callgrind of course!