Summarizing performance numbers

How should we summarize performance numbers? In a recent benchmark run, I had some interesting speedup numbers that I wasn’t certain how to report. While it’s easy to make charts that are illuminating, I’m not certain what I should say in, e.g., an abstract.

Here’s the raw data (also available as a spreadsheet), noting that I’ve made everything as abstract as I can:

In the data, I’ve recorded the runtime of 2 tools (tool1 and tool2) on 40 tests. The tests are lettered by theme, with a number to distinguish tests that are somehow related. Each runtime in the table is in seconds, and is the arithmetic mean of three runs exhibiting nominal variation. I run tool1 in two configurations: tool1 simply solves the problem, while tool1.min tries to solve the problem “minimally” in some sense. I run tool2 in only one configuration. In the spreadsheet, I’ve calculated a few summary statistics for each column. Here are the summary statistics for tool1 vs. tool2:

Arithmetic mean156.84
Geometric mean12.64
Harmonic mean4.49
Summary statistics of tool1’s speedup compared to tool2

Doing some cursory analysis in R, it’s easy to generate charts that give a pretty good feel for the data. (It’s all in mgree/summarizing-perf on GitHub.) Here’s a box plot of times:

boxplots showing runtimes for tool1, tool1.min, and tool2. tool1 is the tightest, lowest box; tool1.min is a little higher but has the same median; tool2 is substantially higher (worse) than the other two.

And here’s a violin plot of times:

a violin plot shows that tool1 and tool1.min are chonkiest around 0.05s, while tool2 has wide variation

I would summarize these charts in text as, “tool1 is an order of magnitude faster than tool2; minimization closes some of the gap, but tool1.min is still substantially faster than tool2”. A bar chart tells the same story:

a bar chart comparing tool1, tool1.min, and tool2 across all tests. tool2 is only rarely competitive with tool1 (i.e., withing half an order of magnitude). tool1.min does worse than tool1, but still typically beats tool2.

With the bar chart, it’s possible to see that sometimes tool2 is in the same league as tool1, but not usually. We have tool2 beating tool1.min only once (test w); it never beats tool1, and typically loses by 1-2 orders of magnitude. Some cases are catastrophically bad.

Plotting speedup lets us justify some other comments. Here’s a scatter plot:

a scatter plot showing speedups for tool1 and tool1.min. a single point for tool1 is on the 1; all others are above. a single point for tool1.min is below the 1; all others are above.

And here’s a boxplot of speedups in aggregate:

a boxplot summarizing speedups of tool1 and tool1.min compared to tool2. the whisker for tool1 stops at 1; the whisker for tool2 goes just below. the medians and quartiles are more or less comparable, with tool1 doing a touch better than tool1.min

Looking at these speedups, I’d feel comfortable saying that “tool1 is typically an order of magnitude faster than tool2, never slower, and sometimes much faster; tool1.min behaves similarly, though it can sometimes be slower”.

This post comes out of a Twitter thread, which is a goldmine of insight and resources. Please check it out, and chime here or in the thread with your thoughts!

Special thanks to Noam Ross for help with some of the fancier log-scale stuff on the plots.