Suppose you were trying to run some experiments about L1 D-caches. (You may also suppose that this is a homework problem, but that’s life.) You’re given a trace of loads and stores at certain addresses. These addresses are 32-bits wide, and the trace is in a textual format:
1A2B3C4D L
represents a load to 0x1a2b3c4d, followed by a store to 0xdeadbeef, followed by a load to 0x1b2b3c4d. (You might notice the two loads may be in conflict, depending on the block and cache size and the degree associativity. In that case, you might be in my computer architecture class…)
DEADBEEF S
1B2B3C4D L
This is problematic. Naturally, you’d like to process the trace in OCaml. But did I mention that the trace is rather large — some 600MB uncompressed? And that some of the addresses require all 32 bits? And some of the statistics you need to collect require 32 bits (or more)? OCaml could process the entire trace in under a minute, but the boxing and unboxing of int32s and int64s adds more than twenty minutes (even with -unsafe). I felt bad about this until a classmate using Haskell had a runtime of about two and a half hours. Yeesh. C can do this in a minute or less. And apparently the traces that real architecture researchers use are gigabytes in size. Writing the simulator in OCaml was a joy; testing and running it was not.
There were some optimizations I didn’t do. I was reading memop-by-memop rather than in blocks of memops. I ran all of my simulations in parallel: read in a memop, simulate the memop in each type of cache, repeat. I could have improved cache locality by reading in a block of memops and then simulating in sequence; I’m not sure how the compiler laid out my statistics structures. I could’ve also written the statistics functions in C on unboxed unsigned longs, but didn’t have the patience. I’d still have to pay for boxing and unboxing the C structure every time, though. Still: one lazy summer week, I may give the code generation for boxed integers a glance.