Optimizing Enzo
Optimizing Enzo with the Intel compiler
Overview
Below I document my efforts to optimize Enzo only with the Intel v8.1 compiler command line options. There are a few beneficial options that we can utilize past the most aggressive option -O3. I benchmarked the runs with a short and simple simulation with a 64^3 topgrid with DM and 9 species cooling on 4 processors. Since the benchmark was a short run, the absolute time savings aren't impressive, but one must look at the percentage changes since a 15% increase in performance results in a day decrease in a week-long simulation.
Baserun
- Options:
-O2 - Time: 1 minute 54 seconds
Aggressive Optimization
- Options:
-O3 - Time: 1 minute 43 seconds (9.6% decrease)
-fast
Enzo will not compile with the default compiler options specified by -fast, which are -O3 -ipo -static. The -static option causes problems with the MPI libraries, which I don't fully understand. However, we don't want to use -ipo for reasons describe next.
Aggressive Optimization plus IPO
The Intel compiler has the capability of optimizing a program as a whole, not as individual routines. They call this Interprocedural Optimization (IPO). The user must also specify -ipo_obj in addition to -ipo in order to use static libraries (i.e. foo.a). In theory, this should increase performance. I observed a significant degradation.
- Options:
-O3 -ipo -ipo_obj - Time: 2 minutes 37 seconds (37.7% increase)
- Note: When using IPO, the linked libraries must be specified in the correct order. If not, IPO complains that there are unknown symbols. In my HDF4 version, the correct order is
-lmpi -lmpi++ -limf -lm -lmfhdf -ldf -ljpeg -lz -lstdc++ -lcxa -lunwind -lifcore -lifport.
Aggressive Optimization plus Floating Point Optimizations
We can take -O3 a step farther and optimize the floating point operations. I used two options here. -IPF_fma combines the multiply, addition, and subtraction options. -IPF_fp_speculation_modefast allows the Itanium II processor to "speculate" the next operation.
- Options:
-O3 -IPF_fma -IPF_fp_speculation_modefast - Time: 1 minute 40 seconds (12.3% decrease)
Above plus PGO
The final step I took in this brief excursion in optimization is the usage of Intel's profile guided optimization (PGO). This is a 3 step procedure to produce an optimized code.
Step 1
The user compiles the code with the additional option -prof_gen to generate profiles while running the executable.
Step 2
The user runs the executable produced in Step 1 to actually produce the performance profiles. It is important that you run the code on a simulation that is similar to a typical one. It can be smaller and shorter, but it should traverse through the same routines as a typical simulations (i.e. cooling, rebuild hierarchy, communication, etc.)
Step 3
Finally the user compiles the code with -prof_use in place of -prof_gen. The compiler will use the profiles to fully optimize the most heavily travelled parts of the code. You cannot use make -j to compile it. It must be compiled in serial.
- Options:
-O3 -IPF_fma -IPF-fp_speculation_modefast -prof_use - Time: 1 minute 30 seconds (21.1% decrease)