FMA instruction set

Recently, Kumar and I were trying to run a grid case for our meshfree solver to test out an experimental objective function. While testing the new grid, we found out that our application generates a lot of NaN values in our solver.

Initially, we thought that something is wrong with our code. We spent more than 4 days going line by line, checking the code, verifying the correctness of our algorithm. To throw a monkey wrench in the works, our code works for all other grids except the new grid we wanted to test. So we spent even more time looking at the grid and see if anything was wrong with the grid.

In the end, we back-traced our algorithm and found out exactly where in the code we are computing NaNs.

1
2
3
4
5
6


...

det = sum_delx_sqr*sum_dely_sqr - sum_delx_dely*sum_delx_dely
one_by_det = 1.d0/det

...

At a first glance, it doesn’t make much sense. The values which we were computing can never be 0.

So we checked the FFLAGS we were using in Makefile. We found out that we were using not using -march=native flag, like we did for our other applications. Well not really a problem. It’s just a small performance loss being compiled to a generic x86-64 build instead of targeting the native CPU architecture. So we changed the -march flag and set it to native.

We thought maybe the compiler we were using to compile it might be a problem so we changed the cc and fc to target another compiler. We rebuilt the application and lo and behold no more NaNs. So the problem was with the compiler, right? Well, we changed the compiler flags to target the old one and rebuilt the application and the NaNs … didn’t appear. Wait! what? Wasn’t the compiler the reason we were getting NaNs.

/images/blog/fma-woes/minion.gif — Something funny

We checked the Makefile again but this time we removed the -march flag we have added. To our surprise, the NaNs reappeared.

I quickly went to the gcc documentation for the -m flags and checked the additional flags which are set when -march=native. I found out that on haswell (the system on which we were compiling) the code compiles with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, BMI2, and F16C instruction set support.

We then checked what instruction set is related to the code we are having issues above with. We narrowed it down to FMA also known as fused multiply–add.

What is FMA?

According to Wikipedia

A fused multiply–add (FMA or fmadd) is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product b×c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire expression a+b×c to its full precision before rounding the final result down to N significant bits.

In our case the compiler, instead of rounding up sum_delx_sqr * sum_dely_sqr and sum_delx_dely * sum_delx_dely products separately it does it after we compute the entire expression sum_delx_sqr*sum_dely_sqr - sum_delx_dely*sum_delx_dely.’ This leads to a more accurate solution as the round-off error decreases.

While FMA fixed our issues, our solution was losing accuracy over time as we were increasing the number of cores in our MPI variant of the solver. We went back to the drawing board. There was something else that was causing the divergence. While compiling with FMA technically “solved” the code execution issue, FMA did something which should otherwise make the code unrunnable.

What really happened

We rechecked the code, but this time printing the intermediate values and found out the reason. For some of the points in the grid, the differences between its neighbors were so small (of the order of 1e-18) that the code discarded those values as assumed them to be zero. This was not happening with FMA. In the case of FMA a value computed to be 3.4567541256061335E-018 which can be safely rounded off to 0 was still being preserved.

How does this matter?

Before we zeroed in on the faulty code we made a statement.

The values computed for sum_delx_sqr, sum_dely_sqr and sum_delx_dely can never be 0.

FMA or not the values were approaching close to 0. This meant the bug was not in our code! We checked the grid file and lo and behold we found the actual problem. (No kidding! This one is the real deal)

Each point in the grid file has a set of neighbor points. We define this as a connectivity set. Now, the only case where sum_delx_sqr, sum_dely_sqr and sum_delx_dely can become 0 is if this connectivity set somehow has only 1 neighbor. Guess what? Some points had just a single neighbor.

We spent more than 5 days debugging the code when the entire fault was with the grid file which was being feed to the solver. Apparently, during the grid generation due to the complexity of the grid some of the safety checks which were supposed to check the grid did not go through. Yikes!

We quickly rectified the issue and tried the grid again. This time with the proper connectivity set. It works!

Conclusion

If your code works for several test cases except for one, the bug is not in your code but with the test case. Just kidding! This maybe applies to meshfree solvers but yeah go through the code once just in case.

Next time you read a StackOverflow article saying use these build flags to optimize the code, stop and maybe ponder for a second. Think about why these flags optimize the code and what exactly do they even do.

It helps to have another person sitting and debugging the code. I sincerely thank Kumar Prasun for his time and effort. Do check out his blog. He is a cool guy.