Use sse_load_f32/64 for scalar FMA3 intrinsic patterns instead of 128-bit loads to...