AMDGPU: Make SIInsertWaits about a factor of 4 faster