[X86] Make a bunch of merge masked binops commutable for loading folding.
This primarily affects add/fadd/mul/fmul/and/or/xor/pmuludq/pmuldq/max/min/fmaxc/fminc/pmaddwd/pavg.
We already commuted the unmasked and zero masked versions.
I've added 512-bit stack folding tests for most of the instructions
affected. I've tested needing commuting and not commuting across
unmasked, merged masked, and zero masked. The 128/256 bit instructions
should behave similarly.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@362746
91177308-0d34-0410-b5e6-
96231b3b80d8