dkhudia 2 hours ago

> It's quite common in machine learning operations to multiply a matrix of unsigned byte by a matrix of signed byte. Don't ask me why, but that's the case.

Overflow is the reason. Intel's vpmaddubsw takes int8_t and uint8_t to give you results in int16_t. If both are unsigned 255 * 255 = 65025 will be out of range for int16_t (−32,768 to +32,767) so likely the instruction is designed to take int8_t and uint8_t. However, if one is signed and other is unsigned extremes -128 * 255 or 127 * 255 are always in int16_t range. The overflow (or rather saturation with this instruction) can still occur because it sums adjacent multiplications. See my comment in PyTorch. https://github.com/pytorch/pytorch/blob/a37db5ae3978010e1bb7...

  • atq2119 8 minutes ago

    This doesn't feel like a convincing argument. If you wanted to multiply uint8 * uint8, you'd naturally use an unsigned multiply with a uint16 result. That doesn't overflow either.

    I believe a better argument is to appeal to the structure of neural networks. Activation inputs into a matrix multiply come out of a non-linear function, and ReLU is a popular function which causes activation inputs to be unsigned. Weights then need to be signed so that the matrix multiplication can have negative outputs -- without negative outputs, you would lose the non-linearity of ReLU.

gok 3 hours ago

Curious how this compares with, say, the implementation of gemm_s8s8s32 in Intel's MKL / OneAPI.