Reordered weights in such a way that accumulated sum fits to output. Weights are grouped in blocks of four elements because four int8 (weight type) corresponds to one int32 (output type). No horizontal additions. Grouped AVX512, AVX2 and SSSE3 implementations. Repeated code was removed.