!! latest version !!Windows x64 for Haswell CPUsWindows x64 for modern computers + AVX2Windows x64 for modern computersWindows x64 + SSSE3Windows x64Windows 32Linux x64 for Haswell CPUsLinux x64 for modern computers + AVX2Linux x64 for modern computersLinux x64 + SSSE3Linux x64Author: Tomasz Sobczyk
Date: Fri Aug 20 08:50:25 2021 +0200
Timestamp: 1629442225
Optimize and tidy up affine transform code.
The new network caused some issues initially due to the very narrow neuron set between the first two FC layers. Necessary changes were hacked together to make it work. This patch is a mature approach to make the affine transform code faster, more readable, and easier to maintain should the layer sizes change again.
The following changes were made:
* ClippedReLU always produces a multiple of 32 outputs. This is about as good of a solution for AffineTransform's SIMD requirements as it can get without a bigger rewrite.
* All self-contained simd helpers are moved to a separate file (simd.h). Inline asm is utilized to work around GCC's issues with code generation and register assignment. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101693, https://godbolt.org/z/da76fY1n7 * AffineTransform has 2 specializations. While it's more lines of code due to the boilerplate, the logic in both is significantly reduced, as these two are impossible to nicely combine into one.
1) The first specialization is for cases when there's >=128 inputs. It uses a different approach to perform the affine transform and can make full use of AVX512 without any edge cases. Furthermore, it has higher theoretical throughput because less loads are needed in the hot path, requiring only a fixed amount of instructions for horizontal additions at the end, which are amortized by the large number of inputs.
2) The second specialization is made to handle smaller layers where performance is still necessary but edge cases need to be handled. AVX512 implementation for this was ommited by mistake, a remnant from the temporary implementation for the new... This could be easily reintroduced if needed. A slightly more detailed description of both implementations is in the code.
Overall it should be a minor speedup, as shown on fishtest:
passed STC:
LLR: 2.96 (-2.94,2.94) <-0.50,2.50>
Total: 51520 W: 4074 L: 3888 D: 43558
Elo +1.25Ptnml(0-2): 111, 3136, 19097, 3288, 128
and various tests shown in the pull request
closes
https://github.com/official-stockfish/Stockfish/pull/3663 No functional change
see source