Latest abrok Development Version August 20th (SF The gift that keeps on giving )

!! latest version !!

Windows x64 for Haswell CPUs
Windows x64 for modern computers + AVX2
Windows x64 for modern computers
Windows x64 + SSSE3
Windows x64
Windows 32
Linux x64 for Haswell CPUs
Linux x64 for modern computers + AVX2
Linux x64 for modern computers
Linux x64 + SSSE3
Linux x64
Author: Tomasz Sobczyk
Date: Fri Aug 20 08:50:25 2021 +0200
Timestamp: 1629442225

Optimize and tidy up affine transform code.

The new network caused some issues initially due to the very narrow neuron set between the first two FC layers. Necessary changes were hacked together to make it work. This patch is a mature approach to make the affine transform code faster, more readable, and easier to maintain should the layer sizes change again.

The following changes were made:

* ClippedReLU always produces a multiple of 32 outputs. This is about as good of a solution for AffineTransform's SIMD requirements as it can get without a bigger rewrite.

* All self-contained simd helpers are moved to a separate file (simd.h). Inline asm is utilized to work around GCC's issues with code generation and register assignment. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101693, https://godbolt.org/z/da76fY1n7

* AffineTransform has 2 specializations. While it's more lines of code due to the boilerplate, the logic in both is significantly reduced, as these two are impossible to nicely combine into one.
1) The first specialization is for cases when there's >=128 inputs. It uses a different approach to perform the affine transform and can make full use of AVX512 without any edge cases. Furthermore, it has higher theoretical throughput because less loads are needed in the hot path, requiring only a fixed amount of instructions for horizontal additions at the end, which are amortized by the large number of inputs.
2) The second specialization is made to handle smaller layers where performance is still necessary but edge cases need to be handled. AVX512 implementation for this was ommited by mistake, a remnant from the temporary implementation for the new... This could be easily reintroduced if needed. A slightly more detailed description of both implementations is in the code.

Overall it should be a minor speedup, as shown on fishtest:

passed STC:
LLR: 2.96 (-2.94,2.94) <-0.50,2.50>
Total: 51520 W: 4074 L: 3888 D: 43558 Elo +1.25
Ptnml(0-2): 111, 3136, 19097, 3288, 128

and various tests shown in the pull request

closes https://github.com/official-stockfish/Stockfish/pull/3663

No functional change
see source

_________________
looking for Bettina

» Latest abrok Development Version August 20th (SF The gift that keeps on giving
» Latest abrok Development Versions August 20th (SF The gift that keeps on giving )
» Latest abrok Development Version November 20th (SF The gift that keeps on giving
» Latest abrok Development Version March 20th (SF The gift that keeps on giving
» Latest abrok Development Version June 20th (SF The gift that keeps on giving