qt6-bb10/src/corelib
Thiago Macieira 55959aefab qHash: implement an AES hasher for QLatin1StringView
It's the same aeshash() as before, except we're passing a template
parameter to indicate whether to read half and then zero-extend the
data. That is, it will perform a conversion from Latin1 on the fly.

When running in zero-extending mode, the length parameters are actually
doubled (counting the number of UTF-16 code units) and we then divide
again by 2 when advancing.

The implementation should have the following performance
characteristics:
* QLatin1StringView now will be roughly half as fast as Qt 6.7
* QLatin1StringView now will be roughly as fast as QStringView

For the aeshash128() in default builds of QtCore (will use SSE4.1), the
long loop (32 characters or more) is:

      QStringView                             QLatin1StringView
    movdqu -0x20(%rax),%xmm4       |        pmovzxbw -0x10(%rdx),%xmm2
    movdqu -0x10(%rax),%xmm5       |        pmovzxbw -0x8(%rdx),%xmm3
    add    $0x20,%rax              |        add    $0x10,%rdx
    pxor   %xmm4,%xmm0             |        pxor   %xmm2,%xmm0
    pxor   %xmm5,%xmm1             |        pxor   %xmm3,%xmm1
    aesenc %xmm0,%xmm0                      aesenc %xmm0,%xmm0
    aesenc %xmm1,%xmm1                      aesenc %xmm1,%xmm1
    aesenc %xmm0,%xmm0                      aesenc %xmm0,%xmm0
    aesenc %xmm1,%xmm1                      aesenc %xmm1,%xmm1

The number of instructions is identical, but there are actually 2 more
uops per iteration. LLVM-MCA simulation shows this should execute in the
same number of cycles on older CPUs that do not have support for VAES
(see <https://analysis.godbolt.org/z/x95Mrfrf7>).

For the VAES version in aeshash256() and the AVX10 version in
aeshash256_256():

      QStringView                             QLatin1StringView
    vpxor  -0x40(%rax),%ymm1,%ym   |        vpmovzxbw -0x20(%rax),%ymm3
    vpxor  -0x20(%rax),%ymm0,%ym   |        vpmovzxbw -0x10(%rax),%ymm2
    add    $0x40,%rax              |        add    $0x20,%rax
                                   |        vpxor  %ymm3,%ymm0,%ymm0
                                   |        vpxor  %ymm2,%ymm1,%ymm1
    vaesenc %ymm1,%ymm1,%ymm1      <
    vaesenc %ymm0,%ymm0,%ymm0               vaesenc %ymm0,%ymm0,%ymm0
    vaesenc %ymm1,%ymm1,%ymm1               vaesenc %ymm1,%ymm1,%ymm1
    vaesenc %ymm0,%ymm0,%ymm0               vaesenc %ymm0,%ymm0,%ymm0
                                   >        vaesenc %ymm1,%ymm1,%ymm1

In this case, the increase in number of instructions matches the
increase in number of uops. The LLVM-MCA simulation says that the
QLatin1StringView version is faster at 11 cycles/iteration vs 14 cyc/it
(see <https://analysis.godbolt.org/z/1Gv1coz13>), but that can't be
right.

Measured performance of CPU cycles, on an Intel Core i9-7940X (Skylake,
no VAES support), normalized on the QString performance (QByteArray is
used as a stand-in for the performance in Qt 6.7):

                        aeshash              |  siphash
                QByteArray  QL1SV   QString     QByteArray  QString
dictionary      94.5%       79.7%   100.0%      150.5%*     159.8%
paths-small     90.2%       93.2%   100.0%      202.8%      290.3%
uuids           81.8%       100.7%  100.0%      215.2%      350.7%
longstrings     42.5%       100.8%  100.0%      185.7%      353.2%
numbers         95.5%       77.9%   100.0%      155.3%*     164.5%

On an Intel Core i7-1165G7 (Tiger Lake, capable of VAES and AVX512VL):

                        aeshash              |  siphash
                QByteArray  QL1SV   QString     QByteArray  QString
dictionary      90.0%       91.1%   100.0%      103.3%*     157.1%
paths-small     99.4%       104.8%  100.0%      237.5%      358.0%
uuids           88.5%       117.6%  100.0%      274.5%      461.7%
longstrings     57.4%       111.2%  100.0%      503.0%      974.3%
numbers         90.6%       89.7%   100.0%      98.7%*      149.9%

On an Intel 4th Generation Xeon Scalable Platinum (Sapphire Rapids, same
Golden Cove core as Alder Lake):

                        aeshash              |  siphash
                QByteArray  QL1SV   QString     QByteArray  QString
dictionary      89.9%       102.1%  100.0%      158.1%*     172.7%
paths-small     78.0%       89.4%   100.0%      159.4%      258.0%
uuids           109.1%      107.9%  100.0%      279.0%      496.3%
longstrings     52.1%       112.4%  100.0%      564.4%      1078.3%
numbers         85.8%       98.9%   100.0%      152.6%*     190.4%

* dictionary contains very short entries (6 characters)
* paths-small contains strings of varying length, but very few over 32
* uuids-list contains fixed-length strings (38 characters)
* longstrings is the same but 304 characters
* numbers also a lot contains very short strings (1 to 6 chars)

What this shows:
* For short strings, the performance difference is negligible between
  all three
* For longer strings, QLatin1StringView now costs between 7 and 17% more
  than QString on the tested machines instead of up to ~50% less, except on
  the older machine (where I think the main QString hashing is suffering
  from memory bandwidth limitations)
* The AES hash implementation is anywhere from 1.6 to 11x faster than
  Siphash
* Murmurhash (marked with asterisk) is much faster than Siphash, but it
  only managed to beat the AES hash in one test

Change-Id: I664b9f014ffc48cbb49bfffd17b045c1811ac0ed
Reviewed-by: Qt CI Bot <qt_ci_bot@qt-project.org>
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
2024-03-12 18:23:09 -07:00
..
animation Q*Animation: s/QPair/std::pair/ 2023-12-14 20:29:45 +01:00
compat QUrl: Use new comparison helper macros 2024-03-06 11:09:21 +01:00
doc Mention QChronoTimer in API docs 2024-03-03 19:56:55 +02:00
global QT_NO_QEXCHANGE: add docs 2024-03-09 10:34:08 +01:00
io QProcess/Unix: fix improper restoration of signal mask and cancel state 2024-03-12 04:38:57 -07:00
ipc Recreate posix QSystemSemaphore on release for VxWorks 2023-12-15 17:12:38 +00:00
itemmodels QIdentityProxyModel: add setHandleSourceLayoutChanges(bool) 2023-12-15 16:52:10 +03:00
kernel QVariant: use comparison helper macros 2024-03-12 21:51:43 +01:00
mimetypes Correct license for tools files 2024-03-05 12:59:21 +01:00
platform wasm: Export haveJspi() for dynamic linking 2024-02-29 14:02:32 +01:00
plugin QCborStreamReader: rename toStringish() -> readAllStringish() 2024-03-12 05:36:54 +01:00
serialization QDataStream: make the public-ish private members smaller in Qt 7 2024-03-12 12:51:43 -08:00
text QString/QByteArray: add explicit constructors for Q{String,ByteArray}View 2024-03-12 17:23:03 -08:00
thread QThread/Win: set the thread name on non-MSVC also 2024-03-08 16:44:18 +00:00
time Correct documentation of QDateTime's comparisons 2024-03-08 18:31:47 +01:00
tools qHash: implement an AES hasher for QLatin1StringView 2024-03-12 18:23:09 -07:00
tracing Fix CTF with static build 2023-10-30 18:59:21 +03:00
CMakeLists.txt Remove qbytearray_p.h 2024-03-04 14:42:32 +01:00
Qt6AndroidMacros.cmake Disable depfile support for the external projects when building for Android 2024-02-29 16:30:31 +01:00
Qt6CTestMacros.cmake Fix running CMake test projects in prefix builds 2023-08-19 11:03:36 +02:00
Qt6CoreConfigExtras.cmake.in Remove the commented legacy code from Qt6CoreConfigExtras.cmake.in 2024-02-07 23:23:40 +01:00
Qt6CoreConfigureFileTemplate.in CMake: Fix unnecessary rebuilding upon reconfiguration 2020-10-30 17:19:27 +01:00
Qt6CoreDeploySupport.cmake CMake: Fix custom QT_DEPLOY_TRANSLATIONS_DIR on Windows 2024-02-14 18:24:39 +01:00
Qt6CoreMacros.cmake CMake: Warn when examples are not added via qt_internal_add_example 2024-03-11 13:48:23 +01:00
Qt6CoreResourceInit.in.cpp Remove ; after QT_DECLARE_EXTERN_RESOURCE 2024-02-22 19:09:36 +01:00
Qt6WasmMacros.cmake wasm: set MAXIMUM_MEMORY to 4GB 2023-12-20 00:44:28 +00:00
QtCompressMimeDatabase.cmake Replace the scripting-based mime types compression mechanism with CMake 2022-12-01 02:23:51 +01:00
QtCore.dynlist Remove old pre-6.0 hooks 2021-10-22 19:12:07 -07:00
QtInstallPaths.cmake.in Move install paths from CoreConfigExtras.cmake to a separate file 2022-10-25 16:05:11 +02:00
configure.cmake Fix build when systemsemaphore is disabled 2024-02-07 14:02:19 +00:00
debug_script.py Correct license for tools files 2024-03-05 12:59:21 +01:00
qt_cmdline.cmake CMake: remove test for eventfd, replace with __has_include 2023-08-29 07:41:11 -07:00