Commit Graph

22 Commits

Author SHA1 Message Date
David Sterba 03f41ac508 btrfs-progs: detect PCLMUL CPU support for accelerated crc32c
The accelerated crc32c needs to check for two CPU features, the crc32c
instructions is in SSE 4.2 and 'pclmulqdq' is a separate. There's still
old hardware used that does not have the PCLMUL instructions. Detect it
and make it the condition.

The pclmul is not supported on old compilers so also add a
configure-time detection and leave the SSE 4.2 only implementation as
the accelerated one if possible.

Issue: #676
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-13 00:38:50 +02:00
David Sterba 83ac6e0a72 btrfs-progs: crypto: make the PCL implementation default for crc32c
Drop the old native intel implementation and use the PCL one. Remove the
artifical CPU flags.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-28 17:24:24 +02:00
David Sterba 992be8b50a btrfs-progs: crypto: add PCL based implementation for crc32c
Copy faster implementation of crc32c from linux kernel as of 6.5-rc7
(x86_64, arch/x86/crypto/crc32c-pcl-intel-asm_64.S). This needs
assembler build support, so detect target architecture so
cross-compilation still works.

Add a special CPU flag so the old and new implementations can be
benchmarked and verified separately.

Sample benchmark:

CPU flags: 0x1ff
CPU features: SSE2 SSSE3 SSE41 SSE42 SHA AVX AVX2 CRC32C_PCL
Block size:     4096
Iterations:     1000000
Implementation: builtin
Units:          CPU cycles

      NULL-NOP: cycles:     77177218, cycles/i       77
   NULL-MEMCPY: cycles:    226313072, cycles/i      226,    62133.395 MiB/s
    CRC32C-ref: cycles:  24418596066, cycles/i    24418,      575.859 MiB/s
     CRC32C-NI: cycles:   1188335920, cycles/i     1188,    11833.073 MiB/s
    CRC32C-PCL: cycles:    463193456, cycles/i      463,    30358.037 MiB/s
        XXHASH: cycles:    851606646, cycles/i      851,    16511.916 MiB/s
    SHA256-ref: cycles:  74476234956, cycles/i    74476,      188.808 MiB/s
     SHA256-NI: cycles:  34198637428, cycles/i    34198,      411.177 MiB/s
    BLAKE2-ref: cycles:  14761411664, cycles/i    14761,      952.597 MiB/s
   BLAKE2-SSE2: cycles:  18101896796, cycles/i    18101,      776.807 MiB/s
  BLAKE2-SSE41: cycles:  12599091062, cycles/i    12599,     1116.087 MiB/s
   BLAKE2-AVX2: cycles:   9668247506, cycles/i     9668,     1454.418 MiB/s

The new implementation is about 2.5x faster.

Note: there new version does not work on musl because of linkage
problems (relocations in .rodata), so it's still using the old
implementation.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-28 17:24:24 +02:00
David Sterba bbcd599062 btrfs-progs: hash-speedtest: benchmark the configured backend
Change what hash-speedtest benchmarks according to the
--with-crypto=backend option. Until now it would run the same version
under different names inherited from the builting.

At configure time detect availability of all backends and define all
macros.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-01 15:53:26 +01:00
David Sterba 4fc291a465 btrfs-progs: fix detection of accelerated implementation.
The build fails with crypto backends other than builtin, the
initializers cannot be reached as they're ifdef-ed out.  Move
hash_init_accel under the right condition and delete the
algorithm-specific initializers as they're used only by the hash test
and that can simply call hash_init_accel to set the implementation.

All the -m flags need to be detected at configure time and the flag used
for ifdef (HAVE_CFLAG_m*), not the actual feature defined by compiler as
the dispatcher function is not built with the -m flags.

The uname check for x86_64 must be dropped so on i386/i586 we can still
build accelerated version.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-01 15:10:21 +01:00
David Sterba 140234dc0d btrfs-progs: move include from toplevel directory to include/
In order to reduce number of files in the toplevel directory,

Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-28 20:11:23 +01:00
David Sterba 3d9217f7ab btrfs-progs: hash-speedtest: select implementation by features
Now put all the recent changes into action. Add a callback that will
reinitialize the implementation pointers according to the desired
feature. Reference implementations use the NONE CPU flag to distinguish
them from the rest.

Example results:

$ hash-speedtest
CPU flags: 0xff
CPU features: SSE2 SSSE3 SSE41 SSE42 SHA AVX AVX2
Block size:     4096
Iterations:     1000000
Implementation: builtin
Units:          CPU cycles

    NULL-NOP: cycles:     67129026, cycles/i       67
 NULL-MEMCPY: cycles:    231303654, cycles/i      231,    60792.500 MiB/s
  CRC32C-ref: cycles:  23982698042, cycles/i    23982,      586.322 MiB/s
   CRC32C-NI: cycles:   1168017624, cycles/i     1168,    12038.828 MiB/s
      XXHASH: cycles:    838434468, cycles/i      838,    16771.152 MiB/s
  SHA256-ref: cycles:  68296865380, cycles/i    68296,      205.889 MiB/s
   SHA256-NI: cycles:  29748853920, cycles/i    29748,      472.676 MiB/s
  BLAKE2-ref: cycles:  14532177414, cycles/i    14532,      967.617 MiB/s
 BLAKE2-SSE2: cycles:  17762215810, cycles/i    17762,      791.657 MiB/s
BLAKE2-SSE41: cycles:  12370044656, cycles/i    12370,     1136.744 MiB/s
 BLAKE2-AVX2: cycles:   9472823338, cycles/i     9472,     1484.412 MiB/s

Previously:

Block size:     4096
Iterations:     1000000
Implementation: builtin
Units:          CPU cycles

    NULL-NOP: cycles:     67714016, cycles/i       67
 NULL-MEMCPY: cycles:    234140818, cycles/i      234,    60055.762 MiB/s
      CRC32C: cycles:   1187358432, cycles/i     1187,    11842.733 MiB/s
      XXHASH: cycles:   1897530684, cycles/i     1897,     7410.448 MiB/s
      SHA256: cycles:  69855340702, cycles/i    69855,      201.296 MiB/s
      BLAKE2: cycles:  14713130972, cycles/i    14713,      955.716 MiB/s

The CPU is i7-11700 3.60GHz and not the same as previous results
mentioned in changelogs so the results are incomparable. Otherwise, the
updated xxhash implementation is twice as fast, no significant changes
for the rest.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-28 20:11:22 +01:00
David Sterba 24ec095295 btrfs-progs: crypto: add common function for accelerated initialization
Prepare a single location that will detect or set accelerated versions
of hash algorithms. Right now it's the crc32c, blake2 and sha256 do
an if-else switch while crc32c sets a function pointer.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-28 19:49:31 +01:00
David Sterba 6a7a0d8af8 btrfs-progs: crypto: add accelerated SHA256 implementation
Copy sha256-x86.c from https://github.com/noloader/SHA-Intrinsics, that
uses the compiler intrinsics to implement the update step with the
native x86_64 instructions.

To avoid dependencies of the reference code and the x86 version, check
runtime support only if the compiler also supports -msha.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-28 19:49:30 +01:00
David Sterba 7d1353fa01 btrfs-progs: hash-speedtest: add accelerated BLAKE2 implementations
Benchmark all accelerated implementations if the CPU supports them. Set
the level before each test, expecting that the implementation switches
the implementation dynamically.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-28 19:49:30 +01:00
David Sterba 0e38e1c4f2 btrfs-progs: use error helper for messages in non-kernel code
Lots of code still uses fprintf(stderr, "...") that should be the
error() helper. The kernel-shared code is left out of the conversion for
now.

Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-11 09:08:07 +02:00
David Sterba c3ee6a8a09 btrfs-progs: unify GPL header comments
Add the GPL v2 header to files where it was missing and is not from an
external source, update to the most recent version with the address.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 13:58:44 +02:00
David Sterba 9527bc0649 btrfs-progs: crypto: add perf support to speed test
Use perf events to read the cycle count, this should work on all
architectures. Enabled by option --perf and the sysctl
kernel.perf_event_paranoid must be 0 or 1.

The results are roughly the same as for raw cycles on x86_64 but worse
because of the additional overhead (read, context switch):

Block size:     4096
Iterations:     100000
Implementation: builtin
Units:          CPU cycles

    NULL-NOP: cycles:     42719688, cycles/i      427
 NULL-MEMCPY: cycles:     72941208, cycles/i      729,    18670.314 MiB/s
      CRC32C: cycles:    183709926, cycles/i     1837,     7413.009 MiB/s
      XXHASH: cycles:    136727614, cycles/i     1367,     9960.264 MiB/s
      SHA256: cycles:  10711594532, cycles/i   107115,      127.137 MiB/s
      BLAKE2: cycles:   2256957529, cycles/i    22569,      603.398 MiB/s

Block size:     4096
Iterations:     100000
Implementation: builtin
Units:          perf event: CPU cycles

    NULL-NOP: perf_c:     29649530, perf_c/i      296
 NULL-MEMCPY: perf_c:     59954062, perf_c/i      599,    15137.464 MiB/s
      CRC32C: perf_c:    179009071, perf_c/i     1790,     6929.460 MiB/s
      XXHASH: perf_c:    136413509, perf_c/i     1364,     9982.950 MiB/s
      SHA256: perf_c:  10997356664, perf_c/i   109973,      127.046 MiB/s
      BLAKE2: perf_c:   2379077576, perf_c/i    23790,      588.780 MiB/s

Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-01 22:19:38 +02:00
David Sterba 1d4bab875a btrfs-progs: drop "2b" from blake2 in speed test
Internally it's blake2b but for the user facing output or other command
line interfaces let's call it just BLAKE2.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-01 21:17:50 +02:00
David Sterba 5734073b15 btrfs-progs: crypto: fix printf warnings in hash-speedtest
With explicit width the default alignment is to the right, using space
is a gnu extension. Fix the following warnings:

  crypto/hash-speedtest.c: In function ‘main’:
  crypto/hash-speedtest.c:152:15: warning: ' ' flag used with ‘%s’ gnu_printf format [-Wformat=]
    152 |   printf("% 12s: ", c->name);
	|               ^
  crypto/hash-speedtest.c:172:21: warning: ' ' flag used with ‘%u’ gnu_printf format [-Wformat=]
    172 |   printf("%s: % 12llu, %s/i % 8llu",
	|                     ^
  crypto/hash-speedtest.c:172:34: warning: ' ' flag used with ‘%u’ gnu_printf format [-Wformat=]
    172 |   printf("%s: % 12llu, %s/i % 8llu",
	|                                  ^

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 11:00:17 +02:00
David Sterba 133dd6c6c3 btrfs-progs: crypto: print throughput in hash-speedtest
Calculate the estimated throughput as a number that's comparable across
machines.

  $ ./hash-speedtest --cycles
  Block size:     4096
  Iterations:     100000
  Implementation: builtin
  Units:          cycles

      NULL-NOP: cycles:     42928902, cycles/i      429
   NULL-MEMCPY: cycles:     73014868, cycles/i      730,    18651.186 MiB/s
	CRC32C: cycles:    182293290, cycles/i     1822,     7470.579 MiB/s
	XXHASH: cycles:    138085981, cycles/i     1380,     9862.272 MiB/s
	SHA256: cycles:  10576270837, cycles/i   105762,      128.764 MiB/s
       BLAKE2b: cycles:   2263761293, cycles/i    22637,      601.585 MiB/s

  $ ./hash-speedtest --time
  Block size:     4096
  Iterations:     100000
  Implementation: builtin
  Units:          nsecs

      NULL-NOP: nsecs:     12164607, nsecs/i      121
   NULL-MEMCPY: nsecs:     20423641, nsecs/i      204,    19095.518 MiB/s
	CRC32C: nsecs:     51972794, nsecs/i      519,     7503.926 MiB/s
	XXHASH: nsecs:     38935164, nsecs/i      389,    10016.651 MiB/s
	SHA256: nsecs:   3030944497, nsecs/i    30309,      128.673 MiB/s
       BLAKE2b: nsecs:    648489262, nsecs/i     6484,      601.398 MiB/s

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-26 23:09:52 +02:00
David Sterba 55bf9b749d btrfs-progs: crypto: add time-based measurement to hash-speedtest
People are interested in measuring the hash performance on non-x86_64
architectures. Add option to do time-based measurements (in nanoseconds)
in case there's no support for clock-based measurements.

  $ ./hash-speedtest --cycles
  Block size:     4096
  Iterations:     100000
  Implementation: builtin
  Units:          cycles

      NULL-NOP: cycles:     43035633, cycles/i      430
   NULL-MEMCPY: cycles:     72478624, cycles/i      724
	CRC32C: cycles:    181712982, cycles/i     1817
	XXHASH: cycles:    136251305, cycles/i     1362
	SHA256: cycles:  10758567410, cycles/i   107585
       BLAKE2b: cycles:   2249704806, cycles/i    22497

  $ ./hash-speedtest --time
  Block size:     4096
  Iterations:     100000
  Implementation: builtin
  Units:          nsecs

      NULL-NOP:  nsecs:     12459033, nsecs/i      124
   NULL-MEMCPY:  nsecs:     20687845, nsecs/i      206
	CRC32C:  nsecs:     52648264, nsecs/i      526
	XXHASH:  nsecs:     39591766, nsecs/i      395
	SHA256:  nsecs:   3079668837, nsecs/i    30796
       BLAKE2b:  nsecs:    644766582, nsecs/i     6447

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-26 22:42:59 +02:00
Adam Borowski c615287cc0 btrfs-progs: a bunch of typo fixes
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-13 22:33:10 +01:00
David Sterba 82464a03e7 btrfs-progs: add more hash implementation providers
For environments that require certified implementations of cryptographic
primitives allow to select a library providing them. The requirements
are SHA256 and BLAKE2 (with the 2b variant and 256 bit digest).

For now there are two: libgrcrypt and libsodium (openssl does not
provide the BLAKE2b-256). Accellerated versions are typically provided
and automatically selected.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-04 20:48:40 +02:00
David Sterba 34ef695a81 btrfs-progs: add BLAKE2 to hash-speedtest
Sample results, Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz

Block size: 4096
Iterations: 1000000

    NULL-NOP: cycles:    314296257, c/i      314
 NULL-MEMCPY: cycles:    582807266, c/i      582
      CRC32C: cycles:   1738544130, c/i     1738
      XXHASH: cycles:   1449519934, c/i     1449
      SHA256: cycles: 110648340548, c/i   110648
     BLAKE2b: cycles:  29743769472, c/i    29743

Note this is unoptimized reference implementation.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 19:21:06 +01:00
David Sterba f5e952b13d btrfs-progs: add SHA256 to hash-speedtest
Sample results, Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz

Block size: 4096
Iterations: 1000000

    NULL-NOP: cycles:    314296257, c/i      314
 NULL-MEMCPY: cycles:    582807266, c/i      582
      CRC32C: cycles:   1738544130, c/i     1738
      XXHASH: cycles:   1449519934, c/i     1449
      SHA256: cycles: 110648340548, c/i   110648

Note this is unoptimized reference implementation.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 19:21:06 +01:00
David Sterba 785073f658 btrfs-progs: crypto: add hash speedtest utility
A simple tool to microbenchmark performance of the hashes. Uses rdtsc
for timing, so works only on x86_64.

 $ make hash-speedtest
 $ ./hash-speedtest [iterations]

   Block size: 4096
   Iterations: 100000

       NULL-NOP: cycles:     56061823, c/i      560
    NULL-MEMCPY: cycles:     61296469, c/i      612
	 CRC32C: cycles:    179961796, c/i     1799
	 XXHASH: cycles:    138434590, c/i     1384

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 19:20:03 +01:00