43 lines
1.1 KiB
C
Raw Normal View History

/* SPDX-License-Identifier: GPL-2.0-or-later */
zram: introduce compressing backend abstraction ZRAM performs direct LZO compression algorithm calls, making it the one and only option. While LZO is generally performs well, LZ4 algorithm tends to have a faster decompression (see http://code.google.com/p/lz4/ for full report) Name Ratio C.speed D.speed MB/s MB/s LZ4 (r101) 2.084 422 1820 LZO 2.06 2.106 414 600 Thus, users who have mostly read (decompress) usage scenarious or mixed workflow (writes with relatively high read ops number) will benefit from using LZ4 compression backend. Introduce compressing backend abstraction zcomp in order to support multiple compression algorithms with the following set of operations: .create .destroy .compress .decompress Schematically zram write() usually contains the following steps: 0) preparation (decompression of partioal IO, etc.) 1) lock buffer_lock mutex (protects meta compress buffers) 2) compress (using meta compress buffers) 3) alloc and map zs_pool object 4) copy compressed data (from meta compress buffers) to object allocated by 3) 5) free previous pool page, assign a new one 6) unlock buffer_lock mutex As we can see, compressing buffers must remain untouched from 1) to 4), because, otherwise, concurrent write() can overwrite data. At the same time, zram_meta must be aware of a) specific compression algorithm memory requirements and b) necessary locking to protect compression buffers. To remove requirement a) new struct zcomp_strm introduced, which contains a compress/decompress `buffer' and compression algorithm `private' part. While struct zcomp implements zcomp_strm stream handling and locking and removes requirement b) from zram meta. zcomp ->create() and ->destroy(), respectively, allocate and deallocate algorithm specific zcomp_strm `private' part. Every zcomp has zcomp stream and mutex to protect its compression stream. Stream usage semantics remains the same -- only one write can hold stream lock and use its buffers. zcomp_strm_find() turns caller into exclusive user of a stream (holding stream mutex until zram release stream), and zcomp_strm_release() makes zcomp stream available (unlock the stream mutex). Hence no concurrent write (compression) operations possible at the moment. iozone -t 3 -R -r 16K -s 60M -I +Z test base patched -------------------------------------------------- Initial write 597992.91 591660.58 Rewrite 609674.34 616054.97 Read 2404771.75 2452909.12 Re-read 2459216.81 2470074.44 Reverse Read 1652769.66 1589128.66 Stride read 2202441.81 2202173.31 Random read 2236311.47 2276565.31 Mixed workload 1423760.41 1709760.06 Random write 579584.08 615933.86 Pwrite 597550.02 594933.70 Pread 1703672.53 1718126.72 Fwrite 1330497.06 1461054.00 Fread 3922851.00 3957242.62 Usage examples: comp = zcomp_create(NAME) /* NAME e.g. "lzo" */ which initialises compressing backend if requested algorithm is supported. Compress: zstrm = zcomp_strm_find(comp) zcomp_compress(comp, zstrm, src, &dst_len) [..] /* copy compressed data */ zcomp_strm_release(comp, zstrm) Decompress: zcomp_decompress(comp, src, src_len, dst); Free compessing backend and its zcomp stream: zcomp_destroy(comp) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 15:38:11 -07:00
/*
* Copyright (C) 2014 Sergey Senozhatsky.
*/
#ifndef _ZCOMP_H_
#define _ZCOMP_H_
#include <linux/local_lock.h>
zram: introduce compressing backend abstraction ZRAM performs direct LZO compression algorithm calls, making it the one and only option. While LZO is generally performs well, LZ4 algorithm tends to have a faster decompression (see http://code.google.com/p/lz4/ for full report) Name Ratio C.speed D.speed MB/s MB/s LZ4 (r101) 2.084 422 1820 LZO 2.06 2.106 414 600 Thus, users who have mostly read (decompress) usage scenarious or mixed workflow (writes with relatively high read ops number) will benefit from using LZ4 compression backend. Introduce compressing backend abstraction zcomp in order to support multiple compression algorithms with the following set of operations: .create .destroy .compress .decompress Schematically zram write() usually contains the following steps: 0) preparation (decompression of partioal IO, etc.) 1) lock buffer_lock mutex (protects meta compress buffers) 2) compress (using meta compress buffers) 3) alloc and map zs_pool object 4) copy compressed data (from meta compress buffers) to object allocated by 3) 5) free previous pool page, assign a new one 6) unlock buffer_lock mutex As we can see, compressing buffers must remain untouched from 1) to 4), because, otherwise, concurrent write() can overwrite data. At the same time, zram_meta must be aware of a) specific compression algorithm memory requirements and b) necessary locking to protect compression buffers. To remove requirement a) new struct zcomp_strm introduced, which contains a compress/decompress `buffer' and compression algorithm `private' part. While struct zcomp implements zcomp_strm stream handling and locking and removes requirement b) from zram meta. zcomp ->create() and ->destroy(), respectively, allocate and deallocate algorithm specific zcomp_strm `private' part. Every zcomp has zcomp stream and mutex to protect its compression stream. Stream usage semantics remains the same -- only one write can hold stream lock and use its buffers. zcomp_strm_find() turns caller into exclusive user of a stream (holding stream mutex until zram release stream), and zcomp_strm_release() makes zcomp stream available (unlock the stream mutex). Hence no concurrent write (compression) operations possible at the moment. iozone -t 3 -R -r 16K -s 60M -I +Z test base patched -------------------------------------------------- Initial write 597992.91 591660.58 Rewrite 609674.34 616054.97 Read 2404771.75 2452909.12 Re-read 2459216.81 2470074.44 Reverse Read 1652769.66 1589128.66 Stride read 2202441.81 2202173.31 Random read 2236311.47 2276565.31 Mixed workload 1423760.41 1709760.06 Random write 579584.08 615933.86 Pwrite 597550.02 594933.70 Pread 1703672.53 1718126.72 Fwrite 1330497.06 1461054.00 Fread 3922851.00 3957242.62 Usage examples: comp = zcomp_create(NAME) /* NAME e.g. "lzo" */ which initialises compressing backend if requested algorithm is supported. Compress: zstrm = zcomp_strm_find(comp) zcomp_compress(comp, zstrm, src, &dst_len) [..] /* copy compressed data */ zcomp_strm_release(comp, zstrm) Decompress: zcomp_decompress(comp, src, src_len, dst); Free compessing backend and its zcomp stream: zcomp_destroy(comp) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 15:38:11 -07:00
struct zcomp_strm {
/* The members ->buffer and ->tfm are protected by ->lock. */
local_lock_t lock;
zram: introduce compressing backend abstraction ZRAM performs direct LZO compression algorithm calls, making it the one and only option. While LZO is generally performs well, LZ4 algorithm tends to have a faster decompression (see http://code.google.com/p/lz4/ for full report) Name Ratio C.speed D.speed MB/s MB/s LZ4 (r101) 2.084 422 1820 LZO 2.06 2.106 414 600 Thus, users who have mostly read (decompress) usage scenarious or mixed workflow (writes with relatively high read ops number) will benefit from using LZ4 compression backend. Introduce compressing backend abstraction zcomp in order to support multiple compression algorithms with the following set of operations: .create .destroy .compress .decompress Schematically zram write() usually contains the following steps: 0) preparation (decompression of partioal IO, etc.) 1) lock buffer_lock mutex (protects meta compress buffers) 2) compress (using meta compress buffers) 3) alloc and map zs_pool object 4) copy compressed data (from meta compress buffers) to object allocated by 3) 5) free previous pool page, assign a new one 6) unlock buffer_lock mutex As we can see, compressing buffers must remain untouched from 1) to 4), because, otherwise, concurrent write() can overwrite data. At the same time, zram_meta must be aware of a) specific compression algorithm memory requirements and b) necessary locking to protect compression buffers. To remove requirement a) new struct zcomp_strm introduced, which contains a compress/decompress `buffer' and compression algorithm `private' part. While struct zcomp implements zcomp_strm stream handling and locking and removes requirement b) from zram meta. zcomp ->create() and ->destroy(), respectively, allocate and deallocate algorithm specific zcomp_strm `private' part. Every zcomp has zcomp stream and mutex to protect its compression stream. Stream usage semantics remains the same -- only one write can hold stream lock and use its buffers. zcomp_strm_find() turns caller into exclusive user of a stream (holding stream mutex until zram release stream), and zcomp_strm_release() makes zcomp stream available (unlock the stream mutex). Hence no concurrent write (compression) operations possible at the moment. iozone -t 3 -R -r 16K -s 60M -I +Z test base patched -------------------------------------------------- Initial write 597992.91 591660.58 Rewrite 609674.34 616054.97 Read 2404771.75 2452909.12 Re-read 2459216.81 2470074.44 Reverse Read 1652769.66 1589128.66 Stride read 2202441.81 2202173.31 Random read 2236311.47 2276565.31 Mixed workload 1423760.41 1709760.06 Random write 579584.08 615933.86 Pwrite 597550.02 594933.70 Pread 1703672.53 1718126.72 Fwrite 1330497.06 1461054.00 Fread 3922851.00 3957242.62 Usage examples: comp = zcomp_create(NAME) /* NAME e.g. "lzo" */ which initialises compressing backend if requested algorithm is supported. Compress: zstrm = zcomp_strm_find(comp) zcomp_compress(comp, zstrm, src, &dst_len) [..] /* copy compressed data */ zcomp_strm_release(comp, zstrm) Decompress: zcomp_decompress(comp, src, src_len, dst); Free compessing backend and its zcomp stream: zcomp_destroy(comp) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 15:38:11 -07:00
/* compression/decompression buffer */
void *buffer;
zram: switch to crypto compress API We don't have an idle zstreams list anymore and our write path now works absolutely differently, preventing preemption during compression. This removes possibilities of read paths preempting writes at wrong places (which could badly affect the performance of both paths) and at the same time opens the door for a move from custom LZO/LZ4 compression backends implementation to a more generic one, using crypto compress API. Joonsoo Kim [1] attempted to do this a while ago, but faced with the need of introducing a new crypto API interface. The root cause was the fact that crypto API compression algorithms require a compression stream structure (in zram terminology) for both compression and decompression ops, while in reality only several of compression algorithms really need it. This resulted in a concept of context-less crypto API compression backends [2]. Both write and read paths, though, would have been executed with the preemption enabled, which in the worst case could have resulted in a decreased worst-case performance, e.g. consider the following case: CPU0 zram_write() spin_lock() take the last idle stream spin_unlock() << preempted >> zram_read() spin_lock() no idle streams spin_unlock() schedule() resuming zram_write compression() but it took me some time to realize that, and it took even longer to evolve zram and to make it ready for crypto API. The key turned out to be -- drop the idle streams list entirely. Without the idle streams list we are free to use compression algorithms that require compression stream for decompression (read), because streams are now placed in per-cpu data and each write path has to disable preemption for compression op, almost completely eliminating the aforementioned case (technically, we still have a small chance, because write path has a fast and a slow paths and the slow path is executed with the preemption enabled; but the frequency of failed fast path is too low). TEST ==== - 4 CPUs, x86_64 system - 3G zram, lzo - fio tests: read, randread, write, randwrite, rw, randrw test script [3] command: ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh BASE PATCHED jobs1 READ: 2527.2MB/s 2482.7MB/s READ: 2102.7MB/s 2045.0MB/s WRITE: 1284.3MB/s 1324.3MB/s WRITE: 1080.7MB/s 1101.9MB/s READ: 430125KB/s 437498KB/s WRITE: 430538KB/s 437919KB/s READ: 399593KB/s 403987KB/s WRITE: 399910KB/s 404308KB/s jobs2 READ: 8133.5MB/s 7854.8MB/s READ: 7086.6MB/s 6912.8MB/s WRITE: 3177.2MB/s 3298.3MB/s WRITE: 2810.2MB/s 2871.4MB/s READ: 1017.6MB/s 1023.4MB/s WRITE: 1018.2MB/s 1023.1MB/s READ: 977836KB/s 984205KB/s WRITE: 979435KB/s 985814KB/s jobs3 READ: 13557MB/s 13391MB/s READ: 11876MB/s 11752MB/s WRITE: 4641.5MB/s 4682.1MB/s WRITE: 4164.9MB/s 4179.3MB/s READ: 1453.8MB/s 1455.1MB/s WRITE: 1455.1MB/s 1458.2MB/s READ: 1387.7MB/s 1395.7MB/s WRITE: 1386.1MB/s 1394.9MB/s jobs4 READ: 20271MB/s 20078MB/s READ: 18033MB/s 17928MB/s WRITE: 6176.8MB/s 6180.5MB/s WRITE: 5686.3MB/s 5705.3MB/s READ: 2009.4MB/s 2006.7MB/s WRITE: 2007.5MB/s 2004.9MB/s READ: 1929.7MB/s 1935.6MB/s WRITE: 1926.8MB/s 1932.6MB/s jobs5 READ: 18823MB/s 19024MB/s READ: 18968MB/s 19071MB/s WRITE: 6191.6MB/s 6372.1MB/s WRITE: 5818.7MB/s 5787.1MB/s READ: 2011.7MB/s 1981.3MB/s WRITE: 2011.4MB/s 1980.1MB/s READ: 1949.3MB/s 1935.7MB/s WRITE: 1940.4MB/s 1926.1MB/s jobs6 READ: 21870MB/s 21715MB/s READ: 19957MB/s 19879MB/s WRITE: 6528.4MB/s 6537.6MB/s WRITE: 6098.9MB/s 6073.6MB/s READ: 2048.6MB/s 2049.9MB/s WRITE: 2041.7MB/s 2042.9MB/s READ: 2013.4MB/s 1990.4MB/s WRITE: 2009.4MB/s 1986.5MB/s jobs7 READ: 21359MB/s 21124MB/s READ: 19746MB/s 19293MB/s WRITE: 6660.4MB/s 6518.8MB/s WRITE: 6211.6MB/s 6193.1MB/s READ: 2089.7MB/s 2080.6MB/s WRITE: 2085.8MB/s 2076.5MB/s READ: 2041.2MB/s 2052.5MB/s WRITE: 2037.5MB/s 2048.8MB/s jobs8 READ: 20477MB/s 19974MB/s READ: 18922MB/s 18576MB/s WRITE: 6851.9MB/s 6788.3MB/s WRITE: 6407.7MB/s 6347.5MB/s READ: 2134.8MB/s 2136.1MB/s WRITE: 2132.8MB/s 2134.4MB/s READ: 2074.2MB/s 2069.6MB/s WRITE: 2087.3MB/s 2082.4MB/s jobs9 READ: 19797MB/s 19994MB/s READ: 18806MB/s 18581MB/s WRITE: 6878.7MB/s 6822.7MB/s WRITE: 6456.8MB/s 6447.2MB/s READ: 2141.1MB/s 2154.7MB/s WRITE: 2144.4MB/s 2157.3MB/s READ: 2084.1MB/s 2085.1MB/s WRITE: 2091.5MB/s 2092.5MB/s jobs10 READ: 19794MB/s 19784MB/s READ: 18794MB/s 18745MB/s WRITE: 6984.4MB/s 6676.3MB/s WRITE: 6532.3MB/s 6342.7MB/s READ: 2150.6MB/s 2155.4MB/s WRITE: 2156.8MB/s 2161.5MB/s READ: 2106.4MB/s 2095.6MB/s WRITE: 2109.7MB/s 2098.4MB/s BASE PATCHED jobs1 perfstat stalled-cycles-frontend 102,480,595,419 ( 41.53%) 114,508,864,804 ( 46.92%) stalled-cycles-backend 51,941,417,832 ( 21.05%) 46,836,112,388 ( 19.19%) instructions 283,612,054,215 ( 1.15) 283,918,134,959 ( 1.16) branches 56,372,560,385 ( 724.923) 56,449,814,753 ( 733.766) branch-misses 374,826,000 ( 0.66%) 326,935,859 ( 0.58%) jobs2 perfstat stalled-cycles-frontend 155,142,745,777 ( 40.99%) 164,170,979,198 ( 43.82%) stalled-cycles-backend 70,813,866,387 ( 18.71%) 66,456,858,165 ( 17.74%) instructions 463,436,648,173 ( 1.22) 464,221,890,191 ( 1.24) branches 91,088,733,902 ( 760.088) 91,278,144,546 ( 769.133) branch-misses 504,460,363 ( 0.55%) 394,033,842 ( 0.43%) jobs3 perfstat stalled-cycles-frontend 201,300,397,212 ( 39.84%) 223,969,902,257 ( 44.44%) stalled-cycles-backend 87,712,593,974 ( 17.36%) 81,618,888,712 ( 16.19%) instructions 642,869,545,023 ( 1.27) 644,677,354,132 ( 1.28) branches 125,724,560,594 ( 690.682) 126,133,159,521 ( 694.542) branch-misses 527,941,798 ( 0.42%) 444,782,220 ( 0.35%) jobs4 perfstat stalled-cycles-frontend 246,701,197,429 ( 38.12%) 280,076,030,886 ( 43.29%) stalled-cycles-backend 119,050,341,112 ( 18.40%) 110,955,641,671 ( 17.15%) instructions 822,716,962,127 ( 1.27) 825,536,969,320 ( 1.28) branches 160,590,028,545 ( 688.614) 161,152,996,915 ( 691.068) branch-misses 650,295,287 ( 0.40%) 550,229,113 ( 0.34%) jobs5 perfstat stalled-cycles-frontend 298,958,462,516 ( 38.30%) 344,852,200,358 ( 44.16%) stalled-cycles-backend 137,558,742,122 ( 17.62%) 129,465,067,102 ( 16.58%) instructions 1,005,714,688,752 ( 1.29) 1,007,657,999,432 ( 1.29) branches 195,988,773,962 ( 697.730) 196,446,873,984 ( 700.319) branch-misses 695,818,940 ( 0.36%) 624,823,263 ( 0.32%) jobs6 perfstat stalled-cycles-frontend 334,497,602,856 ( 36.71%) 387,590,419,779 ( 42.38%) stalled-cycles-backend 163,539,365,335 ( 17.95%) 152,640,193,639 ( 16.69%) instructions 1,184,738,177,851 ( 1.30) 1,187,396,281,677 ( 1.30) branches 230,592,915,640 ( 702.902) 231,253,802,882 ( 702.356) branch-misses 747,934,786 ( 0.32%) 643,902,424 ( 0.28%) jobs7 perfstat stalled-cycles-frontend 396,724,684,187 ( 37.71%) 460,705,858,952 ( 43.84%) stalled-cycles-backend 188,096,616,496 ( 17.88%) 175,785,787,036 ( 16.73%) instructions 1,364,041,136,608 ( 1.30) 1,366,689,075,112 ( 1.30) branches 265,253,096,936 ( 700.078) 265,890,524,883 ( 702.839) branch-misses 784,991,589 ( 0.30%) 729,196,689 ( 0.27%) jobs8 perfstat stalled-cycles-frontend 440,248,299,870 ( 36.92%) 509,554,793,816 ( 42.46%) stalled-cycles-backend 222,575,930,616 ( 18.67%) 213,401,248,432 ( 17.78%) instructions 1,542,262,045,114 ( 1.29) 1,545,233,932,257 ( 1.29) branches 299,775,178,439 ( 697.666) 300,528,458,505 ( 694.769) branch-misses 847,496,084 ( 0.28%) 748,794,308 ( 0.25%) jobs9 perfstat stalled-cycles-frontend 506,269,882,480 ( 37.86%) 592,798,032,820 ( 44.43%) stalled-cycles-backend 253,192,498,861 ( 18.93%) 233,727,666,185 ( 17.52%) instructions 1,721,985,080,913 ( 1.29) 1,724,666,236,005 ( 1.29) branches 334,517,360,255 ( 694.134) 335,199,758,164 ( 697.131) branch-misses 873,496,730 ( 0.26%) 815,379,236 ( 0.24%) jobs10 perfstat stalled-cycles-frontend 549,063,363,749 ( 37.18%) 651,302,376,662 ( 43.61%) stalled-cycles-backend 281,680,986,810 ( 19.07%) 277,005,235,582 ( 18.55%) instructions 1,901,859,271,180 ( 1.29) 1,906,311,064,230 ( 1.28) branches 369,398,536,153 ( 694.004) 370,527,696,358 ( 688.409) branch-misses 967,929,335 ( 0.26%) 890,125,056 ( 0.24%) BASE PATCHED seconds elapsed 79.421641008 78.735285546 seconds elapsed 61.471246133 60.869085949 seconds elapsed 62.317058173 62.224188495 seconds elapsed 60.030739363 60.081102518 seconds elapsed 74.070398362 74.317582865 seconds elapsed 84.985953007 85.414364176 seconds elapsed 97.724553255 98.173311344 seconds elapsed 109.488066758 110.268399318 seconds elapsed 122.768189405 122.967164498 seconds elapsed 135.130035105 136.934770801 On my other system (8 x86_64 CPUs, short version of test results): BASE PATCHED seconds elapsed 19.518065994 19.806320662 seconds elapsed 15.172772749 15.594718291 seconds elapsed 13.820925970 13.821708564 seconds elapsed 13.293097816 14.585206405 seconds elapsed 16.207284118 16.064431606 seconds elapsed 17.958376158 17.771825767 seconds elapsed 19.478009164 19.602961508 seconds elapsed 21.347152811 21.352318709 seconds elapsed 24.478121126 24.171088735 seconds elapsed 26.865057442 26.767327618 So performance-wise the numbers are quite similar. Also update zcomp interface to be more aligned with the crypto API. [1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2 [2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2 [3] https://github.com/sergey-senozhatsky/zram-perf-test Link: http://lkml.kernel.org/r/20160531122017.2878-3-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 15:22:45 -07:00
struct crypto_comp *tfm;
zram: introduce compressing backend abstraction ZRAM performs direct LZO compression algorithm calls, making it the one and only option. While LZO is generally performs well, LZ4 algorithm tends to have a faster decompression (see http://code.google.com/p/lz4/ for full report) Name Ratio C.speed D.speed MB/s MB/s LZ4 (r101) 2.084 422 1820 LZO 2.06 2.106 414 600 Thus, users who have mostly read (decompress) usage scenarious or mixed workflow (writes with relatively high read ops number) will benefit from using LZ4 compression backend. Introduce compressing backend abstraction zcomp in order to support multiple compression algorithms with the following set of operations: .create .destroy .compress .decompress Schematically zram write() usually contains the following steps: 0) preparation (decompression of partioal IO, etc.) 1) lock buffer_lock mutex (protects meta compress buffers) 2) compress (using meta compress buffers) 3) alloc and map zs_pool object 4) copy compressed data (from meta compress buffers) to object allocated by 3) 5) free previous pool page, assign a new one 6) unlock buffer_lock mutex As we can see, compressing buffers must remain untouched from 1) to 4), because, otherwise, concurrent write() can overwrite data. At the same time, zram_meta must be aware of a) specific compression algorithm memory requirements and b) necessary locking to protect compression buffers. To remove requirement a) new struct zcomp_strm introduced, which contains a compress/decompress `buffer' and compression algorithm `private' part. While struct zcomp implements zcomp_strm stream handling and locking and removes requirement b) from zram meta. zcomp ->create() and ->destroy(), respectively, allocate and deallocate algorithm specific zcomp_strm `private' part. Every zcomp has zcomp stream and mutex to protect its compression stream. Stream usage semantics remains the same -- only one write can hold stream lock and use its buffers. zcomp_strm_find() turns caller into exclusive user of a stream (holding stream mutex until zram release stream), and zcomp_strm_release() makes zcomp stream available (unlock the stream mutex). Hence no concurrent write (compression) operations possible at the moment. iozone -t 3 -R -r 16K -s 60M -I +Z test base patched -------------------------------------------------- Initial write 597992.91 591660.58 Rewrite 609674.34 616054.97 Read 2404771.75 2452909.12 Re-read 2459216.81 2470074.44 Reverse Read 1652769.66 1589128.66 Stride read 2202441.81 2202173.31 Random read 2236311.47 2276565.31 Mixed workload 1423760.41 1709760.06 Random write 579584.08 615933.86 Pwrite 597550.02 594933.70 Pread 1703672.53 1718126.72 Fwrite 1330497.06 1461054.00 Fread 3922851.00 3957242.62 Usage examples: comp = zcomp_create(NAME) /* NAME e.g. "lzo" */ which initialises compressing backend if requested algorithm is supported. Compress: zstrm = zcomp_strm_find(comp) zcomp_compress(comp, zstrm, src, &dst_len) [..] /* copy compressed data */ zcomp_strm_release(comp, zstrm) Decompress: zcomp_decompress(comp, src, src_len, dst); Free compessing backend and its zcomp stream: zcomp_destroy(comp) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 15:38:11 -07:00
};
/* dynamic per-device compression frontend */
struct zcomp {
struct zcomp_strm __percpu *stream;
zram: switch to crypto compress API We don't have an idle zstreams list anymore and our write path now works absolutely differently, preventing preemption during compression. This removes possibilities of read paths preempting writes at wrong places (which could badly affect the performance of both paths) and at the same time opens the door for a move from custom LZO/LZ4 compression backends implementation to a more generic one, using crypto compress API. Joonsoo Kim [1] attempted to do this a while ago, but faced with the need of introducing a new crypto API interface. The root cause was the fact that crypto API compression algorithms require a compression stream structure (in zram terminology) for both compression and decompression ops, while in reality only several of compression algorithms really need it. This resulted in a concept of context-less crypto API compression backends [2]. Both write and read paths, though, would have been executed with the preemption enabled, which in the worst case could have resulted in a decreased worst-case performance, e.g. consider the following case: CPU0 zram_write() spin_lock() take the last idle stream spin_unlock() << preempted >> zram_read() spin_lock() no idle streams spin_unlock() schedule() resuming zram_write compression() but it took me some time to realize that, and it took even longer to evolve zram and to make it ready for crypto API. The key turned out to be -- drop the idle streams list entirely. Without the idle streams list we are free to use compression algorithms that require compression stream for decompression (read), because streams are now placed in per-cpu data and each write path has to disable preemption for compression op, almost completely eliminating the aforementioned case (technically, we still have a small chance, because write path has a fast and a slow paths and the slow path is executed with the preemption enabled; but the frequency of failed fast path is too low). TEST ==== - 4 CPUs, x86_64 system - 3G zram, lzo - fio tests: read, randread, write, randwrite, rw, randrw test script [3] command: ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh BASE PATCHED jobs1 READ: 2527.2MB/s 2482.7MB/s READ: 2102.7MB/s 2045.0MB/s WRITE: 1284.3MB/s 1324.3MB/s WRITE: 1080.7MB/s 1101.9MB/s READ: 430125KB/s 437498KB/s WRITE: 430538KB/s 437919KB/s READ: 399593KB/s 403987KB/s WRITE: 399910KB/s 404308KB/s jobs2 READ: 8133.5MB/s 7854.8MB/s READ: 7086.6MB/s 6912.8MB/s WRITE: 3177.2MB/s 3298.3MB/s WRITE: 2810.2MB/s 2871.4MB/s READ: 1017.6MB/s 1023.4MB/s WRITE: 1018.2MB/s 1023.1MB/s READ: 977836KB/s 984205KB/s WRITE: 979435KB/s 985814KB/s jobs3 READ: 13557MB/s 13391MB/s READ: 11876MB/s 11752MB/s WRITE: 4641.5MB/s 4682.1MB/s WRITE: 4164.9MB/s 4179.3MB/s READ: 1453.8MB/s 1455.1MB/s WRITE: 1455.1MB/s 1458.2MB/s READ: 1387.7MB/s 1395.7MB/s WRITE: 1386.1MB/s 1394.9MB/s jobs4 READ: 20271MB/s 20078MB/s READ: 18033MB/s 17928MB/s WRITE: 6176.8MB/s 6180.5MB/s WRITE: 5686.3MB/s 5705.3MB/s READ: 2009.4MB/s 2006.7MB/s WRITE: 2007.5MB/s 2004.9MB/s READ: 1929.7MB/s 1935.6MB/s WRITE: 1926.8MB/s 1932.6MB/s jobs5 READ: 18823MB/s 19024MB/s READ: 18968MB/s 19071MB/s WRITE: 6191.6MB/s 6372.1MB/s WRITE: 5818.7MB/s 5787.1MB/s READ: 2011.7MB/s 1981.3MB/s WRITE: 2011.4MB/s 1980.1MB/s READ: 1949.3MB/s 1935.7MB/s WRITE: 1940.4MB/s 1926.1MB/s jobs6 READ: 21870MB/s 21715MB/s READ: 19957MB/s 19879MB/s WRITE: 6528.4MB/s 6537.6MB/s WRITE: 6098.9MB/s 6073.6MB/s READ: 2048.6MB/s 2049.9MB/s WRITE: 2041.7MB/s 2042.9MB/s READ: 2013.4MB/s 1990.4MB/s WRITE: 2009.4MB/s 1986.5MB/s jobs7 READ: 21359MB/s 21124MB/s READ: 19746MB/s 19293MB/s WRITE: 6660.4MB/s 6518.8MB/s WRITE: 6211.6MB/s 6193.1MB/s READ: 2089.7MB/s 2080.6MB/s WRITE: 2085.8MB/s 2076.5MB/s READ: 2041.2MB/s 2052.5MB/s WRITE: 2037.5MB/s 2048.8MB/s jobs8 READ: 20477MB/s 19974MB/s READ: 18922MB/s 18576MB/s WRITE: 6851.9MB/s 6788.3MB/s WRITE: 6407.7MB/s 6347.5MB/s READ: 2134.8MB/s 2136.1MB/s WRITE: 2132.8MB/s 2134.4MB/s READ: 2074.2MB/s 2069.6MB/s WRITE: 2087.3MB/s 2082.4MB/s jobs9 READ: 19797MB/s 19994MB/s READ: 18806MB/s 18581MB/s WRITE: 6878.7MB/s 6822.7MB/s WRITE: 6456.8MB/s 6447.2MB/s READ: 2141.1MB/s 2154.7MB/s WRITE: 2144.4MB/s 2157.3MB/s READ: 2084.1MB/s 2085.1MB/s WRITE: 2091.5MB/s 2092.5MB/s jobs10 READ: 19794MB/s 19784MB/s READ: 18794MB/s 18745MB/s WRITE: 6984.4MB/s 6676.3MB/s WRITE: 6532.3MB/s 6342.7MB/s READ: 2150.6MB/s 2155.4MB/s WRITE: 2156.8MB/s 2161.5MB/s READ: 2106.4MB/s 2095.6MB/s WRITE: 2109.7MB/s 2098.4MB/s BASE PATCHED jobs1 perfstat stalled-cycles-frontend 102,480,595,419 ( 41.53%) 114,508,864,804 ( 46.92%) stalled-cycles-backend 51,941,417,832 ( 21.05%) 46,836,112,388 ( 19.19%) instructions 283,612,054,215 ( 1.15) 283,918,134,959 ( 1.16) branches 56,372,560,385 ( 724.923) 56,449,814,753 ( 733.766) branch-misses 374,826,000 ( 0.66%) 326,935,859 ( 0.58%) jobs2 perfstat stalled-cycles-frontend 155,142,745,777 ( 40.99%) 164,170,979,198 ( 43.82%) stalled-cycles-backend 70,813,866,387 ( 18.71%) 66,456,858,165 ( 17.74%) instructions 463,436,648,173 ( 1.22) 464,221,890,191 ( 1.24) branches 91,088,733,902 ( 760.088) 91,278,144,546 ( 769.133) branch-misses 504,460,363 ( 0.55%) 394,033,842 ( 0.43%) jobs3 perfstat stalled-cycles-frontend 201,300,397,212 ( 39.84%) 223,969,902,257 ( 44.44%) stalled-cycles-backend 87,712,593,974 ( 17.36%) 81,618,888,712 ( 16.19%) instructions 642,869,545,023 ( 1.27) 644,677,354,132 ( 1.28) branches 125,724,560,594 ( 690.682) 126,133,159,521 ( 694.542) branch-misses 527,941,798 ( 0.42%) 444,782,220 ( 0.35%) jobs4 perfstat stalled-cycles-frontend 246,701,197,429 ( 38.12%) 280,076,030,886 ( 43.29%) stalled-cycles-backend 119,050,341,112 ( 18.40%) 110,955,641,671 ( 17.15%) instructions 822,716,962,127 ( 1.27) 825,536,969,320 ( 1.28) branches 160,590,028,545 ( 688.614) 161,152,996,915 ( 691.068) branch-misses 650,295,287 ( 0.40%) 550,229,113 ( 0.34%) jobs5 perfstat stalled-cycles-frontend 298,958,462,516 ( 38.30%) 344,852,200,358 ( 44.16%) stalled-cycles-backend 137,558,742,122 ( 17.62%) 129,465,067,102 ( 16.58%) instructions 1,005,714,688,752 ( 1.29) 1,007,657,999,432 ( 1.29) branches 195,988,773,962 ( 697.730) 196,446,873,984 ( 700.319) branch-misses 695,818,940 ( 0.36%) 624,823,263 ( 0.32%) jobs6 perfstat stalled-cycles-frontend 334,497,602,856 ( 36.71%) 387,590,419,779 ( 42.38%) stalled-cycles-backend 163,539,365,335 ( 17.95%) 152,640,193,639 ( 16.69%) instructions 1,184,738,177,851 ( 1.30) 1,187,396,281,677 ( 1.30) branches 230,592,915,640 ( 702.902) 231,253,802,882 ( 702.356) branch-misses 747,934,786 ( 0.32%) 643,902,424 ( 0.28%) jobs7 perfstat stalled-cycles-frontend 396,724,684,187 ( 37.71%) 460,705,858,952 ( 43.84%) stalled-cycles-backend 188,096,616,496 ( 17.88%) 175,785,787,036 ( 16.73%) instructions 1,364,041,136,608 ( 1.30) 1,366,689,075,112 ( 1.30) branches 265,253,096,936 ( 700.078) 265,890,524,883 ( 702.839) branch-misses 784,991,589 ( 0.30%) 729,196,689 ( 0.27%) jobs8 perfstat stalled-cycles-frontend 440,248,299,870 ( 36.92%) 509,554,793,816 ( 42.46%) stalled-cycles-backend 222,575,930,616 ( 18.67%) 213,401,248,432 ( 17.78%) instructions 1,542,262,045,114 ( 1.29) 1,545,233,932,257 ( 1.29) branches 299,775,178,439 ( 697.666) 300,528,458,505 ( 694.769) branch-misses 847,496,084 ( 0.28%) 748,794,308 ( 0.25%) jobs9 perfstat stalled-cycles-frontend 506,269,882,480 ( 37.86%) 592,798,032,820 ( 44.43%) stalled-cycles-backend 253,192,498,861 ( 18.93%) 233,727,666,185 ( 17.52%) instructions 1,721,985,080,913 ( 1.29) 1,724,666,236,005 ( 1.29) branches 334,517,360,255 ( 694.134) 335,199,758,164 ( 697.131) branch-misses 873,496,730 ( 0.26%) 815,379,236 ( 0.24%) jobs10 perfstat stalled-cycles-frontend 549,063,363,749 ( 37.18%) 651,302,376,662 ( 43.61%) stalled-cycles-backend 281,680,986,810 ( 19.07%) 277,005,235,582 ( 18.55%) instructions 1,901,859,271,180 ( 1.29) 1,906,311,064,230 ( 1.28) branches 369,398,536,153 ( 694.004) 370,527,696,358 ( 688.409) branch-misses 967,929,335 ( 0.26%) 890,125,056 ( 0.24%) BASE PATCHED seconds elapsed 79.421641008 78.735285546 seconds elapsed 61.471246133 60.869085949 seconds elapsed 62.317058173 62.224188495 seconds elapsed 60.030739363 60.081102518 seconds elapsed 74.070398362 74.317582865 seconds elapsed 84.985953007 85.414364176 seconds elapsed 97.724553255 98.173311344 seconds elapsed 109.488066758 110.268399318 seconds elapsed 122.768189405 122.967164498 seconds elapsed 135.130035105 136.934770801 On my other system (8 x86_64 CPUs, short version of test results): BASE PATCHED seconds elapsed 19.518065994 19.806320662 seconds elapsed 15.172772749 15.594718291 seconds elapsed 13.820925970 13.821708564 seconds elapsed 13.293097816 14.585206405 seconds elapsed 16.207284118 16.064431606 seconds elapsed 17.958376158 17.771825767 seconds elapsed 19.478009164 19.602961508 seconds elapsed 21.347152811 21.352318709 seconds elapsed 24.478121126 24.171088735 seconds elapsed 26.865057442 26.767327618 So performance-wise the numbers are quite similar. Also update zcomp interface to be more aligned with the crypto API. [1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2 [2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2 [3] https://github.com/sergey-senozhatsky/zram-perf-test Link: http://lkml.kernel.org/r/20160531122017.2878-3-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 15:22:45 -07:00
const char *name;
struct hlist_node node;
zram: introduce compressing backend abstraction ZRAM performs direct LZO compression algorithm calls, making it the one and only option. While LZO is generally performs well, LZ4 algorithm tends to have a faster decompression (see http://code.google.com/p/lz4/ for full report) Name Ratio C.speed D.speed MB/s MB/s LZ4 (r101) 2.084 422 1820 LZO 2.06 2.106 414 600 Thus, users who have mostly read (decompress) usage scenarious or mixed workflow (writes with relatively high read ops number) will benefit from using LZ4 compression backend. Introduce compressing backend abstraction zcomp in order to support multiple compression algorithms with the following set of operations: .create .destroy .compress .decompress Schematically zram write() usually contains the following steps: 0) preparation (decompression of partioal IO, etc.) 1) lock buffer_lock mutex (protects meta compress buffers) 2) compress (using meta compress buffers) 3) alloc and map zs_pool object 4) copy compressed data (from meta compress buffers) to object allocated by 3) 5) free previous pool page, assign a new one 6) unlock buffer_lock mutex As we can see, compressing buffers must remain untouched from 1) to 4), because, otherwise, concurrent write() can overwrite data. At the same time, zram_meta must be aware of a) specific compression algorithm memory requirements and b) necessary locking to protect compression buffers. To remove requirement a) new struct zcomp_strm introduced, which contains a compress/decompress `buffer' and compression algorithm `private' part. While struct zcomp implements zcomp_strm stream handling and locking and removes requirement b) from zram meta. zcomp ->create() and ->destroy(), respectively, allocate and deallocate algorithm specific zcomp_strm `private' part. Every zcomp has zcomp stream and mutex to protect its compression stream. Stream usage semantics remains the same -- only one write can hold stream lock and use its buffers. zcomp_strm_find() turns caller into exclusive user of a stream (holding stream mutex until zram release stream), and zcomp_strm_release() makes zcomp stream available (unlock the stream mutex). Hence no concurrent write (compression) operations possible at the moment. iozone -t 3 -R -r 16K -s 60M -I +Z test base patched -------------------------------------------------- Initial write 597992.91 591660.58 Rewrite 609674.34 616054.97 Read 2404771.75 2452909.12 Re-read 2459216.81 2470074.44 Reverse Read 1652769.66 1589128.66 Stride read 2202441.81 2202173.31 Random read 2236311.47 2276565.31 Mixed workload 1423760.41 1709760.06 Random write 579584.08 615933.86 Pwrite 597550.02 594933.70 Pread 1703672.53 1718126.72 Fwrite 1330497.06 1461054.00 Fread 3922851.00 3957242.62 Usage examples: comp = zcomp_create(NAME) /* NAME e.g. "lzo" */ which initialises compressing backend if requested algorithm is supported. Compress: zstrm = zcomp_strm_find(comp) zcomp_compress(comp, zstrm, src, &dst_len) [..] /* copy compressed data */ zcomp_strm_release(comp, zstrm) Decompress: zcomp_decompress(comp, src, src_len, dst); Free compessing backend and its zcomp stream: zcomp_destroy(comp) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 15:38:11 -07:00
};
int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node);
int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node);
ssize_t zcomp_available_show(const char *comp, char *buf);
bool zcomp_available_algorithm(const char *comp);
zram: preparation for multi-zcomp support Patch series "zram: Support multiple compression streams", v5. This series adds support for multiple compression streams. The main idea is that different compression algorithms have different characteristics and zram may benefit when it uses a combination of algorithms: a default algorithm that is faster but have lower compression rate and a secondary algorithm that can use higher compression rate at a price of slower compression/decompression. There are several use-case for this functionality: - huge pages re-compression: zstd or deflate can successfully compress huge pages (~50% of huge pages on my synthetic ChromeOS tests), IOW pages that lzo was not able to compress. - idle pages re-compression: idle/cold pages sit in the memory and we may reduce zsmalloc memory usage if we recompress those idle pages. Userspace has a number of ways to control the behavior and impact of zram recompression: what type of pages should be recompressed, size watermarks, etc. Please refer to documentation patch. This patch (of 13): The patch turns compression streams and compressor algorithm name struct zram members into arrays, so that we can have multiple compression streams support (in the next patches). The patch uses a rather explicit API for compressor selection: - Get primary (default) compression stream zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]) - Get secondary compression stream zcomp_stream_get(zram->comps[ZRAM_SECONDARY_COMP]) We use similar API for compression streams put(). At this point we always have just one compression stream, since CONFIG_ZRAM_MULTI_COMP is not yet defined. Link: https://lkml.kernel.org/r/20221109115047.2921851-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20221109115047.2921851-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Alexey Romanov <avromanov@sberdevices.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-09 20:50:35 +09:00
struct zcomp *zcomp_create(const char *alg);
zram: introduce compressing backend abstraction ZRAM performs direct LZO compression algorithm calls, making it the one and only option. While LZO is generally performs well, LZ4 algorithm tends to have a faster decompression (see http://code.google.com/p/lz4/ for full report) Name Ratio C.speed D.speed MB/s MB/s LZ4 (r101) 2.084 422 1820 LZO 2.06 2.106 414 600 Thus, users who have mostly read (decompress) usage scenarious or mixed workflow (writes with relatively high read ops number) will benefit from using LZ4 compression backend. Introduce compressing backend abstraction zcomp in order to support multiple compression algorithms with the following set of operations: .create .destroy .compress .decompress Schematically zram write() usually contains the following steps: 0) preparation (decompression of partioal IO, etc.) 1) lock buffer_lock mutex (protects meta compress buffers) 2) compress (using meta compress buffers) 3) alloc and map zs_pool object 4) copy compressed data (from meta compress buffers) to object allocated by 3) 5) free previous pool page, assign a new one 6) unlock buffer_lock mutex As we can see, compressing buffers must remain untouched from 1) to 4), because, otherwise, concurrent write() can overwrite data. At the same time, zram_meta must be aware of a) specific compression algorithm memory requirements and b) necessary locking to protect compression buffers. To remove requirement a) new struct zcomp_strm introduced, which contains a compress/decompress `buffer' and compression algorithm `private' part. While struct zcomp implements zcomp_strm stream handling and locking and removes requirement b) from zram meta. zcomp ->create() and ->destroy(), respectively, allocate and deallocate algorithm specific zcomp_strm `private' part. Every zcomp has zcomp stream and mutex to protect its compression stream. Stream usage semantics remains the same -- only one write can hold stream lock and use its buffers. zcomp_strm_find() turns caller into exclusive user of a stream (holding stream mutex until zram release stream), and zcomp_strm_release() makes zcomp stream available (unlock the stream mutex). Hence no concurrent write (compression) operations possible at the moment. iozone -t 3 -R -r 16K -s 60M -I +Z test base patched -------------------------------------------------- Initial write 597992.91 591660.58 Rewrite 609674.34 616054.97 Read 2404771.75 2452909.12 Re-read 2459216.81 2470074.44 Reverse Read 1652769.66 1589128.66 Stride read 2202441.81 2202173.31 Random read 2236311.47 2276565.31 Mixed workload 1423760.41 1709760.06 Random write 579584.08 615933.86 Pwrite 597550.02 594933.70 Pread 1703672.53 1718126.72 Fwrite 1330497.06 1461054.00 Fread 3922851.00 3957242.62 Usage examples: comp = zcomp_create(NAME) /* NAME e.g. "lzo" */ which initialises compressing backend if requested algorithm is supported. Compress: zstrm = zcomp_strm_find(comp) zcomp_compress(comp, zstrm, src, &dst_len) [..] /* copy compressed data */ zcomp_strm_release(comp, zstrm) Decompress: zcomp_decompress(comp, src, src_len, dst); Free compessing backend and its zcomp stream: zcomp_destroy(comp) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 15:38:11 -07:00
void zcomp_destroy(struct zcomp *comp);
zram: rename zstrm find-release functions This has started as a 'add zlib support' work, but after some thinking I saw no blockers for a bigger change -- a switch to crypto API. We don't have an idle zstreams list anymore and our write path now works absolutely differently, preventing preemption during compression. This removes possibilities of read paths preempting writes at wrong places and opens the door for a move from custom LZO/LZ4 compression backends implementation to a more generic one, using crypto compress API. This patch set also eliminates the need of a new context-less crypto API interface, which was quite hard to sell, so we can move along faster. benchmarks: (x86_64, 4GB, zram-perf script) perf reported run-time fio (max jobs=3). I performed fio test with the increasing number of parallel jobs (max to 3) on a 3G zram device, using `static' data and the following crypto comp algorithms: 842, deflate, lz4, lz4hc, lzo the output was: - test running time (which can tell us what algorithms performs faster) and - zram mm_stat (which tells the compressed memory size, max used memory, etc). It's just for information. for example, LZ4HC has twice the running time of LZO, but the compressed memory size is: 23592960 vs 34603008 bytes. test-fio-zram-842 197.907655282 seconds time elapsed 201.623142884 seconds time elapsed 226.854291345 seconds time elapsed test-fio-zram-DEFLATE 253.259516155 seconds time elapsed 258.148563401 seconds time elapsed 290.251909365 seconds time elapsed test-fio-zram-LZ4 27.022598717 seconds time elapsed 29.580522717 seconds time elapsed 33.293463430 seconds time elapsed test-fio-zram-LZ4HC 56.393954615 seconds time elapsed 74.904659747 seconds time elapsed 101.940998564 seconds time elapsed test-fio-zram-LZO 28.155948075 seconds time elapsed 30.390036330 seconds time elapsed 34.455773159 seconds time elapsed zram mm_stat-s (max fio jobs=3) test-fio-zram-842 mm_stat (jobs1): 3221225472 673185792 690266112 0 690266112 0 0 mm_stat (jobs2): 3221225472 673185792 690266112 0 690266112 0 0 mm_stat (jobs3): 3221225472 673185792 690266112 0 690266112 0 0 test-fio-zram-DEFLATE mm_stat (jobs1): 3221225472 24379392 37761024 0 37761024 0 0 mm_stat (jobs2): 3221225472 24379392 37761024 0 37761024 0 0 mm_stat (jobs3): 3221225472 24379392 37761024 0 37761024 0 0 test-fio-zram-LZ4 mm_stat (jobs1): 3221225472 23592960 37761024 0 37761024 0 0 mm_stat (jobs2): 3221225472 23592960 37761024 0 37761024 0 0 mm_stat (jobs3): 3221225472 23592960 37761024 0 37761024 0 0 test-fio-zram-LZ4HC mm_stat (jobs1): 3221225472 23592960 37761024 0 37761024 0 0 mm_stat (jobs2): 3221225472 23592960 37761024 0 37761024 0 0 mm_stat (jobs3): 3221225472 23592960 37761024 0 37761024 0 0 test-fio-zram-LZO mm_stat (jobs1): 3221225472 34603008 50335744 0 50335744 0 0 mm_stat (jobs2): 3221225472 34603008 50335744 0 50335744 0 0 mm_stat (jobs3): 3221225472 34603008 50335744 0 50339840 0 0 This patch (of 8): We don't perform any zstream idle list lookup anymore, so zcomp_strm_find()/zcomp_strm_release() names are not representative. Rename to zcomp_stream_get()/zcomp_stream_put(). Link: http://lkml.kernel.org/r/20160531122017.2878-2-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 15:22:42 -07:00
struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
void zcomp_stream_put(struct zcomp *comp);
zram: introduce compressing backend abstraction ZRAM performs direct LZO compression algorithm calls, making it the one and only option. While LZO is generally performs well, LZ4 algorithm tends to have a faster decompression (see http://code.google.com/p/lz4/ for full report) Name Ratio C.speed D.speed MB/s MB/s LZ4 (r101) 2.084 422 1820 LZO 2.06 2.106 414 600 Thus, users who have mostly read (decompress) usage scenarious or mixed workflow (writes with relatively high read ops number) will benefit from using LZ4 compression backend. Introduce compressing backend abstraction zcomp in order to support multiple compression algorithms with the following set of operations: .create .destroy .compress .decompress Schematically zram write() usually contains the following steps: 0) preparation (decompression of partioal IO, etc.) 1) lock buffer_lock mutex (protects meta compress buffers) 2) compress (using meta compress buffers) 3) alloc and map zs_pool object 4) copy compressed data (from meta compress buffers) to object allocated by 3) 5) free previous pool page, assign a new one 6) unlock buffer_lock mutex As we can see, compressing buffers must remain untouched from 1) to 4), because, otherwise, concurrent write() can overwrite data. At the same time, zram_meta must be aware of a) specific compression algorithm memory requirements and b) necessary locking to protect compression buffers. To remove requirement a) new struct zcomp_strm introduced, which contains a compress/decompress `buffer' and compression algorithm `private' part. While struct zcomp implements zcomp_strm stream handling and locking and removes requirement b) from zram meta. zcomp ->create() and ->destroy(), respectively, allocate and deallocate algorithm specific zcomp_strm `private' part. Every zcomp has zcomp stream and mutex to protect its compression stream. Stream usage semantics remains the same -- only one write can hold stream lock and use its buffers. zcomp_strm_find() turns caller into exclusive user of a stream (holding stream mutex until zram release stream), and zcomp_strm_release() makes zcomp stream available (unlock the stream mutex). Hence no concurrent write (compression) operations possible at the moment. iozone -t 3 -R -r 16K -s 60M -I +Z test base patched -------------------------------------------------- Initial write 597992.91 591660.58 Rewrite 609674.34 616054.97 Read 2404771.75 2452909.12 Re-read 2459216.81 2470074.44 Reverse Read 1652769.66 1589128.66 Stride read 2202441.81 2202173.31 Random read 2236311.47 2276565.31 Mixed workload 1423760.41 1709760.06 Random write 579584.08 615933.86 Pwrite 597550.02 594933.70 Pread 1703672.53 1718126.72 Fwrite 1330497.06 1461054.00 Fread 3922851.00 3957242.62 Usage examples: comp = zcomp_create(NAME) /* NAME e.g. "lzo" */ which initialises compressing backend if requested algorithm is supported. Compress: zstrm = zcomp_strm_find(comp) zcomp_compress(comp, zstrm, src, &dst_len) [..] /* copy compressed data */ zcomp_strm_release(comp, zstrm) Decompress: zcomp_decompress(comp, src, src_len, dst); Free compessing backend and its zcomp stream: zcomp_destroy(comp) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 15:38:11 -07:00
zram: switch to crypto compress API We don't have an idle zstreams list anymore and our write path now works absolutely differently, preventing preemption during compression. This removes possibilities of read paths preempting writes at wrong places (which could badly affect the performance of both paths) and at the same time opens the door for a move from custom LZO/LZ4 compression backends implementation to a more generic one, using crypto compress API. Joonsoo Kim [1] attempted to do this a while ago, but faced with the need of introducing a new crypto API interface. The root cause was the fact that crypto API compression algorithms require a compression stream structure (in zram terminology) for both compression and decompression ops, while in reality only several of compression algorithms really need it. This resulted in a concept of context-less crypto API compression backends [2]. Both write and read paths, though, would have been executed with the preemption enabled, which in the worst case could have resulted in a decreased worst-case performance, e.g. consider the following case: CPU0 zram_write() spin_lock() take the last idle stream spin_unlock() << preempted >> zram_read() spin_lock() no idle streams spin_unlock() schedule() resuming zram_write compression() but it took me some time to realize that, and it took even longer to evolve zram and to make it ready for crypto API. The key turned out to be -- drop the idle streams list entirely. Without the idle streams list we are free to use compression algorithms that require compression stream for decompression (read), because streams are now placed in per-cpu data and each write path has to disable preemption for compression op, almost completely eliminating the aforementioned case (technically, we still have a small chance, because write path has a fast and a slow paths and the slow path is executed with the preemption enabled; but the frequency of failed fast path is too low). TEST ==== - 4 CPUs, x86_64 system - 3G zram, lzo - fio tests: read, randread, write, randwrite, rw, randrw test script [3] command: ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh BASE PATCHED jobs1 READ: 2527.2MB/s 2482.7MB/s READ: 2102.7MB/s 2045.0MB/s WRITE: 1284.3MB/s 1324.3MB/s WRITE: 1080.7MB/s 1101.9MB/s READ: 430125KB/s 437498KB/s WRITE: 430538KB/s 437919KB/s READ: 399593KB/s 403987KB/s WRITE: 399910KB/s 404308KB/s jobs2 READ: 8133.5MB/s 7854.8MB/s READ: 7086.6MB/s 6912.8MB/s WRITE: 3177.2MB/s 3298.3MB/s WRITE: 2810.2MB/s 2871.4MB/s READ: 1017.6MB/s 1023.4MB/s WRITE: 1018.2MB/s 1023.1MB/s READ: 977836KB/s 984205KB/s WRITE: 979435KB/s 985814KB/s jobs3 READ: 13557MB/s 13391MB/s READ: 11876MB/s 11752MB/s WRITE: 4641.5MB/s 4682.1MB/s WRITE: 4164.9MB/s 4179.3MB/s READ: 1453.8MB/s 1455.1MB/s WRITE: 1455.1MB/s 1458.2MB/s READ: 1387.7MB/s 1395.7MB/s WRITE: 1386.1MB/s 1394.9MB/s jobs4 READ: 20271MB/s 20078MB/s READ: 18033MB/s 17928MB/s WRITE: 6176.8MB/s 6180.5MB/s WRITE: 5686.3MB/s 5705.3MB/s READ: 2009.4MB/s 2006.7MB/s WRITE: 2007.5MB/s 2004.9MB/s READ: 1929.7MB/s 1935.6MB/s WRITE: 1926.8MB/s 1932.6MB/s jobs5 READ: 18823MB/s 19024MB/s READ: 18968MB/s 19071MB/s WRITE: 6191.6MB/s 6372.1MB/s WRITE: 5818.7MB/s 5787.1MB/s READ: 2011.7MB/s 1981.3MB/s WRITE: 2011.4MB/s 1980.1MB/s READ: 1949.3MB/s 1935.7MB/s WRITE: 1940.4MB/s 1926.1MB/s jobs6 READ: 21870MB/s 21715MB/s READ: 19957MB/s 19879MB/s WRITE: 6528.4MB/s 6537.6MB/s WRITE: 6098.9MB/s 6073.6MB/s READ: 2048.6MB/s 2049.9MB/s WRITE: 2041.7MB/s 2042.9MB/s READ: 2013.4MB/s 1990.4MB/s WRITE: 2009.4MB/s 1986.5MB/s jobs7 READ: 21359MB/s 21124MB/s READ: 19746MB/s 19293MB/s WRITE: 6660.4MB/s 6518.8MB/s WRITE: 6211.6MB/s 6193.1MB/s READ: 2089.7MB/s 2080.6MB/s WRITE: 2085.8MB/s 2076.5MB/s READ: 2041.2MB/s 2052.5MB/s WRITE: 2037.5MB/s 2048.8MB/s jobs8 READ: 20477MB/s 19974MB/s READ: 18922MB/s 18576MB/s WRITE: 6851.9MB/s 6788.3MB/s WRITE: 6407.7MB/s 6347.5MB/s READ: 2134.8MB/s 2136.1MB/s WRITE: 2132.8MB/s 2134.4MB/s READ: 2074.2MB/s 2069.6MB/s WRITE: 2087.3MB/s 2082.4MB/s jobs9 READ: 19797MB/s 19994MB/s READ: 18806MB/s 18581MB/s WRITE: 6878.7MB/s 6822.7MB/s WRITE: 6456.8MB/s 6447.2MB/s READ: 2141.1MB/s 2154.7MB/s WRITE: 2144.4MB/s 2157.3MB/s READ: 2084.1MB/s 2085.1MB/s WRITE: 2091.5MB/s 2092.5MB/s jobs10 READ: 19794MB/s 19784MB/s READ: 18794MB/s 18745MB/s WRITE: 6984.4MB/s 6676.3MB/s WRITE: 6532.3MB/s 6342.7MB/s READ: 2150.6MB/s 2155.4MB/s WRITE: 2156.8MB/s 2161.5MB/s READ: 2106.4MB/s 2095.6MB/s WRITE: 2109.7MB/s 2098.4MB/s BASE PATCHED jobs1 perfstat stalled-cycles-frontend 102,480,595,419 ( 41.53%) 114,508,864,804 ( 46.92%) stalled-cycles-backend 51,941,417,832 ( 21.05%) 46,836,112,388 ( 19.19%) instructions 283,612,054,215 ( 1.15) 283,918,134,959 ( 1.16) branches 56,372,560,385 ( 724.923) 56,449,814,753 ( 733.766) branch-misses 374,826,000 ( 0.66%) 326,935,859 ( 0.58%) jobs2 perfstat stalled-cycles-frontend 155,142,745,777 ( 40.99%) 164,170,979,198 ( 43.82%) stalled-cycles-backend 70,813,866,387 ( 18.71%) 66,456,858,165 ( 17.74%) instructions 463,436,648,173 ( 1.22) 464,221,890,191 ( 1.24) branches 91,088,733,902 ( 760.088) 91,278,144,546 ( 769.133) branch-misses 504,460,363 ( 0.55%) 394,033,842 ( 0.43%) jobs3 perfstat stalled-cycles-frontend 201,300,397,212 ( 39.84%) 223,969,902,257 ( 44.44%) stalled-cycles-backend 87,712,593,974 ( 17.36%) 81,618,888,712 ( 16.19%) instructions 642,869,545,023 ( 1.27) 644,677,354,132 ( 1.28) branches 125,724,560,594 ( 690.682) 126,133,159,521 ( 694.542) branch-misses 527,941,798 ( 0.42%) 444,782,220 ( 0.35%) jobs4 perfstat stalled-cycles-frontend 246,701,197,429 ( 38.12%) 280,076,030,886 ( 43.29%) stalled-cycles-backend 119,050,341,112 ( 18.40%) 110,955,641,671 ( 17.15%) instructions 822,716,962,127 ( 1.27) 825,536,969,320 ( 1.28) branches 160,590,028,545 ( 688.614) 161,152,996,915 ( 691.068) branch-misses 650,295,287 ( 0.40%) 550,229,113 ( 0.34%) jobs5 perfstat stalled-cycles-frontend 298,958,462,516 ( 38.30%) 344,852,200,358 ( 44.16%) stalled-cycles-backend 137,558,742,122 ( 17.62%) 129,465,067,102 ( 16.58%) instructions 1,005,714,688,752 ( 1.29) 1,007,657,999,432 ( 1.29) branches 195,988,773,962 ( 697.730) 196,446,873,984 ( 700.319) branch-misses 695,818,940 ( 0.36%) 624,823,263 ( 0.32%) jobs6 perfstat stalled-cycles-frontend 334,497,602,856 ( 36.71%) 387,590,419,779 ( 42.38%) stalled-cycles-backend 163,539,365,335 ( 17.95%) 152,640,193,639 ( 16.69%) instructions 1,184,738,177,851 ( 1.30) 1,187,396,281,677 ( 1.30) branches 230,592,915,640 ( 702.902) 231,253,802,882 ( 702.356) branch-misses 747,934,786 ( 0.32%) 643,902,424 ( 0.28%) jobs7 perfstat stalled-cycles-frontend 396,724,684,187 ( 37.71%) 460,705,858,952 ( 43.84%) stalled-cycles-backend 188,096,616,496 ( 17.88%) 175,785,787,036 ( 16.73%) instructions 1,364,041,136,608 ( 1.30) 1,366,689,075,112 ( 1.30) branches 265,253,096,936 ( 700.078) 265,890,524,883 ( 702.839) branch-misses 784,991,589 ( 0.30%) 729,196,689 ( 0.27%) jobs8 perfstat stalled-cycles-frontend 440,248,299,870 ( 36.92%) 509,554,793,816 ( 42.46%) stalled-cycles-backend 222,575,930,616 ( 18.67%) 213,401,248,432 ( 17.78%) instructions 1,542,262,045,114 ( 1.29) 1,545,233,932,257 ( 1.29) branches 299,775,178,439 ( 697.666) 300,528,458,505 ( 694.769) branch-misses 847,496,084 ( 0.28%) 748,794,308 ( 0.25%) jobs9 perfstat stalled-cycles-frontend 506,269,882,480 ( 37.86%) 592,798,032,820 ( 44.43%) stalled-cycles-backend 253,192,498,861 ( 18.93%) 233,727,666,185 ( 17.52%) instructions 1,721,985,080,913 ( 1.29) 1,724,666,236,005 ( 1.29) branches 334,517,360,255 ( 694.134) 335,199,758,164 ( 697.131) branch-misses 873,496,730 ( 0.26%) 815,379,236 ( 0.24%) jobs10 perfstat stalled-cycles-frontend 549,063,363,749 ( 37.18%) 651,302,376,662 ( 43.61%) stalled-cycles-backend 281,680,986,810 ( 19.07%) 277,005,235,582 ( 18.55%) instructions 1,901,859,271,180 ( 1.29) 1,906,311,064,230 ( 1.28) branches 369,398,536,153 ( 694.004) 370,527,696,358 ( 688.409) branch-misses 967,929,335 ( 0.26%) 890,125,056 ( 0.24%) BASE PATCHED seconds elapsed 79.421641008 78.735285546 seconds elapsed 61.471246133 60.869085949 seconds elapsed 62.317058173 62.224188495 seconds elapsed 60.030739363 60.081102518 seconds elapsed 74.070398362 74.317582865 seconds elapsed 84.985953007 85.414364176 seconds elapsed 97.724553255 98.173311344 seconds elapsed 109.488066758 110.268399318 seconds elapsed 122.768189405 122.967164498 seconds elapsed 135.130035105 136.934770801 On my other system (8 x86_64 CPUs, short version of test results): BASE PATCHED seconds elapsed 19.518065994 19.806320662 seconds elapsed 15.172772749 15.594718291 seconds elapsed 13.820925970 13.821708564 seconds elapsed 13.293097816 14.585206405 seconds elapsed 16.207284118 16.064431606 seconds elapsed 17.958376158 17.771825767 seconds elapsed 19.478009164 19.602961508 seconds elapsed 21.347152811 21.352318709 seconds elapsed 24.478121126 24.171088735 seconds elapsed 26.865057442 26.767327618 So performance-wise the numbers are quite similar. Also update zcomp interface to be more aligned with the crypto API. [1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2 [2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2 [3] https://github.com/sergey-senozhatsky/zram-perf-test Link: http://lkml.kernel.org/r/20160531122017.2878-3-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 15:22:45 -07:00
int zcomp_compress(struct zcomp_strm *zstrm,
const void *src, unsigned int *dst_len);
zram: introduce compressing backend abstraction ZRAM performs direct LZO compression algorithm calls, making it the one and only option. While LZO is generally performs well, LZ4 algorithm tends to have a faster decompression (see http://code.google.com/p/lz4/ for full report) Name Ratio C.speed D.speed MB/s MB/s LZ4 (r101) 2.084 422 1820 LZO 2.06 2.106 414 600 Thus, users who have mostly read (decompress) usage scenarious or mixed workflow (writes with relatively high read ops number) will benefit from using LZ4 compression backend. Introduce compressing backend abstraction zcomp in order to support multiple compression algorithms with the following set of operations: .create .destroy .compress .decompress Schematically zram write() usually contains the following steps: 0) preparation (decompression of partioal IO, etc.) 1) lock buffer_lock mutex (protects meta compress buffers) 2) compress (using meta compress buffers) 3) alloc and map zs_pool object 4) copy compressed data (from meta compress buffers) to object allocated by 3) 5) free previous pool page, assign a new one 6) unlock buffer_lock mutex As we can see, compressing buffers must remain untouched from 1) to 4), because, otherwise, concurrent write() can overwrite data. At the same time, zram_meta must be aware of a) specific compression algorithm memory requirements and b) necessary locking to protect compression buffers. To remove requirement a) new struct zcomp_strm introduced, which contains a compress/decompress `buffer' and compression algorithm `private' part. While struct zcomp implements zcomp_strm stream handling and locking and removes requirement b) from zram meta. zcomp ->create() and ->destroy(), respectively, allocate and deallocate algorithm specific zcomp_strm `private' part. Every zcomp has zcomp stream and mutex to protect its compression stream. Stream usage semantics remains the same -- only one write can hold stream lock and use its buffers. zcomp_strm_find() turns caller into exclusive user of a stream (holding stream mutex until zram release stream), and zcomp_strm_release() makes zcomp stream available (unlock the stream mutex). Hence no concurrent write (compression) operations possible at the moment. iozone -t 3 -R -r 16K -s 60M -I +Z test base patched -------------------------------------------------- Initial write 597992.91 591660.58 Rewrite 609674.34 616054.97 Read 2404771.75 2452909.12 Re-read 2459216.81 2470074.44 Reverse Read 1652769.66 1589128.66 Stride read 2202441.81 2202173.31 Random read 2236311.47 2276565.31 Mixed workload 1423760.41 1709760.06 Random write 579584.08 615933.86 Pwrite 597550.02 594933.70 Pread 1703672.53 1718126.72 Fwrite 1330497.06 1461054.00 Fread 3922851.00 3957242.62 Usage examples: comp = zcomp_create(NAME) /* NAME e.g. "lzo" */ which initialises compressing backend if requested algorithm is supported. Compress: zstrm = zcomp_strm_find(comp) zcomp_compress(comp, zstrm, src, &dst_len) [..] /* copy compressed data */ zcomp_strm_release(comp, zstrm) Decompress: zcomp_decompress(comp, src, src_len, dst); Free compessing backend and its zcomp stream: zcomp_destroy(comp) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 15:38:11 -07:00
zram: switch to crypto compress API We don't have an idle zstreams list anymore and our write path now works absolutely differently, preventing preemption during compression. This removes possibilities of read paths preempting writes at wrong places (which could badly affect the performance of both paths) and at the same time opens the door for a move from custom LZO/LZ4 compression backends implementation to a more generic one, using crypto compress API. Joonsoo Kim [1] attempted to do this a while ago, but faced with the need of introducing a new crypto API interface. The root cause was the fact that crypto API compression algorithms require a compression stream structure (in zram terminology) for both compression and decompression ops, while in reality only several of compression algorithms really need it. This resulted in a concept of context-less crypto API compression backends [2]. Both write and read paths, though, would have been executed with the preemption enabled, which in the worst case could have resulted in a decreased worst-case performance, e.g. consider the following case: CPU0 zram_write() spin_lock() take the last idle stream spin_unlock() << preempted >> zram_read() spin_lock() no idle streams spin_unlock() schedule() resuming zram_write compression() but it took me some time to realize that, and it took even longer to evolve zram and to make it ready for crypto API. The key turned out to be -- drop the idle streams list entirely. Without the idle streams list we are free to use compression algorithms that require compression stream for decompression (read), because streams are now placed in per-cpu data and each write path has to disable preemption for compression op, almost completely eliminating the aforementioned case (technically, we still have a small chance, because write path has a fast and a slow paths and the slow path is executed with the preemption enabled; but the frequency of failed fast path is too low). TEST ==== - 4 CPUs, x86_64 system - 3G zram, lzo - fio tests: read, randread, write, randwrite, rw, randrw test script [3] command: ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh BASE PATCHED jobs1 READ: 2527.2MB/s 2482.7MB/s READ: 2102.7MB/s 2045.0MB/s WRITE: 1284.3MB/s 1324.3MB/s WRITE: 1080.7MB/s 1101.9MB/s READ: 430125KB/s 437498KB/s WRITE: 430538KB/s 437919KB/s READ: 399593KB/s 403987KB/s WRITE: 399910KB/s 404308KB/s jobs2 READ: 8133.5MB/s 7854.8MB/s READ: 7086.6MB/s 6912.8MB/s WRITE: 3177.2MB/s 3298.3MB/s WRITE: 2810.2MB/s 2871.4MB/s READ: 1017.6MB/s 1023.4MB/s WRITE: 1018.2MB/s 1023.1MB/s READ: 977836KB/s 984205KB/s WRITE: 979435KB/s 985814KB/s jobs3 READ: 13557MB/s 13391MB/s READ: 11876MB/s 11752MB/s WRITE: 4641.5MB/s 4682.1MB/s WRITE: 4164.9MB/s 4179.3MB/s READ: 1453.8MB/s 1455.1MB/s WRITE: 1455.1MB/s 1458.2MB/s READ: 1387.7MB/s 1395.7MB/s WRITE: 1386.1MB/s 1394.9MB/s jobs4 READ: 20271MB/s 20078MB/s READ: 18033MB/s 17928MB/s WRITE: 6176.8MB/s 6180.5MB/s WRITE: 5686.3MB/s 5705.3MB/s READ: 2009.4MB/s 2006.7MB/s WRITE: 2007.5MB/s 2004.9MB/s READ: 1929.7MB/s 1935.6MB/s WRITE: 1926.8MB/s 1932.6MB/s jobs5 READ: 18823MB/s 19024MB/s READ: 18968MB/s 19071MB/s WRITE: 6191.6MB/s 6372.1MB/s WRITE: 5818.7MB/s 5787.1MB/s READ: 2011.7MB/s 1981.3MB/s WRITE: 2011.4MB/s 1980.1MB/s READ: 1949.3MB/s 1935.7MB/s WRITE: 1940.4MB/s 1926.1MB/s jobs6 READ: 21870MB/s 21715MB/s READ: 19957MB/s 19879MB/s WRITE: 6528.4MB/s 6537.6MB/s WRITE: 6098.9MB/s 6073.6MB/s READ: 2048.6MB/s 2049.9MB/s WRITE: 2041.7MB/s 2042.9MB/s READ: 2013.4MB/s 1990.4MB/s WRITE: 2009.4MB/s 1986.5MB/s jobs7 READ: 21359MB/s 21124MB/s READ: 19746MB/s 19293MB/s WRITE: 6660.4MB/s 6518.8MB/s WRITE: 6211.6MB/s 6193.1MB/s READ: 2089.7MB/s 2080.6MB/s WRITE: 2085.8MB/s 2076.5MB/s READ: 2041.2MB/s 2052.5MB/s WRITE: 2037.5MB/s 2048.8MB/s jobs8 READ: 20477MB/s 19974MB/s READ: 18922MB/s 18576MB/s WRITE: 6851.9MB/s 6788.3MB/s WRITE: 6407.7MB/s 6347.5MB/s READ: 2134.8MB/s 2136.1MB/s WRITE: 2132.8MB/s 2134.4MB/s READ: 2074.2MB/s 2069.6MB/s WRITE: 2087.3MB/s 2082.4MB/s jobs9 READ: 19797MB/s 19994MB/s READ: 18806MB/s 18581MB/s WRITE: 6878.7MB/s 6822.7MB/s WRITE: 6456.8MB/s 6447.2MB/s READ: 2141.1MB/s 2154.7MB/s WRITE: 2144.4MB/s 2157.3MB/s READ: 2084.1MB/s 2085.1MB/s WRITE: 2091.5MB/s 2092.5MB/s jobs10 READ: 19794MB/s 19784MB/s READ: 18794MB/s 18745MB/s WRITE: 6984.4MB/s 6676.3MB/s WRITE: 6532.3MB/s 6342.7MB/s READ: 2150.6MB/s 2155.4MB/s WRITE: 2156.8MB/s 2161.5MB/s READ: 2106.4MB/s 2095.6MB/s WRITE: 2109.7MB/s 2098.4MB/s BASE PATCHED jobs1 perfstat stalled-cycles-frontend 102,480,595,419 ( 41.53%) 114,508,864,804 ( 46.92%) stalled-cycles-backend 51,941,417,832 ( 21.05%) 46,836,112,388 ( 19.19%) instructions 283,612,054,215 ( 1.15) 283,918,134,959 ( 1.16) branches 56,372,560,385 ( 724.923) 56,449,814,753 ( 733.766) branch-misses 374,826,000 ( 0.66%) 326,935,859 ( 0.58%) jobs2 perfstat stalled-cycles-frontend 155,142,745,777 ( 40.99%) 164,170,979,198 ( 43.82%) stalled-cycles-backend 70,813,866,387 ( 18.71%) 66,456,858,165 ( 17.74%) instructions 463,436,648,173 ( 1.22) 464,221,890,191 ( 1.24) branches 91,088,733,902 ( 760.088) 91,278,144,546 ( 769.133) branch-misses 504,460,363 ( 0.55%) 394,033,842 ( 0.43%) jobs3 perfstat stalled-cycles-frontend 201,300,397,212 ( 39.84%) 223,969,902,257 ( 44.44%) stalled-cycles-backend 87,712,593,974 ( 17.36%) 81,618,888,712 ( 16.19%) instructions 642,869,545,023 ( 1.27) 644,677,354,132 ( 1.28) branches 125,724,560,594 ( 690.682) 126,133,159,521 ( 694.542) branch-misses 527,941,798 ( 0.42%) 444,782,220 ( 0.35%) jobs4 perfstat stalled-cycles-frontend 246,701,197,429 ( 38.12%) 280,076,030,886 ( 43.29%) stalled-cycles-backend 119,050,341,112 ( 18.40%) 110,955,641,671 ( 17.15%) instructions 822,716,962,127 ( 1.27) 825,536,969,320 ( 1.28) branches 160,590,028,545 ( 688.614) 161,152,996,915 ( 691.068) branch-misses 650,295,287 ( 0.40%) 550,229,113 ( 0.34%) jobs5 perfstat stalled-cycles-frontend 298,958,462,516 ( 38.30%) 344,852,200,358 ( 44.16%) stalled-cycles-backend 137,558,742,122 ( 17.62%) 129,465,067,102 ( 16.58%) instructions 1,005,714,688,752 ( 1.29) 1,007,657,999,432 ( 1.29) branches 195,988,773,962 ( 697.730) 196,446,873,984 ( 700.319) branch-misses 695,818,940 ( 0.36%) 624,823,263 ( 0.32%) jobs6 perfstat stalled-cycles-frontend 334,497,602,856 ( 36.71%) 387,590,419,779 ( 42.38%) stalled-cycles-backend 163,539,365,335 ( 17.95%) 152,640,193,639 ( 16.69%) instructions 1,184,738,177,851 ( 1.30) 1,187,396,281,677 ( 1.30) branches 230,592,915,640 ( 702.902) 231,253,802,882 ( 702.356) branch-misses 747,934,786 ( 0.32%) 643,902,424 ( 0.28%) jobs7 perfstat stalled-cycles-frontend 396,724,684,187 ( 37.71%) 460,705,858,952 ( 43.84%) stalled-cycles-backend 188,096,616,496 ( 17.88%) 175,785,787,036 ( 16.73%) instructions 1,364,041,136,608 ( 1.30) 1,366,689,075,112 ( 1.30) branches 265,253,096,936 ( 700.078) 265,890,524,883 ( 702.839) branch-misses 784,991,589 ( 0.30%) 729,196,689 ( 0.27%) jobs8 perfstat stalled-cycles-frontend 440,248,299,870 ( 36.92%) 509,554,793,816 ( 42.46%) stalled-cycles-backend 222,575,930,616 ( 18.67%) 213,401,248,432 ( 17.78%) instructions 1,542,262,045,114 ( 1.29) 1,545,233,932,257 ( 1.29) branches 299,775,178,439 ( 697.666) 300,528,458,505 ( 694.769) branch-misses 847,496,084 ( 0.28%) 748,794,308 ( 0.25%) jobs9 perfstat stalled-cycles-frontend 506,269,882,480 ( 37.86%) 592,798,032,820 ( 44.43%) stalled-cycles-backend 253,192,498,861 ( 18.93%) 233,727,666,185 ( 17.52%) instructions 1,721,985,080,913 ( 1.29) 1,724,666,236,005 ( 1.29) branches 334,517,360,255 ( 694.134) 335,199,758,164 ( 697.131) branch-misses 873,496,730 ( 0.26%) 815,379,236 ( 0.24%) jobs10 perfstat stalled-cycles-frontend 549,063,363,749 ( 37.18%) 651,302,376,662 ( 43.61%) stalled-cycles-backend 281,680,986,810 ( 19.07%) 277,005,235,582 ( 18.55%) instructions 1,901,859,271,180 ( 1.29) 1,906,311,064,230 ( 1.28) branches 369,398,536,153 ( 694.004) 370,527,696,358 ( 688.409) branch-misses 967,929,335 ( 0.26%) 890,125,056 ( 0.24%) BASE PATCHED seconds elapsed 79.421641008 78.735285546 seconds elapsed 61.471246133 60.869085949 seconds elapsed 62.317058173 62.224188495 seconds elapsed 60.030739363 60.081102518 seconds elapsed 74.070398362 74.317582865 seconds elapsed 84.985953007 85.414364176 seconds elapsed 97.724553255 98.173311344 seconds elapsed 109.488066758 110.268399318 seconds elapsed 122.768189405 122.967164498 seconds elapsed 135.130035105 136.934770801 On my other system (8 x86_64 CPUs, short version of test results): BASE PATCHED seconds elapsed 19.518065994 19.806320662 seconds elapsed 15.172772749 15.594718291 seconds elapsed 13.820925970 13.821708564 seconds elapsed 13.293097816 14.585206405 seconds elapsed 16.207284118 16.064431606 seconds elapsed 17.958376158 17.771825767 seconds elapsed 19.478009164 19.602961508 seconds elapsed 21.347152811 21.352318709 seconds elapsed 24.478121126 24.171088735 seconds elapsed 26.865057442 26.767327618 So performance-wise the numbers are quite similar. Also update zcomp interface to be more aligned with the crypto API. [1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2 [2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2 [3] https://github.com/sergey-senozhatsky/zram-perf-test Link: http://lkml.kernel.org/r/20160531122017.2878-3-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 15:22:45 -07:00
int zcomp_decompress(struct zcomp_strm *zstrm,
const void *src, unsigned int src_len, void *dst);
zram: introduce compressing backend abstraction ZRAM performs direct LZO compression algorithm calls, making it the one and only option. While LZO is generally performs well, LZ4 algorithm tends to have a faster decompression (see http://code.google.com/p/lz4/ for full report) Name Ratio C.speed D.speed MB/s MB/s LZ4 (r101) 2.084 422 1820 LZO 2.06 2.106 414 600 Thus, users who have mostly read (decompress) usage scenarious or mixed workflow (writes with relatively high read ops number) will benefit from using LZ4 compression backend. Introduce compressing backend abstraction zcomp in order to support multiple compression algorithms with the following set of operations: .create .destroy .compress .decompress Schematically zram write() usually contains the following steps: 0) preparation (decompression of partioal IO, etc.) 1) lock buffer_lock mutex (protects meta compress buffers) 2) compress (using meta compress buffers) 3) alloc and map zs_pool object 4) copy compressed data (from meta compress buffers) to object allocated by 3) 5) free previous pool page, assign a new one 6) unlock buffer_lock mutex As we can see, compressing buffers must remain untouched from 1) to 4), because, otherwise, concurrent write() can overwrite data. At the same time, zram_meta must be aware of a) specific compression algorithm memory requirements and b) necessary locking to protect compression buffers. To remove requirement a) new struct zcomp_strm introduced, which contains a compress/decompress `buffer' and compression algorithm `private' part. While struct zcomp implements zcomp_strm stream handling and locking and removes requirement b) from zram meta. zcomp ->create() and ->destroy(), respectively, allocate and deallocate algorithm specific zcomp_strm `private' part. Every zcomp has zcomp stream and mutex to protect its compression stream. Stream usage semantics remains the same -- only one write can hold stream lock and use its buffers. zcomp_strm_find() turns caller into exclusive user of a stream (holding stream mutex until zram release stream), and zcomp_strm_release() makes zcomp stream available (unlock the stream mutex). Hence no concurrent write (compression) operations possible at the moment. iozone -t 3 -R -r 16K -s 60M -I +Z test base patched -------------------------------------------------- Initial write 597992.91 591660.58 Rewrite 609674.34 616054.97 Read 2404771.75 2452909.12 Re-read 2459216.81 2470074.44 Reverse Read 1652769.66 1589128.66 Stride read 2202441.81 2202173.31 Random read 2236311.47 2276565.31 Mixed workload 1423760.41 1709760.06 Random write 579584.08 615933.86 Pwrite 597550.02 594933.70 Pread 1703672.53 1718126.72 Fwrite 1330497.06 1461054.00 Fread 3922851.00 3957242.62 Usage examples: comp = zcomp_create(NAME) /* NAME e.g. "lzo" */ which initialises compressing backend if requested algorithm is supported. Compress: zstrm = zcomp_strm_find(comp) zcomp_compress(comp, zstrm, src, &dst_len) [..] /* copy compressed data */ zcomp_strm_release(comp, zstrm) Decompress: zcomp_decompress(comp, src, src_len, dst); Free compessing backend and its zcomp stream: zcomp_destroy(comp) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 15:38:11 -07:00
#endif /* _ZCOMP_H_ */