mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-01-07 13:53:24 +00:00
77db7bb56b
Add a new flag to zram block state that shows if the page is incompressible: that none of the algorithm (including secondary ones) could compress it. Link: https://lkml.kernel.org/r/20221109115047.2921851-14-senozhatsky@chromium.org Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
533 lines
19 KiB
ReStructuredText
533 lines
19 KiB
ReStructuredText
========================================
|
|
zram: Compressed RAM-based block devices
|
|
========================================
|
|
|
|
Introduction
|
|
============
|
|
|
|
The zram module creates RAM-based block devices named /dev/zram<id>
|
|
(<id> = 0, 1, ...). Pages written to these disks are compressed and stored
|
|
in memory itself. These disks allow very fast I/O and compression provides
|
|
good amounts of memory savings. Some of the use cases include /tmp storage,
|
|
use as swap disks, various caches under /var and maybe many more. :)
|
|
|
|
Statistics for individual zram devices are exported through sysfs nodes at
|
|
/sys/block/zram<id>/
|
|
|
|
Usage
|
|
=====
|
|
|
|
There are several ways to configure and manage zram device(-s):
|
|
|
|
a) using zram and zram_control sysfs attributes
|
|
b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org).
|
|
|
|
In this document we will describe only 'manual' zram configuration steps,
|
|
IOW, zram and zram_control sysfs attributes.
|
|
|
|
In order to get a better idea about zramctl please consult util-linux
|
|
documentation, zramctl man-page or `zramctl --help`. Please be informed
|
|
that zram maintainers do not develop/maintain util-linux or zramctl, should
|
|
you have any questions please contact util-linux@vger.kernel.org
|
|
|
|
Following shows a typical sequence of steps for using zram.
|
|
|
|
WARNING
|
|
=======
|
|
|
|
For the sake of simplicity we skip error checking parts in most of the
|
|
examples below. However, it is your sole responsibility to handle errors.
|
|
|
|
zram sysfs attributes always return negative values in case of errors.
|
|
The list of possible return codes:
|
|
|
|
======== =============================================================
|
|
-EBUSY an attempt to modify an attribute that cannot be changed once
|
|
the device has been initialised. Please reset device first.
|
|
-ENOMEM zram was not able to allocate enough memory to fulfil your
|
|
needs.
|
|
-EINVAL invalid input has been provided.
|
|
======== =============================================================
|
|
|
|
If you use 'echo', the returned value is set by the 'echo' utility,
|
|
and, in general case, something like::
|
|
|
|
echo 3 > /sys/block/zram0/max_comp_streams
|
|
if [ $? -ne 0 ]; then
|
|
handle_error
|
|
fi
|
|
|
|
should suffice.
|
|
|
|
1) Load Module
|
|
==============
|
|
|
|
::
|
|
|
|
modprobe zram num_devices=4
|
|
|
|
This creates 4 devices: /dev/zram{0,1,2,3}
|
|
|
|
num_devices parameter is optional and tells zram how many devices should be
|
|
pre-created. Default: 1.
|
|
|
|
2) Set max number of compression streams
|
|
========================================
|
|
|
|
Regardless of the value passed to this attribute, ZRAM will always
|
|
allocate multiple compression streams - one per online CPU - thus
|
|
allowing several concurrent compression operations. The number of
|
|
allocated compression streams goes down when some of the CPUs
|
|
become offline. There is no single-compression-stream mode anymore,
|
|
unless you are running a UP system or have only 1 CPU online.
|
|
|
|
To find out how many streams are currently available::
|
|
|
|
cat /sys/block/zram0/max_comp_streams
|
|
|
|
3) Select compression algorithm
|
|
===============================
|
|
|
|
Using comp_algorithm device attribute one can see available and
|
|
currently selected (shown in square brackets) compression algorithms,
|
|
or change the selected compression algorithm (once the device is initialised
|
|
there is no way to change compression algorithm).
|
|
|
|
Examples::
|
|
|
|
#show supported compression algorithms
|
|
cat /sys/block/zram0/comp_algorithm
|
|
lzo [lz4]
|
|
|
|
#select lzo compression algorithm
|
|
echo lzo > /sys/block/zram0/comp_algorithm
|
|
|
|
For the time being, the `comp_algorithm` content does not necessarily
|
|
show every compression algorithm supported by the kernel. We keep this
|
|
list primarily to simplify device configuration and one can configure
|
|
a new device with a compression algorithm that is not listed in
|
|
`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API
|
|
and, if some of the algorithms were built as modules, it's impossible
|
|
to list all of them using, for instance, /proc/crypto or any other
|
|
method. This, however, has an advantage of permitting the usage of
|
|
custom crypto compression modules (implementing S/W or H/W compression).
|
|
|
|
4) Set Disksize
|
|
===============
|
|
|
|
Set disk size by writing the value to sysfs node 'disksize'.
|
|
The value can be either in bytes or you can use mem suffixes.
|
|
Examples::
|
|
|
|
# Initialize /dev/zram0 with 50MB disksize
|
|
echo $((50*1024*1024)) > /sys/block/zram0/disksize
|
|
|
|
# Using mem suffixes
|
|
echo 256K > /sys/block/zram0/disksize
|
|
echo 512M > /sys/block/zram0/disksize
|
|
echo 1G > /sys/block/zram0/disksize
|
|
|
|
Note:
|
|
There is little point creating a zram of greater than twice the size of memory
|
|
since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
|
|
size of the disk when not in use so a huge zram is wasteful.
|
|
|
|
5) Set memory limit: Optional
|
|
=============================
|
|
|
|
Set memory limit by writing the value to sysfs node 'mem_limit'.
|
|
The value can be either in bytes or you can use mem suffixes.
|
|
In addition, you could change the value in runtime.
|
|
Examples::
|
|
|
|
# limit /dev/zram0 with 50MB memory
|
|
echo $((50*1024*1024)) > /sys/block/zram0/mem_limit
|
|
|
|
# Using mem suffixes
|
|
echo 256K > /sys/block/zram0/mem_limit
|
|
echo 512M > /sys/block/zram0/mem_limit
|
|
echo 1G > /sys/block/zram0/mem_limit
|
|
|
|
# To disable memory limit
|
|
echo 0 > /sys/block/zram0/mem_limit
|
|
|
|
6) Activate
|
|
===========
|
|
|
|
::
|
|
|
|
mkswap /dev/zram0
|
|
swapon /dev/zram0
|
|
|
|
mkfs.ext4 /dev/zram1
|
|
mount /dev/zram1 /tmp
|
|
|
|
7) Add/remove zram devices
|
|
==========================
|
|
|
|
zram provides a control interface, which enables dynamic (on-demand) device
|
|
addition and removal.
|
|
|
|
In order to add a new /dev/zramX device, perform a read operation on the hot_add
|
|
attribute. This will return either the new device's device id (meaning that you
|
|
can use /dev/zram<id>) or an error code.
|
|
|
|
Example::
|
|
|
|
cat /sys/class/zram-control/hot_add
|
|
1
|
|
|
|
To remove the existing /dev/zramX device (where X is a device id)
|
|
execute::
|
|
|
|
echo X > /sys/class/zram-control/hot_remove
|
|
|
|
8) Stats
|
|
========
|
|
|
|
Per-device statistics are exported as various nodes under /sys/block/zram<id>/
|
|
|
|
A brief description of exported device attributes follows. For more details
|
|
please read Documentation/ABI/testing/sysfs-block-zram.
|
|
|
|
====================== ====== ===============================================
|
|
Name access description
|
|
====================== ====== ===============================================
|
|
disksize RW show and set the device's disk size
|
|
initstate RO shows the initialization state of the device
|
|
reset WO trigger device reset
|
|
mem_used_max WO reset the `mem_used_max` counter (see later)
|
|
mem_limit WO specifies the maximum amount of memory ZRAM can
|
|
use to store the compressed data
|
|
writeback_limit WO specifies the maximum amount of write IO zram
|
|
can write out to backing device as 4KB unit
|
|
writeback_limit_enable RW show and set writeback_limit feature
|
|
max_comp_streams RW the number of possible concurrent compress
|
|
operations
|
|
comp_algorithm RW show and change the compression algorithm
|
|
compact WO trigger memory compaction
|
|
debug_stat RO this file is used for zram debugging purposes
|
|
backing_dev RW set up backend storage for zram to write out
|
|
idle WO mark allocated slot as idle
|
|
====================== ====== ===============================================
|
|
|
|
|
|
User space is advised to use the following files to read the device statistics.
|
|
|
|
File /sys/block/zram<id>/stat
|
|
|
|
Represents block layer statistics. Read Documentation/block/stat.rst for
|
|
details.
|
|
|
|
File /sys/block/zram<id>/io_stat
|
|
|
|
The stat file represents device's I/O statistics not accounted by block
|
|
layer and, thus, not available in zram<id>/stat file. It consists of a
|
|
single line of text and contains the following stats separated by
|
|
whitespace:
|
|
|
|
============= =============================================================
|
|
failed_reads The number of failed reads
|
|
failed_writes The number of failed writes
|
|
invalid_io The number of non-page-size-aligned I/O requests
|
|
notify_free Depending on device usage scenario it may account
|
|
|
|
a) the number of pages freed because of swap slot free
|
|
notifications
|
|
b) the number of pages freed because of
|
|
REQ_OP_DISCARD requests sent by bio. The former ones are
|
|
sent to a swap block device when a swap slot is freed,
|
|
which implies that this disk is being used as a swap disk.
|
|
|
|
The latter ones are sent by filesystem mounted with
|
|
discard option, whenever some data blocks are getting
|
|
discarded.
|
|
============= =============================================================
|
|
|
|
File /sys/block/zram<id>/mm_stat
|
|
|
|
The mm_stat file represents the device's mm statistics. It consists of a single
|
|
line of text and contains the following stats separated by whitespace:
|
|
|
|
================ =============================================================
|
|
orig_data_size uncompressed size of data stored in this disk.
|
|
Unit: bytes
|
|
compr_data_size compressed size of data stored in this disk
|
|
mem_used_total the amount of memory allocated for this disk. This
|
|
includes allocator fragmentation and metadata overhead,
|
|
allocated for this disk. So, allocator space efficiency
|
|
can be calculated using compr_data_size and this statistic.
|
|
Unit: bytes
|
|
mem_limit the maximum amount of memory ZRAM can use to store
|
|
the compressed data
|
|
mem_used_max the maximum amount of memory zram has consumed to
|
|
store the data
|
|
same_pages the number of same element filled pages written to this disk.
|
|
No memory is allocated for such pages.
|
|
pages_compacted the number of pages freed during compaction
|
|
huge_pages the number of incompressible pages
|
|
huge_pages_since the number of incompressible pages since zram set up
|
|
================ =============================================================
|
|
|
|
File /sys/block/zram<id>/bd_stat
|
|
|
|
The bd_stat file represents a device's backing device statistics. It consists of
|
|
a single line of text and contains the following stats separated by whitespace:
|
|
|
|
============== =============================================================
|
|
bd_count size of data written in backing device.
|
|
Unit: 4K bytes
|
|
bd_reads the number of reads from backing device
|
|
Unit: 4K bytes
|
|
bd_writes the number of writes to backing device
|
|
Unit: 4K bytes
|
|
============== =============================================================
|
|
|
|
9) Deactivate
|
|
=============
|
|
|
|
::
|
|
|
|
swapoff /dev/zram0
|
|
umount /dev/zram1
|
|
|
|
10) Reset
|
|
=========
|
|
|
|
Write any positive value to 'reset' sysfs node::
|
|
|
|
echo 1 > /sys/block/zram0/reset
|
|
echo 1 > /sys/block/zram1/reset
|
|
|
|
This frees all the memory allocated for the given device and
|
|
resets the disksize to zero. You must set the disksize again
|
|
before reusing the device.
|
|
|
|
Optional Feature
|
|
================
|
|
|
|
writeback
|
|
---------
|
|
|
|
With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
|
|
to backing storage rather than keeping it in memory.
|
|
To use the feature, admin should set up backing device via::
|
|
|
|
echo /dev/sda5 > /sys/block/zramX/backing_dev
|
|
|
|
before disksize setting. It supports only partitions at this moment.
|
|
If admin wants to use incompressible page writeback, they could do it via::
|
|
|
|
echo huge > /sys/block/zramX/writeback
|
|
|
|
To use idle page writeback, first, user need to declare zram pages
|
|
as idle::
|
|
|
|
echo all > /sys/block/zramX/idle
|
|
|
|
From now on, any pages on zram are idle pages. The idle mark
|
|
will be removed until someone requests access of the block.
|
|
IOW, unless there is access request, those pages are still idle pages.
|
|
Additionally, when CONFIG_ZRAM_MEMORY_TRACKING is enabled pages can be
|
|
marked as idle based on how long (in seconds) it's been since they were
|
|
last accessed::
|
|
|
|
echo 86400 > /sys/block/zramX/idle
|
|
|
|
In this example all pages which haven't been accessed in more than 86400
|
|
seconds (one day) will be marked idle.
|
|
|
|
Admin can request writeback of those idle pages at right timing via::
|
|
|
|
echo idle > /sys/block/zramX/writeback
|
|
|
|
With the command, zram will writeback idle pages from memory to the storage.
|
|
|
|
Additionally, if a user choose to writeback only huge and idle pages
|
|
this can be accomplished with::
|
|
|
|
echo huge_idle > /sys/block/zramX/writeback
|
|
|
|
If a user chooses to writeback only incompressible pages (pages that none of
|
|
algorithms can compress) this can be accomplished with::
|
|
|
|
echo incompressible > /sys/block/zramX/writeback
|
|
|
|
If an admin wants to write a specific page in zram device to the backing device,
|
|
they could write a page index into the interface::
|
|
|
|
echo "page_index=1251" > /sys/block/zramX/writeback
|
|
|
|
If there are lots of write IO with flash device, potentially, it has
|
|
flash wearout problem so that admin needs to design write limitation
|
|
to guarantee storage health for entire product life.
|
|
|
|
To overcome the concern, zram supports "writeback_limit" feature.
|
|
The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
|
|
any writeback. IOW, if admin wants to apply writeback budget, they should
|
|
enable writeback_limit_enable via::
|
|
|
|
$ echo 1 > /sys/block/zramX/writeback_limit_enable
|
|
|
|
Once writeback_limit_enable is set, zram doesn't allow any writeback
|
|
until admin sets the budget via /sys/block/zramX/writeback_limit.
|
|
|
|
(If admin doesn't enable writeback_limit_enable, writeback_limit's value
|
|
assigned via /sys/block/zramX/writeback_limit is meaningless.)
|
|
|
|
If admin wants to limit writeback as per-day 400M, they could do it
|
|
like below::
|
|
|
|
$ MB_SHIFT=20
|
|
$ 4K_SHIFT=12
|
|
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
|
|
/sys/block/zram0/writeback_limit.
|
|
$ echo 1 > /sys/block/zram0/writeback_limit_enable
|
|
|
|
If admins want to allow further write again once the budget is exhausted,
|
|
they could do it like below::
|
|
|
|
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
|
|
/sys/block/zram0/writeback_limit
|
|
|
|
If an admin wants to see the remaining writeback budget since last set::
|
|
|
|
$ cat /sys/block/zramX/writeback_limit
|
|
|
|
If an admin wants to disable writeback limit, they could do::
|
|
|
|
$ echo 0 > /sys/block/zramX/writeback_limit_enable
|
|
|
|
The writeback_limit count will reset whenever you reset zram (e.g.,
|
|
system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
|
|
writeback happened until you reset the zram to allocate extra writeback
|
|
budget in next setting is user's job.
|
|
|
|
If admin wants to measure writeback count in a certain period, they could
|
|
know it via /sys/block/zram0/bd_stat's 3rd column.
|
|
|
|
recompression
|
|
-------------
|
|
|
|
With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative
|
|
(secondary) compression algorithms. The basic idea is that alternative
|
|
compression algorithm can provide better compression ratio at a price of
|
|
(potentially) slower compression/decompression speeds. Alternative compression
|
|
algorithm can, for example, be more successful compressing huge pages (those
|
|
that default algorithm failed to compress). Another application is idle pages
|
|
recompression - pages that are cold and sit in the memory can be recompressed
|
|
using more effective algorithm and, hence, reduce zsmalloc memory usage.
|
|
|
|
With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms:
|
|
one primary and up to 3 secondary ones. Primary zram compressor is explained
|
|
in "3) Select compression algorithm", secondary algorithms are configured
|
|
using recomp_algorithm device attribute.
|
|
|
|
Example:::
|
|
|
|
#show supported recompression algorithms
|
|
cat /sys/block/zramX/recomp_algorithm
|
|
#1: lzo lzo-rle lz4 lz4hc [zstd]
|
|
#2: lzo lzo-rle lz4 [lz4hc] zstd
|
|
|
|
Alternative compression algorithms are sorted by priority. In the example
|
|
above, zstd is used as the first alternative algorithm, which has priority
|
|
of 1, while lz4hc is configured as a compression algorithm with priority 2.
|
|
Alternative compression algorithm's priority is provided during algorithms
|
|
configuration:::
|
|
|
|
#select zstd recompression algorithm, priority 1
|
|
echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm
|
|
|
|
#select deflate recompression algorithm, priority 2
|
|
echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm
|
|
|
|
Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress,
|
|
which controls recompression.
|
|
|
|
Examples:::
|
|
|
|
#IDLE pages recompression is activated by `idle` mode
|
|
echo "type=idle" > /sys/block/zramX/recompress
|
|
|
|
#HUGE pages recompression is activated by `huge` mode
|
|
echo "type=huge" > /sys/block/zram0/recompress
|
|
|
|
#HUGE_IDLE pages recompression is activated by `huge_idle` mode
|
|
echo "type=huge_idle" > /sys/block/zramX/recompress
|
|
|
|
The number of idle pages can be significant, so user-space can pass a size
|
|
threshold (in bytes) to the recompress knob: zram will recompress only pages
|
|
of equal or greater size:::
|
|
|
|
#recompress all pages larger than 3000 bytes
|
|
echo "threshold=3000" > /sys/block/zramX/recompress
|
|
|
|
#recompress idle pages larger than 2000 bytes
|
|
echo "type=idle threshold=2000" > /sys/block/zramX/recompress
|
|
|
|
Recompression of idle pages requires memory tracking.
|
|
|
|
During re-compression for every page, that matches re-compression criteria,
|
|
ZRAM iterates the list of registered alternative compression algorithms in
|
|
order of their priorities. ZRAM stops either when re-compression was
|
|
successful (re-compressed object is smaller in size than the original one)
|
|
and matches re-compression criteria (e.g. size threshold) or when there are
|
|
no secondary algorithms left to try. If none of the secondary algorithms can
|
|
successfully re-compressed the page such a page is marked as incompressible,
|
|
so ZRAM will not attempt to re-compress it in the future.
|
|
|
|
This re-compression behaviour, when it iterates through the list of
|
|
registered compression algorithms, increases our chances of finding the
|
|
algorithm that successfully compresses a particular page. Sometimes, however,
|
|
it is convenient (and sometimes even necessary) to limit recompression to
|
|
only one particular algorithm so that it will not try any other algorithms.
|
|
This can be achieved by providing a algo=NAME parameter:::
|
|
|
|
#use zstd algorithm only (if registered)
|
|
echo "type=huge algo=zstd" > /sys/block/zramX/recompress
|
|
|
|
memory tracking
|
|
===============
|
|
|
|
With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
|
|
zram block. It could be useful to catch cold or incompressible
|
|
pages of the process with*pagemap.
|
|
|
|
If you enable the feature, you could see block state via
|
|
/sys/kernel/debug/zram/zram0/block_state". The output is as follows::
|
|
|
|
300 75.033841 .wh...
|
|
301 63.806904 s.....
|
|
302 63.806919 ..hi..
|
|
303 62.801919 ....r.
|
|
304 146.781902 ..hi.n
|
|
|
|
First column
|
|
zram's block index.
|
|
Second column
|
|
access time since the system was booted
|
|
Third column
|
|
state of the block:
|
|
|
|
s:
|
|
same page
|
|
w:
|
|
written page to backing store
|
|
h:
|
|
huge page
|
|
i:
|
|
idle page
|
|
r:
|
|
recompressed page (secondary compression algorithm)
|
|
n:
|
|
none (including secondary) of algorithms could compress it
|
|
|
|
First line of above example says 300th block is accessed at 75.033841sec
|
|
and the block's state is huge so it is written back to the backing
|
|
storage. It's a debugging feature so anyone shouldn't rely on it to work
|
|
properly.
|
|
|
|
Nitin Gupta
|
|
ngupta@vflare.org
|