56369 Commits

Author SHA1 Message Date
Sheng Yong
26b5a07919 f2fs: cleanup dirty pages if recover failed
During recover, we will try to create new dentries for inodes with
dentry_mark. But if the parent is missing (e.g. killed by fsck),
recover will break. But those recovered dirty pages are not cleanup.
This will hit f2fs_bug_on:

[   53.519566] F2FS-fs (loop0): Found nat_bits in checkpoint
[   53.539354] F2FS-fs (loop0): recover_inode: ino = 5, name = file, inline = 3
[   53.539402] F2FS-fs (loop0): recover_dentry: ino = 5, name = file, dir = 0, err = -2
[   53.545760] F2FS-fs (loop0): Cannot recover all fsync data errno=-2
[   53.546105] F2FS-fs (loop0): access invalid blkaddr:4294967295
[   53.546171] WARNING: CPU: 1 PID: 1798 at fs/f2fs/checkpoint.c:163 f2fs_is_valid_blkaddr+0x26c/0x320
[   53.546174] Modules linked in:
[   53.546183] CPU: 1 PID: 1798 Comm: mount Not tainted 4.19.0-rc2+ #1
[   53.546186] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   53.546191] RIP: 0010:f2fs_is_valid_blkaddr+0x26c/0x320
[   53.546195] Code: 85 bb 00 00 00 48 89 df 88 44 24 07 e8 ad a8 db ff 48 8b 3b 44 89 e1 48 c7 c2 40 03 72 a9 48 c7 c6 e0 01 72 a9 e8 84 3c ff ff <0f> 0b 0f b6 44 24 07 e9 8a 00 00 00 48 8d bf 38 01 00 00 e8 7c a8
[   53.546201] RSP: 0018:ffff88006c067768 EFLAGS: 00010282
[   53.546208] RAX: 0000000000000000 RBX: ffff880068844200 RCX: ffffffffa83e1a33
[   53.546211] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88006d51e590
[   53.546215] RBP: 0000000000000005 R08: ffffed000daa3cb3 R09: ffffed000daa3cb3
[   53.546218] R10: 0000000000000001 R11: ffffed000daa3cb2 R12: 00000000ffffffff
[   53.546221] R13: ffff88006a1f8000 R14: 0000000000000200 R15: 0000000000000009
[   53.546226] FS:  00007fb2f3646840(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
[   53.546229] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   53.546234] CR2: 00007f0fd77f0008 CR3: 00000000687e6002 CR4: 00000000000206e0
[   53.546237] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   53.546240] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   53.546242] Call Trace:
[   53.546248]  f2fs_submit_page_bio+0x95/0x740
[   53.546253]  read_node_page+0x161/0x1e0
[   53.546271]  ? truncate_node+0x650/0x650
[   53.546283]  ? add_to_page_cache_lru+0x12c/0x170
[   53.546288]  ? pagecache_get_page+0x262/0x2d0
[   53.546292]  __get_node_page+0x200/0x660
[   53.546302]  f2fs_update_inode_page+0x4a/0x160
[   53.546306]  f2fs_write_inode+0x86/0xb0
[   53.546317]  __writeback_single_inode+0x49c/0x620
[   53.546322]  writeback_single_inode+0xe4/0x1e0
[   53.546326]  sync_inode_metadata+0x93/0xd0
[   53.546330]  ? sync_inode+0x10/0x10
[   53.546342]  ? do_raw_spin_unlock+0xed/0x100
[   53.546347]  f2fs_sync_inode_meta+0xe0/0x130
[   53.546351]  f2fs_fill_super+0x287d/0x2d10
[   53.546367]  ? vsnprintf+0x742/0x7a0
[   53.546372]  ? f2fs_commit_super+0x180/0x180
[   53.546379]  ? up_write+0x20/0x40
[   53.546385]  ? set_blocksize+0x5f/0x140
[   53.546391]  ? f2fs_commit_super+0x180/0x180
[   53.546402]  mount_bdev+0x181/0x200
[   53.546406]  mount_fs+0x94/0x180
[   53.546411]  vfs_kern_mount+0x6c/0x1e0
[   53.546415]  do_mount+0xe5e/0x1510
[   53.546420]  ? fs_reclaim_release+0x9/0x30
[   53.546424]  ? copy_mount_string+0x20/0x20
[   53.546428]  ? fs_reclaim_acquire+0xd/0x30
[   53.546435]  ? __might_sleep+0x2c/0xc0
[   53.546440]  ? ___might_sleep+0x53/0x170
[   53.546453]  ? __might_fault+0x4c/0x60
[   53.546468]  ? _copy_from_user+0x95/0xa0
[   53.546474]  ? memdup_user+0x39/0x60
[   53.546478]  ksys_mount+0x88/0xb0
[   53.546482]  __x64_sys_mount+0x5d/0x70
[   53.546495]  do_syscall_64+0x65/0x130
[   53.546503]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   53.547639] ---[ end trace b804d1ea2fec893e ]---

So if recover fails, we need to drop all recovered data.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:47 -07:00
Sahitya Tummala
1e78e8bd9d f2fs: fix data corruption issue with hardware encryption
Direct IO can be used in case of hardware encryption. The following
scenario results into data corruption issue in this path -

Thread A -                          Thread B-
-> write file#1 in direct IO
                                    -> GC gets kicked in
                                    -> GC submitted bio on meta mapping
				       for file#1, but pending completion
-> write file#1 again with new data
   in direct IO
                                    -> GC bio gets completed now
                                    -> GC writes old data to the new
                                       location and thus file#1 is
				       corrupted.

Fix this by submitting and waiting for pending io on meta mapping
for direct IO case in f2fs_map_blocks().

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:47 -07:00
Chao Yu
0c093b590e f2fs: fix to recover inode->i_flags of inode block during POR
Testcase to reproduce this bug:
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr +a /mnt/f2fs/file
6. xfs_io -a /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. xfs_io /mnt/f2fs/file

There is no error when opening this file w/o O_APPEND, but actually,
we expect the correct result should be:

/mnt/f2fs/file: Operation not permitted

The root cause is, in recover_inode(), we recover inode->i_flags more
than F2FS_I(inode)->i_flags, so fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:47 -07:00
Chao Yu
9149a5eb60 f2fs: spread f2fs_set_inode_flags()
This patch changes codes as below:
- use f2fs_set_inode_flags() to update i_flags atomically to avoid
potential race.
- synchronize F2FS_I(inode)->i_flags to inode->i_flags in
f2fs_new_inode().
- use f2fs_set_inode_flags() to simply codes in f2fs_quota_{on,off}.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:47 -07:00
Chao Yu
2baf078185 f2fs: fix to spread clear_cold_data()
We need to drop PG_checked flag on page as well when we clear PG_uptodate
flag, in order to avoid treating the page as GCing one later.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:46 -07:00
Jaegeuk Kim
164a63fa6b Revert "f2fs: fix to clear PG_checked flag in set_page_dirty()"
This reverts commit 66110abc4c931f879d70e83e1281f891699364bf.

If we clear the cold data flag out of the writeback flow, we can miscount
-1 by end_io, which incurs a deadlock caused by all I/Os being blocked during
heavy GC.

Balancing F2FS Async:
 - IO (CP:    1, Data:   -1, Flush: (   0    0    1), Discard: (   ...

GC thread:                              IRQ
- move_data_page()
 - set_page_dirty()
  - clear_cold_data()
                                        - f2fs_write_end_io()
                                         - type = WB_DATA_TYPE(page);
                                           here, we get wrong type
                                         - dec_page_count(sbi, type);
 - f2fs_wait_on_page_writeback()

Cc: <stable@vger.kernel.org>
Reported-and-Tested-by: Park Ju Hyung <qkrwngud825@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:46 -07:00
Jaegeuk Kim
5f9abab42b f2fs: account read IOs and use IO counts for is_idle
This patch adds issued read IO counts which is under block layer.

Chao modified a bit, since:

Below race can cause reversed reference on F2FS_RD_DATA, there is
the same issue in f2fs_submit_page_bio(), fix them by relocate
__submit_bio() and inc_page_count.

Thread A			Thread B
- f2fs_write_begin
 - f2fs_submit_page_read
 - __submit_bio
				- f2fs_read_end_io
				 - __read_end_io
				 - dec_page_count(, F2FS_RD_DATA)
 - inc_page_count(, F2FS_RD_DATA)

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:46 -07:00
Chao Yu
78efac537d f2fs: fix to account IO correctly for cgroup writeback
Now, we have supported cgroup writeback, it depends on correctly IO
account of specified filesystem.

But in commit d1b3e72d5490 ("f2fs: submit bio of in-place-update pages"),
we split write paths from f2fs_submit_page_mbio() to two:
- f2fs_submit_page_bio() for IPU path
- f2fs_submit_page_bio() for OPU path

But still we account write IO only in f2fs_submit_page_mbio(), result in
incorrect IO account, fix it by adding missing IO account in IPU path.

Fixes: d1b3e72d5490 ("f2fs: submit bio of in-place-update pages")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:39 -07:00
Chao Yu
4c58ed0768 f2fs: fix to account IO correctly
Below race can cause reversed reference on dirty count, fix it by
relocating __submit_bio() and inc_page_count().

Thread A				Thread B
- f2fs_inplace_write_data
 - f2fs_submit_page_bio
  - __submit_bio
					- f2fs_write_end_io
					 - dec_page_count
  - inc_page_count

Cc: <stable@vger.kernel.org>
Fixes: d1b3e72d5490 ("f2fs: submit bio of in-place-update pages")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:53:48 -07:00
Linus Torvalds
a36cf68651 SPI NOR changes:
Core changes:
   * Support non-uniform erase size
   * Support controllers with limited TX fifo size
 
  Driver changes:
   * m25p80: Re-issue a WREN command after each write access
   * cadence: Pass a proper dir value to dma_[un]map_single()
   * fsl-qspi: Check fsl_qspi_get_seqid() return val make sure 4B
     addressing opcodes are properly handled
   * intel-spi: Add a new PCI entry for Ice Lake
 
 NAND changes:
  Raw NAND core changes:
  - Two batchs of cleanups of the NAND API, including:
    * Deprecating a lot of interfaces (now replaced by ->exec_op()).
    * Moving code in separate drivers (JEDEC, ONFI), in private files
      (internals), in platform drivers, etc.
    * Functions/structures reordering.
    * Exclusive use of the nand_chip structure instead of the MTD one
      all across the subsystem.
  - Addition of the nand_wait_readrdy/rdy_op() helpers.
 
  Raw NAND controllers drivers changes:
  - Various coccinelle patches.
  - Marvell:
    * Use regmap_update_bits() for syscon access.
    * More documentation.
    * BCH failure path rework.
    * More layouts to be supported.
    * IRQ handler complete() condition fixed.
  - Fsl_ifc:
    * SRAM initialization fixed for newer controller versions.
  - Denali:
    * Fix licenses mismatch and use a SPDX tag.
    * Set SPARE_AREA_SKIP_BYTES register to 8 if unset.
  - Qualcomm:
    * Do not include dma-direct.h.
  - Docg4:
    * Removed.
  - Ams-delta:
    * Use of a GPIO lookup table
    * Internal machinery changes.
 
  Raw NAND chip drivers changes:
  - Toshiba:
    * Add support for Toshiba memory BENAND
    * Pass a single nand_chip object to the status helper.
  - ESMT:
    * New driver to retrieve the ECC requirements from the 5th ID byte.
 
 MTD changes:
  * physmap cleanups/fixe
  * gpio-addr-flash cleanups/fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJQBAABCgA6FiEEKmCqpbOU668PNA69Ze02AX4ItwAFAlvNmJocHGJvcmlzLmJy
 ZXppbGxvbkBib290bGluLmNvbQAKCRBl7TYBfgi3AKr+D/4pH0gYSGDUstZsqUHW
 gZsPRU4jmOA+OCRbSxE03bOZOYR0UPgdoUYLfAKhZ5qxc7wHd3b47wykP0kUEviu
 lC8QhSs4DUA+EuMVPDVS4FlXRT0e3dMV7jh/IK6nInshD2YkaovyCqa6GvgwFEcM
 7BCizxRhtHV8+fo7GVQWuMLb9ZfbEvz42D0sYOu6hIsF1SnRDvHOvfdDUFEXpJoF
 a2mC9ove65ChEzc2iZ/r260iZ4aoJYb9XJRJKWxmYeITZSmLNmcrUyGqAaNQ7NRc
 AIuPXASbeHGjrIuEfXpKYW07AE5MV1nJFSk3v4u5FjgoohOoobPp7npk+Lz/qwe3
 y8/uW9qOQ/iEOsRnkvMGNu4Yjhw41L1a+J5wVvUvzmwHy1xMCrRHYB/8gXoZzemR
 A7XmCPwjAFVWv1WeKV2cs5MLW9yZq9QNMtGLlNs5OgFR6CccLk67/tHIkYLsJ/l9
 IDXFhd/YKhTeF361u6Iimmgb427TwM71P9N+OMpv4nk46DxurpxkgGle3nslzKNU
 DOFNFlMGiPIx3h4X96AKER6u7cxiOXnLGq+XeHa/y8tIrziy3jg03YoTh0RSwKxV
 J3EXwh1sFLaeebwWEgpE3Ag1LOxpRCqJ2ED71SPzS/DR/938HBVsEoPqUEHNf1in
 jwsUB3cRzNwf+XX68DRCVT5GFA==
 =7w4w
 -----END PGP SIGNATURE-----

Merge tag 'mtd/for-4.20' of git://git.infradead.org/linux-mtd

Pull mtd updates from Boris Brezillon:
 "SPI NOR core changes:
   - Support non-uniform erase size
   - Support controllers with limited TX fifo size

 Driver changes:
   - m25p80: Re-issue a WREN command after each write access
   - cadence: Pass a proper dir value to dma_[un]map_single()
   - fsl-qspi: Check fsl_qspi_get_seqid() return val make sure 4B
     addressing opcodes are properly handled
   - intel-spi: Add a new PCI entry for Ice Lake

 Raw NAND core changes:
   - Two batchs of cleanups of the NAND API, including:
      * Deprecating a lot of interfaces (now replaced by ->exec_op()).
      * Moving code in separate drivers (JEDEC, ONFI), in private files
        (internals), in platform drivers, etc.
      * Functions/structures reordering.
      * Exclusive use of the nand_chip structure instead of the MTD one
        all across the subsystem.
   - Addition of the nand_wait_readrdy/rdy_op() helpers.

 Raw NAND controllers drivers changes:
   - Various coccinelle patches.
   - Marvell:
      * Use regmap_update_bits() for syscon access.
      * More documentation.
      * BCH failure path rework.
      * More layouts to be supported.
      * IRQ handler complete() condition fixed.
   - Fsl_ifc:
      * SRAM initialization fixed for newer controller versions.
   - Denali:
      * Fix licenses mismatch and use a SPDX tag.
      * Set SPARE_AREA_SKIP_BYTES register to 8 if unset.
   - Qualcomm:
      * Do not include dma-direct.h.
   - Docg4:
      * Removed.
   - Ams-delta:
      * Use of a GPIO lookup table
      * Internal machinery changes.

 Raw NAND chip drivers changes:
   - Toshiba:
      * Add support for Toshiba memory BENAND
      * Pass a single nand_chip object to the status helper.
   - ESMT:
      * New driver to retrieve the ECC requirements from the 5th ID
        byte.

  MTD changes:
   - physmap cleanups/fixe
   - gpio-addr-flash cleanups/fixes"

* tag 'mtd/for-4.20' of git://git.infradead.org/linux-mtd: (93 commits)
  jffs2: free jffs2_sb_info through jffs2_kill_sb()
  mtd: spi-nor: fsl-quadspi: fix read error for flash size larger than 16MB
  mtd: spi-nor: intel-spi: Add support for Intel Ice Lake SPI serial flash
  mtd: maps: gpio-addr-flash: Convert to gpiod
  mtd: maps: gpio-addr-flash: Replace array with an integer
  mtd: maps: gpio-addr-flash: Use order instead of size
  mtd: spi-nor: fsl-quadspi: Don't let -EINVAL on the bus
  mtd: devices: m25p80: Make sure WRITE_EN is issued before each write
  mtd: spi-nor: Support controllers with limited TX FIFO size
  mtd: spi-nor: cadence-quadspi: Use proper enum for dma_[un]map_single
  mtd: spi-nor: parse SFDP Sector Map Parameter Table
  mtd: spi-nor: add support to non-uniform SFDP SPI NOR flash memories
  mtd: rawnand: marvell: fix the IRQ handler complete() condition
  mtd: rawnand: denali: set SPARE_AREA_SKIP_BYTES register to 8 if unset
  mtd: rawnand: r852: fix spelling mistake "card_registred" -> "card_registered"
  mtd: rawnand: toshiba: Pass a single nand_chip object to the status helper
  mtd: maps: gpio-addr-flash: Use devm_* functions
  mtd: maps: gpio-addr-flash: Fix ioremapped size
  mtd: maps: gpio-addr-flash: Replace custom printk
  mtd: physmap_of: Release resources on error
  ...
2018-10-23 01:09:22 +01:00
Filipe Manana
9084cb6a24 Btrfs: fix use-after-free when dumping free space
We were iterating a block group's free space cache rbtree without locking
first the lock that protects it (the free_space_ctl->free_space_offset
rbtree is protected by the free_space_ctl->tree_lock spinlock).

KASAN reported an use-after-free problem when iterating such a rbtree due
to a concurrent rbtree delete:

[ 9520.359168] ==================================================================
[ 9520.359656] BUG: KASAN: use-after-free in rb_next+0x13/0x90
[ 9520.359949] Read of size 8 at addr ffff8800b7ada500 by task btrfs-transacti/1721
[ 9520.360357]
[ 9520.360530] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G             L    4.19.0-rc8-nbor #555
[ 9520.360990] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 9520.362682] Call Trace:
[ 9520.362887]  dump_stack+0xa4/0xf5
[ 9520.363146]  print_address_description+0x78/0x280
[ 9520.363412]  kasan_report+0x263/0x390
[ 9520.363650]  ? rb_next+0x13/0x90
[ 9520.363873]  __asan_load8+0x54/0x90
[ 9520.364102]  rb_next+0x13/0x90
[ 9520.364380]  btrfs_dump_free_space+0x146/0x160 [btrfs]
[ 9520.364697]  dump_space_info+0x2cd/0x310 [btrfs]
[ 9520.364997]  btrfs_reserve_extent+0x1ee/0x1f0 [btrfs]
[ 9520.365310]  __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs]
[ 9520.365646]  ? btrfs_update_time+0x180/0x180 [btrfs]
[ 9520.365923]  ? _raw_spin_unlock+0x27/0x40
[ 9520.366204]  ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs]
[ 9520.366549]  btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs]
[ 9520.366880]  cache_save_setup+0x42e/0x580 [btrfs]
[ 9520.367220]  ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs]
[ 9520.367518]  ? lock_downgrade+0x2f0/0x2f0
[ 9520.367799]  ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs]
[ 9520.368104]  ? kasan_check_read+0x11/0x20
[ 9520.368349]  ? do_raw_spin_unlock+0xa8/0x140
[ 9520.368638]  btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs]
[ 9520.368978]  ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs]
[ 9520.369282]  ? do_raw_spin_unlock+0xa8/0x140
[ 9520.369534]  ? _raw_spin_unlock+0x27/0x40
[ 9520.369811]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.370137]  commit_cowonly_roots+0x4b9/0x610 [btrfs]
[ 9520.370560]  ? commit_fs_roots+0x350/0x350 [btrfs]
[ 9520.370926]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.371285]  btrfs_commit_transaction+0x5e5/0x10e0 [btrfs]
[ 9520.371612]  ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
[ 9520.371943]  ? start_transaction+0x168/0x6c0 [btrfs]
[ 9520.372257]  transaction_kthread+0x21c/0x240 [btrfs]
[ 9520.372537]  kthread+0x1d2/0x1f0
[ 9520.372793]  ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
[ 9520.373090]  ? kthread_park+0xb0/0xb0
[ 9520.373329]  ret_from_fork+0x3a/0x50
[ 9520.373567]
[ 9520.373738] Allocated by task 1804:
[ 9520.373974]  kasan_kmalloc+0xff/0x180
[ 9520.374208]  kasan_slab_alloc+0x11/0x20
[ 9520.374447]  kmem_cache_alloc+0xfc/0x2d0
[ 9520.374731]  __btrfs_add_free_space+0x40/0x580 [btrfs]
[ 9520.375044]  unpin_extent_range+0x4f7/0x7a0 [btrfs]
[ 9520.375383]  btrfs_finish_extent_commit+0x15f/0x4d0 [btrfs]
[ 9520.375707]  btrfs_commit_transaction+0xb06/0x10e0 [btrfs]
[ 9520.376027]  btrfs_alloc_data_chunk_ondemand+0x237/0x5c0 [btrfs]
[ 9520.376365]  btrfs_check_data_free_space+0x81/0xd0 [btrfs]
[ 9520.376689]  btrfs_delalloc_reserve_space+0x25/0x80 [btrfs]
[ 9520.377018]  btrfs_direct_IO+0x42e/0x6d0 [btrfs]
[ 9520.377284]  generic_file_direct_write+0x11e/0x220
[ 9520.377587]  btrfs_file_write_iter+0x472/0xac0 [btrfs]
[ 9520.377875]  aio_write+0x25c/0x360
[ 9520.378106]  io_submit_one+0xaa0/0xdc0
[ 9520.378343]  __se_sys_io_submit+0xfa/0x2f0
[ 9520.378589]  __x64_sys_io_submit+0x43/0x50
[ 9520.378840]  do_syscall_64+0x7d/0x240
[ 9520.379081]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9520.379387]
[ 9520.379557] Freed by task 1802:
[ 9520.379782]  __kasan_slab_free+0x173/0x260
[ 9520.380028]  kasan_slab_free+0xe/0x10
[ 9520.380262]  kmem_cache_free+0xc1/0x2c0
[ 9520.380544]  btrfs_find_space_for_alloc+0x4cd/0x4e0 [btrfs]
[ 9520.380866]  find_free_extent+0xa99/0x17e0 [btrfs]
[ 9520.381166]  btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
[ 9520.381474]  btrfs_get_blocks_direct+0x60b/0xbd0 [btrfs]
[ 9520.381761]  __blockdev_direct_IO+0x10ee/0x58a1
[ 9520.382059]  btrfs_direct_IO+0x25a/0x6d0 [btrfs]
[ 9520.382321]  generic_file_direct_write+0x11e/0x220
[ 9520.382623]  btrfs_file_write_iter+0x472/0xac0 [btrfs]
[ 9520.382904]  aio_write+0x25c/0x360
[ 9520.383172]  io_submit_one+0xaa0/0xdc0
[ 9520.383416]  __se_sys_io_submit+0xfa/0x2f0
[ 9520.383678]  __x64_sys_io_submit+0x43/0x50
[ 9520.383927]  do_syscall_64+0x7d/0x240
[ 9520.384165]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9520.384439]
[ 9520.384610] The buggy address belongs to the object at ffff8800b7ada500
                which belongs to the cache btrfs_free_space of size 72
[ 9520.385175] The buggy address is located 0 bytes inside of
                72-byte region [ffff8800b7ada500, ffff8800b7ada548)
[ 9520.385691] The buggy address belongs to the page:
[ 9520.385957] page:ffffea0002deb680 count:1 mapcount:0 mapping:ffff880108a1d700 index:0x0 compound_mapcount: 0
[ 9520.388030] flags: 0x8100(slab|head)
[ 9520.388281] raw: 0000000000008100 ffffea0002deb608 ffffea0002728808 ffff880108a1d700
[ 9520.388722] raw: 0000000000000000 0000000000130013 00000001ffffffff 0000000000000000
[ 9520.389169] page dumped because: kasan: bad access detected
[ 9520.389473]
[ 9520.389658] Memory state around the buggy address:
[ 9520.389943]  ffff8800b7ada400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.390368]  ffff8800b7ada480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.390796] >ffff8800b7ada500: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
[ 9520.391223]                    ^
[ 9520.391461]  ffff8800b7ada580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.391885]  ffff8800b7ada600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.392313] ==================================================================
[ 9520.392772] BTRFS critical (device vdc): entry offset 2258497536, bytes 131072, bitmap no
[ 9520.393247] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011
[ 9520.393705] PGD 800000010dbab067 P4D 800000010dbab067 PUD 107551067 PMD 0
[ 9520.394059] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 9520.394378] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G    B        L    4.19.0-rc8-nbor #555
[ 9520.394858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 9520.395350] RIP: 0010:rb_next+0x3c/0x90
[ 9520.396461] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292
[ 9520.396762] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c
[ 9520.397115] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011
[ 9520.397468] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc
[ 9520.397821] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000
[ 9520.398188] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000
[ 9520.398555] FS:  0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000
[ 9520.399007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9520.399335] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0
[ 9520.399679] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9520.400023] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9520.400400] Call Trace:
[ 9520.400648]  btrfs_dump_free_space+0x146/0x160 [btrfs]
[ 9520.400974]  dump_space_info+0x2cd/0x310 [btrfs]
[ 9520.401287]  btrfs_reserve_extent+0x1ee/0x1f0 [btrfs]
[ 9520.401609]  __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs]
[ 9520.401952]  ? btrfs_update_time+0x180/0x180 [btrfs]
[ 9520.402232]  ? _raw_spin_unlock+0x27/0x40
[ 9520.402522]  ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs]
[ 9520.402882]  btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs]
[ 9520.403261]  cache_save_setup+0x42e/0x580 [btrfs]
[ 9520.403570]  ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs]
[ 9520.403871]  ? lock_downgrade+0x2f0/0x2f0
[ 9520.404161]  ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs]
[ 9520.404481]  ? kasan_check_read+0x11/0x20
[ 9520.404732]  ? do_raw_spin_unlock+0xa8/0x140
[ 9520.405026]  btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs]
[ 9520.405375]  ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs]
[ 9520.405694]  ? do_raw_spin_unlock+0xa8/0x140
[ 9520.405958]  ? _raw_spin_unlock+0x27/0x40
[ 9520.406243]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.406574]  commit_cowonly_roots+0x4b9/0x610 [btrfs]
[ 9520.406899]  ? commit_fs_roots+0x350/0x350 [btrfs]
[ 9520.407253]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.407589]  btrfs_commit_transaction+0x5e5/0x10e0 [btrfs]
[ 9520.407925]  ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
[ 9520.408262]  ? start_transaction+0x168/0x6c0 [btrfs]
[ 9520.408582]  transaction_kthread+0x21c/0x240 [btrfs]
[ 9520.408870]  kthread+0x1d2/0x1f0
[ 9520.409138]  ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
[ 9520.409440]  ? kthread_park+0xb0/0xb0
[ 9520.409682]  ret_from_fork+0x3a/0x50
[ 9520.410508] Dumping ftrace buffer:
[ 9520.410764]    (ftrace buffer empty)
[ 9520.411007] CR2: 0000000000000011
[ 9520.411297] ---[ end trace 01a0863445cf360a ]---
[ 9520.411568] RIP: 0010:rb_next+0x3c/0x90
[ 9520.412644] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292
[ 9520.412932] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c
[ 9520.413274] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011
[ 9520.413616] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc
[ 9520.414007] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000
[ 9520.414349] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000
[ 9520.416074] FS:  0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000
[ 9520.416536] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9520.416848] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0
[ 9520.418477] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9520.418846] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9520.419204] Kernel panic - not syncing: Fatal exception
[ 9520.419666] Dumping ftrace buffer:
[ 9520.419930]    (ftrace buffer empty)
[ 9520.420168] Kernel Offset: disabled
[ 9520.420406] ---[ end Kernel panic - not syncing: Fatal exception ]---

Fix this by acquiring the respective lock before iterating the rbtree.

Reported-by: Nikolay Borisov <nborisov@suse.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-22 20:31:22 +02:00
Linus Torvalds
6ab9e09238 for-4.20/block-20181021
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww
 3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00
 VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5
 WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY
 lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d
 5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH
 TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm
 C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie
 WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5
 wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m
 jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+
 J0oh0FHBDg==
 =LRTT
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "This is the main pull request for block changes for 4.20. This
  contains:

   - Series enabling runtime PM for blk-mq (Bart).

   - Two pull requests from Christoph for NVMe, with items such as;
      - Better AEN tracking
      - Multipath improvements
      - RDMA fixes
      - Rework of FC for target removal
      - Fixes for issues identified by static checkers
      - Fabric cleanups, as prep for TCP transport
      - Various cleanups and bug fixes

   - Block merging cleanups (Christoph)

   - Conversion of drivers to generic DMA mapping API (Christoph)

   - Series fixing ref count issues with blkcg (Dennis)

   - Series improving BFQ heuristics (Paolo, et al)

   - Series improving heuristics for the Kyber IO scheduler (Omar)

   - Removal of dangerous bio_rewind_iter() API (Ming)

   - Apply single queue IPI redirection logic to blk-mq (Ming)

   - Set of fixes and improvements for bcache (Coly et al)

   - Series closing a hotplug race with sysfs group attributes (Hannes)

   - Set of patches for lightnvm:
      - pblk trace support (Hans)
      - SPDX license header update (Javier)
      - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
        specs behind a common core interface. (Javier, Matias)
      - Enable pblk to use a common interface to retrieve chunk metadata
        (Matias)
      - Bug fixes (Various)

   - Set of fixes and updates to the blk IO latency target (Josef)

   - blk-mq queue number updates fixes (Jianchao)

   - Convert a bunch of drivers from the old legacy IO interface to
     blk-mq. This will conclude with the removal of the legacy IO
     interface itself in 4.21, with the rest of the drivers (me, Omar)

   - Removal of the DAC960 driver. The SCSI tree will introduce two
     replacement drivers for this (Hannes)"

* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
  block: setup bounce bio_sets properly
  blkcg: reassociate bios when make_request() is called recursively
  blkcg: fix edge case for blk_get_rl() under memory pressure
  nvme-fabrics: move controller options matching to fabrics
  nvme-rdma: always have a valid trsvcid
  mtip32xx: fully switch to the generic DMA API
  rsxx: switch to the generic DMA API
  umem: switch to the generic DMA API
  sx8: switch to the generic DMA API
  sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
  skd: switch to the generic DMA API
  ubd: remove use of blk_rq_map_sg
  nvme-pci: remove duplicate check
  drivers/block: Remove DAC960 driver
  nvme-pci: fix hot removal during error handling
  nvmet-fcloop: suppress a compiler warning
  nvme-core: make implicit seed truncation explicit
  nvmet-fc: fix kernel-doc headers
  nvme-fc: rework the request initialization code
  nvme-fc: introduce struct nvme_fcp_op_w_sgl
  ...
2018-10-22 17:46:08 +01:00
Kees Cook
1227daa43b pstore/ram: Clarify resource reservation labels
When ramoops reserved a memory region in the kernel, it had an unhelpful
label of "persistent_memory". When reading /proc/iomem, it would be
repeated many times, did not hint that it was ramoops in particular,
and didn't clarify very much about what each was used for:

400000000-407ffffff : Persistent Memory (legacy)
  400000000-400000fff : persistent_memory
  400001000-400001fff : persistent_memory
...
  4000ff000-4000fffff : persistent_memory

Instead, this adds meaningful labels for how the various regions are
being used:

400000000-407ffffff : Persistent Memory (legacy)
  400000000-400000fff : ramoops:dump(0/252)
  400001000-400001fff : ramoops:dump(1/252)
...
  4000fc000-4000fcfff : ramoops:dump(252/252)
  4000fd000-4000fdfff : ramoops:console
  4000fe000-4000fe3ff : ramoops:ftrace(0/3)
  4000fe400-4000fe7ff : ramoops:ftrace(1/3)
  4000fe800-4000febff : ramoops:ftrace(2/3)
  4000fec00-4000fefff : ramoops:ftrace(3/3)
  4000ff000-4000fffff : ramoops:pmsg

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22 07:11:58 -07:00
Kees Cook
95047b0519 pstore: Refactor compression initialization
This refactors compression initialization slightly to better handle
getting potentially called twice (via early pstore_register() calls
and later pstore_init()) and improves the comments and reporting to be
more verbose.

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22 07:11:58 -07:00
Joel Fernandes (Google)
416031653e pstore: Allocate compression during late_initcall()
ramoops's call of pstore_register() was recently moved to run during
late_initcall() because the crypto backend may not have been ready during
postcore_initcall(). This meant early-boot crash dumps were not getting
caught by pstore any more.

Instead, lets allow calls to pstore_register() earlier, and once crypto
is ready we can initialize the compression.

Reported-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Fixes: cb3bee0369bc ("pstore: Use crypto compress API")
[kees: trivial rebase]
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22 07:11:58 -07:00
Kees Cook
cb095afd44 pstore: Centralize init/exit routines
In preparation for having additional actions during init/exit, this moves
the init/exit into platform.c, centralizing the logic to make call outs
to the fs init/exit.

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22 07:11:58 -07:00
Luis Henriques
ea4cdc548e ceph: new mount option to disable usage of copy-from op
Add a new mount option 'nocopyfrom' that will prevent the usage of the
RADOS 'copy-from' operation in cephfs.  This could be useful, for example,
for an administrator to temporarily mitigate any possible bugs in the
'copy-from' implementation.

Currently, only copy_file_range uses this RADOS operation.  Setting this
mount option will result in this syscall reverting to the default VFS
implementation, i.e. to perform the copies locally instead of doing remote
object copies.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:24 +02:00
Luis Henriques
503f82a993 ceph: support copy_file_range file operation
This commit implements support for the copy_file_range syscall in cephfs.
It is implemented using the RADOS 'copy-from' operation, which allows to
do a remote object copy, without the need to download/upload data from/to
the OSDs.

Some manual copy may however be required if the source/destination file
offsets aren't object aligned or if the copy length is smaller than the
object size.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:24 +02:00
Luis Henriques
2ee9dd958d ceph: add non-blocking parameter to ceph_try_get_caps()
ceph_try_get_caps currently calls try_get_cap_refs with the nonblock
parameter always set to 'true'.  This change adds a new parameter that
allows to set it's value.  This will be useful for a follow-up patch that
will need to get two sets of capabilities for two different inodes without
risking a deadlock.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:23 +02:00
Ilya Dryomov
0d9c1ab3be libceph: preallocate message data items
Currently message data items are allocated with ceph_msg_data_create()
in setup_request_data() inside send_request().  send_request() has never
been allowed to fail, so each allocation is followed by a BUG_ON:

  data = ceph_msg_data_create(...);
  BUG_ON(!data);

It's been this way since support for multiple message data items was
added in commit 6644ed7b7e04 ("libceph: make message data be a pointer")
in 3.10.

There is no reason to delay the allocation of message data items until
the last possible moment and we certainly don't need a linked list of
them as they are only ever appended to the end and never erased.  Make
ceph_msg_new2() take max_data_items and adapt the rest of the code.

Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:22 +02:00
Ilya Dryomov
26f887e0a3 libceph, rbd, ceph: move ceph_osdc_alloc_messages() calls
The current requirement is that ceph_osdc_alloc_messages() should be
called after oid and oloc are known.  In preparation for preallocating
message data items, move ceph_osdc_alloc_messages() further down, so
that it is called when OSD op codes are known.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:22 +02:00
Ilya Dryomov
61d2f85504 ceph: num_ops is off by one in ceph_aio_retry_work()
Two OSD op slots are allocated, but only one is ever used.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Xuehan Xu
6680288441 ceph: set timeout conditionally in __cap_delay_requeue
__cap_delay_requeue could be invoked through ceph_check_caps when there
exists caps that needs to be sent and are delayed by "i_hold_caps_min"
or "i_hold_caps_max". If __cap_delay_requeue sets timeout unconditionally,
there could be a chance that some "wanted" caps can not be release for a
long since their timeouts are reset every time they get delayed.

Fixes: http://tracker.ceph.com/issues/36369
Signed-off-by: Xuehan Xu <xuxuehan@360.cn>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Ilya Dryomov
894868330a libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist()
Because send_mds_reconnect() wants to send a message with a pagelist
and pass the ownership to the messenger, ceph_msg_data_add_pagelist()
consumes a ref which is then put in ceph_msg_data_destroy().  This
makes managing pagelists in the OSD client (where they are wrapped in
ceph_osd_data) unnecessarily hard because the handoff only happens in
ceph_osdc_start_request() instead of when the pagelist is passed to
ceph_osd_data_pagelist_init().  I counted several memory leaks on
various error paths.

Fix up ceph_msg_data_add_pagelist() and carry a pagelist ref in
ceph_osd_data.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Ilya Dryomov
33165d4723 libceph: introduce ceph_pagelist_alloc()
struct ceph_pagelist cannot be embedded into anything else because it
has its own refcount.  Merge allocation and initialization together.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Luis Henriques
bddff633ab ceph: only allow punch hole mode in fallocate
Current implementation of cephfs fallocate isn't correct as it doesn't
really reserve the space in the cluster, which means that a subsequent
call to a write may actually fail due to lack of space.  In fact, it is
currently possible to fallocate an amount space that is larger than the
free space in the cluster.  It has behaved this way since the initial
commit ad7a60de882a ("ceph: punch hole support").

Since there's no easy solution to fix this at the moment, this patch
simply removes support for all fallocate operations but
FALLOC_FL_PUNCH_HOLE (which implies FALLOC_FL_KEEP_SIZE).

Link: https://tracker.ceph.com/issues/36317
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng
fce7a9744b ceph: refactor ceph_sync_read()
Avoid allocating memory for the entire user request: striped_read()
does a synchronous OSD request per object, so it doesn't need more than
object size worth of pages at a time.

[ Preserve the comment, changelog. ]

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng
74c9e6bf4c ceph: check if LOOKUPNAME request was aborted when filling trace
d_lookup()/d_alloc() require parent inode locked. Parent inode is
not locked if request is aborted.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng
c58f450bd6 ceph: fix dentry leak in ceph_readdir_prepopulate
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng
efe328230d Revert "ceph: fix dentry leak in splice_dentry()"
This reverts commit 8b8f53af1ed9df88a4c0fbfdf3db58f62060edf3.

splice_dentry() is used by three places. For two places, req->r_dentry
is passed to splice_dentry(). In the case of error, req->r_dentry does
not get updated. So splice_dentry() should not drop reference.

Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Chengguang Xu
5da207993e ceph: check snap first in ceph_set_acl()
Do the snap check first in ceph_set_acl(), so we can avoid
unnecessary operations when the inode has snap.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Chengguang Xu
3167893ae6 ceph: reset cap hold timeout only for requeued inode
__cap_delay_requeue() only requeue inode which does not
have CEPH_I_FLUSH flag, so avoid reset cap hold timeout
for that inode.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:19 +02:00
Matthew Wilcox
a283348629 page cache: Finish XArray conversion
With no more radix tree API users left, we can drop the GFP flags
and use xa_init() instead of INIT_RADIX_TREE().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:44 -04:00
Matthew Wilcox
b15cd80068 dax: Convert page fault handlers to XArray
This is the last part of DAX to be converted to the XArray so
remove all the old helper functions.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:44 -04:00
Matthew Wilcox
9f32d22130 dax: Convert dax_lock_mapping_entry to XArray
Instead of always retrying when we slept, only retry if the page has
moved.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:44 -04:00
Matthew Wilcox
9fc747f68d dax: Convert dax writeback to XArray
Use XArray iteration instead of a pagevec.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:43 -04:00
Matthew Wilcox
07f2d89cc2 dax: Convert __dax_invalidate_entry to XArray
Avoids walking the radix tree multiple times looking for tags.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:43 -04:00
Matthew Wilcox
084a899008 dax: Convert dax_layout_busy_page to XArray
Instead of using a pagevec, just use the XArray iterators.  Add a
conditional rescheduling point which probably should have been there in
the original.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:43 -04:00
Matthew Wilcox
cfc93c6c6c dax: Convert dax_insert_pfn_mkwrite to XArray
Add some XArray-based helper functions to replace the radix tree based
metaphors currently in use.  The biggest change is that converted code
doesn't see its own lock bit; get_unlocked_entry() always returns an
entry with the lock bit clear.  So we don't have to mess around loading
the current entry and clearing the lock bit; we can just store the
unlocked entry that we already have.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:43 -04:00
Matthew Wilcox
ec4907ff69 dax: Hash on XArray instead of mapping
Since the XArray is embedded in the struct address_space, its address
contains exactly as much entropy as the address of the mapping.  This
patch is purely preparatory for later patches which will simplify the
wait/wake interfaces.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:43 -04:00
Matthew Wilcox
a77d19f46a dax: Rename some functions
Remove mentions of 'radix' and 'radix tree'.  Simplify some names by
dropping the word 'mapping'.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:42 -04:00
Matthew Wilcox
5ec2d99de7 f2fs: Convert to XArray
This is a straightforward conversion.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:42 -04:00
Matthew Wilcox
f611ff6375 nilfs2: Convert to XArray
This is close to a 1:1 replacement of radix tree APIs with their XArray
equivalents.  It would be possible to optimise nilfs_copy_back_pages(),
but that doesn't seem to be in the performance path.  Also, I think
it has a pre-existing bug, and I've added a note to that effect in the
source code.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:42 -04:00
Matthew Wilcox
04edf02cdd fs: Convert writeback to XArray
A couple of short loops.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:42 -04:00
Matthew Wilcox
ec82e1c1c8 fs: Convert buffer to XArray
Mostly comment fixes, but one use of __xa_set_mark.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:41 -04:00
Matthew Wilcox
0a943c65e7 btrfs: Convert page cache to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: David Sterba <dsterba@suse.com>
2018-10-21 10:46:41 -04:00
Matthew Wilcox
10bbd23585 pagevec: Use xa_mark_t
Removes sparse warnings.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:39 -04:00
Matthew Wilcox
0d3f929666 page cache: Convert hole search to XArray
The page cache offers the ability to search for a miss in the previous or
next N locations.  Rather than teach the XArray about the page cache's
definition of a miss, use xas_prev() and xas_next() to search the page
array.  This should be more efficient as it does not have to start the
lookup from the top for each index.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:33 -04:00
David S. Miller
2e2d6f0342 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.

net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'.  Thanks to David
Ahern for the heads up.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-19 11:03:06 -07:00
Bob Peterson
8e31582a9a gfs2: Fix minor typo: couln't versus couldn't.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-19 11:34:04 -05:00