Skip to content

ARM: tegra: Add device-tree for ASUS Transformer Prime TF201#4

Closed
clamor-s wants to merge 2 commits intograte-driver:masterfrom
clamor-s:clamor95-patch-1
Closed

ARM: tegra: Add device-tree for ASUS Transformer Prime TF201#4
clamor-s wants to merge 2 commits intograte-driver:masterfrom
clamor-s:clamor95-patch-1

Conversation

@clamor-s
Copy link
Member

@clamor-s clamor-s commented Jun 2, 2020

No description provided.

@clamor-s clamor-s closed this Jun 2, 2020
@clamor-s clamor-s deleted the clamor95-patch-1 branch June 2, 2020 08:00
@clamor-s
Copy link
Member Author

clamor-s commented Jun 2, 2020

Wrong stuff

digetx pushed a commit that referenced this pull request Jun 8, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jun 9, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jun 10, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jun 13, 2020
The code in decode_config4() of arch/mips/kernel/cpu-probe.c

        asid_mask = MIPS_ENTRYHI_ASID;
        if (config4 & MIPS_CONF4_AE)
                asid_mask |= MIPS_ENTRYHI_ASIDX;
        set_cpu_asid_mask(c, asid_mask);

set asid_mask to cpuinfo->asid_mask.

So in order to support variable ASID_MASK, KVM_ENTRYHI_ASID should also
be changed to cpu_asid_mask(&boot_cpu_data).

Cc: Stable <stable@vger.kernel.org>  #4.9+
Reviewed-by: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
Signed-off-by: Xing Li <lixing@loongson.cn>
[Huacai: Change current_cpu_data to boot_cpu_data for optimization]
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Message-Id: <1590220602-3547-2-git-send-email-chenhc@lemote.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
digetx pushed a commit that referenced this pull request Jun 13, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jun 17, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jun 28, 2020
Suppose that, for unrelated reasons, FSF requests on behalf of recovery are
very slow and can run into the ERP timeout.

In the case at hand, we did adapter recovery to a large degree.  However
due to the slowness a LUN open is pending so the corresponding fc_rport
remains blocked.  After fast_io_fail_tmo we trigger close physical port
recovery for the port under which the LUN should have been opened.  The new
higher order port recovery dismisses the pending LUN open ERP action and
dismisses the pending LUN open FSF request.  Such dismissal decouples the
ERP action from the pending corresponding FSF request by setting
zfcp_fsf_req->erp_action to NULL (among other things)
[zfcp_erp_strategy_check_fsfreq()].

If now the ERP timeout for the pending open LUN request runs out, we must
not use zfcp_fsf_req->erp_action in the ERP timeout handler.  This is a
problem since v4.15 commit 75492a5 ("s390/scsi: Convert timers to use
timer_setup()"). Before that we intentionally only passed zfcp_erp_action
as context argument to zfcp_erp_timeout_handler().

Note: The lifetime of the corresponding zfcp_fsf_req object continues until
a (late) response or an (unrelated) adapter recovery.

Just like the regular response path ignores dismissed requests
[zfcp_fsf_req_complete() => zfcp_fsf_protstatus_eval() => return early] the
ERP timeout handler now needs to ignore dismissed requests.  So simply
return early in the ERP timeout handler if the FSF request is marked as
dismissed in its status flags.  To protect against the race where
zfcp_erp_strategy_check_fsfreq() dismisses and sets
zfcp_fsf_req->erp_action to NULL after our previous status flag check,
return early if zfcp_fsf_req->erp_action is NULL.  After all, the former
ERP action does not need to be woken up as that was already done as part of
the dismissal above [zfcp_erp_action_dismiss()].

This fixes the following panic due to kernel page fault in IRQ context:

Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:000009859238c00b R2:00000e3e7ffd000b R3:00000e3e7ffcc007 S:00000e3e7ffd7000 P:000000000000013d
Oops: 0004 ilc:2 [#1] SMP
Modules linked in: ...
CPU: 82 PID: 311273 Comm: stress Kdump: loaded Tainted: G            E  X   ...
Hardware name: IBM 8561 T01 701 (LPAR)
Krnl PSW : 0404c00180000000 001fffff80549be0 (zfcp_erp_notify+0x40/0xc0 [zfcp])
           R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000080 00000e3d00000000 00000000000000f0 0000000000030000
           000000010028e700 000000000400a39c 000000010028e700 00000e3e7cf87e02
           0000000010000000 0700098591cb67f0 0000000000000000 0000000000000000
           0000033840e9a000 0000000000000000 001fffe008d6bc18 001fffe008d6bbc8
Krnl Code: 001fffff80549bd4: a7180000            lhi     %r1,0
           001fffff80549bd8: 4120a0f0            la      %r2,240(%r10)
          #001fffff80549bdc: a53e0003            llilh   %r3,3
          >001fffff80549be0: ba132000            cs      %r1,%r3,0(%r2)
           001fffff80549be4: a7740037            brc     7,1fffff80549c52
           001fffff80549be8: e320b0180004        lg      %r2,24(%r11)
           001fffff80549bee: e31020e00004        lg      %r1,224(%r2)
           001fffff80549bf4: 412020e0            la      %r2,224(%r2)
Call Trace:
 [<001fffff80549be0>] zfcp_erp_notify+0x40/0xc0 [zfcp]
 [<00000985915e26f0>] call_timer_fn+0x38/0x190
 [<00000985915e2944>] expire_timers+0xfc/0x190
 [<00000985915e2ac4>] run_timer_softirq+0xec/0x218
 [<0000098591ca7c4c>] __do_softirq+0x144/0x398
 [<00000985915110aa>] do_softirq_own_stack+0x72/0x88
 [<0000098591551b58>] irq_exit+0xb0/0xb8
 [<0000098591510c6a>] do_IRQ+0x82/0xb0
 [<0000098591ca7140>] ext_int_handler+0x128/0x12c
 [<0000098591722d98>] clear_subpage.constprop.13+0x38/0x60
([<000009859172ae4c>] clear_huge_page+0xec/0x250)
 [<000009859177e7a2>] do_huge_pmd_anonymous_page+0x32a/0x768
 [<000009859172a712>] __handle_mm_fault+0x88a/0x900
 [<000009859172a860>] handle_mm_fault+0xd8/0x1b0
 [<0000098591529ef6>] do_dat_exception+0x136/0x3e8
 [<0000098591ca6d34>] pgm_check_handler+0x1c8/0x220
Last Breaking-Event-Address:
 [<001fffff80549c88>] zfcp_erp_timeout_handler+0x10/0x18 [zfcp]
Kernel panic - not syncing: Fatal exception in interrupt

Link: https://lore.kernel.org/r/20200623140242.98864-1-maier@linux.ibm.com
Fixes: 75492a5 ("s390/scsi: Convert timers to use timer_setup()")
Cc: <stable@vger.kernel.org> #4.15+
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Steffen Maier <maier@linux.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
digetx pushed a commit that referenced this pull request Jun 28, 2020
NULL pointer exception happens occasionally on serial output initiated
by login timeout.  This was reproduced only if kernel was built with
significant debugging options and EDMA driver is used with serial
console.

    col-vf50 login: root
    Password:
    Login timed out after 60 seconds.
    Unable to handle kernel NULL pointer dereference at virtual address 00000044
    Internal error: Oops: 5 [#1] ARM
    CPU: 0 PID: 157 Comm: login Not tainted 5.7.0-next-20200610-dirty #4
    Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
      (fsl_edma_tx_handler) from [<8016eb10>] (__handle_irq_event_percpu+0x64/0x304)
      (__handle_irq_event_percpu) from [<8016eddc>] (handle_irq_event_percpu+0x2c/0x7c)
      (handle_irq_event_percpu) from [<8016ee64>] (handle_irq_event+0x38/0x5c)
      (handle_irq_event) from [<801729e4>] (handle_fasteoi_irq+0xa4/0x160)
      (handle_fasteoi_irq) from [<8016ddcc>] (generic_handle_irq+0x34/0x44)
      (generic_handle_irq) from [<8016e40c>] (__handle_domain_irq+0x54/0xa8)
      (__handle_domain_irq) from [<80508bc8>] (gic_handle_irq+0x4c/0x80)
      (gic_handle_irq) from [<80100af0>] (__irq_svc+0x70/0x98)
    Exception stack(0x8459fe80 to 0x8459fec8)
    fe80: 72286b00 e3359f64 00000001 0000412d a0070013 85c98840 85c98840 a0070013
    fea0: 8054e0d4 00000000 00000002 00000000 00000002 8459fed0 8081fbe8 8081fbec
    fec0: 60070013 ffffffff
      (__irq_svc) from [<8081fbec>] (_raw_spin_unlock_irqrestore+0x30/0x58)
      (_raw_spin_unlock_irqrestore) from [<8056cb48>] (uart_flush_buffer+0x88/0xf8)
      (uart_flush_buffer) from [<80554e60>] (tty_ldisc_hangup+0x38/0x1ac)
      (tty_ldisc_hangup) from [<8054c7f4>] (__tty_hangup+0x158/0x2bc)
      (__tty_hangup) from [<80557b90>] (disassociate_ctty.part.1+0x30/0x23c)
      (disassociate_ctty.part.1) from [<8011fc18>] (do_exit+0x580/0xba0)
      (do_exit) from [<801214f8>] (do_group_exit+0x3c/0xb4)
      (do_group_exit) from [<80121580>] (__wake_up_parent+0x0/0x14)

Issue looks like race condition between interrupt handler fsl_edma_tx_handler()
(called as result of fsl_edma_xfer_desc()) and terminating the transfer with
fsl_edma_terminate_all().

The fsl_edma_tx_handler() handles interrupt for a transfer with already freed
edesc and idle==true.

Fixes: d6be34f ("dma: Add Freescale eDMA engine driver support")
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Robin Gong <yibin.gong@nxp.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/1591877861-28156-2-git-send-email-krzk@kernel.org
Signed-off-by: Vinod Koul <vkoul@kernel.org>
digetx pushed a commit that referenced this pull request Jun 28, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jul 1, 2020
devm_gpiod_get_index() doesn't return NULL but -ENOENT when the
requested GPIO doesn't exist,  leading to the following messages:

[    2.742468] gpiod_direction_input: invalid GPIO (errorpointer)
[    2.748147] can't set direction for gpio #2: -2
[    2.753081] gpiod_direction_input: invalid GPIO (errorpointer)
[    2.758724] can't set direction for gpio #3: -2
[    2.763666] gpiod_direction_output: invalid GPIO (errorpointer)
[    2.769394] can't set direction for gpio #4: -2
[    2.774341] gpiod_direction_input: invalid GPIO (errorpointer)
[    2.779981] can't set direction for gpio #5: -2
[    2.784545] ff000a20.serial: ttyCPM1 at MMIO 0xfff00a20 (irq = 39, base_baud = 8250000) is a CPM UART

Use devm_gpiod_get_index_optional() instead.

At the same time, handle the error case and properly exit
with an error.

Fixes: 97cbaf2 ("tty: serial: cpm_uart: Convert to use GPIO descriptors")
Cc: stable@vger.kernel.org
Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/694a25fdce548c5ee8b060ef6a4b02746b8f25c0.1591986307.git.christophe.leroy@csgroup.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
digetx pushed a commit that referenced this pull request Jul 1, 2020
Luo bin says:

====================
hinic: add some ethtool ops support

patch #1: support to set and get pause params with
          "ethtool -A/a" cmd
patch #2: support to set and get irq coalesce params with
          "ethtool -C/c" cmd
patch #3: support to do self test with "ethtool -t" cmd
patch #4: support to identify physical device with "ethtool -p" cmd
patch #5: support to get eeprom information with "ethtool -m" cmd
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
digetx pushed a commit that referenced this pull request Jul 1, 2020
Petr Machata says:

====================
TC: Introduce qevents

The Spectrum hardware allows execution of one of several actions as a
result of queue management decisions: tail-dropping, early-dropping,
marking a packet, or passing a configured latency threshold or buffer
size. Such packets can be mirrored, trapped, or sampled.

Modeling the action to be taken as simply a TC action is very attractive,
but it is not obvious where to put these actions. At least with ECN marking
one could imagine a tree of qdiscs and classifiers that effectively
accomplishes this task, albeit in an impractically complex manner. But
there is just no way to match on dropped-ness of a packet, let alone
dropped-ness due to a particular reason.

To allow configuring user-defined actions as a result of inner workings of
a qdisc, this patch set introduces a concept of qevents. Those are attach
points for TC blocks, where filters can be put that are executed as the
packet hits well-defined points in the qdisc algorithms. The attached
blocks can be shared, in a manner similar to clsact ingress and egress
blocks, arbitrary classifiers with arbitrary actions can be put on them,
etc.

For example:

	red limit 500K avpkt 1K qevent early_drop block 10
	matchall action mirred egress mirror dev eth1

The central patch #2 introduces several helpers to allow easy and uniform
addition of qevents to qdiscs: initialization, destruction, qevent block
number change validation, and qevent handling, i.e. dispatch of the filters
attached to the block bound to a qevent.

Patch #1 adds root_lock argument to qdisc enqueue op. The problem this is
tackling is that if a qevent filter pushes packets to the same qdisc tree
that holds the qevent in the first place, attempt to take qdisc root lock
for the second time will lead to a deadlock. To solve the issue, qevent
handler needs to unlock and relock the root lock around the filter
processing. Passing root_lock around makes it possible to get the lock
where it is needed, and visibly so, such that it is obvious the lock will
be used when invoking a qevent.

The following two patches, #3 and #4, then add two qevents to the RED
qdisc: "early_drop" qevent fires when a packet is early-dropped; "mark"
qevent, when it is ECN-marked.

Patch #5 contains a selftest. I have mentioned this test when pushing the
RED ECN nodrop mode and said that "I have no confidence in its portability
to [...] different configurations". That still holds. The backlog and
packet size are tuned to make the test deterministic. But it is better than
nothing, and on the boxes that I ran it on it does work and shows that
qevents work the way they are supposed to, and that their addition has not
broken the other tested features.

This patch set does not deal with offloading. The idea there is that a
driver will be able to figure out that a given block is used in qevent
context by looking at binder type. A future patch-set will add a qdisc
pointer to struct flow_block_offload, which a driver will be able to
consult to glean the TC or other relevant attributes.

Changes from RFC to v1:
- Move a "q = qdisc_priv(sch)" from patch #3 to patch #4
- Fix deadlock caused by mirroring packet back to the same qdisc tree.
- Rename "tail" qevent to "tail_drop".
- Adapt to the new 100-column standard.
- Add a selftest
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
digetx pushed a commit that referenced this pull request Jul 1, 2020
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: multipath-tcp/mptcp_net-next#35
Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
digetx pushed a commit that referenced this pull request Jul 1, 2020
Ido Schimmel says:

====================
Add ethtool extended link state

Amit says:

Currently, device drivers can only indicate to user space if the network
link is up or down, without additional information.

This patch set provides an infrastructure that allows these drivers to
expose more information to user space about the link state. The
information can save users' time when trying to understand why a link is
not operationally up, for example.

The above is achieved by extending the existing ethtool LINKSTATE_GET
command with attributes that carry the extended state.

For example, no link due to missing cable:

$ ethtool ethX
...
Link detected: no (No cable)

Beside the general extended state, drivers can pass additional
information about the link state using the sub-state field. For example:

$ ethtool ethX
...
Link detected: no (Autoneg, No partner detected)

In the future the infrastructure can be extended - for example - to
allow PHY drivers to report whether a downshift to a lower speed
occurred. Something like:

$ ethtool ethX
...
Link detected: yes (downshifted)

Patch set overview:

Patches #1-#3 move mlxsw ethtool code to a separate file
Patches #4-#5 add the ethtool infrastructure for extended link state
Patches #6-#7 add support of extended link state in the mlxsw driver
Patches #8-#10 add test cases

Changes since v1:

* In documentation, show ETHTOOL_LINK_EXT_STATE_* and
  ETHTOOL_LINK_EXT_SUBSTATE_* constants instead of user-space strings
* Add `_CI_` to cable_issue substates to be consistent with
  other substates
* Keep the commit messages within 75 columns
* Use u8 variable for __link_ext_substate
* Document the meaning of -ENODATA in get_link_ext_state() callback
  description
* Do not zero data->link_ext_state_provided after getting an error
* Use `ret` variable for error value

Changes since RFC:

* Move documentation patch before ethtool patch
* Add nla_total_size() instead of sizeof() directly
* Return an error code from linkstate_get_ext_state()
* Remove SHORTED_CABLE, add CABLE_TEST_FAILURE instead
* Check if the interface is administratively up before setting ext_state
* Document all sub-states
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
digetx pushed a commit that referenced this pull request Jul 1, 2020
This patch is to fix a crash:

 #3 [ffffb6580689f898] oops_end at ffffffffa2835bc2
 #4 [ffffb6580689f8b8] no_context at ffffffffa28766e7
 #5 [ffffb6580689f920] async_page_fault at ffffffffa320135e
    [exception RIP: f2fs_is_compressed_page+34]
    RIP: ffffffffa2ba83a2  RSP: ffffb6580689f9d8  RFLAGS: 00010213
    RAX: 0000000000000001  RBX: fffffc0f50b34bc0  RCX: 0000000000002122
    RDX: 0000000000002123  RSI: 0000000000000c00  RDI: fffffc0f50b34bc0
    RBP: ffff97e815a40178   R8: 0000000000000000   R9: ffff97e83ffc9000
    R10: 0000000000032300  R11: 0000000000032380  R12: ffffb6580689fa38
    R13: fffffc0f50b34bc0  R14: ffff97e825cbd000  R15: 0000000000000c00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffffb6580689f9d8] __is_cp_guaranteed at ffffffffa2b7ea98
 #7 [ffffb6580689f9f0] f2fs_submit_page_write at ffffffffa2b81a69
 #8 [ffffb6580689fa30] f2fs_do_write_meta_page at ffffffffa2b99777
 #9 [ffffb6580689fae0] __f2fs_write_meta_page at ffffffffa2b75f1a
 #10 [ffffb6580689fb18] f2fs_sync_meta_pages at ffffffffa2b77466
 #11 [ffffb6580689fc98] do_checkpoint at ffffffffa2b78e46
 #12 [ffffb6580689fd88] f2fs_write_checkpoint at ffffffffa2b79c29
 #13 [ffffb6580689fdd0] f2fs_sync_fs at ffffffffa2b69d95
 #14 [ffffb6580689fe20] sync_filesystem at ffffffffa2ad2574
 #15 [ffffb6580689fe30] generic_shutdown_super at ffffffffa2a9b582
 #16 [ffffb6580689fe48] kill_block_super at ffffffffa2a9b6d1
 #17 [ffffb6580689fe60] kill_f2fs_super at ffffffffa2b6abe1
 #18 [ffffb6580689fea0] deactivate_locked_super at ffffffffa2a9afb6
 #19 [ffffb6580689feb8] cleanup_mnt at ffffffffa2abcad4
 #20 [ffffb6580689fee0] task_work_run at ffffffffa28bca28
 #21 [ffffb6580689ff00] exit_to_usermode_loop at ffffffffa28050b7
 #22 [ffffb6580689ff38] do_syscall_64 at ffffffffa280560e
 #23 [ffffb6580689ff50] entry_SYSCALL_64_after_hwframe at ffffffffa320008c

This occurred when umount f2fs if enable F2FS_FS_COMPRESSION
with F2FS_IO_TRACE. Fixes it by adding IS_IO_TRACED_PAGE to check
validity of pid for page_private.

Signed-off-by: Yu Changchun <yuchangchun1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
digetx pushed a commit that referenced this pull request Jul 1, 2020
Jakub Sitnicki says:

====================
This patch set prepares ground for link-based multi-prog attachment for
future netns attach types, with BPF_SK_LOOKUP attach type in mind [0].

Two changes are needed in order to attach and run a series of BPF programs:

  1) an bpf_prog_array of programs to run (patch #2), and
  2) a list of attached links to keep track of attachments (patch #3).

Nothing changes for BPF flow_dissector. Just as before only one program can
be attached to netns.

In v3 I've simplified patch #2 that introduces bpf_prog_array to take
advantage of the fact that it will hold at most one program for now.

In particular, I'm no longer using bpf_prog_array_copy. It turned out to be
less suitable for link operations than I thought as it fails to append the
same BPF program.

bpf_prog_array_replace_item is also gone, because we know we always want to
replace the first element in prog_array.

Naturally the code that handles bpf_prog_array will need change once
more when there is a program type that allows multi-prog attachment. But I
feel it will be better to do it gradually and present it together with
tests that actually exercise multi-prog code paths.

[0] https://lore.kernel.org/bpf/20200511185218.1422406-1-jakub@cloudflare.com/

v2 -> v3:
- Don't check if run_array is null in link update callback. (Martin)
- Allow updating the link with the same BPF program. (Andrii)
- Add patch #4 with a test for the above case.
- Kill bpf_prog_array_replace_item. Access the run_array directly.
- Switch from bpf_prog_array_copy() to bpf_prog_array_alloc(1, ...).
- Replace rcu_deref_protected & RCU_INIT_POINTER with rcu_replace_pointer.
- Drop Andrii's Ack from patch #2. Code changed.

v1 -> v2:

- Show with a (void) cast that bpf_prog_array_replace_item() return value
  is ignored on purpose. (Andrii)
- Explain why bpf-cgroup cannot replace programs in bpf_prog_array based
  on bpf_prog pointer comparison in patch #2 description. (Andrii)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
digetx pushed a commit that referenced this pull request Jul 1, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jul 3, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jul 8, 2020
Enable promisc mode of PF, set VF link state to enable, and
run iperf of the VF, then do self test of the PF. The self test
will fail with a low frequency, and may cause a use-after-free
problem.

[   87.142126] selftest:000004a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   87.159722] ==================================================================
[   87.174187] BUG: KASAN: use-after-free in hex_dump_to_buffer+0x140/0x608
[   87.187600] Read of size 1 at addr ffff003b22828000 by task ethtool/1186
[   87.201012]
[   87.203978] CPU: 7 PID: 1186 Comm: ethtool Not tainted 5.5.0-rc4-gfd51c473-dirty #4
[   87.219306] Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS 2280-A CS V2.B160.01 01/15/2020
[   87.238292] Call trace:
[   87.243173]  dump_backtrace+0x0/0x280
[   87.250491]  show_stack+0x24/0x30
[   87.257114]  dump_stack+0xe8/0x140
[   87.263911]  print_address_description.isra.8+0x70/0x380
[   87.274538]  __kasan_report+0x12c/0x230
[   87.282203]  kasan_report+0xc/0x18
[   87.288999]  __asan_load1+0x60/0x68
[   87.295969]  hex_dump_to_buffer+0x140/0x608
[   87.304332]  print_hex_dump+0x140/0x1e0
[   87.312000]  hns3_lb_check_skb_data+0x168/0x170
[   87.321060]  hns3_clean_rx_ring+0xa94/0xfe0
[   87.329422]  hns3_self_test+0x708/0x8c0

The length of packet sent by the selftest process is only
128 + 14 bytes, and the min buffer size of a BD is 256 bytes,
and the receive process will make sure the packet sent by
the selftest process is in the linear part, so only check
the linear part in hns3_lb_check_skb_data().

So fix this use-after-free by using skb_headlen() to dump
skb->data instead of skb->len.

Fixes: c39c4d9 ("net: hns3: Add mac loopback selftest support in hns3 driver")
Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
digetx pushed a commit that referenced this pull request Jul 8, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jul 8, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jul 9, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Jul 22, 2020
https://bugzilla.kernel.org/show_bug.cgi?id=208565

PID: 257    TASK: ecdd0000  CPU: 0   COMMAND: "init"
  #0 [<c0b420ec>] (__schedule) from [<c0b423c8>]
  #1 [<c0b423c8>] (schedule) from [<c0b459d4>]
  #2 [<c0b459d4>] (rwsem_down_read_failed) from [<c0b44fa0>]
  #3 [<c0b44fa0>] (down_read) from [<c044233c>]
  #4 [<c044233c>] (f2fs_truncate_blocks) from [<c0442890>]
  #5 [<c0442890>] (f2fs_truncate) from [<c044d408>]
  #6 [<c044d408>] (f2fs_evict_inode) from [<c030be18>]
  #7 [<c030be18>] (evict) from [<c030a558>]
  #8 [<c030a558>] (iput) from [<c047c600>]
  #9 [<c047c600>] (f2fs_sync_node_pages) from [<c0465414>]
 #10 [<c0465414>] (f2fs_write_checkpoint) from [<c04575f4>]
 #11 [<c04575f4>] (f2fs_sync_fs) from [<c0441918>]
 #12 [<c0441918>] (f2fs_do_sync_file) from [<c0441098>]
 #13 [<c0441098>] (f2fs_sync_file) from [<c0323fa0>]
 #14 [<c0323fa0>] (vfs_fsync_range) from [<c0324294>]
 #15 [<c0324294>] (do_fsync) from [<c0324014>]
 #16 [<c0324014>] (sys_fsync) from [<c0108bc0>]

This can be caused by flush_dirty_inode() in f2fs_sync_node_pages() where
iput() requires f2fs_lock_op() again resulting in livelock.

Reported-by: Zhiguo Niu <Zhiguo.Niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
digetx pushed a commit that referenced this pull request Jul 22, 2020
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Aug 8, 2020
Ido Schimmel says:

====================
mlxsw: Add support for buffer drop traps

Petr says:

A recent patch set added the ability to mirror buffer related drops
(e.g., early drops) through a netdev. This patch set adds the ability to
trap such packets to the local CPU for analysis.

The trapping towards the CPU is configured by using tc-trap action
instead of tc-mirred as was done when the packets were mirrored through
a netdev. A future patch set will also add the ability to sample the
dropped packets using tc-sample action.

The buffer related drop traps are added to devlink, which means that the
dropped packets can be reported to user space via the kernel's
drop_monitor module.

Patch set overview:

Patch #1 adds the early_drop trap to devlink

Patch #2 adds extack to a few devlink operations to facilitate better
error reporting to user space. This is necessary - among other things -
because the action of buffer drop traps cannot be changed in mlxsw

Patch #3 performs a small refactoring in mlxsw, patch #4 fixes a bug that
this patchset would trigger.

Patches #5-#6 add the infrastructure required to support different traps
/ trap groups in mlxsw per-ASIC. This is required because buffer drop
traps are not supported by Spectrum-1

Patch #7 extends mlxsw to register the early_drop trap

Patch #8 adds the offload logic for the "trap" action at a qevent block.

Patch #9 adds a mlxsw-specific selftest.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
digetx pushed a commit that referenced this pull request Sep 9, 2020
At memory hot-remove time we can retrieve an LMB's nid from its
corresponding memory_block.  There is no need to store the nid
in multiple locations.

Note that lmb_to_memblock() uses find_memory_block() to get the
corresponding memory_block.  As find_memory_block() runs in sub-linear
time this approach is negligibly slower than what we do at present.

In exchange for this lookup at hot-remove time we no longer need to
call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
spares us an O(n^2) initialization during boot.

On systems with many LMBs that initialization overhead is palpable and
disruptive.  For example, on a box with 249854 LMBs we're seeing
drmem_init() take upwards of 30 seconds to complete:

[   53.721639] drmem: initializing drmem v2
[   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
[   80.604377] Modules linked in:
[   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
[   80.604397] NIP:  c0000000000a4980 LR: c0000000000a4940 CTR: 0000000000000000
[   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
[   80.604412] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000248  XER: 0000000d
[   80.604431] CFAR: c0000000000a4a38 IRQMASK: 0
[   80.604431] GPR00: c0000000000a4940 c0002dbff8493ac0 c000000001904400 c0003cfffffede30
[   80.604431] GPR04: 0000000000000000 c000000000f4095a 000000000000002f 0000000010000000
[   80.604431] GPR08: c0000bf7ecdb7fb8 c0000bf7ecc2d3c8 0000000000000008 c00c0002fdfb2001
[   80.604431] GPR12: 0000000000000000 c00000001e8ec200
[   80.604477] NIP [c0000000000a4980] hot_add_scn_to_nid+0xa0/0x3e0
[   80.604486] LR [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0
[   80.604492] Call Trace:
[   80.604498] [c0002dbff8493ac0] [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
[   80.604509] [c0002dbff8493b20] [c000000000087c10] memory_add_physaddr_to_nid+0x20/0x60
[   80.604521] [c0002dbff8493b40] [c0000000010d4880] drmem_init+0x25c/0x2f0
[   80.604530] [c0002dbff8493c10] [c000000000010154] do_one_initcall+0x64/0x2c0
[   80.604540] [c0002dbff8493ce0] [c0000000010c4aa0] kernel_init_freeable+0x2d8/0x3a0
[   80.604550] [c0002dbff8493db0] [c000000000010824] kernel_init+0x2c/0x148
[   80.604560] [c0002dbff8493e20] [c00000000000b648] ret_from_kernel_thread+0x5c/0x74
[   80.604567] Instruction dump:
[   80.604574] 392918e8 e9490000 e90a000a e92a0000 80ea000c 1d080018 3908ffe8 7d094214
[   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e9490000 7fbe5040
[   89.047390] drmem: 249854 LMB(s)

With a patched kernel on the same machine we're no longer seeing the
soft lockup.  drmem_init() now completes in negligible time, even when
the LMB count is large.

Fixes: b2d3b5e ("powerpc/pseries: Track LMB nid instead of using device tree")
Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200811015115.63677-1-cheloha@linux.ibm.com
digetx pushed a commit that referenced this pull request Sep 12, 2020
Krzysztof Kozlowski says:

====================
nfc: s3fwrn5: Few cleanups

Changes since v2:
1. Fix dtschema ID after rename (patch 1/8).
2. Apply patch 9/9 (defconfig change).

Changes since v1:
1. Rename dtschema file and add additionalProperties:false, as Rob
   suggested,
2. Add Marek's tested-by,
3. New patches: #4, #5, #6, #7 and #9.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
digetx pushed a commit that referenced this pull request Sep 28, 2020
When executing the LOCK06 locktorture scenario featuring percpu-rwsem,
the RCU callback rcu_sync_func() may still be pending after locktorture
module is removed.  This can in turn lead to the following Oops:

  BUG: unable to handle page fault for address: ffffffffc00eb920
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 6500a067 P4D 6500a067 PUD 6500c067 PMD 13a36c067 PTE 800000013691c163
  Oops: 0000 [#1] PREEMPT SMP
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  RIP: 0010:rcu_cblist_dequeue+0x12/0x30
  Call Trace:
   <IRQ>
   rcu_core+0x1b1/0x860
   __do_softirq+0xfe/0x326
   asm_call_on_stack+0x12/0x20
   </IRQ>
   do_softirq_own_stack+0x5f/0x80
   irq_exit_rcu+0xaf/0xc0
   sysvec_apic_timer_interrupt+0x2e/0xb0
   asm_sysvec_apic_timer_interrupt+0x12/0x20

This commit avoids tis problem by adding an exit hook in lock_torture_ops
and using it to call percpu_free_rwsem() for percpu rwsem torture during
the module-cleanup function, thus ensuring that rcu_sync_func() completes
before module exits.

It is also necessary to call the exit hook if lock_torture_init()
fails half-way, so this commit also adds an ->init_called field in
lock_torture_cxt to indicate that exit hook, if present, must be called.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
digetx pushed a commit that referenced this pull request Sep 28, 2020
WARNING: Unknown commit id '63a0895d960a', maybe rebased or not pulled?
#4:
The uninitialized_var() macro was removed recently via commit 63a0895

WARNING: 'assigments' may be misspelled - perhaps 'assignments'?
#30: FILE: scripts/checkpatch.pl:3904:
+# check for self assigments used to avoid compiler warnings

total: 0 errors, 2 warnings, 17 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/checkpatch-warn-on-self-assignments.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Sep 28, 2020
Ido Schimmel says:

====================
mlxsw: Expose transceiver overheat counter

Amit says:

An overheated transceiver can be the root cause of various network
problems such as link flapping. Counting the number of times a
transceiver's temperature was higher than its configured threshold can
therefore help in debugging such issues.

This patch set exposes a transceiver overheat counter via ethtool. This
is achieved by configuring the Spectrum ASIC to generate events whenever
a transceiver is overheated. The temperature thresholds are queried from
the transceiver (if available) and set to the default otherwise.

Example:

...
transceiver_overheat: 2

Patch set overview:

Patches #1-#3 add required device registers
Patches #4-#5 add required infrastructure in mlxsw to configure and
count overheat events
Patches #6-#9 gradually add support for the transceiver overheat counter
Patch #10 exposes the transceiver overheat counter via ethtool
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
digetx pushed a commit that referenced this pull request Oct 15, 2020
The following lockdep splat

======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
------------------------------------------------------
fsstress/8739 is trying to acquire lock:
ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70

but task is already holding lock:
ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #10 (sb_pagefaults){.+.+}-{0:0}:
       __sb_start_write+0x129/0x210
       btrfs_page_mkwrite+0x6a/0x4a0
       do_page_mkwrite+0x4d/0xc0
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       asm_exc_page_fault+0x1e/0x30

-> #9 (&mm->mmap_lock#2){++++}-{3:3}:
       __might_fault+0x68/0x90
       _copy_to_user+0x1e/0x80
       perf_read+0x141/0x2c0
       vfs_read+0xad/0x1b0
       ksys_read+0x5f/0xe0
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #8 (&cpuctx_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       perf_event_init_cpu+0x88/0x150
       perf_event_init+0x1db/0x20b
       start_kernel+0x3ae/0x53c
       secondary_startup_64+0xa4/0xb0

-> #7 (pmus_lock){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       perf_event_init_cpu+0x4f/0x150
       cpuhp_invoke_callback+0xb1/0x900
       _cpu_up.constprop.26+0x9f/0x130
       cpu_up+0x7b/0xc0
       bringup_nonboot_cpus+0x4f/0x60
       smp_init+0x26/0x71
       kernel_init_freeable+0x110/0x258
       kernel_init+0xa/0x103
       ret_from_fork+0x1f/0x30

-> #6 (cpu_hotplug_lock){++++}-{0:0}:
       cpus_read_lock+0x39/0xb0
       kmem_cache_create_usercopy+0x28/0x230
       kmem_cache_create+0x12/0x20
       bioset_init+0x15e/0x2b0
       init_bio+0xa3/0xaa
       do_one_initcall+0x5a/0x2e0
       kernel_init_freeable+0x1f4/0x258
       kernel_init+0xa/0x103
       ret_from_fork+0x1f/0x30

-> #5 (bio_slab_lock){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       bioset_init+0xbc/0x2b0
       __blk_alloc_queue+0x6f/0x2d0
       blk_mq_init_queue_data+0x1b/0x70
       loop_add+0x110/0x290 [loop]
       fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
       do_one_initcall+0x5a/0x2e0
       do_init_module+0x5a/0x220
       load_module+0x2459/0x26e0
       __do_sys_finit_module+0xba/0xe0
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #4 (loop_ctl_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       lo_open+0x18/0x50 [loop]
       __blkdev_get+0xec/0x570
       blkdev_get+0xe8/0x150
       do_dentry_open+0x167/0x410
       path_openat+0x7c9/0xa80
       do_filp_open+0x93/0x100
       do_sys_openat2+0x22a/0x2e0
       do_sys_open+0x4b/0x80
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       blkdev_put+0x1d/0x120
       close_fs_devices.part.31+0x84/0x130
       btrfs_close_devices+0x44/0xb0
       close_ctree+0x296/0x2b2
       generic_shutdown_super+0x69/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_run_dev_stats+0x49/0x480
       commit_cowonly_roots+0xb5/0x2a0
       btrfs_commit_transaction+0x516/0xa60
       sync_filesystem+0x6b/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_commit_transaction+0x4bb/0xa60
       sync_filesystem+0x6b/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
       __lock_acquire+0x1272/0x2310
       lock_acquire+0x9e/0x360
       __mutex_lock+0x9f/0x930
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       do_page_mkwrite+0x4d/0xc0
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       asm_exc_page_fault+0x1e/0x30

other info that might help us debug this:

Chain exists of:
  &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_pagefaults);
                               lock(&mm->mmap_lock#2);
                               lock(sb_pagefaults);
  lock(&fs_info->reloc_mutex);

 *** DEADLOCK ***

3 locks held by fsstress/8739:
 #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
 #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
 #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0

stack backtrace:
CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
 dump_stack+0x78/0xa0
 check_noncircular+0x165/0x180
 __lock_acquire+0x1272/0x2310
 ? btrfs_get_alloc_profile+0x150/0x210
 lock_acquire+0x9e/0x360
 ? btrfs_record_root_in_trans+0x43/0x70
 __mutex_lock+0x9f/0x930
 ? btrfs_record_root_in_trans+0x43/0x70
 ? lock_acquire+0x9e/0x360
 ? join_transaction+0x5d/0x450
 ? find_held_lock+0x2d/0x90
 ? btrfs_record_root_in_trans+0x43/0x70
 ? join_transaction+0x3d5/0x450
 ? btrfs_record_root_in_trans+0x43/0x70
 btrfs_record_root_in_trans+0x43/0x70
 start_transaction+0xd1/0x5d0
 btrfs_dirty_inode+0x42/0xd0
 file_update_time+0xc8/0x110
 btrfs_page_mkwrite+0x10c/0x4a0
 ? handle_mm_fault+0x5e/0x1730
 do_page_mkwrite+0x4d/0xc0
 ? __do_fault+0x32/0x150
 handle_mm_fault+0x103c/0x1730
 exc_page_fault+0x340/0x660
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7faa6c9969c4

Was seen in testing.  The fix is similar to that of

  btrfs: open device without device_list_mutex

where we're holding the device_list_mutex and then grab the bd_mutex,
which pulls in a bunch of dependencies under the bd_mutex.  We only ever
call btrfs_close_devices() on mount failure or unmount, so we're save to
not have the device_list_mutex here.  We're already holding the
uuid_mutex which keeps us safe from any external modification of the
fs_devices.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
digetx pushed a commit that referenced this pull request Oct 15, 2020
While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
following lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-rc3+ #4 Not tainted
  ------------------------------------------------------
  kswapd0/100 is trying to acquire lock:
  ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330

  but task is already holding lock:
  ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (fs_reclaim){+.+.}-{0:0}:
	 fs_reclaim_acquire+0x65/0x80
	 slab_pre_alloc_hook.constprop.0+0x20/0x200
	 kmem_cache_alloc+0x37/0x270
	 alloc_inode+0x82/0xb0
	 iget_locked+0x10d/0x2c0
	 kernfs_get_inode+0x1b/0x130
	 kernfs_get_tree+0x136/0x240
	 sysfs_get_tree+0x16/0x40
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x434/0xc00
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (kernfs_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 kernfs_add_one+0x23/0x150
	 kernfs_create_link+0x63/0xa0
	 sysfs_do_create_link_sd+0x5e/0xd0
	 btrfs_sysfs_add_devices_dir+0x81/0x130
	 btrfs_init_new_device+0x67f/0x1250
	 btrfs_ioctl+0x1ef/0x2e20
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 btrfs_chunk_alloc+0x125/0x3a0
	 find_free_extent+0xdf6/0x1210
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb0/0x310
	 alloc_tree_block_no_bg_flush+0x4a/0x60
	 __btrfs_cow_block+0x11a/0x530
	 btrfs_cow_block+0x104/0x220
	 btrfs_search_slot+0x52e/0x9d0
	 btrfs_insert_empty_items+0x64/0xb0
	 btrfs_insert_delayed_items+0x90/0x4f0
	 btrfs_commit_inode_delayed_items+0x93/0x140
	 btrfs_log_inode+0x5de/0x2020
	 btrfs_log_inode_parent+0x429/0xc90
	 btrfs_log_new_name+0x95/0x9b
	 btrfs_rename2+0xbb9/0x1800
	 vfs_rename+0x64f/0x9f0
	 do_renameat2+0x320/0x4e0
	 __x64_sys_rename+0x1f/0x30
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 __lock_acquire+0x119c/0x1fc0
	 lock_acquire+0xa7/0x3d0
	 __mutex_lock+0x7e/0x7e0
	 __btrfs_release_delayed_node.part.0+0x3f/0x330
	 btrfs_evict_inode+0x24c/0x500
	 evict+0xcf/0x1f0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x44/0x50
	 super_cache_scan+0x161/0x1e0
	 do_shrink_slab+0x178/0x3c0
	 shrink_slab+0x17c/0x290
	 shrink_node+0x2b2/0x6d0
	 balance_pgdat+0x30a/0x670
	 kswapd+0x213/0x4c0
	 kthread+0x138/0x160
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(kernfs_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/100:
   #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
   #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0

  stack backtrace:
  CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb8
   check_noncircular+0x12d/0x150
   __lock_acquire+0x119c/0x1fc0
   lock_acquire+0xa7/0x3d0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   __mutex_lock+0x7e/0x7e0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? lock_acquire+0xa7/0x3d0
   ? find_held_lock+0x2b/0x80
   __btrfs_release_delayed_node.part.0+0x3f/0x330
   btrfs_evict_inode+0x24c/0x500
   evict+0xcf/0x1f0
   dispose_list+0x48/0x70
   prune_icache_sb+0x44/0x50
   super_cache_scan+0x161/0x1e0
   do_shrink_slab+0x178/0x3c0
   shrink_slab+0x17c/0x290
   shrink_node+0x2b2/0x6d0
   balance_pgdat+0x30a/0x670
   kswapd+0x213/0x4c0
   ? _raw_spin_unlock_irqrestore+0x41/0x50
   ? add_wait_queue_exclusive+0x70/0x70
   ? balance_pgdat+0x670/0x670
   kthread+0x138/0x160
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30

This happens because we are holding the chunk_mutex at the time of
adding in a new device.  However we only need to hold the
device_list_mutex, as we're going to iterate over the fs_devices
devices.  Move the sysfs init stuff outside of the chunk_mutex to get
rid of this lockdep splat.

CC: stable@vger.kernel.org # 4.4.x: f3cd2c5: btrfs: sysfs, rename device_link add/remove functions
CC: stable@vger.kernel.org # 4.4.x
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
digetx pushed a commit that referenced this pull request Oct 15, 2020
WARNING: Unknown commit id '63a0895d960a', maybe rebased or not pulled?
#4:
The uninitialized_var() macro was removed recently via commit 63a0895

WARNING: 'assigments' may be misspelled - perhaps 'assignments'?
#30: FILE: scripts/checkpatch.pl:3904:
+# check for self assigments used to avoid compiler warnings

total: 0 errors, 2 warnings, 17 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/checkpatch-warn-on-self-assignments.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Oct 16, 2020
Commit 4fc427e ("ipv6_route_seq_next should increase position index")
tried to fix the issue where seq_file pos is not increased
if a NULL element is returned with seq_ops->next(). See bug
  https://bugzilla.kernel.org/show_bug.cgi?id=206283
The commit effectively does:
  - increase pos for all seq_ops->start()
  - increase pos for all seq_ops->next()

For ipv6_route, increasing pos for all seq_ops->next() is correct.
But increasing pos for seq_ops->start() is not correct
since pos is used to determine how many items to skip during
seq_ops->start():
  iter->skip = *pos;
seq_ops->start() just fetches the *current* pos item.
The item can be skipped only after seq_ops->show() which essentially
is the beginning of seq_ops->next().

For example, I have 7 ipv6 route entries,
  root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=4096
  00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001     eth0
  fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
  fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
  ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000004 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  0+1 records in
  0+1 records out
  1050 bytes (1.0 kB, 1.0 KiB) copied, 0.00707908 s, 148 kB/s
  root@arch-fb-vm1:~/net-next

In the above, I specify buffer size 4096, so all records can be returned
to user space with a single trip to the kernel.

If I use buffer size 128, since each record size is 149, internally
kernel seq_read() will read 149 into its internal buffer and return the data
to user space in two read() syscalls. Then user read() syscall will trigger
next seq_ops->start(). Since the current implementation increased pos even
for seq_ops->start(), it will skip record #2, #4 and #6, assuming the first
record is #1.

  root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=128
  00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
4+1 records in
4+1 records out
600 bytes copied, 0.00127758 s, 470 kB/s

To fix the problem, create a fake pos pointer so seq_ops->start()
won't actually increase seq_file pos. With this fix, the
above `dd` command with `bs=128` will show correct result.

Fixes: 4fc427e ("ipv6_route_seq_next should increase position index")
Cc: Alexei Starovoitov <ast@kernel.org>
Suggested-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
digetx pushed a commit that referenced this pull request Oct 21, 2020
Like evlist cpu map, evsel's cpu map should have a proper refcount.

As it's created with a refcount, we don't need to get an extra count.
Thanks to Arnaldo for the simpler suggestion.

This, together with the following patch, fixes the following ASAN
report:

  Direct leak of 840 byte(s) in 70 object(s) allocated from:
    #0 0x7fe36703f628 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x107628)
    #1 0x559fbbf611ca in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
    #2 0x559fbbf6229c in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:237
    #3 0x559fbbcc6c6d in __add_event util/parse-events.c:357
    #4 0x559fbbcc6c6d in add_event_tool util/parse-events.c:408
    #5 0x559fbbcc6c6d in parse_events_add_tool util/parse-events.c:1414
    #6 0x559fbbd8474d in parse_events_parse util/parse-events.y:439
    #7 0x559fbbcc95da in parse_events__scanner util/parse-events.c:2096
    #8 0x559fbbcc95da in __parse_events util/parse-events.c:2141
    #9 0x559fbbc2788b in check_parse_id tests/pmu-events.c:406
    #10 0x559fbbc2788b in check_parse_id tests/pmu-events.c:393
    #11 0x559fbbc2788b in check_parse_fake tests/pmu-events.c:436
    #12 0x559fbbc2788b in metric_parse_fake tests/pmu-events.c:553
    #13 0x559fbbc27e2d in test_parsing_fake tests/pmu-events.c:599
    #14 0x559fbbc27e2d in test_parsing_fake tests/pmu-events.c:574
    #15 0x559fbbc0109b in run_test tests/builtin-test.c:410
    #16 0x559fbbc0109b in test_and_print tests/builtin-test.c:440
    #17 0x559fbbc03e69 in __cmd_test tests/builtin-test.c:695
    #18 0x559fbbc03e69 in cmd_test tests/builtin-test.c:807
    #19 0x559fbbc691f4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #20 0x559fbbb071a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #21 0x559fbbb071a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #22 0x559fbbb071a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #23 0x7fe366b68cc9 in __libc_start_main ../csu/libc-start.c:308

And I've failed which commit introduced this bug as the code was
heavily changed since then. ;-/

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200917060219.1287863-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
digetx pushed a commit that referenced this pull request Oct 21, 2020
Ensure 'st' is initialized before an error branch is taken.
Fixes test "67: Parse and process metrics" with LLVM msan:

  ==6757==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x5570edae947d in rblist__exit tools/perf/util/rblist.c:114:2
    #1 0x5570edb1c6e8 in runtime_stat__exit tools/perf/util/stat-shadow.c:141:2
    #2 0x5570ed92cfae in __compute_metric tools/perf/tests/parse-metric.c:187:2
    #3 0x5570ed92cb74 in compute_metric tools/perf/tests/parse-metric.c:196:9
    #4 0x5570ed92c6d8 in test_recursion_fail tools/perf/tests/parse-metric.c:318:2
    #5 0x5570ed92b8c8 in test__parse_metric tools/perf/tests/parse-metric.c:356:2
    #6 0x5570ed8de8c1 in run_test tools/perf/tests/builtin-test.c:410:9
    #7 0x5570ed8ddadf in test_and_print tools/perf/tests/builtin-test.c:440:9
    #8 0x5570ed8dca04 in __cmd_test tools/perf/tests/builtin-test.c:661:4
    #9 0x5570ed8dbc07 in cmd_test tools/perf/tests/builtin-test.c:807:9
    #10 0x5570ed7326cc in run_builtin tools/perf/perf.c:313:11
    #11 0x5570ed731639 in handle_internal_command tools/perf/perf.c:365:8
    #12 0x5570ed7323cd in run_argv tools/perf/perf.c:409:2
    #13 0x5570ed731076 in main tools/perf/perf.c:539:3

Fixes: commit f5a5657 ("perf test: Fix memory leaks in parse-metric test")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: clang-built-linux@googlegroups.com
Link: http://lore.kernel.org/lkml/20200923210655.4143682-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
digetx pushed a commit that referenced this pull request Oct 21, 2020
…vents

It was reported that 'perf stat' crashed when using with armv8_pmu (CPU)
events with the task mode.  As 'perf stat' uses an empty cpu map for
task mode but armv8_pmu has its own cpu mask, it has confused which map
it should use when accessing file descriptors and this causes segfaults:

  (gdb) bt
  #0  0x0000000000603fc8 in perf_evsel__close_fd_cpu (evsel=<optimized out>,
      cpu=<optimized out>) at evsel.c:122
  #1  perf_evsel__close_cpu (evsel=evsel@entry=0x716e950, cpu=7) at evsel.c:156
  #2  0x00000000004d4718 in evlist__close (evlist=0x70a7cb0) at util/evlist.c:1242
  #3  0x0000000000453404 in __run_perf_stat (argc=3, argc@entry=1, argv=0x30,
      argv@entry=0xfffffaea2f90, run_idx=119, run_idx@entry=1701998435)
      at builtin-stat.c:929
  #4  0x0000000000455058 in run_perf_stat (run_idx=1701998435, argv=0xfffffaea2f90,
      argc=1) at builtin-stat.c:947
  #5  cmd_stat (argc=1, argv=0xfffffaea2f90) at builtin-stat.c:2357
  #6  0x00000000004bb888 in run_builtin (p=p@entry=0x9764b8 <commands+288>,
      argc=argc@entry=4, argv=argv@entry=0xfffffaea2f90) at perf.c:312
  #7  0x00000000004bbb54 in handle_internal_command (argc=argc@entry=4,
      argv=argv@entry=0xfffffaea2f90) at perf.c:364
  #8  0x0000000000435378 in run_argv (argcp=<synthetic pointer>,
      argv=<synthetic pointer>) at perf.c:408
  #9  main (argc=4, argv=0xfffffaea2f90) at perf.c:538

To fix this, I simply used the given cpu map unless the evsel actually
is not a system-wide event (like uncore events).

Fixes: 7736627 ("perf stat: Use affinity for closing file descriptors")
Reported-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Barry Song <song.bao.hua@hisilicon.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20201007081311.1831003-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
digetx pushed a commit that referenced this pull request Oct 21, 2020
Dave reported a problem with my rwsem conversion patch where we got the
following lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-default+ #1297 Not tainted
  ------------------------------------------------------
  kswapd0/76 is trying to acquire lock:
  ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]

  but task is already holding lock:
  ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #4 (fs_reclaim){+.+.}-{0:0}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 fs_reclaim_acquire.part.0+0x25/0x30
	 kmem_cache_alloc+0x30/0x9c0
	 alloc_inode+0x81/0x90
	 iget_locked+0xcd/0x1a0
	 kernfs_get_inode+0x1b/0x130
	 kernfs_get_tree+0x136/0x210
	 sysfs_get_tree+0x1a/0x50
	 vfs_get_tree+0x1d/0xb0
	 path_mount+0x70f/0xa80
	 do_mount+0x75/0x90
	 __x64_sys_mount+0x8e/0xd0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #3 (kernfs_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 __mutex_lock+0xa0/0xaf0
	 kernfs_add_one+0x23/0x150
	 kernfs_create_dir_ns+0x58/0x80
	 sysfs_create_dir_ns+0x70/0xd0
	 kobject_add_internal+0xbb/0x2d0
	 kobject_add+0x7a/0xd0
	 btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
	 btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
	 open_ctree+0x981/0x1108 [btrfs]
	 btrfs_mount_root.cold+0xe/0xb0 [btrfs]
	 legacy_get_tree+0x2d/0x60
	 vfs_get_tree+0x1d/0xb0
	 fc_mount+0xe/0x40
	 vfs_kern_mount.part.0+0x71/0x90
	 btrfs_mount+0x13b/0x3e0 [btrfs]
	 legacy_get_tree+0x2d/0x60
	 vfs_get_tree+0x1d/0xb0
	 path_mount+0x70f/0xa80
	 do_mount+0x75/0x90
	 __x64_sys_mount+0x8e/0xd0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (btrfs-extent-00){++++}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 down_read_nested+0x45/0x220
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
	 check_committed_ref+0x69/0x200 [btrfs]
	 btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
	 run_delalloc_nocow+0x446/0x9b0 [btrfs]
	 btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
	 writepage_delalloc+0xae/0x160 [btrfs]
	 __extent_writepage+0x262/0x420 [btrfs]
	 extent_write_cache_pages+0x2b6/0x510 [btrfs]
	 extent_writepages+0x43/0x90 [btrfs]
	 do_writepages+0x40/0xe0
	 __writeback_single_inode+0x62/0x610
	 writeback_sb_inodes+0x20f/0x500
	 wb_writeback+0xef/0x4a0
	 wb_do_writeback+0x49/0x2e0
	 wb_workfn+0x81/0x340
	 process_one_work+0x233/0x5d0
	 worker_thread+0x50/0x3b0
	 kthread+0x137/0x150
	 ret_from_fork+0x1f/0x30

  -> #1 (btrfs-fs-00){++++}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 down_read_nested+0x45/0x220
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
	 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
	 __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
	 __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
	 __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
	 btrfs_commit_transaction+0x367/0xbc0 [btrfs]
	 btrfs_mksubvol+0x2db/0x470 [btrfs]
	 btrfs_mksnapshot+0x7b/0xb0 [btrfs]
	 __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
	 btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
	 btrfs_ioctl+0xd0b/0x2690 [btrfs]
	 __x64_sys_ioctl+0x6f/0xa0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 check_prev_add+0x91/0xc60
	 validate_chain+0xa6e/0x2a20
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 __mutex_lock+0xa0/0xaf0
	 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
	 btrfs_evict_inode+0x3cc/0x560 [btrfs]
	 evict+0xd6/0x1c0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x54/0x80
	 super_cache_scan+0x121/0x1a0
	 do_shrink_slab+0x16d/0x3b0
	 shrink_slab+0xb1/0x2e0
	 shrink_node+0x230/0x6a0
	 balance_pgdat+0x325/0x750
	 kswapd+0x206/0x4d0
	 kthread+0x137/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(kernfs_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/76:
   #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
   #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50

  stack backtrace:
  CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  Call Trace:
   dump_stack+0x77/0x97
   check_noncircular+0xff/0x110
   ? save_trace+0x50/0x470
   check_prev_add+0x91/0xc60
   validate_chain+0xa6e/0x2a20
   ? save_trace+0x50/0x470
   __lock_acquire+0x582/0xac0
   lock_acquire+0xca/0x430
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   __mutex_lock+0xa0/0xaf0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   ? __lock_acquire+0x582/0xac0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   ? btrfs_evict_inode+0x30b/0x560 [btrfs]
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   btrfs_evict_inode+0x3cc/0x560 [btrfs]
   evict+0xd6/0x1c0
   dispose_list+0x48/0x70
   prune_icache_sb+0x54/0x80
   super_cache_scan+0x121/0x1a0
   do_shrink_slab+0x16d/0x3b0
   shrink_slab+0xb1/0x2e0
   shrink_node+0x230/0x6a0
   balance_pgdat+0x325/0x750
   kswapd+0x206/0x4d0
   ? finish_wait+0x90/0x90
   ? balance_pgdat+0x750/0x750
   kthread+0x137/0x150
   ? kthread_mod_delayed_work+0xc0/0xc0
   ret_from_fork+0x1f/0x30

This happens because we are still holding the path open when we start
adding the sysfs files for the block groups, which creates a dependency
on fs_reclaim via the tree lock.  Fix this by dropping the path before
we start doing anything with sysfs.

Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
digetx pushed a commit that referenced this pull request Oct 29, 2020
Dave reported a problem with my rwsem conversion patch where we got the
following lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-default+ #1297 Not tainted
  ------------------------------------------------------
  kswapd0/76 is trying to acquire lock:
  ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]

  but task is already holding lock:
  ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #4 (fs_reclaim){+.+.}-{0:0}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 fs_reclaim_acquire.part.0+0x25/0x30
	 kmem_cache_alloc+0x30/0x9c0
	 alloc_inode+0x81/0x90
	 iget_locked+0xcd/0x1a0
	 kernfs_get_inode+0x1b/0x130
	 kernfs_get_tree+0x136/0x210
	 sysfs_get_tree+0x1a/0x50
	 vfs_get_tree+0x1d/0xb0
	 path_mount+0x70f/0xa80
	 do_mount+0x75/0x90
	 __x64_sys_mount+0x8e/0xd0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #3 (kernfs_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 __mutex_lock+0xa0/0xaf0
	 kernfs_add_one+0x23/0x150
	 kernfs_create_dir_ns+0x58/0x80
	 sysfs_create_dir_ns+0x70/0xd0
	 kobject_add_internal+0xbb/0x2d0
	 kobject_add+0x7a/0xd0
	 btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
	 btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
	 open_ctree+0x981/0x1108 [btrfs]
	 btrfs_mount_root.cold+0xe/0xb0 [btrfs]
	 legacy_get_tree+0x2d/0x60
	 vfs_get_tree+0x1d/0xb0
	 fc_mount+0xe/0x40
	 vfs_kern_mount.part.0+0x71/0x90
	 btrfs_mount+0x13b/0x3e0 [btrfs]
	 legacy_get_tree+0x2d/0x60
	 vfs_get_tree+0x1d/0xb0
	 path_mount+0x70f/0xa80
	 do_mount+0x75/0x90
	 __x64_sys_mount+0x8e/0xd0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (btrfs-extent-00){++++}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 down_read_nested+0x45/0x220
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
	 check_committed_ref+0x69/0x200 [btrfs]
	 btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
	 run_delalloc_nocow+0x446/0x9b0 [btrfs]
	 btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
	 writepage_delalloc+0xae/0x160 [btrfs]
	 __extent_writepage+0x262/0x420 [btrfs]
	 extent_write_cache_pages+0x2b6/0x510 [btrfs]
	 extent_writepages+0x43/0x90 [btrfs]
	 do_writepages+0x40/0xe0
	 __writeback_single_inode+0x62/0x610
	 writeback_sb_inodes+0x20f/0x500
	 wb_writeback+0xef/0x4a0
	 wb_do_writeback+0x49/0x2e0
	 wb_workfn+0x81/0x340
	 process_one_work+0x233/0x5d0
	 worker_thread+0x50/0x3b0
	 kthread+0x137/0x150
	 ret_from_fork+0x1f/0x30

  -> #1 (btrfs-fs-00){++++}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 down_read_nested+0x45/0x220
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
	 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
	 __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
	 __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
	 __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
	 btrfs_commit_transaction+0x367/0xbc0 [btrfs]
	 btrfs_mksubvol+0x2db/0x470 [btrfs]
	 btrfs_mksnapshot+0x7b/0xb0 [btrfs]
	 __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
	 btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
	 btrfs_ioctl+0xd0b/0x2690 [btrfs]
	 __x64_sys_ioctl+0x6f/0xa0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 check_prev_add+0x91/0xc60
	 validate_chain+0xa6e/0x2a20
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 __mutex_lock+0xa0/0xaf0
	 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
	 btrfs_evict_inode+0x3cc/0x560 [btrfs]
	 evict+0xd6/0x1c0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x54/0x80
	 super_cache_scan+0x121/0x1a0
	 do_shrink_slab+0x16d/0x3b0
	 shrink_slab+0xb1/0x2e0
	 shrink_node+0x230/0x6a0
	 balance_pgdat+0x325/0x750
	 kswapd+0x206/0x4d0
	 kthread+0x137/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(kernfs_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/76:
   #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
   #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50

  stack backtrace:
  CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  Call Trace:
   dump_stack+0x77/0x97
   check_noncircular+0xff/0x110
   ? save_trace+0x50/0x470
   check_prev_add+0x91/0xc60
   validate_chain+0xa6e/0x2a20
   ? save_trace+0x50/0x470
   __lock_acquire+0x582/0xac0
   lock_acquire+0xca/0x430
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   __mutex_lock+0xa0/0xaf0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   ? __lock_acquire+0x582/0xac0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   ? btrfs_evict_inode+0x30b/0x560 [btrfs]
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   btrfs_evict_inode+0x3cc/0x560 [btrfs]
   evict+0xd6/0x1c0
   dispose_list+0x48/0x70
   prune_icache_sb+0x54/0x80
   super_cache_scan+0x121/0x1a0
   do_shrink_slab+0x16d/0x3b0
   shrink_slab+0xb1/0x2e0
   shrink_node+0x230/0x6a0
   balance_pgdat+0x325/0x750
   kswapd+0x206/0x4d0
   ? finish_wait+0x90/0x90
   ? balance_pgdat+0x750/0x750
   kthread+0x137/0x150
   ? kthread_mod_delayed_work+0xc0/0xc0
   ret_from_fork+0x1f/0x30

This happens because we are still holding the path open when we start
adding the sysfs files for the block groups, which creates a dependency
on fs_reclaim via the tree lock.  Fix this by dropping the path before
we start doing anything with sysfs.

Reported-by: David Sterba <dsterba@suse.com>
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
digetx pushed a commit that referenced this pull request Oct 29, 2020
Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:

  [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
  [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
  [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
  [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
  [162513.514318]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.514747] task:btrfs-transacti state:D stack:    0 pid:1356167 ppid:     2 flags:0x00004000
  [162513.514751] Call Trace:
  [162513.514761]  __schedule+0x5ce/0xd00
  [162513.514765]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.514771]  schedule+0x46/0xf0
  [162513.514844]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.514850]  ? finish_wait+0x90/0x90
  [162513.514864]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.514879]  transaction_kthread+0xa4/0x170 [btrfs]
  [162513.514891]  ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
  [162513.514894]  kthread+0x153/0x170
  [162513.514897]  ? kthread_stop+0x2c0/0x2c0
  [162513.514902]  ret_from_fork+0x22/0x30
  [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
  [162513.515192]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.515680] task:fsstress        state:D stack:    0 pid:1356184 ppid:1356177 flags:0x00004000
  [162513.515682] Call Trace:
  [162513.515688]  __schedule+0x5ce/0xd00
  [162513.515691]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.515697]  schedule+0x46/0xf0
  [162513.515712]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.515716]  ? finish_wait+0x90/0x90
  [162513.515729]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.515743]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
  [162513.515753]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
  [162513.515758]  ? __ia32_sys_fdatasync+0x20/0x20
  [162513.515761]  iterate_supers+0x87/0xf0
  [162513.515765]  ksys_sync+0x60/0xb0
  [162513.515768]  __do_sys_sync+0xa/0x10
  [162513.515771]  do_syscall_64+0x33/0x80
  [162513.515774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.515781] RIP: 0033:0x7f5238f50bd7
  [162513.515782] Code: Bad RIP value.
  [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
  [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
  [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
  [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
  [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
  [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
  [162513.516064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.516617] task:fsstress        state:D stack:    0 pid:1356185 ppid:1356177 flags:0x00000000
  [162513.516620] Call Trace:
  [162513.516625]  __schedule+0x5ce/0xd00
  [162513.516628]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.516634]  schedule+0x46/0xf0
  [162513.516647]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.516650]  ? finish_wait+0x90/0x90
  [162513.516662]  start_transaction+0x4d7/0x5f0 [btrfs]
  [162513.516679]  btrfs_setxattr_trans+0x3c/0x100 [btrfs]
  [162513.516686]  __vfs_setxattr+0x66/0x80
  [162513.516691]  __vfs_setxattr_noperm+0x70/0x200
  [162513.516697]  vfs_setxattr+0x6b/0x120
  [162513.516703]  setxattr+0x125/0x240
  [162513.516709]  ? lock_acquire+0xb1/0x480
  [162513.516712]  ? mnt_want_write+0x20/0x50
  [162513.516721]  ? rcu_read_lock_any_held+0x8e/0xb0
  [162513.516723]  ? preempt_count_add+0x49/0xa0
  [162513.516725]  ? __sb_start_write+0x19b/0x290
  [162513.516727]  ? preempt_count_add+0x49/0xa0
  [162513.516732]  path_setxattr+0xba/0xd0
  [162513.516739]  __x64_sys_setxattr+0x27/0x30
  [162513.516741]  do_syscall_64+0x33/0x80
  [162513.516743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.516745] RIP: 0033:0x7f5238f56d5a
  [162513.516746] Code: Bad RIP value.
  [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
  [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
  [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
  [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
  [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
  [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
  [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
  [162513.517064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.517763] task:fsstress        state:D stack:    0 pid:1356196 ppid:1356177 flags:0x00004000
  [162513.517780] Call Trace:
  [162513.517786]  __schedule+0x5ce/0xd00
  [162513.517789]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.517796]  schedule+0x46/0xf0
  [162513.517810]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.517814]  ? finish_wait+0x90/0x90
  [162513.517829]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.517845]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
  [162513.517857]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
  [162513.517862]  ? __ia32_sys_fdatasync+0x20/0x20
  [162513.517865]  iterate_supers+0x87/0xf0
  [162513.517869]  ksys_sync+0x60/0xb0
  [162513.517872]  __do_sys_sync+0xa/0x10
  [162513.517875]  do_syscall_64+0x33/0x80
  [162513.517878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.517881] RIP: 0033:0x7f5238f50bd7
  [162513.517883] Code: Bad RIP value.
  [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
  [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
  [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
  [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
  [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
  [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
  [162513.518298]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.519157] task:fsstress        state:D stack:    0 pid:1356197 ppid:1356177 flags:0x00000000
  [162513.519160] Call Trace:
  [162513.519165]  __schedule+0x5ce/0xd00
  [162513.519168]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.519174]  schedule+0x46/0xf0
  [162513.519190]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.519193]  ? finish_wait+0x90/0x90
  [162513.519206]  start_transaction+0x4d7/0x5f0 [btrfs]
  [162513.519222]  btrfs_create+0x57/0x200 [btrfs]
  [162513.519230]  lookup_open+0x522/0x650
  [162513.519246]  path_openat+0x2b8/0xa50
  [162513.519270]  do_filp_open+0x91/0x100
  [162513.519275]  ? find_held_lock+0x32/0x90
  [162513.519280]  ? lock_acquired+0x33b/0x470
  [162513.519285]  ? do_raw_spin_unlock+0x4b/0xc0
  [162513.519287]  ? _raw_spin_unlock+0x29/0x40
  [162513.519295]  do_sys_openat2+0x20d/0x2d0
  [162513.519300]  do_sys_open+0x44/0x80
  [162513.519304]  do_syscall_64+0x33/0x80
  [162513.519307]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.519309] RIP: 0033:0x7f5238f4a903
  [162513.519310] Code: Bad RIP value.
  [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
  [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
  [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
  [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
  [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
  [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
  [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
  [162513.519727]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.520508] task:btrfs           state:D stack:    0 pid:1356211 ppid:1356178 flags:0x00004002
  [162513.520511] Call Trace:
  [162513.520516]  __schedule+0x5ce/0xd00
  [162513.520519]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.520525]  schedule+0x46/0xf0
  [162513.520544]  btrfs_scrub_pause+0x11f/0x180 [btrfs]
  [162513.520548]  ? finish_wait+0x90/0x90
  [162513.520562]  btrfs_commit_transaction+0x45a/0xc30 [btrfs]
  [162513.520574]  ? start_transaction+0xe0/0x5f0 [btrfs]
  [162513.520596]  btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
  [162513.520619]  btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
  [162513.520639]  btrfs_ioctl+0x2a25/0x36f0 [btrfs]
  [162513.520643]  ? do_sigaction+0xf3/0x240
  [162513.520645]  ? find_held_lock+0x32/0x90
  [162513.520648]  ? do_sigaction+0xf3/0x240
  [162513.520651]  ? lock_acquired+0x33b/0x470
  [162513.520655]  ? _raw_spin_unlock_irq+0x24/0x50
  [162513.520657]  ? lockdep_hardirqs_on+0x7d/0x100
  [162513.520660]  ? _raw_spin_unlock_irq+0x35/0x50
  [162513.520662]  ? do_sigaction+0xf3/0x240
  [162513.520671]  ? __x64_sys_ioctl+0x83/0xb0
  [162513.520672]  __x64_sys_ioctl+0x83/0xb0
  [162513.520677]  do_syscall_64+0x33/0x80
  [162513.520679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.520681] RIP: 0033:0x7fc3cd307d87
  [162513.520682] Code: Bad RIP value.
  [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
  [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
  [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
  [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
  [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
  [162513.520703]
		  Showing all locks held in the system:
  [162513.520712] 1 lock held by khungtaskd/54:
  [162513.520713]  #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
  [162513.520728] 1 lock held by in:imklog/596:
  [162513.520729]  #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
  [162513.520782] 1 lock held by btrfs-transacti/1356167:
  [162513.520784]  #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
  [162513.520798] 1 lock held by btrfs/1356190:
  [162513.520800]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
  [162513.520805] 1 lock held by fsstress/1356184:
  [162513.520806]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
  [162513.520811] 3 locks held by fsstress/1356185:
  [162513.520812]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
  [162513.520815]  #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
  [162513.520820]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
  [162513.520833] 1 lock held by fsstress/1356196:
  [162513.520834]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
  [162513.520838] 3 locks held by fsstress/1356197:
  [162513.520839]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
  [162513.520843]  #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
  [162513.520846]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
  [162513.520858] 2 locks held by btrfs/1356211:
  [162513.520859]  #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
  [162513.520877]  #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]

This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.

After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:

  >>> t = find_task(prog, 1356190)
  >>> prog.stack_trace(t)
  #0  __schedule+0x5ce/0xcfc
  #1  schedule+0x46/0xe4
  #2  schedule_timeout+0x1df/0x475
  #3  btrfs_reada_wait+0xda/0x132
  #4  scrub_stripe+0x2a8/0x112f
  #5  scrub_chunk+0xcd/0x134
  #6  scrub_enumerate_chunks+0x29e/0x5ee
  #7  btrfs_scrub_dev+0x2d5/0x91b
  #8  btrfs_ioctl+0x7f5/0x36e7
  #9  __x64_sys_ioctl+0x83/0xb0
  #10 do_syscall_64+0x33/0x77
  #11 entry_SYSCALL_64+0x7c/0x156

Which corresponds to:

int btrfs_reada_wait(void *handle)
{
    struct reada_control *rc = handle;
    struct btrfs_fs_info *fs_info = rc->fs_info;

    while (atomic_read(&rc->elems)) {
        if (!atomic_read(&fs_info->reada_works_cnt))
            reada_start_machine(fs_info);
        wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
                          (HZ + 9) / 10);
    }
(...)

So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:

  $ cat dump_reada.py
  import sys
  import drgn
  from drgn import NULL, Object, cast, container_of, execscript, \
      reinterpret, sizeof
  from drgn.helpers.linux import *

  mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

  mnt = None
  for mnt in for_each_mount(prog, dst = mnt_path):
      pass

  if mnt is None:
      sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
      sys.exit(1)

  fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

  def dump_re(re):
      nzones = re.nzones.value_()
      print(f're at {hex(re.value_())}')
      print(f'\t logical {re.logical.value_()}')
      print(f'\t refcnt {re.refcnt.value_()}')
      print(f'\t nzones {nzones}')
      for i in range(nzones):
          dev = re.zones[i].device
          name = dev.name.str.string_()
          print(f'\t\t dev id {dev.devid.value_()} name {name}')
      print()

  for _, e in radix_tree_for_each(fs_info.reada_tree):
      re = cast('struct reada_extent *', e)
      dump_re(re)

  $ drgn dump_reada.py
  re at 0xffff8f3da9d25ad8
          logical 38928384
          refcnt 1
          nzones 1
                 dev id 0 name b'/dev/sdd'
  $

So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.

Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:

  $ cat dump_block_groups.py
  import sys
  import drgn
  from drgn import NULL, Object, cast, container_of, execscript, \
      reinterpret, sizeof
  from drgn.helpers.linux import *

  mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

  mnt = None
  for mnt in for_each_mount(prog, dst = mnt_path):
      pass

  if mnt is None:
      sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
      sys.exit(1)

  fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

  BTRFS_BLOCK_GROUP_DATA = (1 << 0)
  BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
  BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
  BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
  BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
  BTRFS_BLOCK_GROUP_DUP = (1 << 5)
  BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
  BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
  BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
  BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
  BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)

  def bg_flags_string(bg):
      flags = bg.flags.value_()
      ret = ''
      if flags & BTRFS_BLOCK_GROUP_DATA:
          ret = 'data'
      if flags & BTRFS_BLOCK_GROUP_METADATA:
          if len(ret) > 0:
              ret += '|'
          ret += 'meta'
      if flags & BTRFS_BLOCK_GROUP_SYSTEM:
          if len(ret) > 0:
              ret += '|'
          ret += 'system'
      if flags & BTRFS_BLOCK_GROUP_RAID0:
          ret += ' raid0'
      elif flags & BTRFS_BLOCK_GROUP_RAID1:
          ret += ' raid1'
      elif flags & BTRFS_BLOCK_GROUP_DUP:
          ret += ' dup'
      elif flags & BTRFS_BLOCK_GROUP_RAID10:
          ret += ' raid10'
      elif flags & BTRFS_BLOCK_GROUP_RAID5:
          ret += ' raid5'
      elif flags & BTRFS_BLOCK_GROUP_RAID6:
          ret += ' raid6'
      elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
          ret += ' raid1c3'
      elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
          ret += ' raid1c4'
      else:
          ret += ' single'

      return ret

  def dump_bg(bg):
      print()
      print(f'block group at {hex(bg.value_())}')
      print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
      print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')

  bg_root = fs_info.block_group_cache_tree.address_of_()
  for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
      dump_bg(bg)

  $ drgn dump_block_groups.py

  block group at 0xffff8f3d673b0400
         start 22020096 length 16777216
         flags 258 - system raid6

  block group at 0xffff8f3d53ddb400
         start 38797312 length 536870912
         flags 260 - meta raid6

  block group at 0xffff8f3d5f4d9c00
         start 575668224 length 2147483648
         flags 257 - data raid6

  block group at 0xffff8f3d08189000
         start 2723151872 length 67108864
         flags 258 - system raid6

  block group at 0xffff8f3db70ff000
         start 2790260736 length 1073741824
         flags 260 - meta raid6

  block group at 0xffff8f3d5f4dd800
         start 3864002560 length 67108864
         flags 258 - system raid6

  block group at 0xffff8f3d67037000
         start 3931111424 length 2147483648
         flags 257 - data raid6
  $

So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.

So after figuring that out it became obvious why the hang happens:

1) Task A starts a scrub on any device of the filesystem, except for
   device /dev/sdd;

2) Task B starts a device replace with /dev/sdd as the source device;

3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
   starting to scrub a stripe from block group X. This call to
   btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
   calls reada_add_block(), it passes the logical address of the extent
   tree's root node as its 'logical' argument - a value of 38928384;

4) Task A then enters reada_find_extent(), called from reada_add_block().
   It finds there isn't any existing readahead extent for the logical
   address 38928384, so it proceeds to the path of creating a new one.

   It calls btrfs_map_block() to find out which stripes exist for the block
   group X. On the first iteration of the for loop that iterates over the
   stripes, it finds the stripe for device /dev/sdd, so it creates one
   zone for that device and adds it to the readahead extent. Before getting
   into the second iteration of the loop, the cleanup kthread deletes block
   group X because it was empty. So in the iterations for the remaining
   stripes it does not add more zones to the readahead extent, because the
   calls to reada_find_zone() returned NULL because they couldn't find
   block group X anymore.

   As a result the new readahead extent has a single zone, corresponding to
   the device /dev/sdd;

4) Before task A returns to btrfs_reada_add() and queues the readahead job
   for the readahead work queue, task B finishes the device replace and at
   btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
   device /dev/sdg;

5) Task A returns to reada_add_block(), which increments the counter
   "->elems" of the reada_control structure allocated at btrfs_reada_add().

   Then it returns back to btrfs_reada_add() and calls
   reada_start_machine(). This queues a job in the readahead work queue to
   run the function reada_start_machine_worker(), which calls
   __reada_start_machine().

   At __reada_start_machine() we take the device list mutex and for each
   device found in the current device list, we call
   reada_start_machine_dev() to start the readahead work. However at this
   point the device /dev/sdd was already freed and is not in the device
   list anymore.

   This means the corresponding readahead for the extent at 38928384 is
   never started, and therefore the "->elems" counter of the reada_control
   structure allocated at btrfs_reada_add() never goes down to 0, causing
   the call to btrfs_reada_wait(), done by the scrub task, to wait forever.

Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.

This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.

For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:

1) Before the readahead worker starts, the device /dev/sdd is removed,
   and the corresponding btrfs_device structure is freed. However the
   readahead extent still has the zone pointing to the device structure;

2) When the readahead worker starts, it only finds device /dev/sde in the
   current device list of the filesystem;

3) It starts the readahead work, at reada_start_machine_dev(), using the
   device /dev/sde;

4) Then when it finishes reading the extent from device /dev/sde, it calls
   __readahead_hook() which ends up dropping the last reference on the
   readahead extent through the last call to reada_extent_put();

5) At reada_extent_put() it iterates over each zone of the readahead extent
   and attempts to delete an element from the device's 'reada_extents'
   radix tree, resulting in a use-after-free, as the device pointer of the
   zone for /dev/sdd is now stale. We can also access the device after
   dropping the last reference of a zone, through reada_zone_release(),
   also called by reada_extent_put().

And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.

While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.

So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.

This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
digetx pushed a commit that referenced this pull request Oct 29, 2020
When enabling qgroups we walk the tree_root and then add a qgroup item
for every root that we have.  This creates a lock dependency on the
tree_root and qgroup_root, which results in the following lockdep splat
(with tree locks using rwsem), eg. in tests btrfs/017 or btrfs/022:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-default+ #1299 Not tainted
  ------------------------------------------------------
  btrfs/24552 is trying to acquire lock:
  ffff9142dfc5f630 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

  but task is already holding lock:
  ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (btrfs-root-00){++++}-{3:3}:
	 __lock_acquire+0x3fb/0x730
	 lock_acquire.part.0+0x6a/0x130
	 down_read_nested+0x46/0x130
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
	 btrfs_search_slot+0xc3/0x9f0 [btrfs]
	 btrfs_insert_item+0x6e/0x140 [btrfs]
	 btrfs_create_tree+0x1cb/0x240 [btrfs]
	 btrfs_quota_enable+0xcd/0x790 [btrfs]
	 btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
	 __x64_sys_ioctl+0x83/0xa0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (btrfs-quota-00){++++}-{3:3}:
	 check_prev_add+0x91/0xc30
	 validate_chain+0x491/0x750
	 __lock_acquire+0x3fb/0x730
	 lock_acquire.part.0+0x6a/0x130
	 down_read_nested+0x46/0x130
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
	 btrfs_search_slot+0xc3/0x9f0 [btrfs]
	 btrfs_insert_empty_items+0x58/0xa0 [btrfs]
	 add_qgroup_item.part.0+0x72/0x210 [btrfs]
	 btrfs_quota_enable+0x3bb/0x790 [btrfs]
	 btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
	 __x64_sys_ioctl+0x83/0xa0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-root-00);
				 lock(btrfs-quota-00);
				 lock(btrfs-root-00);
    lock(btrfs-quota-00);

   *** DEADLOCK ***

  5 locks held by btrfs/24552:
   #0: ffff9142df431478 (sb_writers#10){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0xa0
   #1: ffff9142f9b10cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl_quota_ctl+0x7b/0xe0 [btrfs]
   #2: ffff9142f9b11a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0x790 [btrfs]
   #3: ffff9142df431698 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x406/0x510 [btrfs]
   #4: ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

  stack backtrace:
  CPU: 1 PID: 24552 Comm: btrfs Not tainted 5.9.0-default+ #1299
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  Call Trace:
   dump_stack+0x77/0x97
   check_noncircular+0xf3/0x110
   check_prev_add+0x91/0xc30
   validate_chain+0x491/0x750
   __lock_acquire+0x3fb/0x730
   lock_acquire.part.0+0x6a/0x130
   ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   ? lock_acquire+0xc4/0x140
   ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   down_read_nested+0x46/0x130
   ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   ? btrfs_root_node+0xd9/0x200 [btrfs]
   __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
   btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
   btrfs_search_slot+0xc3/0x9f0 [btrfs]
   btrfs_insert_empty_items+0x58/0xa0 [btrfs]
   add_qgroup_item.part.0+0x72/0x210 [btrfs]
   btrfs_quota_enable+0x3bb/0x790 [btrfs]
   btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
   __x64_sys_ioctl+0x83/0xa0
   do_syscall_64+0x2d/0x70
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by dropping the path whenever we find a root item, add the
qgroup item, and then re-lookup the root item we found and continue
processing roots.

Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
digetx pushed a commit that referenced this pull request Oct 29, 2020
There are no races while allocating an OPP table, except while adding it
to the opp_tables list. And taking the opp_table_lock for the entire
duration causes circular dependency issues like the one mentioned below.
This patch takes another approach to reduce the size of the critical
section to avoid such issues, the details of that are present within the
patch.

 ======================================================
 WARNING: possible circular locking dependency detected
 5.4.72 #14 Not tainted
 ------------------------------------------------------
 chrome/1865 is trying to acquire lock:
 ffffffdd34921750 (opp_table_lock){+.+.}, at: _find_opp_table+0x34/0x74

 but task is already holding lock:
 ffffff81f0fc71a8 (reservation_ww_class_mutex){+.+.}, at: submit_lock_objects+0x70/0x1ec

 fscrypt: AES-256-CTS-CBC using implementation "cts-cbc-aes-ce"
 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:
 -> #4 (reservation_ww_class_mutex){+.+.}:
        __mutex_lock_common+0xec/0xc0c
        ww_mutex_lock_interruptible+0x5c/0xc4
        msm_gem_fault+0x2c/0x124
        __do_fault+0x40/0x16c
        handle_mm_fault+0x7cc/0xd98
        do_page_fault+0x230/0x3b4
        do_translation_fault+0x5c/0x78
        do_mem_abort+0x4c/0xb4
        el0_da+0x1c/0x20
 -> #3 (&mm->mmap_sem){++++}:
        __might_fault+0x70/0x98
        compat_filldir+0xf8/0x48c
        dcache_readdir+0x70/0x1dc
        iterate_dir+0xd4/0x180
        __arm64_compat_sys_getdents+0xa0/0x19c
        el0_svc_common+0xa8/0x178
        el0_svc_compat_handler+0x2c/0x40
        el0_svc_compat+0x8/0x10
 -> #2 (&sb->s_type->i_mutex_key#3){++++}:
        down_write+0x54/0x16c
        start_creating+0x68/0x128
        debugfs_create_dir+0x28/0x114
        opp_debug_register+0x8c/0xc0
        _add_opp_dev_unlocked+0x5c/0x70
        _add_opp_dev+0x38/0x58
        _opp_get_opp_table+0xdc/0x1ac
        dev_pm_opp_get_opp_table_indexed+0x24/0x30
        dev_pm_opp_of_add_table_indexed+0x48/0x84
        of_genpd_add_provider_onecell+0xc0/0x1b8
        rpmhpd_probe+0x240/0x268
        platform_drv_probe+0x90/0xb0
        really_probe+0x134/0x2ec
        driver_probe_device+0x64/0xfc
        __device_attach_driver+0x8c/0xa4
        bus_for_each_drv+0x90/0xd8
        __device_attach+0xc0/0x148
        device_initial_probe+0x20/0x2c
        bus_probe_device+0x34/0x94
        device_add+0x1fc/0x3b0
        of_device_add+0x3c/0x4c
        of_platform_device_create_pdata+0xb8/0xfc
        of_platform_bus_create+0x1e4/0x368
        of_platform_populate+0x70/0xbc
        devm_of_platform_populate+0x58/0xa0
        rpmh_rsc_probe+0x36c/0x3cc
        platform_drv_probe+0x90/0xb0
        really_probe+0x134/0x2ec
        driver_probe_device+0x64/0xfc
        __device_attach_driver+0x8c/0xa4
        bus_for_each_drv+0x90/0xd8
        __device_attach+0xc0/0x148
        device_initial_probe+0x20/0x2c
        bus_probe_device+0x34/0x94
        device_add+0x1fc/0x3b0
        of_device_add+0x3c/0x4c
        of_platform_device_create_pdata+0xb8/0xfc
        of_platform_bus_create+0x1e4/0x368
        of_platform_bus_create+0x230/0x368
        of_platform_populate+0x70/0xbc
        of_platform_default_populate_init+0xa8/0xc0
        do_one_initcall+0x1c8/0x3fc
        do_initcall_level+0xb4/0x10c
        do_basic_setup+0x30/0x48
        kernel_init_freeable+0x124/0x1a4
        kernel_init+0x14/0x104
        ret_from_fork+0x10/0x18
 -> #1 (&opp_table->lock){+.+.}:
        __mutex_lock_common+0xec/0xc0c
        mutex_lock_nested+0x40/0x50
        _add_opp_dev+0x2c/0x58
        _opp_get_opp_table+0xdc/0x1ac
        dev_pm_opp_get_opp_table_indexed+0x24/0x30
        dev_pm_opp_of_add_table_indexed+0x48/0x84
        of_genpd_add_provider_onecell+0xc0/0x1b8
        rpmhpd_probe+0x240/0x268
        platform_drv_probe+0x90/0xb0
        really_probe+0x134/0x2ec
        driver_probe_device+0x64/0xfc
        __device_attach_driver+0x8c/0xa4
        bus_for_each_drv+0x90/0xd8
        __device_attach+0xc0/0x148
        device_initial_probe+0x20/0x2c
        bus_probe_device+0x34/0x94
        device_add+0x1fc/0x3b0
        of_device_add+0x3c/0x4c
        of_platform_device_create_pdata+0xb8/0xfc
        of_platform_bus_create+0x1e4/0x368
        of_platform_populate+0x70/0xbc
        devm_of_platform_populate+0x58/0xa0
        rpmh_rsc_probe+0x36c/0x3cc
        platform_drv_probe+0x90/0xb0
        really_probe+0x134/0x2ec
        driver_probe_device+0x64/0xfc
        __device_attach_driver+0x8c/0xa4
        bus_for_each_drv+0x90/0xd8
        __device_attach+0xc0/0x148
        device_initial_probe+0x20/0x2c
        bus_probe_device+0x34/0x94
        device_add+0x1fc/0x3b0
        of_device_add+0x3c/0x4c
        of_platform_device_create_pdata+0xb8/0xfc
        of_platform_bus_create+0x1e4/0x368
        of_platform_populate+0x70/0xbc
        devm_of_platform_populate+0x58/0xa0
        rpmh_rsc_probe+0x36c/0x3cc
        platform_drv_probe+0x90/0xb0
        really_probe+0x134/0x2ec
        driver_probe_device+0x64/0xfc
        __device_attach_driver+0x8c/0xa4
        bus_for_each_drv+0x90/0xd8
        __device_attach+0xc0/0x148
        device_initial_probe+0x20/0x2c
        bus_probe_device+0x34/0x94
        device_add+0x1fc/0x3b0
        of_device_add+0x3c/0x4c
        of_platform_device_create_pdata+0xb8/0xfc
        of_platform_bus_create+0x1e4/0x368
        of_platform_bus_create+0x230/0x368
        of_platform_populate+0x70/0xbc
        of_platform_default_populate_init+0xa8/0xc0
        do_one_initcall+0x1c8/0x3fc
        do_initcall_level+0xb4/0x10c
        do_basic_setup+0x30/0x48
        kernel_init_freeable+0x124/0x1a4
        kernel_init+0x14/0x104
        ret_from_fork+0x10/0x18
 -> #0 (opp_table_lock){+.+.}:
        __lock_acquire+0xee4/0x2450
        lock_acquire+0x1cc/0x210
        __mutex_lock_common+0xec/0xc0c
        mutex_lock_nested+0x40/0x50
        _find_opp_table+0x34/0x74
        dev_pm_opp_find_freq_exact+0x2c/0xdc
        a6xx_gmu_resume+0xc8/0xecc
        a6xx_pm_resume+0x148/0x200
        adreno_resume+0x28/0x34
        pm_generic_runtime_resume+0x34/0x48
        __rpm_callback+0x70/0x10c
        rpm_callback+0x34/0x8c
        rpm_resume+0x414/0x550
        __pm_runtime_resume+0x7c/0xa0
        msm_gpu_submit+0x60/0x1c0
        msm_ioctl_gem_submit+0xadc/0xb60
        drm_ioctl_kernel+0x9c/0x118
        drm_ioctl+0x27c/0x408
        drm_compat_ioctl+0xcc/0xdc
        __se_compat_sys_ioctl+0x100/0x206c
        __arm64_compat_sys_ioctl+0x20/0x2c
        el0_svc_common+0xa8/0x178
        el0_svc_compat_handler+0x2c/0x40
        el0_svc_compat+0x8/0x10
 other info that might help us debug this:
 Chain exists of: opp_table_lock --> &mm->mmap_sem --> reservation_ww_class_mutex
  Possible unsafe locking scenario:
        CPU0                    CPU1
        ----                    ----
   lock(reservation_ww_class_mutex);
                                lock(&mm->mmap_sem);
                                lock(reservation_ww_class_mutex);
   lock(opp_table_lock);

 *** DEADLOCK ***

 3 locks held by chrome/1865:
  #0: ffffff81edecc0d8 (&dev->struct_mutex){+.+.}, at: msm_ioctl_gem_submit+0x264/0xb60
  #1: ffffff81d0000870 (reservation_ww_class_acquire){+.+.}, at: msm_ioctl_gem_submit+0x8e8/0xb60
  #2: ffffff81f0fc71a8 (reservation_ww_class_mutex){+.+.}, at: submit_lock_objects+0x70/0x1ec
 stack backtrace:
 CPU: 0 PID: 1865 Comm: chrome Not tainted 5.4.72 #14
 Hardware name: Google Lazor (rev1+) with LTE (DT)
 Call trace:
  dump_backtrace+0x0/0x158
  show_stack+0x20/0x2c
  dump_stack+0xc8/0x160
  print_circular_bug+0x2c4/0x2c8
  check_noncircular+0x1a8/0x1b0
  __lock_acquire+0xee4/0x2450
  lock_acquire+0x1cc/0x210
  __mutex_lock_common+0xec/0xc0c
  mutex_lock_nested+0x40/0x50
  _find_opp_table+0x34/0x74
  dev_pm_opp_find_freq_exact+0x2c/0xdc
  a6xx_gmu_resume+0xc8/0xecc
  a6xx_pm_resume+0x148/0x200
  adreno_resume+0x28/0x34
  pm_generic_runtime_resume+0x34/0x48
  __rpm_callback+0x70/0x10c
  rpm_callback+0x34/0x8c
  rpm_resume+0x414/0x550
  __pm_runtime_resume+0x7c/0xa0
  msm_gpu_submit+0x60/0x1c0
  msm_ioctl_gem_submit+0xadc/0xb60
  drm_ioctl_kernel+0x9c/0x118
  drm_ioctl+0x27c/0x408
  drm_compat_ioctl+0xcc/0xdc
  __se_compat_sys_ioctl+0x100/0x206c
  __arm64_compat_sys_ioctl+0x20/0x2c
  el0_svc_common+0xa8/0x178
  el0_svc_compat_handler+0x2c/0x40
  el0_svc_compat+0x8/0x10

Reported-by: Rob Clark <robdclark@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
digetx pushed a commit that referenced this pull request Nov 2, 2020
When executing the LOCK06 locktorture scenario featuring percpu-rwsem,
the RCU callback rcu_sync_func() may still be pending after locktorture
module is removed.  This can in turn lead to the following Oops:

  BUG: unable to handle page fault for address: ffffffffc00eb920
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 6500a067 P4D 6500a067 PUD 6500c067 PMD 13a36c067 PTE 800000013691c163
  Oops: 0000 [#1] PREEMPT SMP
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  RIP: 0010:rcu_cblist_dequeue+0x12/0x30
  Call Trace:
   <IRQ>
   rcu_core+0x1b1/0x860
   __do_softirq+0xfe/0x326
   asm_call_on_stack+0x12/0x20
   </IRQ>
   do_softirq_own_stack+0x5f/0x80
   irq_exit_rcu+0xaf/0xc0
   sysvec_apic_timer_interrupt+0x2e/0xb0
   asm_sysvec_apic_timer_interrupt+0x12/0x20

This commit avoids tis problem by adding an exit hook in lock_torture_ops
and using it to call percpu_free_rwsem() for percpu rwsem torture during
the module-cleanup function, thus ensuring that rcu_sync_func() completes
before module exits.

It is also necessary to call the exit hook if lock_torture_init()
fails half-way, so this commit also adds an ->init_called field in
lock_torture_cxt to indicate that exit hook, if present, must be called.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
digetx pushed a commit that referenced this pull request Nov 3, 2020
The purpose of srv_mutex is to protect srv_list as in put_srv, so no need
to hold it when allocate memory for srv since it could be time consuming.

Otherwise if one machine has limited memory, rsrv_close_work could be
blocked for a longer time due to the mutex is held by get_or_create_srv
since it can't get memory in time.

  INFO: task kworker/1:1:27478 blocked for more than 120 seconds.
        Tainted: G           O    4.14.171-1-storage #4.14.171-1.3~deb9
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/1:1     D    0 27478      2 0x80000000
  Workqueue: rtrs_server_wq rtrs_srv_close_work [rtrs_server]
  Call Trace:
   ? __schedule+0x38c/0x7e0
   schedule+0x32/0x80
   schedule_preempt_disabled+0xa/0x10
   __mutex_lock.isra.2+0x25e/0x4d0
   ? put_srv+0x44/0x100 [rtrs_server]
   put_srv+0x44/0x100 [rtrs_server]
   rtrs_srv_close_work+0x16c/0x280 [rtrs_server]
   process_one_work+0x1c5/0x3c0
   worker_thread+0x47/0x3e0
   kthread+0xfc/0x130
   ? trace_event_raw_event_workqueue_execute_start+0xa0/0xa0
   ? kthread_create_on_node+0x70/0x70
   ret_from_fork+0x1f/0x30

Let's move all the logics from __find_srv_and_get and __alloc_srv to
get_or_create_srv, and remove the two functions. Then it should be safe
for multiple processes to access the same srv since it is protected with
srv_mutex.

And since we don't want to allocate chunks with srv_mutex held, let's
check the srv->refcount after get srv because the chunks could not be
allocated yet.

Fixes: 9cb8374 ("RDMA/rtrs: server: main functionality")
Link: https://lore.kernel.org/r/20201023074353.21946-6-jinpu.wang@cloud.ionos.com
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
digetx pushed a commit that referenced this pull request Nov 16, 2020
Ido Schimmel says:

====================
mlxsw: Preparations for nexthop objects support - part 1/2

This patch set contains small and non-functional changes aimed at making
it easier to support nexthop objects in mlxsw. Follow up patches can be
found here [1].

Patches #1-#4 add a type field to the nexthop group struct instead of
the existing protocol field. This will be used later on to add a nexthop
object type, which can contain both IPv4 and IPv6 nexthops.

Patches #5-#7 move the IPv4 FIB info pointer (i.e., 'struct fib_info')
from the nexthop group struct to the route. The pointer will not be
available when the nexthop group is a nexthop object, but it needs to be
accessible to routes regardless.

Patch #8 is the biggest change, but it is an entirely cosmetic change
and should therefore be easy to review. The motivation and the change
itself are explained in detail in the commit message.

Patches #9-#12 perform small changes so that two functions that are
currently split between IPv4 and IPv6 could be consolidated in patches

Patch #15 removes an outdated comment.

[1] https://github.com/idosch/linux/tree/submit/nexthop_objects
====================

Link: https://lore.kernel.org/r/20201113160559.22148-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
digetx pushed a commit that referenced this pull request Nov 16, 2020
WARNING: Unknown commit id '70e806e4e645', maybe rebased or not pulled?
#4:
Since commit 70e806e ("mm: Do early cow for pinned pages during

WARNING: Unknown commit id 'f3c64eda3e50', maybe rebased or not pulled?
#34:
problems (and was missing a barrrier), see commit f3c64ed ("mm: avoid

WARNING: Unknown commit id 'f3c64eda3e50', maybe rebased or not pulled?
#46:
Fixes: f3c64ed ("mm: avoid early COW write protect games during fork()")

WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
#136: FILE: mm/gup.c:2687:
+	unsigned seq;

WARNING: Block comments should align the * on each line
#191: FILE: mm/memory.c:1180:
+		 * lockdep complaining about preemptibility.
+		*/

total: 0 errors, 5 warnings, 109 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-prevent-gup_fast-from-racing-with-cow-during-fork.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Nov 30, 2020
…/kernel/git/kvmarm/kvmarm into kvm-master

KVM/arm64 fixes for v5.10, take #4

- Fix alignment of the new HYP sections
- Fix GICR_TYPER access from userspace
digetx pushed a commit that referenced this pull request Nov 30, 2020
A lock dependency loop exists between the root tree lock, the extent tree
lock, and the free space tree lock.

The root tree lock depends on the free space tree lock because
btrfs_create_tree holds the new tree's lock while adding it to the root
tree.

The extent tree lock depends on the root tree lock because during umount,
we write out space cache v1, which writes inodes in the root tree, which
results in holding the root tree lock while doing a lookup in the extent tree.

Finally, the free space tree depends on the extent tree because
populate_free_space_tree holds a locked path in the extent tree and then
does a lookup in the free space tree to add the new item.

The simplest of the three to break is the one during tree creation: we
unlock the leaf before inserting the tree node into the root tree, which
fixes the lockdep.

Full contents of lockdep error for reference:

[   30.480136] ======================================================
[   30.480830] WARNING: possible circular locking dependency detected
[   30.481457] 5.9.0-rc8+ #76 Not tainted
[   30.481897] ------------------------------------------------------
[   30.482500] mount/520 is trying to acquire lock:
[   30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[   30.484054]
               but task is already holding lock:
[   30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[   30.485581]
               which lock already depends on the new lock.

[   30.486397]
               the existing dependency chain (in reverse order) is:
[   30.487205]
               -> #2 (btrfs-extent-01#2){++++}-{3:3}:
[   30.487825]        down_read_nested+0x43/0x150
[   30.488306]        __btrfs_tree_read_lock+0x39/0x180
[   30.488868]        __btrfs_read_lock_root_node+0x3a/0x50
[   30.489477]        btrfs_search_slot+0x464/0x9b0
[   30.490009]        check_committed_ref+0x59/0x1d0
[   30.490603]        btrfs_cross_ref_exist+0x65/0xb0
[   30.491108]        run_delalloc_nocow+0x405/0x930
[   30.491651]        btrfs_run_delalloc_range+0x60/0x6b0
[   30.492203]        writepage_delalloc+0xd4/0x150
[   30.492688]        __extent_writepage+0x18d/0x3a0
[   30.493199]        extent_write_cache_pages+0x2af/0x450
[   30.493743]        extent_writepages+0x34/0x70
[   30.494231]        do_writepages+0x31/0xd0
[   30.494642]        __filemap_fdatawrite_range+0xad/0xe0
[   30.495194]        btrfs_fdatawrite_range+0x1b/0x50
[   30.495677]        __btrfs_write_out_cache+0x40d/0x460
[   30.496227]        btrfs_write_out_cache+0x8b/0x110
[   30.496716]        btrfs_start_dirty_block_groups+0x211/0x4e0
[   30.497317]        btrfs_commit_transaction+0xc0/0xba0
[   30.497861]        sync_filesystem+0x71/0x90
[   30.498303]        btrfs_remount+0x81/0x433
[   30.498767]        reconfigure_super+0x9f/0x210
[   30.499261]        path_mount+0x9d1/0xa30
[   30.499722]        do_mount+0x55/0x70
[   30.500158]        __x64_sys_mount+0xc4/0xe0
[   30.500616]        do_syscall_64+0x33/0x40
[   30.501091]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.501629]
               -> #1 (btrfs-root-00){++++}-{3:3}:
[   30.502241]        down_read_nested+0x43/0x150
[   30.502727]        __btrfs_tree_read_lock+0x39/0x180
[   30.503291]        __btrfs_read_lock_root_node+0x3a/0x50
[   30.503903]        btrfs_search_slot+0x464/0x9b0
[   30.504405]        btrfs_insert_empty_items+0x60/0xa0
[   30.504973]        btrfs_insert_item+0x60/0xd0
[   30.505412]        btrfs_create_tree+0x1b6/0x210
[   30.505913]        btrfs_create_free_space_tree+0x54/0x110
[   30.506460]        btrfs_mount_rw+0x15d/0x20f
[   30.506937]        btrfs_remount+0x356/0x433
[   30.507369]        reconfigure_super+0x9f/0x210
[   30.507868]        path_mount+0x9d1/0xa30
[   30.508264]        do_mount+0x55/0x70
[   30.508668]        __x64_sys_mount+0xc4/0xe0
[   30.509186]        do_syscall_64+0x33/0x40
[   30.509652]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.510271]
               -> #0 (btrfs-free-space-00){++++}-{3:3}:
[   30.510972]        __lock_acquire+0x11ad/0x1b60
[   30.511432]        lock_acquire+0xa2/0x360
[   30.511917]        down_read_nested+0x43/0x150
[   30.512383]        __btrfs_tree_read_lock+0x39/0x180
[   30.512947]        __btrfs_read_lock_root_node+0x3a/0x50
[   30.513455]        btrfs_search_slot+0x464/0x9b0
[   30.513947]        search_free_space_info+0x45/0x90
[   30.514465]        __add_to_free_space_tree+0x92/0x39d
[   30.515010]        btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[   30.515639]        btrfs_mount_rw+0x15d/0x20f
[   30.516142]        btrfs_remount+0x356/0x433
[   30.516538]        reconfigure_super+0x9f/0x210
[   30.517065]        path_mount+0x9d1/0xa30
[   30.517438]        do_mount+0x55/0x70
[   30.517824]        __x64_sys_mount+0xc4/0xe0
[   30.518293]        do_syscall_64+0x33/0x40
[   30.518776]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.519335]
               other info that might help us debug this:

[   30.520210] Chain exists of:
                 btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2

[   30.521407]  Possible unsafe locking scenario:

[   30.522037]        CPU0                    CPU1
[   30.522456]        ----                    ----
[   30.522941]   lock(btrfs-extent-01#2);
[   30.523311]                                lock(btrfs-root-00);
[   30.523952]                                lock(btrfs-extent-01#2);
[   30.524620]   lock(btrfs-free-space-00);
[   30.525068]
                *** DEADLOCK ***

[   30.525669] 5 locks held by mount/520:
[   30.526116]  #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
[   30.527056]  #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
[   30.527960]  #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
[   30.529118]  #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[   30.530113]  #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
[   30.531124]
               stack backtrace:
[   30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
[   30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
[   30.533215] Call Trace:
[   30.533452]  dump_stack+0x8d/0xc0
[   30.533797]  check_noncircular+0x13c/0x150
[   30.534233]  __lock_acquire+0x11ad/0x1b60
[   30.534667]  lock_acquire+0xa2/0x360
[   30.535063]  ? __btrfs_tree_read_lock+0x39/0x180
[   30.535525]  down_read_nested+0x43/0x150
[   30.535939]  ? __btrfs_tree_read_lock+0x39/0x180
[   30.536400]  __btrfs_tree_read_lock+0x39/0x180
[   30.536862]  __btrfs_read_lock_root_node+0x3a/0x50
[   30.537304]  btrfs_search_slot+0x464/0x9b0
[   30.537713]  ? trace_hardirqs_on+0x1c/0xf0
[   30.538148]  search_free_space_info+0x45/0x90
[   30.538572]  __add_to_free_space_tree+0x92/0x39d
[   30.539071]  ? printk+0x48/0x4a
[   30.539367]  btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[   30.539972]  btrfs_mount_rw+0x15d/0x20f
[   30.540350]  btrfs_remount+0x356/0x433
[   30.540773]  ? shrink_dcache_sb+0xd9/0x100
[   30.541203]  reconfigure_super+0x9f/0x210
[   30.541642]  path_mount+0x9d1/0xa30
[   30.542040]  do_mount+0x55/0x70
[   30.542366]  __x64_sys_mount+0xc4/0xe0
[   30.542822]  do_syscall_64+0x33/0x40
[   30.543197]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.543691] RIP: 0033:0x7f109f7ab93a
[   30.544053] Code: 48 8b 0d 49 e5 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 16 e5 0b 00 f7 d8 64 89 01 48
[   30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[   30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
[   30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
[   30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
[   30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
[   30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
digetx pushed a commit that referenced this pull request Nov 30, 2020
Ido Schimmel says:

====================
mlxsw: Update adjacency index more efficiently

The device supports an operation that allows the driver to issue one
request to update the adjacency index for all the routes in a given
virtual router (VR) from old index and size to new ones. This is useful
in case the configuration of a certain nexthop group is updated and its
adjacency index changes.

Currently, the driver does not use this operation in an efficient
manner. It iterates over all the routes using the nexthop group and
issues an update request for the VR if it is not the same as the
previous VR.

Instead, this patch set tracks the VRs in which the nexthop group is
used and issues one request for each VR.

Example:

8k IPv6 routes were added in an alternating manner to two VRFs. All the
routes are using the same nexthop object ('nhid 1').

Before:

 Performance counter stats for 'ip nexthop replace id 1 via 2001:db8:1::2 dev swp3':

            16,385      devlink:devlink_hwmsg

       4.255933213 seconds time elapsed

       0.000000000 seconds user
       0.666923000 seconds sys

Number of EMAD transactions corresponds to number of routes using the
nexthop group.

After:

 Performance counter stats for 'ip nexthop replace id 1 via 2001:db8:1::2 dev swp3':

                 3      devlink:devlink_hwmsg

       0.077655094 seconds time elapsed

       0.000000000 seconds user
       0.076698000 seconds sys

Number of EMAD transactions corresponds to number of VRFs / VRs.

Patch set overview:

Patch #1 is a fix for a bug introduced in previous submission. Detected
by Coverity.

Patches #2 and #3 are preparations.

Patch #4 tracks the VRs a nexthop group is member of.

Patch #5 uses the membership tracking from the previous patch to issue
one update request per each VR.
====================

Link: https://lore.kernel.org/r/20201125193505.1052466-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
digetx pushed a commit that referenced this pull request Dec 4, 2020
Mismatch in probe platform_set_drvdata set's and method's that call
dev_get_platdata will result in "Unable to handle kernel NULL pointer
dereference", let's use according method for getting driver data after
platform_set_drvdata.

8<--- cut here ---
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = (ptrval)
[00000000] *pgd=00000000
Internal error: Oops: 5 [#1] ARM
Modules linked in:
CPU: 0 PID: 1 Comm: swapper Not tainted 5.9.10-00003-g723e101e0037-dirty #4
Hardware name: Technologic Systems TS-72xx SBC
PC is at ep93xx_rtc_read_time+0xc/0x2c
LR is at __rtc_read_time+0x4c/0x8c
[...]
[<c02b01c8>] (ep93xx_rtc_read_time) from [<c02ac38c>] (__rtc_read_time+0x4c/0x8c)
[<c02ac38c>] (__rtc_read_time) from [<c02ac3f8>] (rtc_read_time+0x2c/0x4c)
[<c02ac3f8>] (rtc_read_time) from [<c02acc54>] (__rtc_read_alarm+0x28/0x358)
[<c02acc54>] (__rtc_read_alarm) from [<c02abd80>] (__rtc_register_device+0x124/0x2ec)
[<c02abd80>] (__rtc_register_device) from [<c02b028c>] (ep93xx_rtc_probe+0xa4/0xac)
[<c02b028c>] (ep93xx_rtc_probe) from [<c026424c>] (platform_drv_probe+0x24/0x5c)
[<c026424c>] (platform_drv_probe) from [<c0262918>] (really_probe+0x218/0x374)
[<c0262918>] (really_probe) from [<c0262da0>] (device_driver_attach+0x44/0x60)
[<c0262da0>] (device_driver_attach) from [<c0262e70>] (__driver_attach+0xb4/0xc0)
[<c0262e70>] (__driver_attach) from [<c0260d44>] (bus_for_each_dev+0x68/0xac)
[<c0260d44>] (bus_for_each_dev) from [<c026223c>] (driver_attach+0x18/0x24)
[<c026223c>] (driver_attach) from [<c0261dd8>] (bus_add_driver+0x150/0x1b4)
[<c0261dd8>] (bus_add_driver) from [<c026342c>] (driver_register+0xb0/0xf4)
[<c026342c>] (driver_register) from [<c0264210>] (__platform_driver_register+0x30/0x48)
[<c0264210>] (__platform_driver_register) from [<c04cb9ac>] (ep93xx_rtc_driver_init+0x10/0x1c)
[<c04cb9ac>] (ep93xx_rtc_driver_init) from [<c000973c>] (do_one_initcall+0x7c/0x1c0)
[<c000973c>] (do_one_initcall) from [<c04b9ecc>] (kernel_init_freeable+0x168/0x1ac)
[<c04b9ecc>] (kernel_init_freeable) from [<c03b2228>] (kernel_init+0x8/0xf4)
[<c03b2228>] (kernel_init) from [<c00082c0>] (ret_from_fork+0x14/0x34)
Exception stack(0xc441dfb0 to 0xc441dff8)
dfa0:                                     00000000 00000000 00000000 00000000
dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
Code: e12fff1e e92d4010 e590303c e1a02001 (e5933000)
---[ end trace c914d6030eaa95c8 ]---

Fixes: b809d19 ("rtc: ep93xx: stop setting platform_data")
Signed-off-by: Nikita Shubin <nikita.shubin@maquefel.me>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201201095507.10317-1-nikita.shubin@maquefel.me
digetx pushed a commit that referenced this pull request Dec 4, 2020
WARNING: Unknown commit id '70e806e4e645', maybe rebased or not pulled?
#4:
Since commit 70e806e ("mm: Do early cow for pinned pages during

WARNING: Unknown commit id 'f3c64eda3e50', maybe rebased or not pulled?
#34:
problems (and was missing a barrrier), see commit f3c64ed ("mm: avoid

WARNING: Unknown commit id 'f3c64eda3e50', maybe rebased or not pulled?
#46:
Fixes: f3c64ed ("mm: avoid early COW write protect games during fork()")

WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
#136: FILE: mm/gup.c:2687:
+	unsigned seq;

WARNING: Block comments should align the * on each line
#191: FILE: mm/memory.c:1180:
+		 * lockdep complaining about preemptibility.
+		*/

total: 0 errors, 5 warnings, 109 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-prevent-gup_fast-from-racing-with-cow-during-fork.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Dec 4, 2020
A current "lazy drain" model suffers from at least two issues.

First one is related to the unsorted list of vmap areas, thus in order to
identify the [min:max] range of areas to be drained, it requires a full
list scan.  What is a time consuming if the list is too long.

Second one and as a next step is about merging all fragments with a free
space.  What is also a time consuming because it has to iterate over
entire list which holds outstanding lazy areas.

See below the "preemptirqsoff" tracer that illustrates a high latency.  It
is ~24676us.  Our workloads like audio and video are effected by such long
latency:

<snip>
  tracer: preemptirqsoff

  preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+
  --------------------------------------------------------------------
  latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8)
     -----------------
     | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16)
     -----------------
   => started at: __purge_vmap_area_lazy
   => ended at:   __purge_vmap_area_lazy

                   _------=> CPU#
                  / _-----=> irqs-off
                 | / _----=> need-resched
                 || / _---=> hardirq/softirq
                 ||| / _--=> preempt-depth
                 |||| /     delay
   cmd     pid   ||||| time  |   caller
      \   /      |||||  \    |   /
crtc_com-261     1...1    1us*: _raw_spin_lock <-__purge_vmap_area_lazy
[...]
crtc_com-261     1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy
crtc_com-261     1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy
crtc_com-261     1...1 24683us : <stack trace>
 => free_vmap_area_noflush
 => remove_vm_area
 => __vunmap
 => vfree
 => drm_property_free_blob
 => drm_mode_object_unreference
 => drm_property_unreference_blob
 => __drm_atomic_helper_crtc_destroy_state
 => sde_crtc_destroy_state
 => drm_atomic_state_default_clear
 => drm_atomic_state_clear
 => drm_atomic_state_free
 => complete_commit
 => _msm_drm_commit_work_cb
 => kthread_worker_fn
 => kthread
 => ret_from_fork
<snip>

To address those two issues we can redesign a purging of the outstanding
lazy areas.  Instead of queuing vmap areas to the list, we replace it by
the separate rb-tree.  In hat case an area is located in the tree/list in
ascending order.  It will give us below advantages:

a) Outstanding vmap areas are merged creating bigger coalesced blocks,
   thus it becomes less fragmented.

b) It is possible to calculate a flush range [min:max] without scanning
   all elements.  It is O(1) access time or complexity;

c) The final merge of areas with the rb-tree that represents a free
   space is faster because of (a).  As a result the lock contention is
   also reduced.

Link: https://lkml.kernel.org/r/20201116220033.1837-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: huang ying <huang.ying.caritas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Dec 7, 2020
WARNING: Unknown commit id '70e806e4e645', maybe rebased or not pulled?
#4:
Since commit 70e806e ("mm: Do early cow for pinned pages during

WARNING: Unknown commit id 'f3c64eda3e50', maybe rebased or not pulled?
#34:
problems (and was missing a barrrier), see commit f3c64ed ("mm: avoid

WARNING: Unknown commit id 'f3c64eda3e50', maybe rebased or not pulled?
#46:
Fixes: f3c64ed ("mm: avoid early COW write protect games during fork()")

WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
#136: FILE: mm/gup.c:2687:
+	unsigned seq;

WARNING: Block comments should align the * on each line
#191: FILE: mm/memory.c:1180:
+		 * lockdep complaining about preemptibility.
+		*/

total: 0 errors, 5 warnings, 109 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-prevent-gup_fast-from-racing-with-cow-during-fork.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Dec 7, 2020
A current "lazy drain" model suffers from at least two issues.

First one is related to the unsorted list of vmap areas, thus in order to
identify the [min:max] range of areas to be drained, it requires a full
list scan.  What is a time consuming if the list is too long.

Second one and as a next step is about merging all fragments with a free
space.  What is also a time consuming because it has to iterate over
entire list which holds outstanding lazy areas.

See below the "preemptirqsoff" tracer that illustrates a high latency.  It
is ~24676us.  Our workloads like audio and video are effected by such long
latency:

<snip>
  tracer: preemptirqsoff

  preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+
  --------------------------------------------------------------------
  latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8)
     -----------------
     | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16)
     -----------------
   => started at: __purge_vmap_area_lazy
   => ended at:   __purge_vmap_area_lazy

                   _------=> CPU#
                  / _-----=> irqs-off
                 | / _----=> need-resched
                 || / _---=> hardirq/softirq
                 ||| / _--=> preempt-depth
                 |||| /     delay
   cmd     pid   ||||| time  |   caller
      \   /      |||||  \    |   /
crtc_com-261     1...1    1us*: _raw_spin_lock <-__purge_vmap_area_lazy
[...]
crtc_com-261     1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy
crtc_com-261     1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy
crtc_com-261     1...1 24683us : <stack trace>
 => free_vmap_area_noflush
 => remove_vm_area
 => __vunmap
 => vfree
 => drm_property_free_blob
 => drm_mode_object_unreference
 => drm_property_unreference_blob
 => __drm_atomic_helper_crtc_destroy_state
 => sde_crtc_destroy_state
 => drm_atomic_state_default_clear
 => drm_atomic_state_clear
 => drm_atomic_state_free
 => complete_commit
 => _msm_drm_commit_work_cb
 => kthread_worker_fn
 => kthread
 => ret_from_fork
<snip>

To address those two issues we can redesign a purging of the outstanding
lazy areas.  Instead of queuing vmap areas to the list, we replace it by
the separate rb-tree.  In hat case an area is located in the tree/list in
ascending order.  It will give us below advantages:

a) Outstanding vmap areas are merged creating bigger coalesced blocks,
   thus it becomes less fragmented.

b) It is possible to calculate a flush range [min:max] without scanning
   all elements.  It is O(1) access time or complexity;

c) The final merge of areas with the rb-tree that represents a free
   space is faster because of (a).  As a result the lock contention is
   also reduced.

Link: https://lkml.kernel.org/r/20201116220033.1837-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: huang ying <huang.ying.caritas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
digetx pushed a commit that referenced this pull request Dec 8, 2020
Ido Schimmel says:

====================
mlxsw: Misc updates

This patchset contains miscellaneous patches we gathered in our queue.
Some of them are dependencies of larger patchsets that I will submit
later this cycle.

Patches #1-#3 perform small non-functional changes in mlxsw.

Patch #4 adds more extended ack messages in mlxsw.

Patch #5 adds devlink parameters documentation for mlxsw. To be extended
with more parameters this cycle.

Patches #6-#7 perform small changes in forwarding selftests
infrastructure.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant