Previously, the ps command will iterate over all threads which
have the same tgid, to accumulate their rss value, in order to
get a thread/process's final rss value as part of the final output.
For non-live systems, the rss accumulation values are identical for
threads which have the same tgid, so there is no need to do the
iteration and accumulation repeatly, thus a lot of readmem calls are
skipped. Otherwise it will be the performance bottleneck if the
vmcores have a large number of threads.
In this patch, the rss accumulation value will be stored in a cache,
next time a thread with the same tgid will take it directly without
the iteration.
For example, we can monitor the performance issue when a vmcore has
~65k processes, most of which are threads for several specific
processes. Without the patch, it will take ~7h for ps command
to finish. With the patch, ps command will finish in 1min.
Signed-off-by: Tao Liu <ltao@redhat.com>
With some vmlinux e.g. RHEL9 ones, the first execution of the gdb ptype
command heavily consumes memory and time. The "ps -t" option uses it in
start_time_timespec(), and it can be replaced with the crash macros.
This can reduce about 1.4 GB memory and 6 seconds time comsumption in
the following test:
$ echo "ps -t" | time crash vmlinux vmcore
Without the patch:
11.60user 0.43system 0:11.94elapsed 100%CPU (0avgtext+0avgdata 1837964maxresident)k
0inputs+400outputs (0major+413636minor)pagefaults 0swaps
With the patch:
5.40user 0.16system 0:05.46elapsed 101%CPU (0avgtext+0avgdata 417896maxresident)k
0inputs+384outputs (0major+41528minor)pagefaults 0swaps
Although the ptype command and similar ones cannot be fully removed,
but removing some of them will make the use of crash safer, especially
for an automatic crash reporter.
Signed-off-by: Kazuhito Hagio <k-hagio-ab@nec.com>
The "boot_date" is initialized conditionally in the cmd_log(), which may
display incorrect "boot_date" value with the following command before
running the "log -T" command:
crash> help -k | grep date
date: Wed Dec 22 13:39:29 IST 2021
boot_date: Thu Jan 1 05:30:00 IST 1970
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The calculation of "boot_date" depends on the HZ value, and the HZ will
be calculated in task_init() at the latest, so let's move it here.
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Kernel commit 3e9a99eba058 ("block/mq-deadline: Rename dd_init_queue()
and dd_exit_queue()") renamed dd_init_queue to dd_init_sched. Without
the patch, the 'help -m' may print incorrect hz value as follows:
crash> help -m | grep hz
hz: 1000 <---The correct hz value on ppc64le machine is 100.
^^^^
Fixes: b93027ce5c ("Add alternate HZ calculation using write_expire")
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
The "bt -v" command prints incorrect stack-end address when the
"CONFIG_THREAD_INFO_IN_TASK=y" is enabled in kernel, the "bt -v"
command output shows that the value stored at 0xffff8dee0312c198
is 0xffffffffc076400a, however, the value stored actually at
0xffff8dee0312c198 is NULL(0x0000000000000000), the stack-end
address is incorrect.
Without the patch:
crash> bt -v
PID: 28642 TASK: ffff8dee0312c180 CPU: 0 COMMAND: "insmod"
possible stack overflow: ffff8dee0312c198: ffffffffc076400a != STACK_END_MAGIC
^^^^^^^^^^^^^^^^
crash> rd 0xffff8dee0312c198
ffff8dee0312c198: 0000000000000000 ........
^^^^^^^^^^^^^^^^
With the patch:
crash> bt -v
PID: 28642 TASK: ffff8dee0312c180 CPU: 0 COMMAND: "insmod"
possible stack overflow: ffff991340bc0000: ffffffffc076400a != STACK_END_MAGIC
crash> rd 0xffff991340bc0000
ffff991340bc0000: ffffffffc076400a .@v.....
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Kernel commit bcf9033e5449bdcaa9bed46467a7141a8049dadb
("sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y")
moved the member cpu of task_struct back into thread_info.
Without the patch, crash fails with the following error message
during session initialization:
crash: invalid structure member offset: task_struct_cpu
FILE: task.c LINE: 2904 FUNCTION: add_context()
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Kazuhito Hagio <k-hagio-ab@nec.com>
Kernel commit 2f064a59a11ff9bc22e52e9678bc601404c7cb34 ("sched: Change
task_struct::state") renamed the member state of task_struct to __state
and its type changed from long to unsigned int. Without the patch,
crash fails to start up with the following error:
crash: invalid structure member offset: task_struct_state
FILE: task.c LINE: 5929 FUNCTION: task_state()
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Linux 4.8 and later kernels that contain kernel commit 9a7f38c42c2b
("cfq-iosched: Convert from jiffies to nanoseconds") changed the
definition of cfq_slice_async, and it cannot be used to calculate
the HZ value.
Add alternate HZ calculation using write_expire, which depends on
CONFIG_MQ_IOSCHED_DEADLINE or CONFIG_IOSCHED_DEADLINE.
Signed-off-by: Kazuhito Hagio <k-hagio-ab@nec.com>
In task_init() there is a calculation of machdep->hz based on the
value of cfq_slice_async if the symbol exists.
However, kernel commit 9a7f38c42c2b ("cfq-iosched: Convert from
jiffies to nanoseconds") changed the definition of cfq_slice_async
from (HZ / 25) to (NSEC_PER_SEC / 25). As such, the HZ calculation
will result in a value of 1000000000 for machdep->hz. In vmcores
where the symbol exists and has this definition, this causes incorrect
results for some calculations. For a couple of 4.12 vmcores in this
situation crash shows the uptime as 3 seconds, which also throws off
the timestamps in "log -T".
Fix this by skipping the code block for Linux 4.8 and above.
Signed-off-by: Martin Moore <martin.moore@hpe.com>
While stkptr_to_task does the job of trying to match a stack pointer
to a task, it runs through each task's stack to find whether the given
SP falls into its range. This can be a very expensive operation, if
the vmcore is from a system running too many tasks. It can get even
worse when the total number of CPUs on the system is in the order of
thousands. Given the expensive nature of the operation, it must be
optimized as much as possible. Possible options to optimize:
1) Get min & max of the stack range in first pass and use these
values against the given SP to decide whether or not to proceed
with stack lookup.
2) Use multithreading to parallely update irq_tasks.
3) Skip stkptr_to_task() when SP is 0
Though option 3 is a low hanging fruit, it significantly improved the
time taken between starting crash utility & reaching crash prompt.
Implement option 3 to optimize while listing the other two options
as TODO items for follow-up.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
determined by the normal means, scan the kernel log buffer for panic
keywords, and if found, generate the panic task from the CPU number
that is specified following the panic message.
(chenqiwu@xiaomi.com)
facility while allowing subsystems to continue to use radix trees.
Without the patch, the crash session fails during initialization
with the message "crash: xarray facility does not exist or has
changed its format".
(anderson@redhat.com)
"XArray: Remove radix tree compatibility", changes the definition
of "radix_tree_root" back to be a struct. However, the content of
the new structure differs from the original structure, so without
the patch, current linux-next kernels fail during initialization
with the error message "radix trees do not exist or have changed
their format". Because the new "radix_tree_root" and "xarray"
structures have nearly the same layout, the existing functionality
for XArrays can be reused.
(prudo@linux.ibm.com)
2000. Without the patch, the command ultimately fails with a
dump of the internal buffer allocation stats, followed by the
message "ps: cannot allocate any more memory!".
(anderson@redhat.com)
the radix_tree_pair and xarray_pair structures into a unified
list_pair structure that is used by both facilities, and fixes the
"bpf" command. Without the patch, the command fails with the error
message "bpf: radix trees do not exist or have changed their format".
(anderson@redhat.com)
switch-over of PID handling from a radix tree to an XArray in Linux
4.20 and later kernels. Without the patch, the crash session fails
during session initialization with the message "crash: radix trees
do not exist or have changed their format".
(asmadeus@codewreck.org, anderson@redhat.com)
similar to that of radix trees, but introduces completely separate
functions, structures and #defines. None of the applicable radix
tree users in the crash utility have been switched over, so this
phase does not introduce any functional changes.
(asmadeus@codewreck.org, anderson@redhat.com)
2c4704756cab7cfa031ada4dab361562f0e357c0, titled "pids: Move the pgrp
and session pid pointers from task_struct to signal_struct". Without
the patch, the crash session fails during initialization with the
message "crash: invalid structure member offset: task_struct_pids".
(anderson@redhat.com)
new process states, "ID" for the TASK_IDLE macro introduced in
Linux 4.2, and "NE" for the TASK_NEW bit introduced in Linux 4.8.
(k-hagio@ab.jp.nec.com)
that contain commit d8bff643d81a58181356c0aa3ab771ac10da6894,
titled "[x86] asm: Make sure verify_cpu() has a good stack", which
inadvertently breaks the ppc64/ppc64le kernel stack size calculation
when running with crash-7.2.2 or later. Without the patch, "bt" may
fail with a filtered kdump dumpfile with the two error messages
"bt: page excluded: kernel virtual address: <address> type: stack
contents" and "bt: read of stack at <address> failed".
(anderson@redhat.com)
"prctl(PR_SET_MM, ...)" to self-modify its memory map such
that the stack locations of its command line arguments and
environment variables such are not contiguous. Without the
patch, the command may fail with a dump of the crash utility's
internal buffer usage statistics followed by "ps: cannot allocate
any more memory!".
(k-hagio@ab.jp.nec.com)
when building with "make warn". The warnings are all false alarm
messages of type [-Wformat-overflow=], [-Wformat-truncation=] and
[-Wstringop-truncation]; the affected files are extensions.c, task.c,
kernel.c, memory.c, remote.c, symbols.c, filesys.c and xen_hyper.c.
(anderson@redhat.com)
task_struct.rlim or signal_struct.rlim array in the internal
array_table[]. Without the patch, the length of the array
is determined by a call to the embedded gdb module for each
task, and as a result, the command takes a minute or more
per 1000 tasks. With the patch applied, it only takes about
0.5 seconds per 1000 tasks.
(k-hagio@ab.jp.nec.com)
time when analyzing dumpfiles/systems with extremely large task
counts. For example, running with a dumpfile containing over a
million tasks, startup time and "ps" processing time was reduced
from 45 minutes to less then 40 seconds.
(gthelen@google.com)
"thread_union" data structure is not contained in the vmlinux file's
debuginfo data. Without the patch, the kernel stack size is not
calculated correctly, and defaults to 8K. As a result "bt" fails
with the message "bt: invalid RSP: <address> bt->stackbase/stacktop:
<address>/<address> cpu: <number>".
(efault@gmx.de)
whose kernel stacks have been freed/detached. Without the patch,
the "bt" command indicates "bt: invalid kernel virtual address: 0
type: stack contents" and "bt: read of stack at 0 failed"; it will
be changed to display "(no stack)". The "ps -s" option would fail
prematurely upon reaching such a task, indicating "ps: invalid kernel
virtual address: 0 type: stack contents" and "ps: read of stack at 0
failed".
(anderson@redhat.com)
commit e8cfbc245e24887e3c30235f71e9e9405e0cfc39, titled "pid: remove
pidhash". The kernel's traditional usage of a pid_hash[] array to
store PIDs has been replaced by an IDR radix tree, requiring a new
crash plug-in function to gather the system's task set. Without the
patch, the crash session fails during initialization with the error
message "crash: cannot resolve init_task_union".
(anderson@redhat.com)
introduced above. Without the patch, the -l option generates a
segmentation violation if not accompanied by a -C cpu specifier
option.
(vinayakm.list@gmail.com)
contain commit cd9e61ed1eebbcd5dfad59475d41ec58d9b64b6a, titled
"rbtree: cache leftmost node internally". Without the patch,
the command fails with the error message "runq: invalid structure
member offset: cfs_rq_rb_leftmost".
(anderson@redhat.com)
where the task_struct contains a "randomized_struct_fields_start" to
"randomized_struct_fields_end" section. Without the patch, a member
argument that is inside the randomized section is not found.
(anderson@redhat.com)
in the situation where the task was running on its soft IRQ stack,
took a hard IRQ, and then the system crashed while it was running on
its hard IRQ stack.
(hirofumi@mail.parknet.co.jp)
commit 198d208df4371734ac4728f69cb585c284d20a15, titled "x86: Keep
thread_info on thread stack in x86_32". Without the patch, incorrect
addresses of each per-cpu hardirq_stack and softirq_stack were saved
for usage by the "bt" command.
(hirofumi@mail.parknet.co.jp, anderson@redhat.com)
and c65eacbe290b8141554c71b2c94489e73ade8c8d, which have introduced a
new CONFIG_THREAD_INFO_IN_TASK configuration. This configuration
moves each task's thread_info structure from the base of its kernel
stack into its task_struct. Without the patch, the crash session
fails during initialization with the error "crash: invalid structure
member offset: thread_info_cpu".
(anderson@redhat.com)
that are not declared as "char *" types. Change two prior direct
callers of resizebuf() to use RESIZEBUF(), and fix two prior users of
RESIZEBUF() to correctly calculate the need to resize their buffers.
(anderson@redhat.com)
"bt" option, which can be accessed using "bt -o", and where "bt -O"
will toggle the original and optional methods as the default. The
original backtrace method has adopted two changes/features from
the optional method:
(1) ORIG_X0 and SYSCALLNO registers are not displayed in kernel
exception frames.
(2) stackframe entry text locations are modified to be the PC
address of the branch instruction instead of the subsequent
"return" PC address contained in the stackframe link register.
Accordingly, these are the essential differences between the original
and optional methods:
(1) optional: the backtrace will start with the IPI exception frame
located on the process stack.
(2) original: the starting point of backtraces for the active,
non-crashing, tasks, will continue to have crash_save_cpu()
on the IRQ stack as the starting point.
(3) optional: the exception entry stackframe adjusted to be located
farther down in the IRQ stack.
(4) optional: bt -f does not display IRQ stack memory above the
adjusted exception entry stackframe.
(5) optional: may display "(Next exception frame might be wrong)".
(takahiro.akashi@linaro.org, anderson@redhat.com)
attached to it. While kernel threads typically do not have an
mm_struct referencing a user-space virtual address space, they can
either temporarily reference one for a user-space copy operation, or
in the case of KVM "vhost" kernel threads, keep a reference to the
user space of the "quem-kvm" task that created them. Without the
patch, they will be mistaken for user tasks; the "bt" command will
display an invalid kernel-entry exception frame that indicates
"[exception RIP: unknown or invalid address]", the "ps" command
will not enclose the command name with brackets, and the "ps -[uk]"
and "foreach [user|kernel]" options will show the kernel thread as
a user task.
(anderson@redhat.com)
all tasks for evidence of stack overflows. It does so by verifying
the thread_info.task address, ensuring the thread_info.cpu value is
a valid cpu number, and checking the end of the stack for the
STACK_END_MAGIC value.
(anderson@redhat.com)
are specified by the QEMU mem-path argument of a memory-backend-file
object. This allows the running of a live crash session against a
QEMU guest from the host machine. In this example, the /tmp/MEM file
on a QEMU host represents the guest's physical memory:
$ qemu-kvm ...other-options... \
-object memory-backend-file,id=MEM,size=128m,mem-path=/tmp/MEM,share=on \
-numa node,memdev=MEM -m 128
and a live session run can be run against the guest kernel like so:
$ crash <path-to-guest-vmlinux> live:/tmp/MEM@0
By prepending the ramdump image name with "live:", the crash session will
act as if it were running a normal live session.
(oleg@redhat.com)
version supports running against a live kernel. Compressed kdump
support is also here, but the crash dump support for the kernel,
kexec-tools, and makedumpfile is still pending. Initial work was
done by Karl Volz with help from Bob Picco.
(dave.kleikamp@oracle.com)
commit ccbf62d8a284cf181ac28c8e8407dd077d90dd4b, which changed the
task_struct.start_time member from a struct timespec to a u64.
Without the patch, the "RUN TIME" value is nonsensical.
(anderson@redhat.com)
the kernel has dynamically downsized from the size indicated by the
debuginfo data. At this time, only "kmem_cache" and "task_struct"
structures that have been downsized are registered, but others may be
added in the future. If a downsized data structure is passed to gdb
for display, gdb will request a read of the "full" data structure,
which may flow into a memory region that was either filtered by
makedumpfile(8), or perhaps into non-existent memory, thereby killing
the generating command immediately due to a partial read. With this
patch, commands such as "struct" and "task" that reference downsized
data structures will have their reads flagged to return successfully
if partial read error occurs.
(anderson@redhat.com)
Linux 4.2 and later kernels, which contain these commits:
commit 5aaeb5c01c5b6c0be7b7aadbf3ace9f3a4458c3d
x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and
use it on x86
commit 0c8c0f03e3a292e031596484275c14cf39c0ab7a
x86/fpu, sched: Dynamically allocate 'struct fpu'
Without the patch, when running on a filtered kdump dumpfile, it is
possible that error messages like this will be seen when gathering
the tasks running on a system: "crash: page excluded: kernel virtual
address: <task_struct address> type: "fill_task_struct".
(ats-kumagai@wm.jp.nec.com)
"live" s390x dumpfiles created by the VMDUMP, stand-alone dump, or
"virsh dump" facilities, none of which explicitly mark the dumpfile
as a "live dump", run a standard "bt" backtrace on each kernel stack
instead of the text-address-only "bt -t". Without the patch, an
invalid text reference may be found in a task's kernel stack due to
the common zero-based user and kernel virtual address space ranges of
the s390x, causing the task to be mistakenly set as the "PANIC" task.
(holzheu@linux.vnet.ibm.com)
Do not search for a panic task in s390x dumpfiles that are marked
as a "live dump"...
The first part prevented a search of the active tasks; this part
prevents the last-ditch search of all tasks.
(anderson@redhat.com)