Kernel Tracing (ebpf)
Skill Level: Advanced
1. Overview
This exercise requires the SECONDARY terminal and conflicts with any other exercise that uses the secondary terminal. Do not proceed if the secondary terminal is already in use with another exercise. |
This unit deals with system performance observability.
eBPF (The Extended Berkeley Packet Filter) is an in-kernel virtual machine that allows code execution in the kernel space, in a restricted sandbox environment with access to a limited set of functions. There are numerous components shipped by Red Hat that utilize the eBPF virtual machine. Each component is in a different development phase, and thus not all components are currently fully supported. Those that are not supported are available as a Technology Preview and are not intended for production use. For information on Red Hat scope of support for Technology Preview features, see: Technology Preview Features Support Scope.
In this unit, we will be focusing on the kernel tracing offering shipped as bcc-tools (BPF Compiler Collection Tools). We will also touch on bpftrace which is a tool to compile scripts to BPF-bytecode.
2. Getting Started
For these exercises, you will primarily be using the host node2
as user root
. However, these exercises do REQUIRE a second terminal session to run commands on other hosts. Please pay careful attention to what commands to run on different hosts.
Now is a great time to use the multi session capability of tmux. Use Ctrl-b " to create another session with a split screen. Cycle the active session back and forth with CTRL-b n (next) and CTRL-b p (previous).
|
Primary Terminal (node2)
From host bastion
, ssh to node2
.
ssh node2
Use sudo
to elevate your privileges.
[[ "$UID" == 0 ]] || sudo -i
Verify that you are on the right host for these exercises.
workshop-ebpf-checkhost.sh
Secondary Terminal (bastion)
You’ll need your second terminal for these exercises. So just keep your eyes on the instructions for what to run where.
You are now ready to proceed with these exercises.
3. Core Concepts
There are just over 100 tools shipped in the bcc-tools package. A few things of note:
-
All of these tools live in
/usr/share/bcc/tools
. -
These tools must run as the root user as any eBPF program can read kernel data. As such, injecting eBPF bytecode as a regular user is not allowed in RHEL 10.
-
Each tool has a man page. To view the man page, run
man <tool name>
. These man pages include descriptions of the tools, provide the options that can be called and have information on the expected overhead of the specific tool.
3.1. Installation
Primary Terminal (node2)
Start by installing the BPF Compiler Collection (bcc-tools) and kernel-devel packages for the installed kernel:
dnf install bcc-tools bpftrace strace kernel-devel-$(uname -r) -y
Now, we have a lot of interesting tools installed in /usr/share/bcc/tools along with accompanying man pages.
4. Exercise: execsnoop
The execsnoop
script monitors all calls to execve() and catches all processes that follow the fork→exec sequence, as well as processes that re-exec() themselves. Processes that fork() but do not exec() won’t be caught by this script.
Primary Terminal (node2)
Continuing on host node2
, to run execsnoop
as follows:
/usr/share/bcc/tools/execsnoop
PCOMM PID PPID RET ARGS
It is important to wait until the tool is setup. Do not proceed until you see the output headers (displayed above) appear. |
Secondary Terminal (bastion)
In your second terminal having established yourself as root, run the following command:
ssh node2 "sudo workshop-ebpf-rootkit.sh"
Again, this command should run for about 10 seconds and then exit. Before porceeding to the results you can
either hit Ctrl-C
in the primary terminal OR send a sigkill to the process to terminate execsnoop.
ssh node2 "sudo pkill execsnoop"
Results
In the primary terminal, you should see output similar to:
PCOMM PID PPID RET ARGS sshd 28512 749 0 /usr/sbin/sshd -D -oCiphers=aes256-gcm@openssh.com,chacha20-poly1305@openssh.com,aes256-ctr,aes256-cbc,aes128-gcm@openssh.com,aes128-ctr,aes128-cb -oMACs=hmac-sha2-256-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha2- -oGSSAPIKexAlgorithms=gss-gex-sha1-,gss-group14-sha1- -oKexAlgorithms=curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-excha -oHostKeyAlgorithms=rsa-sha2-256,ecdsa-sha2-nistp256,ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384,ecdsa-sha2-nis -oPubkeyAcceptedKeyTypes=rsa-sha2-256,ecdsa-sha2-nistp256,ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384,ecdsa-sha -R unix_chkpwd 28514 28512 0 /usr/sbin/unix_chkpwd root chkexpiry bash 28516 28515 0 /bin/bash -c workshop-ebpf-rootkit.sh grepconf.sh 28517 28516 0 /usr/libexec/grepconf.sh -c grep 28518 28517 0 /usr/bin/grep -qsi ^COLOR.*none /etc/GREP_COLORS grepconf.sh 28519 28516 0 /usr/libexec/grepconf.sh -c grep 28520 28519 0 /usr/bin/grep -qsi ^COLOR.*none /etc/GREP_COLORS grepconf.sh 28521 28516 0 /usr/libexec/grepconf.sh -c grep 28522 28521 0 /usr/bin/grep -qsi ^COLOR.*none /etc/GREP_COLORS sed 28524 28523 0 /usr/bin/sed -r -e s/^[[:blank:]]*([[:upper:]_]+)=([[:print:][:digit:]\._-]+|"[[:print:][:digit:]\._-]+")/export \1=\2/;t;d /etc/locale.conf uname 28525 28516 0 /usr/bin/uname -a sleep 28526 28516 0 /usr/bin/sleep 1 who 28527 28516 0 /usr/bin/who sleep 28528 28516 0 /usr/bin/sleep 1 grep 28530 28516 0 /usr/bin/grep root /etc/passwd sleep 28531 28516 0 /usr/bin/sleep 1 grep 28532 28516 0 /usr/bin/grep root /etc/shadow sleep 28533 28516 0 /usr/bin/sleep 1 cat 28534 28516 0 /usr/bin/cat /etc/fstab sleep 28535 28516 0 /usr/bin/sleep 1 ps 28536 28516 0 /usr/bin/ps -ef sleep 28537 28516 0 /usr/bin/sleep 1 netstat 28538 28516 0 /usr/bin/netstat -tulpn sleep 28539 28516 0 /usr/bin/sleep 1 getenforce 28540 28516 0 /usr/sbin/getenforce sleep 28541 28516 0 /usr/bin/sleep 1 firewall-cmd 28542 28516 0 /usr/bin/firewall-cmd --state
This shows you all the processes that ran exec() during that ssh login, their PID, their parent PID, their return code, and the arguments that were sent to the process. You could keep monitoring this for quite some time to catch potential bad actors on the system.
5. Exercise: mountsnoop
Similar in nature to execsnoop
, mountsnoop
traces the mount() and umount() syscalls which show processes that are attempting to mount (or unmount) filesystems.
Primary Terminal (node2)
To run this tool, execute the following and give it a couple of seconds to get set up:
/usr/share/bcc/tools/mountsnoop
COMM PID TID MNT_NS CALL
It is important to wait until the tool is setup. Do not proceed until you see the output headers (displayed above) appear. |
Secondary Terminal (bastion)
In your second terminal, having established yourself as root, let’s try to unmount something we know can NOT be unmounted. For this, we’ll pick the root filesystem '/'.
ssh node2 "sudo workshop-ebpf-unmountroot.sh"
umount: /: target is busy.
Results
Taking a look at the terminal running mountsnoop
, we see:
umount 20001 20001 4026531840 umount("/", 0x0) = -EBUSY
This shows us that the mount is busy and cannot be unmounted.
Secondary Terminal (bastion)
Now let’s try to unmount a filesystem that we should be able to unmount. But before doing so, look at the mount options to ensure we can restore it correctly. On node2
run the following:
ssh node2 "grep /dev/shm /proc/mounts"
tmpfs /dev/shm tmpfs rw,seclabel,nosuid,nodev,relatime 0 0
Now proceed to umount /dev/shm
on node2
ssh node2 "sudo workshop-ebpf-unmountshm.sh"
Results
Back to the primary terminal and you should see the following:
umount 20003 20003 4026531840 umount("/dev/shm", 0x0) = 0
The umount command succeeded.
Secondary Terminal (bastion)
Proceed to restore the /dev/shm mount as follows:
ssh node2 "sudo workshop-ebpf-mountshm.sh"
This is the last step related to mountsnoop. Before proceeding to the results you can
either hit Ctrl-C
in the primary terminal OR send a sigkill to the process to terminate mountsnoop.
ssh node2 "sudo pkill mountsnoop"
Results
Finally, back to the primary terminal you should see the following:
mount 48024 48024 1832832 mount("tmpfs", "/dev/shm", "tmpfs", MS_NOSUID|MS_NODEV, "seclabel,inode64") = 0
This shows us that the mount succeeded and all the options that were passed into the system call.
As you can see, the mountsnoop
tool is very useful for seeing what processes are calling the mount and umount system calls and what the results of those calls are.
6. Exercise: cachestat
The cachestat
tool traces kernel page cache functions and prints a summary every five seconds to aid you in workload characterization.
Primary Terminal (node2)
To run this tool, execute the following and give it a couple of seconds to get set up:
/usr/share/bcc/tools/cachestat 5
TOTAL MISSES HITS DIRTIES BUFFERS_MB CACHED_MB
It is important to wait until the tool is setup. Do not proceed until you see the output headers (displayed above) appear. |
Secondary Terminal (bastion)
Let’s now run a little exercise that will generate some i/o. This script
flushes the cache and then runs a series of dd
commands.
ssh node2 "sudo workshop-ebpf-cachestat.sh"
This command should run for a few seconds and then exit. Before porceeding to the results you can
either hit Ctrl-C
in the primary terminal OR send a sigkill to the process to terminate cachestat.
ssh node2 "sudo pkill cachestat"
Results
In the cachestat
window, you should output similar to:
HITS MISSES DIRTIES HITRATIO BUFFERS_MB CACHED_MB 16016 1216 0 92.94% 0 266 8891 0 0 100.00% 0 322 0 0 0 0.00% 0 322
This shows that we had 1216 page cache misses during a five second period while running the above loop but, during that same period, there were 16016 hits, indicating great performance from the page cache.
7. Exercise: bpftrace
This tool is a swiss army knife allowing you to specify functions to trace and messages to be printed when certain conditions are met.
bpftrace makes use of BCC for interacting with the Linux BPF system as well as existing Linux tracing capabilities:
-
kernel dynamic tracing (kprobes)
-
user-level dynamic tracing (uprobes)
-
tracepoints
Let us start this educational journey by examining what system calls a simple command executes.
Primary Terminal (node2)
To begin, we need a test file to use as our target.
touch /tmp/workshop.tmp
The command we will evaluate is cat /tmp/workshop.tmp
and we will use strace to list the system calls
made and with what frequency (count).
strace -c cat /tmp/workshop.tmp
% time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 28.86 0.000267 267 1 execve 28.76 0.000266 8 31 13 openat 12.97 0.000120 5 22 mmap 9.19 0.000085 4 20 close 7.46 0.000069 3 19 newfstatat 3.24 0.000030 7 4 mprotect 2.38 0.000022 11 2 munmap 1.51 0.000014 3 4 read 1.30 0.000012 3 4 pread64 0.97 0.000009 3 3 brk 0.86 0.000008 8 1 1 access 0.54 0.000005 2 2 1 arch_prctl 0.54 0.000005 5 1 getrandom 0.32 0.000003 3 1 futex 0.32 0.000003 3 1 fadvise64 0.32 0.000003 3 1 prlimit64 0.22 0.000002 2 1 set_tid_address 0.22 0.000002 2 1 set_robust_list ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000925 7 119 15 total
What we see is that our command used the 'openat' systemcall 31 times. Among those calls, 13 resulted in errors but are likely harmless attempts to open files that do not exist.
7.1. bpftrace: count all syscalls
Next, let’s use bpftrace to create a similar output of counted system calls. We will use a wildcarded tracepoint 'sys_enter_*' to get all the relevant syscalls.
Primary Terminal (node2)
bpftrace -e 'tracepoint:syscalls:sys_enter_* { if (comm == "cat") { @[probe] = count(); } }' -c "cat /tmp/workshop.tmp"
Attaching 337 probes... @[tracepoint:syscalls:sys_enter_access]: 1 @[tracepoint:syscalls:sys_enter_futex]: 1 @[tracepoint:syscalls:sys_enter_set_tid_address]: 1 @[tracepoint:syscalls:sys_enter_set_robust_list]: 1 @[tracepoint:syscalls:sys_enter_prlimit64]: 1 @[tracepoint:syscalls:sys_enter_fadvise64]: 1 @[tracepoint:syscalls:sys_enter_getrandom]: 1 @[tracepoint:syscalls:sys_enter_exit_group]: 1 @[tracepoint:syscalls:sys_enter_munmap]: 2 @[tracepoint:syscalls:sys_enter_arch_prctl]: 2 @[tracepoint:syscalls:sys_enter_brk]: 3 @[tracepoint:syscalls:sys_enter_mprotect]: 4 @[tracepoint:syscalls:sys_enter_pread64]: 4 @[tracepoint:syscalls:sys_enter_read]: 4 @[tracepoint:syscalls:sys_enter_newfstatat]: 19 @[tracepoint:syscalls:sys_enter_close]: 20 @[tracepoint:syscalls:sys_enter_mmap]: 22 @[tracepoint:syscalls:sys_enter_openat]: 31
This should line up almost 1-to-1 with the previous output from strace, note that 'sys_enter_openat' is 31.
7.2. bpftrace: count specific syscall
Now we are going to limit the probe to only show 'sys_enter_openat' thereby only setting 1 probe and collecting a count.
Primary Terminal (node2)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { if (comm == "cat") { @[probe] = count(); } }' -c "cat /tmp/workshop.tmp"
Attaching 1 probe... @[tracepoint:syscalls:sys_enter_openat]: 31
7.3. bpftrace: set probe on specific file
For this last exercise we will utilize both terminal sessions again. |
The objective is to set a probe and identify any process that opens our targeted file.
-
-e [PROBE CODE]
-
tracepoint:syscalls:sys_enter_openat: is the tracepoint probe type (kernel static tracing)
-
conditional if() to evaluate args→filename for a match
-
print() the desired output
-
typical pid, uid, gid fields
-
comm is the process name
-
args→filename is the filename being accessed
-
-
-
-c [COMMAND] to run command after probes are set and exit
Primary Terminal (node2)
Proceed to set the probe in the primary terminal.
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { if (str(args->filename) == "/tmp/workshop.tmp") { printf( "%d %d %d %s %s\n", pid, uid, gid, comm, str(args->filename)); }}'
Secondary Terminal (bastion)
To generate some output, you will need to utilize the second terminal and open the target with a couple of different tools. From the bastion host, run the following:
ssh node2 "cat /tmp/workshop.tmp"
ssh node2 "grep something /tmp/workshop.tmp"
ssh node2 "echo < /tmp/workshop.tmp"
This is the last step related to bpftrace. Before porceeding to the results you can
either hit Ctrl-C
in the primary terminal OR send a sigkill to the process to terminate bpftrace.
ssh node2 "sudo pkill bpftrace"
Results
Finally back to the primary terminal you should have seen some output like this:
Attaching 1 probe... 2741 0 0 cat /tmp/workshop.tmp 2757 0 0 grep /tmp/workshop.tmp 2772 0 0 bash /tmp/workshop.tmp
This displays the process-id along with the user and group ids of programs that tried to access our file.
8. Conclusion
This concludes the exercises related to EBPF.
Time to finish this unit and return the shell to its home position.
workshop-finish-exercise.sh