Kernel Tracing (ebpf)

Skill Level: Advanced

1. Overview

This exercise requires the SECONDARY terminal and conflicts with any other exercise that uses the secondary terminal. Do not proceed if the secondary terminal is already in use with another exercise.

This unit deals with system performance observability.

eBPF (The Extended Berkeley Packet Filter) is an in-kernel virtual machine that allows code execution in the kernel space, in a restricted sandbox environment with access to a limited set of functions. There are numerous components shipped by Red Hat that utilize the eBPF virtual machine. Each component is in a different development phase, and thus not all components are currently fully supported. Those that are not supported are available as a Technology Preview and are not intended for production use. For information on Red Hat scope of support for Technology Preview features, see: Technology Preview Features Support Scope.

In this unit, we will be focusing on the kernel tracing offering shipped as bcc-tools (BPF Compiler Collection Tools). We will also touch on bpftrace which is a tool to compile scripts to BPF-bytecode.

2. Getting Started

For these exercises, you will primarily be using the host node2 as user root. However, these exercises do REQUIRE a second terminal session to run commands on other hosts. Please pay careful attention to what commands to run on different hosts.

Now is a great time to use the multi session capability of tmux. Use Ctrl-b " to create another session with a split screen. Cycle the active session back and forth with CTRL-b n (next) and CTRL-b p (previous).

Primary Terminal (node2)

From host bastion, ssh to node2.

ssh node2

Use sudo to elevate your privileges.

[[ "$UID" == 0 ]] || sudo -i

Verify that you are on the right host for these exercises.

workshop-ebpf-checkhost.sh

Secondary Terminal (bastion)

You’ll need your second terminal for these exercises. So just keep your eyes on the instructions for what to run where.

You are now ready to proceed with these exercises.

3. Core Concepts

There are just over 100 tools shipped in the bcc-tools package. A few things of note:

All of these tools live in /usr/share/bcc/tools.
These tools must run as the root user as any eBPF program can read kernel data. As such, injecting eBPF bytecode as a regular user is not allowed in RHEL 10.
Each tool has a man page. To view the man page, run man <tool name>. These man pages include descriptions of the tools, provide the options that can be called and have information on the expected overhead of the specific tool.

3.1. Installation

Primary Terminal (node2)

Start by installing the BPF Compiler Collection (bcc-tools) and kernel-devel packages for the installed kernel:

terminal-1

dnf install bcc-tools bpftrace strace kernel-devel-$(uname -r) -y

Now, we have a lot of interesting tools installed in /usr/share/bcc/tools along with accompanying man pages.

4. Exercise: execsnoop

The execsnoop script monitors all calls to execve() and catches all processes that follow the fork→exec sequence, as well as processes that re-exec() themselves. Processes that fork() but do not exec() won’t be caught by this script.

Primary Terminal (node2)

Continuing on host node2, to run execsnoop as follows:

terminal-1

/usr/share/bcc/tools/execsnoop

PCOMM            PID    PPID   RET ARGS

It is important to wait until the tool is setup. Do not proceed until you see the output headers (displayed above) appear.

Secondary Terminal (bastion)

In your second terminal having established yourself as root, run the following command:

terminal-2

ssh node2 "sudo workshop-ebpf-rootkit.sh"

Again, this command should run for about 10 seconds and then exit. Before porceeding to the results you can either hit Ctrl-C in the primary terminal OR send a sigkill to the process to terminate execsnoop.

terminal-2

ssh node2 "sudo pkill execsnoop"

Results

In the primary terminal, you should see output similar to:

PCOMM            PID    PPID   RET ARGS
sshd             28512  749      0 /usr/sbin/sshd -D -oCiphers=aes256-gcm@openssh.com,chacha20-poly1305@openssh.com,aes256-ctr,aes256-cbc,aes128-gcm@openssh.com,aes128-ctr,aes128-cb -oMACs=hmac-sha2-256-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha2- -oGSSAPIKexAlgorithms=gss-gex-sha1-,gss-group14-sha1- -oKexAlgorithms=curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-excha -oHostKeyAlgorithms=rsa-sha2-256,ecdsa-sha2-nistp256,ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384,ecdsa-sha2-nis -oPubkeyAcceptedKeyTypes=rsa-sha2-256,ecdsa-sha2-nistp256,ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384,ecdsa-sha -R
unix_chkpwd      28514  28512    0 /usr/sbin/unix_chkpwd root chkexpiry
bash             28516  28515    0 /bin/bash -c workshop-ebpf-rootkit.sh
grepconf.sh      28517  28516    0 /usr/libexec/grepconf.sh -c
grep             28518  28517    0 /usr/bin/grep -qsi ^COLOR.*none /etc/GREP_COLORS
grepconf.sh      28519  28516    0 /usr/libexec/grepconf.sh -c
grep             28520  28519    0 /usr/bin/grep -qsi ^COLOR.*none /etc/GREP_COLORS
grepconf.sh      28521  28516    0 /usr/libexec/grepconf.sh -c
grep             28522  28521    0 /usr/bin/grep -qsi ^COLOR.*none /etc/GREP_COLORS
sed              28524  28523    0 /usr/bin/sed -r -e s/^[[:blank:]]*([[:upper:]_]+)=([[:print:][:digit:]\._-]+|"[[:print:][:digit:]\._-]+")/export \1=\2/;t;d /etc/locale.conf
uname            28525  28516    0 /usr/bin/uname -a
sleep            28526  28516    0 /usr/bin/sleep 1
who              28527  28516    0 /usr/bin/who
sleep            28528  28516    0 /usr/bin/sleep 1
grep             28530  28516    0 /usr/bin/grep root /etc/passwd
sleep            28531  28516    0 /usr/bin/sleep 1
grep             28532  28516    0 /usr/bin/grep root /etc/shadow
sleep            28533  28516    0 /usr/bin/sleep 1
cat              28534  28516    0 /usr/bin/cat /etc/fstab
sleep            28535  28516    0 /usr/bin/sleep 1
ps               28536  28516    0 /usr/bin/ps -ef
sleep            28537  28516    0 /usr/bin/sleep 1
netstat          28538  28516    0 /usr/bin/netstat -tulpn
sleep            28539  28516    0 /usr/bin/sleep 1
getenforce       28540  28516    0 /usr/sbin/getenforce
sleep            28541  28516    0 /usr/bin/sleep 1
firewall-cmd     28542  28516    0 /usr/bin/firewall-cmd --state

This shows you all the processes that ran exec() during that ssh login, their PID, their parent PID, their return code, and the arguments that were sent to the process. You could keep monitoring this for quite some time to catch potential bad actors on the system.

5. Exercise: mountsnoop

Similar in nature to execsnoop, mountsnoop traces the mount() and umount() syscalls which show processes that are attempting to mount (or unmount) filesystems.

Primary Terminal (node2)

To run this tool, execute the following and give it a couple of seconds to get set up:

terminal-1

/usr/share/bcc/tools/mountsnoop

COMM             PID     TID     MNT_NS      CALL

It is important to wait until the tool is setup. Do not proceed until you see the output headers (displayed above) appear.

Secondary Terminal (bastion)

In your second terminal, having established yourself as root, let’s try to unmount something we know can NOT be unmounted. For this, we’ll pick the root filesystem '/'.

terminal-2

ssh node2 "sudo workshop-ebpf-unmountroot.sh"

umount: /: target is busy.

Results

Taking a look at the terminal running mountsnoop, we see:

umount           20001   20001   4026531840  umount("/", 0x0) = -EBUSY

This shows us that the mount is busy and cannot be unmounted.

Secondary Terminal (bastion)

Now let’s try to unmount a filesystem that we should be able to unmount. But before doing so, look at the mount options to ensure we can restore it correctly. On node2 run the following:

terminal-2

ssh node2 "grep /dev/shm /proc/mounts"

tmpfs /dev/shm tmpfs rw,seclabel,nosuid,nodev,relatime 0 0

Now proceed to umount /dev/shm on node2

terminal-2

ssh node2 "sudo workshop-ebpf-unmountshm.sh"

Results

Back to the primary terminal and you should see the following:

umount           20003   20003   4026531840  umount("/dev/shm", 0x0) = 0

The umount command succeeded.

Secondary Terminal (bastion)

Proceed to restore the /dev/shm mount as follows:

terminal-2

ssh node2 "sudo workshop-ebpf-mountshm.sh"

This is the last step related to mountsnoop. Before proceeding to the results you can either hit Ctrl-C in the primary terminal OR send a sigkill to the process to terminate mountsnoop.

terminal-2

ssh node2 "sudo pkill mountsnoop"

Results

Finally, back to the primary terminal you should see the following:

mount            48024   48024   1832832     mount("tmpfs", "/dev/shm", "tmpfs", MS_NOSUID|MS_NODEV, "seclabel,inode64") = 0

This shows us that the mount succeeded and all the options that were passed into the system call.

As you can see, the mountsnoop tool is very useful for seeing what processes are calling the mount and umount system calls and what the results of those calls are.

6. Exercise: cachestat

The cachestat tool traces kernel page cache functions and prints a summary every five seconds to aid you in workload characterization.

Primary Terminal (node2)

To run this tool, execute the following and give it a couple of seconds to get set up:

terminal-1

/usr/share/bcc/tools/cachestat 5

   TOTAL   MISSES     HITS  DIRTIES   BUFFERS_MB  CACHED_MB

It is important to wait until the tool is setup. Do not proceed until you see the output headers (displayed above) appear.

Secondary Terminal (bastion)

Let’s now run a little exercise that will generate some i/o. This script flushes the cache and then runs a series of dd commands.

terminal-2

ssh node2 "sudo workshop-ebpf-cachestat.sh"

This command should run for a few seconds and then exit. Before porceeding to the results you can either hit Ctrl-C in the primary terminal OR send a sigkill to the process to terminate cachestat.

terminal-2

ssh node2 "sudo pkill cachestat"

Results

In the cachestat window, you should output similar to:

    HITS   MISSES  DIRTIES HITRATIO   BUFFERS_MB  CACHED_MB
   16016     1216        0   92.94%            0        266
    8891        0        0  100.00%            0        322
       0        0        0    0.00%            0        322

This shows that we had 1216 page cache misses during a five second period while running the above loop but, during that same period, there were 16016 hits, indicating great performance from the page cache.

7. Exercise: bpftrace

This tool is a swiss army knife allowing you to specify functions to trace and messages to be printed when certain conditions are met.

bpftrace makes use of BCC for interacting with the Linux BPF system as well as existing Linux tracing capabilities:

kernel dynamic tracing (kprobes)
user-level dynamic tracing (uprobes)
tracepoints

Let us start this educational journey by examining what system calls a simple command executes.

Primary Terminal (node2)

To begin, we need a test file to use as our target.

terminal-1

touch /tmp/workshop.tmp

The command we will evaluate is cat /tmp/workshop.tmp and we will use strace to list the system calls made and with what frequency (count).

terminal-1

strace -c cat /tmp/workshop.tmp

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 28.86    0.000267         267         1           execve
 28.76    0.000266           8        31        13 openat
 12.97    0.000120           5        22           mmap
  9.19    0.000085           4        20           close
  7.46    0.000069           3        19           newfstatat
  3.24    0.000030           7         4           mprotect
  2.38    0.000022          11         2           munmap
  1.51    0.000014           3         4           read
  1.30    0.000012           3         4           pread64
  0.97    0.000009           3         3           brk
  0.86    0.000008           8         1         1 access
  0.54    0.000005           2         2         1 arch_prctl
  0.54    0.000005           5         1           getrandom
  0.32    0.000003           3         1           futex
  0.32    0.000003           3         1           fadvise64
  0.32    0.000003           3         1           prlimit64
  0.22    0.000002           2         1           set_tid_address
  0.22    0.000002           2         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.000925           7       119        15 total

What we see is that our command used the 'openat' systemcall 31 times. Among those calls, 13 resulted in errors but are likely harmless attempts to open files that do not exist.

7.1. bpftrace: count all syscalls

Next, let’s use bpftrace to create a similar output of counted system calls. We will use a wildcarded tracepoint 'sys_enter_*' to get all the relevant syscalls.

Primary Terminal (node2)

terminal-1

bpftrace -e 'tracepoint:syscalls:sys_enter_* { if (comm == "cat") { @[probe] = count(); } }' -c "cat /tmp/workshop.tmp"

Attaching 337 probes...


@[tracepoint:syscalls:sys_enter_access]: 1
@[tracepoint:syscalls:sys_enter_futex]: 1
@[tracepoint:syscalls:sys_enter_set_tid_address]: 1
@[tracepoint:syscalls:sys_enter_set_robust_list]: 1
@[tracepoint:syscalls:sys_enter_prlimit64]: 1
@[tracepoint:syscalls:sys_enter_fadvise64]: 1
@[tracepoint:syscalls:sys_enter_getrandom]: 1
@[tracepoint:syscalls:sys_enter_exit_group]: 1
@[tracepoint:syscalls:sys_enter_munmap]: 2
@[tracepoint:syscalls:sys_enter_arch_prctl]: 2
@[tracepoint:syscalls:sys_enter_brk]: 3
@[tracepoint:syscalls:sys_enter_mprotect]: 4
@[tracepoint:syscalls:sys_enter_pread64]: 4
@[tracepoint:syscalls:sys_enter_read]: 4
@[tracepoint:syscalls:sys_enter_newfstatat]: 19
@[tracepoint:syscalls:sys_enter_close]: 20
@[tracepoint:syscalls:sys_enter_mmap]: 22
@[tracepoint:syscalls:sys_enter_openat]: 31

This should line up almost 1-to-1 with the previous output from strace, note that 'sys_enter_openat' is 31.

7.2. bpftrace: count specific syscall

Now we are going to limit the probe to only show 'sys_enter_openat' thereby only setting 1 probe and collecting a count.

Primary Terminal (node2)

terminal-1

bpftrace -e 'tracepoint:syscalls:sys_enter_openat { if (comm == "cat") { @[probe] = count(); } }' -c "cat /tmp/workshop.tmp"

Attaching 1 probe...


@[tracepoint:syscalls:sys_enter_openat]: 31

7.3. bpftrace: set probe on specific file

For this last exercise we will utilize both terminal sessions again.

The objective is to set a probe and identify any process that opens our targeted file.

-e [PROBE CODE]
- tracepoint:syscalls:sys_enter_openat: is the tracepoint probe type (kernel static tracing)
- conditional if() to evaluate args→filename for a match
- print() the desired output
  - typical pid, uid, gid fields
  - comm is the process name
  - args→filename is the filename being accessed
-c [COMMAND] to run command after probes are set and exit

Primary Terminal (node2)

Proceed to set the probe in the primary terminal.

terminal-1

bpftrace -e 'tracepoint:syscalls:sys_enter_openat { if (str(args->filename) == "/tmp/workshop.tmp") { printf( "%d %d %d %s %s\n", pid, uid, gid, comm, str(args->filename)); }}'

Secondary Terminal (bastion)

To generate some output, you will need to utilize the second terminal and open the target with a couple of different tools. From the bastion host, run the following:

terminal-2

ssh node2 "cat /tmp/workshop.tmp"

terminal-2

ssh node2 "grep something /tmp/workshop.tmp"

terminal-2

ssh node2 "echo < /tmp/workshop.tmp"

This is the last step related to bpftrace. Before porceeding to the results you can either hit Ctrl-C in the primary terminal OR send a sigkill to the process to terminate bpftrace.

terminal-2

ssh node2 "sudo pkill bpftrace"

Results

Finally back to the primary terminal you should have seen some output like this:

Attaching 1 probe...
2741 0 0 cat /tmp/workshop.tmp
2757 0 0 grep /tmp/workshop.tmp
2772 0 0 bash /tmp/workshop.tmp

This displays the process-id along with the user and group ids of programs that tried to access our file.

8. Conclusion

This concludes the exercises related to EBPF.

Time to finish this unit and return the shell to its home position.

terminal-1

workshop-finish-exercise.sh

9. Additional Resources

You can find more information:

Kernel Tracing (ebpf)

Skill Level: Advanced

1. Overview

2. Getting Started

Primary Terminal (node2)

Secondary Terminal (bastion)

3. Core Concepts

3.1. Installation

Primary Terminal (node2)

4. Exercise: execsnoop

Primary Terminal (node2)

Secondary Terminal (bastion)

Results

5. Exercise: mountsnoop

Primary Terminal (node2)

Secondary Terminal (bastion)

Results

Secondary Terminal (bastion)

Results

Secondary Terminal (bastion)

Results

6. Exercise: cachestat

Primary Terminal (node2)

Secondary Terminal (bastion)

Results

7. Exercise: bpftrace

Primary Terminal (node2)

7.1. bpftrace: count all syscalls

Primary Terminal (node2)

7.2. bpftrace: count specific syscall

Primary Terminal (node2)

7.3. bpftrace: set probe on specific file

Primary Terminal (node2)

Secondary Terminal (bastion)

Results

8. Conclusion

9. Additional Resources

End of Unit