As the kind-of-sort-of maintainer of Linux's syscall infrastructure on x86, I have a public service announcement: the syscall auditing infrastructure is awful.
It is inherently buggy in numerous ways. It hardcodes the number of arguments a syscall has incorrectly. It screws up compat handling. It doesn't robustly match entries to returns. It has an utterly broken approach to handling x32 syscalls. It has terrifying code that does bizarre things involving path names (!). It doesn't handle containerization sensibly at all. I wouldn't be at all surprised if it contains major root holes. And last, but certainly not least, it's eminently clear that no one stress tests it.
If you really want to use it for production, invest the effort to fix it, please. (And cc me.) Otherwise do yourself a favor and stay away from it. Use the syscall tracing infrastructure instead.
eBPF is great, and we plan to support it as our collection mechanism, but not today and maybe not soon. When we wrote go-audit, there were few, (if any?), distros that had eBPF kernel support. Now Ubuntu 16.04 has support, as do many others, and that is great!
But..
Releasing a tool that will work for most people today is very different to releasing one that will work for most people in a couple of years. We are sharing this imperfect tool that uses an using an imperfect syscall monitoring mechanism, both of which were written by imperfect humans.
We have found plenty of problems with auditd during development (some real facepalm stuff, to be sure), but it also does many useful things. We have worked around limitations and very definitely stress tested it in our environment. A lot. A lot lot.
If you don't trust auditd, that's a totally valid opinion to hold. I welcome criticism of our approach, but a PSA that doesn't involve having testing our tool is rather unfair.
I know basically nothing about auditd. It's the kernel code I don't trust. Go-auditd may well be fantastic, but treating the kernel part as a reliable black box seems unwise to me.
Edit: you might not need eBPF to get something better. Plain old "perf script" and the underlying ringbuffer API should work decently well on older kernels. There's a performance hit, but Steven Rostedt has a fix, and it should get backported to RHEL at least.
> When we wrote go-audit, there were few, (if any?), distros that had eBPF kernel
> support. Now Ubuntu 16.04 has support, as do many others, and that is great!
This! I'm rather tired of people preaching about eBPF when in reality many of us
have to run long term supported kernels/ distros like RedHat or Ubuntu LTS
releases.
What is said is "If Slack really want to use syscall auditing for production, invest the effort to fix it, please. (And cc me.)", which seems totally reasonable to me.
Slack's answer is "We have worked around limitations and very definitely stress tested it in our environment. A lot. A lot lot." As I understand, this means "No, Slack will not invest any effort to fix Linux syscall auditing in upstream kernel, because we have already worked around limitations." Which is kind of sad and expected.
True that, I've been writing [libaudit-go](https://github.com/mozilla/libaudit-go/) which aims to provide a replacement of the c version of auditd libraries and is in constant development. During this period I looked very closely in the auditd source code and I can say it looks like a bunch of things are patched together without much prior thought to make it work.
Yes, more or less. There are bunch of different "tracing" mechanisms in the kernel, and perf trace is the common way to use it. Syscalls trigger tracepoints, and anything that can see tracepoints can see them. Using eBPF to trace syscalls is probably quite useful.
Offtopic (and forgive me if I mess up some details):
I was reading about eBPF and how it's instruction set is easily JITable across several architectures. I was wondering if it would be good as a VM for retro games. Something like Pico-8 [1], but where one could write games in this restricted C or eBPF-ASM.
As I understand it the VM calls specific functions with limits of calling convention (10 64-bit registers shared between arguments, returns etc.). VM can also expose helper functions. It seems quite capable for such a purpose.
Where could I find more materials to explore this side of eBPF? I found a user space eBPF VM [2] - probably a good start.
If nothing else, you'll need to rework the verifier. The time complexity of verification is super-linear right now. Awhile back I proposed a linear time algorithm that was a bit stricter, but it didn't get implemented.
Also, from memory, there aren't real stack frames right now. (This is a limitation of the implementation and the verifier, not a fundamental issue.)
> Do you mean it fails to match the call/return (annoying), or that it may mismatch them? (pretty bad...)
__audit_syscall_entry() and __audit_syscall_exit() have some interesting checks in them, and I was never convinced that every entry would get paired with the corresponding exit and logged correctly.
>> It doesn't handle containerization sensibly at all.
> Do you mean it's just oblivious to namespaces (but works as far as I know) or something actually not working?
As far as I know, there is one global audit daemon and audit log, and you have to be globally privileged to use it.
We (unintentionally) stress test it. Our Linux admins vociferously insist using it to audit everything that is happening on a 120 core box that spawns tens of heavy processes every second.
Great idea. I always thought that it's essential to log events in realtime to a remote system that is secure and harder to compromise to modify the logs post-intrusion. Way back in the day it was suggested to do this to an entirely offline system by cutting the rx pins on a parallel cable, thereby only allowing the one-way transmission of logs to the log server. I don't know if anyone ever did that in practice though.
Anyways this invites the question, are you allowing your production servers to make outbound internet connections? Generally, I would proxy outbound connections and/or use internal mirrors and repos for the installation of software.
> it was suggested to do this to an entirely offline system by cutting the rx pins on a parallel cable, thereby only allowing the one-way transmission of logs to the log server.
I contributed to this kind of development (It was still a prototype when I left) and AFAIR you can use some ethernet to optic fiber converters. Thoses devices will spit out (or ingest on the other side) one fiber for RX and one fiber for TX, so it makes the creation of the gap very easy. I don't exactly remember the device name though..
Looks like a nice tool, and it's great to see syscalls getting more attention.
I don't fully get the argument for why on-host filtering is undesirable. Of course naively filtering for curl-originated connections isn't a solid detection scheme for rootkit-installs! That's just a naive filter, which a naive user could mis-use in a centralized way or in a distributed way.
As for event correlation (#2 of the pros), it can be done on-host too. And back-testing (#3) of new rules is indeed a highly valuable feature! But you certainly don't have to log everything centrally to get that capability. E.g. in the case of Falco, you can capture trace files and re-run any number of rules/filters on them.
I do agree with the point on rules being exposed to an attacker.
[Disclaimer: author of the initial version of Sysdig Falco]
The issue with enabling syscall auditing is the overhead it introduces, iirc some around two orders of magnitude, as in 200000/s -> 3000/s. I would just use seccomp-bpf filters on a per program basis as the overhead there according to benchmarks is much less.
seccomp-bpf is awsome, but it has to be really baked into the app to be useful. Not being able to filter by deref memory is pretty limiting.
But not all syscalls need to be logged. You don't need to audit all read()/write() calls for example. open/connect/exec/setid should give you plenty of information already. I know at least Fedora had audit included, but inactive, and I haven't heard of any terrible performance degradation there.
This is seriously cool, in my opinion. In fact, I've been working on something kind of similar in Go with more support for network monitoring and vulnerability management that would feed into GrayLog.
If you're wondering why this is useful, or if you need it at all, ask yourself if you would want to get an alert when nginx execs a shell, or opens /etc/shadow. Syscall auditing gives you the lowest possible interface to capture these events.
It's not for the faint of the heart - the volume of events to filter through requires some serious infrastructure - but it's an important component of a mature secops program.
If you want similar functionality to the first questions without audit, try netfilter. It's still shitty logs, unfortunately, but so is most monitoring.
Couldn't you just do this in user-land with `PTRACE_SYSEMU`? Then your tracing processing also has to make 1 system call to unlock and allow the other process to run?
I started tooling up a basic version of this but I need to change PTRACE flags, and change from reading /proc/[PID]/mem to `process_vm_readv(2)`
You know that ptrace messes up the child parent relationship (this is why you can't strace a strace'd or gdb'd process already. It goes 1 level deep only) and has a serious performance impact, whereas syscall tracing doesn't, right?
I'd argue no. While you still receive all the syscalls that are made within each container, there is no way to differentiate those calls from ones made directly on the host. So no filtering or analytics can be made for containers. This is a known issue of the audit framework in Linux.
True, and of https://news.ycombinator.com/item?id=13007726 before it. But there seems to be interest, so we've moved the comments to the post currently ranked highest (that would be this one).
It is inherently buggy in numerous ways. It hardcodes the number of arguments a syscall has incorrectly. It screws up compat handling. It doesn't robustly match entries to returns. It has an utterly broken approach to handling x32 syscalls. It has terrifying code that does bizarre things involving path names (!). It doesn't handle containerization sensibly at all. I wouldn't be at all surprised if it contains major root holes. And last, but certainly not least, it's eminently clear that no one stress tests it.
If you really want to use it for production, invest the effort to fix it, please. (And cc me.) Otherwise do yourself a favor and stay away from it. Use the syscall tracing infrastructure instead.