Syscall Auditing at Scale

amluto · on Nov 22, 2016

As the kind-of-sort-of maintainer of Linux's syscall infrastructure on x86, I have a public service announcement: the syscall auditing infrastructure is awful.

It is inherently buggy in numerous ways. It hardcodes the number of arguments a syscall has incorrectly. It screws up compat handling. It doesn't robustly match entries to returns. It has an utterly broken approach to handling x32 syscalls. It has terrifying code that does bizarre things involving path names (!). It doesn't handle containerization sensibly at all. I wouldn't be at all surprised if it contains major root holes. And last, but certainly not least, it's eminently clear that no one stress tests it.

If you really want to use it for production, invest the effort to fix it, please. (And cc me.) Otherwise do yourself a favor and stay away from it. Use the syscall tracing infrastructure instead.

rhuber · on Nov 22, 2016

Hi - Ryan from the blog post here.

eBPF is great, and we plan to support it as our collection mechanism, but not today and maybe not soon. When we wrote go-audit, there were few, (if any?), distros that had eBPF kernel support. Now Ubuntu 16.04 has support, as do many others, and that is great!

But..

Releasing a tool that will work for most people today is very different to releasing one that will work for most people in a couple of years. We are sharing this imperfect tool that uses an using an imperfect syscall monitoring mechanism, both of which were written by imperfect humans.

We have found plenty of problems with auditd during development (some real facepalm stuff, to be sure), but it also does many useful things. We have worked around limitations and very definitely stress tested it in our environment. A lot. A lot lot.

If you don't trust auditd, that's a totally valid opinion to hold. I welcome criticism of our approach, but a PSA that doesn't involve having testing our tool is rather unfair.

amluto · on Nov 22, 2016

> If you don't trust auditd

I know basically nothing about auditd. It's the kernel code I don't trust. Go-auditd may well be fantastic, but treating the kernel part as a reliable black box seems unwise to me.

Edit: you might not need eBPF to get something better. Plain old "perf script" and the underlying ringbuffer API should work decently well on older kernels. There's a performance hit, but Steven Rostedt has a fix, and it should get backported to RHEL at least.

g0xA52A2A · on Nov 22, 2016

> When we wrote go-audit, there were few, (if any?), distros that had eBPF kernel

> support. Now Ubuntu 16.04 has support, as do many others, and that is great!

This! I'm rather tired of people preaching about eBPF when in reality many of us have to run long term supported kernels/ distros like RedHat or Ubuntu LTS releases.

sanxiyn · on Nov 22, 2016

Where did amluto "preach" eBPF?

What is said is "If Slack really want to use syscall auditing for production, invest the effort to fix it, please. (And cc me.)", which seems totally reasonable to me.

Slack's answer is "We have worked around limitations and very definitely stress tested it in our environment. A lot. A lot lot." As I understand, this means "No, Slack will not invest any effort to fix Linux syscall auditing in upstream kernel, because we have already worked around limitations." Which is kind of sad and expected.

nialv7 · on Nov 22, 2016

> We have worked around limitations

Why not try to fix those limitations and push to upstream?

arunk-s · on Nov 22, 2016

True that, I've been writing [libaudit-go](https://github.com/mozilla/libaudit-go/) which aims to provide a replacement of the c version of auditd libraries and is in constant development. During this period I looked very closely in the auditd source code and I can say it looks like a bunch of things are patched together without much prior thought to make it work.

base698 · on Nov 22, 2016

Is Syscall tracing == perf trace? Ive been doing the same sort of thing with perf. I hadn't heard of the auditing tools.

amluto · on Nov 22, 2016

Yes, more or less. There are bunch of different "tracing" mechanisms in the kernel, and perf trace is the common way to use it. Syscalls trigger tracepoints, and anything that can see tracepoints can see them. Using eBPF to trace syscalls is probably quite useful.

hawski · on Nov 22, 2016

Offtopic (and forgive me if I mess up some details):

I was reading about eBPF and how it's instruction set is easily JITable across several architectures. I was wondering if it would be good as a VM for retro games. Something like Pico-8 [1], but where one could write games in this restricted C or eBPF-ASM.

As I understand it the VM calls specific functions with limits of calling convention (10 64-bit registers shared between arguments, returns etc.). VM can also expose helper functions. It seems quite capable for such a purpose.

Where could I find more materials to explore this side of eBPF? I found a user space eBPF VM [2] - probably a good start.

EDIT: In case anyone is wondering: Why? For fun!

[1] http://www.lexaloffle.com/pico-8.php

[2] https://github.com/rlane/ubpf

amluto · on Nov 22, 2016

If nothing else, you'll need to rework the verifier. The time complexity of verification is super-linear right now. Awhile back I proposed a linear time algorithm that was a bit stricter, but it didn't get implemented.

Also, from memory, there aren't real stack frames right now. (This is a limitation of the implementation and the verifier, not a fundamental issue.)

viraptor · on Nov 22, 2016

Thanks for this information! I have a few questions though:

> It doesn't robustly match entries to returns

Do you mean it fails to match the call/return (annoying), or that it may mismatch them? (pretty bad...)

> It doesn't handle containerization sensibly at all.

Do you mean it's just oblivious to namespaces (but works as far as I know) or something actually not working?

amluto · on Nov 22, 2016

>> It doesn't robustly match entries to returns

> Do you mean it fails to match the call/return (annoying), or that it may mismatch them? (pretty bad...)

__audit_syscall_entry() and __audit_syscall_exit() have some interesting checks in them, and I was never convinced that every entry would get paired with the corresponding exit and logged correctly.

>> It doesn't handle containerization sensibly at all.

> Do you mean it's just oblivious to namespaces (but works as far as I know) or something actually not working?

As far as I know, there is one global audit daemon and audit log, and you have to be globally privileged to use it.

coredog64 · on Nov 22, 2016

We (unintentionally) stress test it. Our Linux admins vociferously insist using it to audit everything that is happening on a 120 core box that spawns tens of heavy processes every second.

jtakkala · on Nov 22, 2016

Great idea. I always thought that it's essential to log events in realtime to a remote system that is secure and harder to compromise to modify the logs post-intrusion. Way back in the day it was suggested to do this to an entirely offline system by cutting the rx pins on a parallel cable, thereby only allowing the one-way transmission of logs to the log server. I don't know if anyone ever did that in practice though.

Anyways this invites the question, are you allowing your production servers to make outbound internet connections? Generally, I would proxy outbound connections and/or use internal mirrors and repos for the installation of software.

henrygrew · on Nov 22, 2016

> it was suggested to do this to an entirely offline system by cutting the rx pins on a parallel cable, thereby only allowing the one-way transmission of logs to the log server.

sounds like overkill, but pretty cool i must say.

Manozco · on Nov 22, 2016

I contributed to this kind of development (It was still a prototype when I left) and AFAIR you can use some ethernet to optic fiber converters. Thoses devices will spit out (or ingest on the other side) one fiber for RX and one fiber for TX, so it makes the creation of the gap very easy. I don't exactly remember the device name though..

_joel · on Nov 22, 2016

I think the term you are looking for is 'data diode' or uni-directional network link - https://en.wikipedia.org/wiki/Unidirectional_network

henridf · on Nov 22, 2016

Looks like a nice tool, and it's great to see syscalls getting more attention.

I don't fully get the argument for why on-host filtering is undesirable. Of course naively filtering for curl-originated connections isn't a solid detection scheme for rootkit-installs! That's just a naive filter, which a naive user could mis-use in a centralized way or in a distributed way.

As for event correlation (#2 of the pros), it can be done on-host too. And back-testing (#3) of new rules is indeed a highly valuable feature! But you certainly don't have to log everything centrally to get that capability. E.g. in the case of Falco, you can capture trace files and re-run any number of rules/filters on them.

I do agree with the point on rules being exposed to an attacker.

[Disclaimer: author of the initial version of Sysdig Falco]

akadien · on Nov 22, 2016

Regarding on-host filtering (edge analytics), my experience has been it's because of performance, and I agree with the security angle, too.

nwmcsween · on Nov 21, 2016

The issue with enabling syscall auditing is the overhead it introduces, iirc some around two orders of magnitude, as in 200000/s -> 3000/s. I would just use seccomp-bpf filters on a per program basis as the overhead there according to benchmarks is much less.

viraptor · on Nov 22, 2016

seccomp-bpf is awsome, but it has to be really baked into the app to be useful. Not being able to filter by deref memory is pretty limiting.

But not all syscalls need to be logged. You don't need to audit all read()/write() calls for example. open/connect/exec/setid should give you plenty of information already. I know at least Fedora had audit included, but inactive, and I haven't heard of any terrible performance degradation there.

(it looks like the penalty was ~40ns in 2014 http://permalink.gmane.org/gmane.linux.kernel/1639528)

amluto · on Nov 22, 2016

It was an order of magnitude higher at one point. These days it's not that bad under most workloads.

akadien · on Nov 22, 2016

This is seriously cool, in my opinion. In fact, I've been working on something kind of similar in Go with more support for network monitoring and vulnerability management that would feed into GrayLog.

Are you guys hiring?

jvehent · on Nov 22, 2016

If you're wondering why this is useful, or if you need it at all, ask yourself if you would want to get an alert when nginx execs a shell, or opens /etc/shadow. Syscall auditing gives you the lowest possible interface to capture these events.

It's not for the faint of the heart - the volume of events to filter through requires some serious infrastructure - but it's an important component of a mature secops program.

amluto · on Nov 22, 2016

The even more mature secops programs might wonder whether the kernel syscall auditing infrastructure itself is a problematic attack surface.

0xbadcafebee · on Nov 22, 2016

If you want similar functionality to the first questions without audit, try netfilter. It's still shitty logs, unfortunately, but so is most monitoring.

valarauca1 · on Nov 22, 2016

Couldn't you just do this in user-land with `PTRACE_SYSEMU`? Then your tracing processing also has to make 1 system call to unlock and allow the other process to run?

I started tooling up a basic version of this but I need to change PTRACE flags, and change from reading /proc/[PID]/mem to `process_vm_readv(2)`

SEJeff · on Nov 22, 2016

You know that ptrace messes up the child parent relationship (this is why you can't strace a strace'd or gdb'd process already. It goes 1 level deep only) and has a serious performance impact, whereas syscall tracing doesn't, right?

packetized · on Nov 22, 2016

Similar to https://github.com/gdestuynder/audisp-json and https://github.com/mozilla/audit-go

aliakhtar · on Nov 22, 2016

Would this be useful in a containerized architecture where everything is run as a container?

viraptor · on Nov 22, 2016

At the host level - yes. All the applications still make the same requests to the kernel. Different namespace doesn't matter.

rkv · on Nov 22, 2016

I'd argue no. While you still receive all the syscalls that are made within each container, there is no way to differentiate those calls from ones made directly on the host. So no filtering or analytics can be made for containers. This is a known issue of the audit framework in Linux.

ejholmes · on Nov 22, 2016

Great article. I'd also recommend people take a look at ThreatStack for aggregating syscall events: https://www.threatstack.com/.

mburns · on Nov 22, 2016

[duplicate of https://news.ycombinator.com/item?id=13009796]

dang · on Nov 22, 2016

True, and of https://news.ycombinator.com/item?id=13007726 before it. But there seems to be interest, so we've moved the comments to the post currently ranked highest (that would be this one).

mistertrotsky · on Nov 21, 2016

This is super great.

ms4720 · on Nov 21, 2016

Yes it is