I filter for false positives with language like this:
For each bug you find, write a failing test. Run the test to make sure it fails. If it passes, try 1-3 times to fix the test. If you can't get it to work, delete the test and move on to the next bug.
It's not perfect, you still get some non-bugs where the test fails because it's premises are wrong. Eg, recently I tossed out some tests that were asserting they could index a list at `foo.len()` instead of `foo.len() - 1`. But I've found a bunch of bugs this way too.