Measure distance in n dimensions with Pythagoras

pixcavator · on Nov 6, 2007

You can't apply this stuff mindlessly. Consider this suggestion from the article: "Differences between people: (Height, Weight, Age)". If you use the formula, you end up with inch^2 + pound^2 + year^2. Mixing units is bad enough. Now imagine how the results will change if you switch to the metric system.

bluishgreen · on Nov 6, 2007

I think the missing element is this: you need to establish that the domain that you are trying to measure is a metric space before trying to measure distance using this formula.

http://en.wikipedia.org/wiki/Metric_space

edit: read the second section on how to establish something is a metric space. (in the wiki article)

pixcavator · on Nov 6, 2007

Exactly! It is a big mistake to assume that you can express relations between the entities you study with a single number. I've never heard however of non-metric topological spaces used in applications.

jgrahamc · on Nov 6, 2007

The use of the color distance measure turns out to be important in spam filtering because of the 'Camouflage' trick used by some spammers where similar, but not identical colors are used to mask chunks of 'good' text inserted to try to fool spam filters.

From http://www.jgc.org/tsc/

Camouflage (GWI!Camouflage!HTML) What: Like Invisible Ink, but instead of using identical colors (e.g. white on white) use very similar colors. Date added: June 2, 2003 Example from the wild: The colors 1133333, 123939, and 423939 are chosen to be very similar without being the same)

<table bgcolor="#113333"><tr><td><font color="#123939">those rearing lands</font><br> <table><tr><td><br><font color="yellow" size=5><b>Plasticine sex-cartoons.</b></font><br> <font color="#423939">eel harness highest</font><br> <font color="white" size=3>Absolutely new category of adu1t sites. </td></tr></table> <font color="#123939">nobody jets held<br>Northumbria- diamond sleep</font></td></tr></table>

ivankirigin · on Nov 6, 2007

Cool.

But finding the distance between two N-D points isn't really hard at all.

What is hard is finding the distance between a point and a set of points. Or a set and a set.

Doing exhaustive search is wasteful.

If you're interested, look in to K-D trees as a real solution. Best-bin-first modified K-D trees are the basis of the SIFT object instance recognition feature matching algorithm. Break an image into a set of ND features. Matching a geometrically consistent subset of those features to a previously seen object works extremely well.

The algorithm is general. Change and add features to make it more powerful or faster. But the idea of using a set of ND linear features to describe an object should last for years.

tel · on Nov 6, 2007

Not to snipe at anyone, but skills like this seem pretty essential in solving problems in general. Little tricks from linear algebra like magnitude (what this post concerns) and projections show up in interesting and useful places.

Additionally, while they /are/ harder to access, reading and understanding various proofs in Math can be an even more beautiful and enlightening experience than just seeing the practical result.

Maybe it's the difference between realizing that setf can be using on all the generalized variables and actually reading the source and seeing why. Math is full of clever hacks.

ced · on Nov 6, 2007

Pythagoras theorem only holds in Euclidean geometry (or so, Wikipedia says). The computer world has no notion of space, so there's no a priori reason for choosing Pythagoras over other norms%. Has anyone here experimented with alternatives?

%See for example: http://en.wikipedia.org/wiki/Distance

pixcavator · on Nov 6, 2007

I've used the 1-norm, aka "Manhattan", distance (with weights of course) for visual image search. I think the 1-norm is always a better choice than the Euclidean distance unless there is some clear geometric context.

machine · on Nov 6, 2007

The 1-norm tends to also be less sensitive to outliers, and in machine learning, 1-norm regularization leads to sparse solutions. The real reason 2-norm is popular is that it is easy to minimize (differentiable).

ed · on Nov 6, 2007

Very timely post. I was in bed last night trying to visualize how you might go about measuring distance in 4-dimensional space...

While the method presented is really simple, it's tricky to think of it as "intuitive" when n>3 without looking at a proof.

herdrick · on Nov 6, 2007

Very simple, but there are some good examples here about how you might do this to quantify similarity in users based on their expressed preferences. Simple techniques are often best.