I ran the Python script through Shedskin, a partial Python implementation that compiles Python to machine code via an intermediate layer of C++. I did not have to declare any types for any variables, or add declarations for speed/safety; Shedskin automatically determined the types.
The finished program went down in runtime from 533.7s on my machine, to 24.42 seconds. If we adjust for machine speed (since the author's Python ran the python code in 474s), the comparative optimized result would be 21.7 seconds compared to author's hand-optimized lisp code at 30.2.
Note that this is probably an ideal case for Shedskin (which is a relatively new product: a 1-man job that's just a few years old compared to, I guess, hundreds if not thousands of man-years of research that have gone in LISP implementations).
There may be hundreds of man years of research gone into LISP, but a lot of that research found its way into shedskin too. And the Self research is in there too. And his own research (Shedskin was originilly his Masters).
the main techniques that I use for type inference were invented by ole agesen and john plevyak, who worked on self and concurrent aggregates, respectively (iirc). it may be they were influenced by research on lisp, but I doubt it.
It's better to use the actual JDK (i.e. the one everyone uses) rather than the implementation you get from debian apt-get due to debian's licensing hangups.
SBCL:
* (test)
Evaluation took:
45.268 seconds of real time
Sun JDK
Runtime was: 51086 Milliseconds
That's less than 15%, a far, far cry from '4 times'
The result of the distance() function is unused and can be optimized away by the compiler. Runs in 0.04s with LuaJIT 2.0. No type declarations needed. :-)
local radius = 6371
local function distance(latA, lngA, latB, lngB)
local latAr = math.rad(latA)
local lngAr = math.rad(lngA)
local latBr = math.rad(latB)
local lngBr = math.rad(lngB)
local deltaLat = latBr - latAr
local deltaLng = lngBr - lngAr
return radius * 2 * math.asin(math.sqrt(
math.sin(deltaLat/2)^2 + math.cos(latAr)
* math.cos(latBr) * (math.sin(deltaLng/2)^2)))
end
local start = os.clock()
for latA=-90,90,2.5 do
for lngA=-90,90,2.5 do
for latB=-90,90,2.5 do
for lngB=-90,90,2.5 do
distance(latA, lngA, latB, lngB)
end
end
end
end
local stop = os.clock()
print(stop-start)
Ha.. of course, the real problem is that Java and Python are actually doing all of the work to compute their trigonometric functions, while C libraries that Lisp uses probably have them mapped into a table.
"I just performed some simple benchmarks of my own, and noticed that summing e.g. sqrt, sin and tan for small values are just as fast as their C equivalents. This was a great relief to me.
However, when computing sin in the range of pi/2 ... pi and 10 ... 10+pi/6 Java became 3-5 times slower, while the C program was just as fast. This is explained above by lack of precision, but for example sin(10.0) yields bit-to-bit exactly the same result in Java and C! This comparison was performed on a rather old 1GHz Mobile Intel Pentium, so I would assume that the precision has been fixed in virtually all modern processors, and therefore these operations could be intrinsified for a much larger value range (possibly the entire double range)."
Interesting.. well, I'd think the trig functions would be by far the most expensive part of that program. Is Python really that much slower for control structures? That's literally the only other computation there.
Well, every instruction in a Python program requires running through the interpreter loop. Then you have to decode, get the opcode arguments, then actually do the operation. For calling C functions from Python, you also have to extract the C types, call the C function, and wrap the result back up. Its not hugely slow, but yeah, it adds up, and its fairly easy to see why it compares poorly to compiled versions.
The most common way to compute your distance from a deal is to assume the earth is a perfect sphere
I thought you would only really be finding deals at distances small enough to just assume the earth is flat. Or at least, aren't those the only ones people would actually be interested in? If it's beyond my physical reach, I don't care how far it is. I'll get it shipped.
You're absolutely right - in my experience the pythagorean theorem has less than ~50 meters difference from the Haversine formula for distances less than 20 miles or so. In the actual code I use a simple heuristic to decide when to use which calculation.
Like I mentioned in the post, the distance function presented is a bit of a simplification I invented for pedagogical purposes (my goal was to examine languages, not geometry).
Fair enough. But then that raises a more important issue: you are showing Lisp is faster than more popular languages in a way that probably doesn't matter much. When I read the title, I was hoping this isn't some benchmarking or academic exercise, which does not have relevance for real applications. So I was delighted to read that this is something you use in your webapp, only to be disappointed again now.
I don't mean to discourage you or be cynical. I love the fact that you are writing interesting articles on the practicalities of using Lisp with actual data behind it. So, keep up the good work. I look forward to more posts.
The distance the swallow flies doesn't seem very interesting for many cases, where the terrain forces you to take a much longer route due to bridges, train track crossings, etc. Wouldn't you be better of using the Google maps API to return the length of an actual route to the target?
I just iterate through each latitude from -90 to 90 and each longitude from -180 to 180 in increments of 2.5 (for both the start and end). Check out the source code for any of the 5 examples (included in the post).
If you'd like, post the code up somewhere and I'll be happy to run it on the same machine and post the results along with the rest.
I also wonder why your java benchmark is slow. Java seems to be around 2-3x slower than C on average, this would make your optimized CL to be around 1.5x faster than C.
Define "slower". Languages don't have a fixed "speed" relation between them. Performance varies between machine, language, problem, and implementation technique.
Is any of that so remarkable that we should presume catch23 doesn't know?
A charitable response would help catch23 understand what's different about this one problem from what we can sensibly presume as common experience - not Sun Java, JVM trig functions.
Even if you assume a local mercator projection (centered on your business or whatever) you still need to know that the earth is spherical. Otherwise, you're likely to assume that an 0.1 degree change in latitude is the same distance as an 0.1 degree change in longitude. To correct this erroneous assumption (for small distances) divide your longitude differences by the sine of your latitude.
Are people still calling Lisp slow in 2010? That was somewhat true when most programming was done in C. Plenty of languages slower than Lisp have risen to popularity since then.
That said, I will stick with python as I am not a programmer by profession and can't deal with a plethora of languages when I have a good swiss army knife as it is.
What does it mean to call a language "slow", though? It's all relative, and it depends on the problem set.
I'm wary of analogies, because they can be tortured into "proving" anything you like. But what's faster, a car or a bike? In a big city during rush hour, a bike could very well be faster than driving. Clearly on the freeway, the car wins out.
One interesting source of data is www.spoj.pl, since they keep records of the runtime of submissions to their site.
For example, take a look at the solutions to PRIME 1; the task is to generate a list of prime numbers: http://www.spoj.pl/ranks/PRIME1/
The top 20 C solutions all take less than 0.05 seconds. The top 20 Java solutions all take less than 0.5 second. Python has several solutions that took 0.55 seconds. But for Lisp, only one user was able to get it to less than 1 second, and then only barely.
There's not enough data to draw any real conclusions, but it does make me skeptical of claims that Lisp is near in speed to Java or C (any Lisp gurus who want to prove me wrong, feel free to code up a fast solution and submit it)
There are two Lisp implementations on www.spoj.pl. One is Clisp which is bytecode compiled and rather slow and the other is SBCL which compiles to machine code and is the implementation used by the OP.
It looks as if you only looked at the former as the later has an entry at 0.52 (as mentioned by another commenter.)
A note about Common Lisp implementations is that there are many of them, all with different performance specs ideal use cases. Some are considered very fast, such as SBCL and CCL, others, not so much. Most of the data I've seen benchmarking Lisp against other languages uses one of the open source implementations. I'd like to see how a commercial Lisp, such as Allegro or Lispworks, performed, especially since those are the ones that have actually been worked on since the 80s/
I like the Lisp approach to optimize the program by adding hints to the runtime about types and the ability to tune the optimizer's aggressiveness from the code. It really makes possible to write the correct algorithms first, then later add appropriate optimizations simply by changing the optimized and unoptimized functions.
That said I think these kinds of benchmarks (and article titles) are misleading, because one who has deep knowledge in one language does not necessarily write good code in other languages. Even the article notes that "they are all just literal line-by-line translations of each other".
Thanks, that's interesting. My own main optimization problem is always reducing the memory usage of large in memory data sets. I found that languages like Java and Python, which don't support structured value types and use references/pointers everywhere, are basically unusable for my my purposes.
Now, I wonder whether I could use Lisp for what I do. One basic necessity would be to have a way to define a structure like this
struct s {
int32_t x;
int32_t y;
};
and then create an array of that type as struct s a[size], which uses exactly size * sizeof(s) bytes in memory. A further requirement would be to have a library of data structures (balanced trees, lists, hash tables) that support storing such structs directly without using pointers everywhere.
Do you have any idea whether that could work with SBCL?
I think you can do that - although I'm not entirely certain how much overhead a struct (defstruct) has. Array allocation has no overhead, to the best of my knowledge (if you specify the element-type, etc).
I usually don't code at that low level, but you might want to check out www.lrde.epita.fr/~didier/research/verna.06.imecs.pdf
Let me know if you're on Win32 and I will package for you my setup, with a double-clickable installer, and your choice of Emacs or Win32 friendly IDE :-)
The python benchmark is contaminated. Lots of <= checks, increasing variables, etc. Those things need to be left out of the timing. How about using itertools.product instead?
for latA, lngA, latB, lngB in itertools.product(*[xrange(-90,91), xrange(-180,181)]*2):
distance(latA, lngA, latB, lngB)
Not sure it's faster, but certainly more idiomatic.
Also, the 2nd python example doesn't work; you need to drop all the 'math.' and use radians, sqrt, etc directly.
My understanding is that you have to rewrite your code to take advantage of NumPy. Unfortunately, I've never used NumPy before, so I don't know how to do so.
However, if you'd like to modify my code to do so I'd be happy to rerun the test and publish the results.
Isn't it a much better idea to pick a distance comparison algorithm that's not so staggeringly slow to begin with? Why not convert (i.e. pre-compute) the lat/lons to 3-vectors (origin center of the earth) and then compare distances by pythagoras, no fuss, muss or trig and you can even skip the square root. If you need to display an actual accurate distance you can do the (significantly smaller amount of) trig to get something presentable.
His problem is 'i have a bunch of existing points, find and compare distances to new point'. Doing this by lat,lon and trig is known as one of the slowest ways to do it.
"I thought I'd demonstrate this functionality, with a bit of real world code. Postabon recommends deals to users based on a variety of factors such as the age of the deal, other people's votes, your preferences, and (relevant for this example) your distance from the deal. Most of the other factors can be computed asyncronously and just cached, but your distance from deals is computed a lot, and can't really be pre-computed since I have no way of knowing your location a priori (well, that's not strictly true, and we do do some memoization, but it has a relatively low hit rate)."
Yes. He has a bunch of existing points (the deals) and and a new point (user location). Deals are the known points, which can be represented any way you like. lat lon, vectors, cylindrical coords, you name it. The thing not known in advance is the user location. And don't umm... me. It's rude and stupid, especially when you don't seem to know what you're talking about.
Any guesses as to why the Java implementation is relatively slow? JNI boundary crossings for floating point arithmetic, or lack of use of the math coprocessor?
I don't know, but it's much slower than I expected.
It's the latest Java installable by apt on Debian (all these tests are run in Debian running under VMWare).
$ java -version
java version "1.6.0_0"
OpenJDK Runtime Environment (build 1.6.0_0-b11)
OpenJDK 64-Bit Server VM (build 1.6.0_0-b11, mixed mode)
In more depth : the sun JDK probably does a lot better job of JIT compilation, it's the difference between running interpreted bytecode and compiled native code.
On the Hadoop project, checksumming was a major CPU constraint, because you have to assume every file everywhere is possibly on faulty hardware, and checksum everything. They found that it was actually faster to implement CRC32 in pure Java and let the JIT compilation do it's magic than it was to make native calls to zlibc.
But that may not apply to floating point if the Java implementation is bad enough.
I coded it in haskell, and it is faster, if it's a correct translation. Without parallelizing, it clocks at 1.8 seconds, compiled with $ ghc --make. If anyone plans to compile this you'll need to $ cabal install geodetic, before $ ghc --make filename.hs
It is. You never request the actual calculation to be done on each point, you only request that the last value be calculated. It takes 1.8 seconds to get to the end of the 4-variable cartesian product.
This was my theory, and I tested it by importing Debug.Trace, and then changing the line in question to:
let result = [traceShow [j,k,l,m] (distance j k l m) | j <- rangeLat, k <- rangeLng, l <- rangeLat, m <- rangeLng]
This will print "show [j,k,l,m]" when the actual distance is required. When I run the program, it prints what you'd expect:
The stuff in brackets is the traceShow output, and it's mixed in with the printing of each result. Lazy evaluation is; it calculates only what is necessary. In this case, it's every (j,k,l,m) quad (so it knows which one is last), and then the 5 values you request.
FWIW, your program runs in 14.28 seconds on my (old) machine, and the strict version runs in 48.94. (This includes a foldl' over the list that ensures every value is evaluated. It may add a bit of overhead.)
I don't think that "Not always" is a valid answer to "Haskell can be faster"; or rather, in the sense that it is a valid answer, I don't think your link proves it.
Incorrect code is silly. It is not unlikely that the different languages are performing a different number of iterations in these tests, correcting that by being precise about the range to be iterated over is not silly. The integer indices here produce the behavior that seems to be intended by the `<=` comparison in the originals.
The speed difference is negligible since all but the inner multiplication is hoisted out, and each inner loop takes over 250 cycles.
The speed difference is hardly negligible especially when talking about benchmarks. It's actually about 15% on my OS X laptop, less but still measurable on a Linux server. Your change actually has a greater impact on execution time than off-by-ones in the iteration counts which don't really take place.
15% is a bit higher than I benchmark, but here is a version that is guaranteed to do the correct number of iterations, increments in floating point instead of performing the slow cast, and is faster than performing the comparison in floating point math (because it occupies a different execution unit). Note that it is also less accurate than the cast for most definitions of what this function might be attempting to measure.
I really don't know, but I would guess Google's search engine is almost all C/C++. When you are operating at that scale, it makes sense to invest huge programmer time for very small speed increases.
As to "normal" people, I do a lot of numerical computing using python+C. I end up with probably 90% of the code in python, and 10% inner loops in C. That's typically enough that the C code is taking, say, 50% of the time, at which it isn't worth it (for me) to move more python into C, since I could never buy myself more than a factor of 2 speedup.
1.) A simple square macro will be faster than the expt function, because expt does special calculations for dealing with decimal exponents. All you really need to do is multiply it by itself. I think something like this:
And avoid a couple of function calls and some multiplication.
This way you also don't need to duplicate 'the single-float'...
On my machine, running clozure cl, this made it about 1/8th faster than the original optimized version.
3.)
I was kind of wondering, could you do this sort of computation ahead of time for varying longitudes and latitudes, then store them in a table and do a simple look up and interpolation, instead of doing a lot of trigonometry (is this technically trigonometry?) at run time?
I'm not clear on the math exactly (whether it would work in this situation), but that sort of thing is done using pre-computed tables with fairly complex polynomial equations all of the time. In the case of a web app, it might be better to do this sort of thing (ram is cheap anyway).
I ran all tests 5 times each, and averaged the results. Each of the results for each of the tests were within 2% of each other, and the machine wasn't used for anything else while the tests were running.
I know it's not perfect, but I don't see any obvious sources of error ... I did provide all the source, so you're welcome to verify my results.
It shows, you are comparing optomized LISP with unoptomized Java and C. Java has a long startup process so it is relativly faster on 20 second benchmarks than sub second ones. As to C, it does what you tell it do do, so the compiler needs a little more handholding. I suspect LISP's compiler is not acctually doing the computations because they are never used, granted I have not really looked into it.
Wow, that is slow. Well java's pow function uses floats on the exponent so it would be a lot faster if you used a temporary variable, and then multiplied it by its self. Other than that I don't know, most of these math functions should translate to a single ASM function so I don't know why it's that slow.
PS: I would probably inline the C function, but that's not a huge deal. My point was each language has its own optimizations; the only way to compare them is to write reasonably optimized programs in each of them and compare that.
Edit: I think Java's sin functions are slower than the HW implementation but more accurate, which is irrelevant because you don't care but it's something to look into. I dislike Java for other reasons, but it's easy to make code slow which code that looks very similar runs a lot faster.
2. Inlining in an actual C implementation doesn't matter because there is no register pressure and each trig function generates a real function call. Function calls to known addresses are really fast (~2 cycles), even indirect calls are fast compared to transcendental functions (~ 6 cycles). For comparison, the C implementation takes ~260 cycles each distance computation. This can't be improved much without vectorizing the trig functions.
> but a tight inner loop with a function call is begging for an inline.
The loop body makes several other function calls, you would be very hard pressed to measure the difference. And if it was a C function in a larger project, you probably wouldn't want to inline it because that would slow your compile times and cause you to lose modularity (i.e. you'd have to rebuild all client code rather than just update the DSO when you optimize the distance function). This function does enough work that it doesn't deserve to be inlined.
The finished program went down in runtime from 533.7s on my machine, to 24.42 seconds. If we adjust for machine speed (since the author's Python ran the python code in 474s), the comparative optimized result would be 21.7 seconds compared to author's hand-optimized lisp code at 30.2.
Note that this is probably an ideal case for Shedskin (which is a relatively new product: a 1-man job that's just a few years old compared to, I guess, hundreds if not thousands of man-years of research that have gone in LISP implementations).
Still, an interesting result.