Alright. I think the spirit of what our ancestor poster was saying is that it's mostly just powered by an API (by which most of the "work" is being done). In a previous discussion I asked if you were going to do full-scale Web crawls, and you said:
>We never stopped crawling; we just scaled it back for specific purposes, mainly now for spam detection and zero-click info
So if you've scaled back the crawling and are mostly using it just for spam detection and zero-click info (forgive my ignorance, but I'm not sure what zero-click info is) then does that mean that you are getting the majority of your result data from outside APIs like BOSS?
Edit: And does that mean that you store those pages in your database and use your own ranking algorithm, or are you using their ranked results for a query and then rearranging them on the fly?
No, the spirit of the above comment was condescending, dismissive and only mentioned one source: They're Bing with some twiddles on top. (I believe he is a Google employee.)
About 50% of queries show information from my index. I've concentrated on the fat head of the search space, where I get a lot of volume (1 and 2 word queries). For long-tail stuff, yes, the majority of results come from external APIs, but even there it is inaccurate to say they are all derived from a direct call to one API. It really varies by query. Some will look like Bing for sure, but others will look very different. I'm not going to disclose everything I do of course.
>No, the spirit of the above comment was condescending, dismissive
Well, I didn't even think of it that way. Naturally you're probably looking at these types of criticisms under a magnifying glass since it's your project.
>About 50% of queries show information from my index. I've concentrated on the fat head of the search space, where I get a lot of volume (1 and 2 word queries).
See now that's actually good to know! From many people's point of view (including mine), DDG is Bing with some twiddles on top. Maybe out of ignorance, but what can you expect? If you aren't already enamored by the idea of DDG then you probably don't care enough to find out what you're really doing on the back-end, which is what really matters most to me in these discussions. (Not the results since unless I'm told otherwise I just assume it's Bing or Yahoo.)
Edit: It's quite humorous to be down-voted by the fan boys for expressing an honest and IMO helpful opinion to the owner of DDG. If all he gets is blind love he won't know how to get the rest of us on board. Believe it or not, the down vote button isn't the disagree button. If you have something constructive to add, say it.
See, and you were doing so well until the edit. I was about to upvote your rather interesting reply, but reddit has left an unforgivably bad taste in my mouth for people who say "not sure why the downvotes" or "I'll probably get downvoted, but..."
I'm not going to downvote you, but next time just say what you have to say. If people like it and upvote it, great. if not, it's a meaningless metric on a fairly small social news site. There are far more important things to worry about.
I understand your point, but I'm not worried about the actual points being added or subtracted from my karma. People here tend to continue up-voting something that's been up-voted or continue down-voting something that's been down-voted. I'm attempting to curb any further lack of real consideration for my comments due to the blind nature of many of the users here regarding voting etiquette. If you've seen me around here before you'll know that I often post comments that attempt to uphold what I consider to be the HN standard. Which is why I ended it with "if you have something constructive to add, say it." I realize some users may find this grating, but perhaps those users aren't the target audience for the remark to begin with. Thanks for the heads up though.
To add to that, if I were trolling or being rude or vicious I would not have edited to bring up the down-voting. I hate to see this site slip into the bad habit of down-voting legitimate and helpful points due to simple disagreement with the commenter. With so many new users coming to HN daily it's important to occasionally remind people that this isn't Reddit, it's not Digg, and it's not Slashdot.
I actually have a lot of respect for DDG and what you've accomplished (well, until the recent ad campaign, but that's a different issue...)
My impression from the FAQ and your comments here is that you start with Yahoo! BOSS results, merge in other results that you suspect may be useful from your own crawl, filter out known spam pages, and possibly re-rank things afterwards according to whatever algorithms you have. Hence, "Bing with a bunch of twiddles on top." I don't really get a different impression after reading this comment - "50% of queries show information from my index" could mean as little as reranking one result, and query distribution is known to be heavily biased towards the fat head.
Although, I just looked at one not-quite-head-but-not-quite-long-tail query ([modern family episode list]):
And the top 3 are identical, but the rest of the top 10 are about as different from Bing as Bing is from Google. Twiddles can be fairly powerful.
You obviously don't have to disclose everything you do in response to criticism - Google and Bing certainly don't. But that was my perception from your FAQ and posts, and certainly seems to be the perception of many other HN users as well.
> And does that mean that you store those pages in your database and use your own ranking algorithm, or are you using their ranked results for a query and then rearranging them on the fly?
I have looked at those search APIs and they don't let you write your own ranking formula. They just give you back 10-20-30 top results. There is no info on term frequency, inverted document frequency, pagerank or any other component of a decent ranking formula. So you can rank yourself the pages from your own crawls, but you can't merge this nicely with the API results.
If I were Garbiel, I would merge them in some crude way basically assuming my results are better, and add those URLs to the list for my crawler. The top URLs for all one and two word queries aren't that many.