When our startup was just getting going, search spiders were an unexpected problem for us.
Our project was a bit odd, in that we had millions of documents available to browse. We wanted them all to be available to search engines, but it was an operational headache to keep that much data moving.
Our normal daily traffic might have been 10,000 users with perhaps 50K page requests. We built our first small web server to handle this load.
We were caught blindsided by the Googlebot, which from day one demanded hundreds of thousands of pages per day. Googlebot offers no option to slow it down, and if you try to slow it down on purpose your search ranking will suffer.
So 90% of our early operational burden was devoted to keeping Googlebot happy. This was an unexpected burden, but it was worth it to be in the Google index, because they sent a lot of traffic our way and kept our fledging startup alive.
But it was definitely not worth it to spend equivalent operational expense to be in marginal search engines that don't refer much traffic. To a site operator, they don't offer any value in exchange for what they demand.
It was very interesting to see the relative volume of requests that we got from Googlebot versus the other crawlers. While Googlebot found us on day one, it took over 3 months before we appeared in Microsoft's index, and even then, they had perhaps only 10% of the coverage of our site when compared to Google.
I know at minimum GoogleBot and Yahoo! Slurp respect this value. I believe Google AdsBot (Landing page URL verification for AdWords) also supports this value, but GoogleBot and AdsBot each will crawl at the rate you specify (so if you specify a Crawl-Delay of 1, each bot will crawl at ~1qps). I don't know if it is in the spec, but fractional crawl delays appeared to be respected (Crawl-Delay: 0.25 would result in ~4qps, for example).
I too have fought with this problem - at times more than 90% of the capacity of the site I was running was devoted to serving up content for bots. I sympathize with small (and large) companies who don't want to add capacity so that new/random bot can add another <x> QPS to the daily baseline load.
Then deny everything until you are ready, but don't deny access to everything but the big boys (who are the ones with the real resources to hammer your site anyway).
I'm not saying you were in the wrong, but that is the stance I would (and do, with my current project) take. And of course, when we are ready, we will open the door to all indexing.
Just because you do this doesn't mean it makes sense for everyone. Allowing Google more access because it gives more reward makes perfect business sense. Why should anyone and everyone be allowed to crawl and scrape my website or nobody allowed? The world isn't a series of black and white decisions.
I think they just mean that it's a whole site/domain action; that you can't change crawl rate for a sub-directory based site ala Wordpress MU/Multisite with directories.
Ironically I believe that this is to protect the users privacy on the Quora site. They have an option to hide your username from your answers, so if someone were to Google your name your Quora answers wouldn't appear.
If one wants to be an idiot or an asshole on the Internet, then one is free to do so. If one wants to attach their real name, address or some other identifiable bit of information to such outbursts, then once again, one is free to do so.
The same goes for life off of the Internet, which of course could be recorded or reported, and subsequently placed on the Internet.
The real problem is people mistakenly believe they can secretly operate with impunity in an age of increasing transparency.
You've successfully identified a fundamental problem with society, and it is by no means limited to Quora.
I think the search engine privacy feature is a fantastic innovation from Quora to help absolve potential problems in the future. (eg: you google my name and see my Quora answers on Sex)
If you wrote public answers on questions of sex, why should they be hidden? --You made your answers public intentionally and also provided the answers intentionally so others would hopefully learn from them. Additionally, you did both while knowing your name is attached to them. In other words, you want to be known both for and by your contributions.
If you did not wish to be known for and by your public contributions, then the only answer is to either not make public contributions, or go to rather extreme lengths to only make public contributions anonymously. The latter fraught with plenty of caveats to maintain anonymity correctly, so the only real answer is to only make public contributions you would want attributed to you.
Back in the 80's when Chico State was the number 1 party school in the US, recruiters would show up in droves looking to get new employees. The unstated reason was very simple; if someone could graduate from Chico, it pretty much proved the person could not only get things done but also have a really fun time doing it.
Who would you rather work with? --A fiercely academic but socially limited graduate, or the guy who can get stuff done and also do keg stands?
If I searched on your name and found your public opinions, I'd think far more of you for having the stones to make your own contributions than I would be concerned about your predilection for midget porn.
Of course, the trouble with trying to be non-judgmental and unbiased is assuming others will do the same.
What about the rule that matches "*" at the end? That's there for all the rest - the others are just fine-tuned for mutual benefit.
There is no rule in there that denys other search engines, just some rules that set some slightly different rules for different ones.
EDIT: I see now -therule at the end is rather specific, and wouldn't allow crawling the entire site. Still - without actual complaints from someone who contacted them and got no response, I see no evil.
Your job as a startup is to sit around all day coming up with seemingly important news stories that just happen to revolve around your company. Get some press contacts to feed those stories to, and you're sorted for marketing.
Bonus points for attacking the Big Competitor in a public way, thus forcing major news outlets to cover the "story" and ask Big Competitor for a quote about you.
>why has every article about search engines recently ended up talking about DDG?
Because they're the new darling, and tech types like them. It's always nice to see someone come in as a substantial underdog and make a real effort to succeed. Also, their results seem to be at least as good as Google's which is pretty impressive.
"Also, their results seem to be at least as good as Google's which is pretty impressive."
They're Bing with some twiddles on top. DDG uses Yahoo! BOSS, which is Yahoo's API to its search service, which is now powered by Bing. Your DDG results all originally come from Bing's index, they just get filtered and possibly re-ranked by some custom code Gabriel's written.
Alright. I think the spirit of what our ancestor poster was saying is that it's mostly just powered by an API (by which most of the "work" is being done). In a previous discussion I asked if you were going to do full-scale Web crawls, and you said:
>We never stopped crawling; we just scaled it back for specific purposes, mainly now for spam detection and zero-click info
So if you've scaled back the crawling and are mostly using it just for spam detection and zero-click info (forgive my ignorance, but I'm not sure what zero-click info is) then does that mean that you are getting the majority of your result data from outside APIs like BOSS?
Edit: And does that mean that you store those pages in your database and use your own ranking algorithm, or are you using their ranked results for a query and then rearranging them on the fly?
No, the spirit of the above comment was condescending, dismissive and only mentioned one source: They're Bing with some twiddles on top. (I believe he is a Google employee.)
About 50% of queries show information from my index. I've concentrated on the fat head of the search space, where I get a lot of volume (1 and 2 word queries). For long-tail stuff, yes, the majority of results come from external APIs, but even there it is inaccurate to say they are all derived from a direct call to one API. It really varies by query. Some will look like Bing for sure, but others will look very different. I'm not going to disclose everything I do of course.
>No, the spirit of the above comment was condescending, dismissive
Well, I didn't even think of it that way. Naturally you're probably looking at these types of criticisms under a magnifying glass since it's your project.
>About 50% of queries show information from my index. I've concentrated on the fat head of the search space, where I get a lot of volume (1 and 2 word queries).
See now that's actually good to know! From many people's point of view (including mine), DDG is Bing with some twiddles on top. Maybe out of ignorance, but what can you expect? If you aren't already enamored by the idea of DDG then you probably don't care enough to find out what you're really doing on the back-end, which is what really matters most to me in these discussions. (Not the results since unless I'm told otherwise I just assume it's Bing or Yahoo.)
Edit: It's quite humorous to be down-voted by the fan boys for expressing an honest and IMO helpful opinion to the owner of DDG. If all he gets is blind love he won't know how to get the rest of us on board. Believe it or not, the down vote button isn't the disagree button. If you have something constructive to add, say it.
See, and you were doing so well until the edit. I was about to upvote your rather interesting reply, but reddit has left an unforgivably bad taste in my mouth for people who say "not sure why the downvotes" or "I'll probably get downvoted, but..."
I'm not going to downvote you, but next time just say what you have to say. If people like it and upvote it, great. if not, it's a meaningless metric on a fairly small social news site. There are far more important things to worry about.
I understand your point, but I'm not worried about the actual points being added or subtracted from my karma. People here tend to continue up-voting something that's been up-voted or continue down-voting something that's been down-voted. I'm attempting to curb any further lack of real consideration for my comments due to the blind nature of many of the users here regarding voting etiquette. If you've seen me around here before you'll know that I often post comments that attempt to uphold what I consider to be the HN standard. Which is why I ended it with "if you have something constructive to add, say it." I realize some users may find this grating, but perhaps those users aren't the target audience for the remark to begin with. Thanks for the heads up though.
To add to that, if I were trolling or being rude or vicious I would not have edited to bring up the down-voting. I hate to see this site slip into the bad habit of down-voting legitimate and helpful points due to simple disagreement with the commenter. With so many new users coming to HN daily it's important to occasionally remind people that this isn't Reddit, it's not Digg, and it's not Slashdot.
I actually have a lot of respect for DDG and what you've accomplished (well, until the recent ad campaign, but that's a different issue...)
My impression from the FAQ and your comments here is that you start with Yahoo! BOSS results, merge in other results that you suspect may be useful from your own crawl, filter out known spam pages, and possibly re-rank things afterwards according to whatever algorithms you have. Hence, "Bing with a bunch of twiddles on top." I don't really get a different impression after reading this comment - "50% of queries show information from my index" could mean as little as reranking one result, and query distribution is known to be heavily biased towards the fat head.
Although, I just looked at one not-quite-head-but-not-quite-long-tail query ([modern family episode list]):
And the top 3 are identical, but the rest of the top 10 are about as different from Bing as Bing is from Google. Twiddles can be fairly powerful.
You obviously don't have to disclose everything you do in response to criticism - Google and Bing certainly don't. But that was my perception from your FAQ and posts, and certainly seems to be the perception of many other HN users as well.
> And does that mean that you store those pages in your database and use your own ranking algorithm, or are you using their ranked results for a query and then rearranging them on the fly?
I have looked at those search APIs and they don't let you write your own ranking formula. They just give you back 10-20-30 top results. There is no info on term frequency, inverted document frequency, pagerank or any other component of a decent ranking formula. So you can rank yourself the pages from your own crawls, but you can't merge this nicely with the API results.
If I were Garbiel, I would merge them in some crude way basically assuming my results are better, and add those URLs to the list for my crawler. The top URLs for all one and two word queries aren't that many.
This is like saying a Mac is just an Intel PC. If DDG re-ranks the results, or provides a better interface for reading the results, or provides some privacy when obtaining the results...
Those are all big wins by themselves. And speaking as a user, if it's doing all three I am only mildy curious about how it gets those results.
I would expect an article about search engines to mention all the credibly useful search engines. Though I usually use Google, if we remove that as an option my next choice is DDG.
I think he meant in terms of traffic and popularity. If you're measuring hardware, "large" is so very relative. Someone's going to beat you with a bigger stick, unless you're the NSA.
"[Quora's] robots.txt file explicitly grants Google, Bing, Blekko and other big players access"
Blekko is a big player? It launched two months ago and it's still in public beta. According to Compete.com [1], it attracts only 120,000 visitors a month.
I'd say that if Blekko is a big player, DuckDuckGo is one too.
I agree wholeheartedly. For comparison, I currently see ~95% of search engine referral traffic from Google, 2.8% from Yahoo, 1.4% from Bing, and <1% from all other search engines combined. It makes no sense to let X, Y, Z random search engines index content like Google, when they send nowhere near the traffic. As others have mentioned, blocking by robots.txt often doesn't work as some bots don't obey it. I currently use an extensive rewrite rule to block those bots' User-Agents, and block specific IPs as a last resort.
Problem is, that by disallowing those smaller players, you are preventing them from getting bigger because they would need to be able to index sites to actually become good enough to attract visitors.
Also, when disallowing a spider you can be sure that you won't get ANY visitors from them, as you are not in their index to start with.
I mean: if I disallow any bot but Googlebot, I will have 100% Google referrals by design.
I can understand that in an emergency situation you might want to block some bots, but blocking all of them because they are currently small feels a) a bit unfair due to not giving them the chance to grow and b) shortsighted due to never knowing if they might be getting big enough or the visitors they deliver might be better ad-clickers
Speaking from experience, this is why robots.txt files don't work; they've become anti-competitive in nature. If you actually build code to obey them you get locked out of actual sites, while the real scrapers you don't want don't listen to robots.txt anyways.
If a site really doesn't want you there, they can block you at the IP/User-Agent level, which is what Quora will end up doing.
Curious if you have any citations for this remark. I have always been under the impression that robots.txt is entirely optional, both to implement and to respect.
Essentially, the existence of robots.txt and meta noindex has made courts more comfortable ruling that the index of search engines constitutes fair use.
But this still doesn't specifically answer the question, "What is the penalty for ignoring 'User-agent:* Disallow: /' in robots.txt, or meta tag no-index", only that without a robots.txt or no-index tag, crawlers can't be found liable for accessing information that is freely available to them.
Interesting link, however. I was unaware of that case.
5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.
It looks to me like it doesn't block anything - it does have some specific settings tailored for various big-boy search engines, with a catch-all rule at the end for those not otherwise defined.
It also says right there at the top "# If you operate a search engine and would like to crawl Quora, please
# email info@quora.com. Thanks.
"
So they're trying to get the right data to the right search engines the right way - makes sense to me.
Search engines can be a real PITA for a site. They can take a site down. When it's my site - I don't have to let you, or anyone else, index it. We can do it nicely through robots.txt, or cold-war style through firewalls and script and tarpits and who knows what else.
So - assuming Quora took the time to put what they did in their robots.txt, in a world where many sites still don't bother at all - one can assume they are paying close attention to the business value they are extracting from search engine driven traffic.
Craiglist policy of specifically disallowing only classified search engines should look kinder in retrospect.
It is reasonable to block sites which aim to merely re-frame your content within a "search" parameter. But that kind of thing can be dealt with terms of service.
The problem is that it takes time to actively filter against offenders.
If they are blocking to save resources, then it might be worth noting that startup search engines will not bomb their servers the way Google and Bing can. Are there any published numbers as to how the crawling load varies from startups to Googlebot?
Just about everybody effective blocks (or throttles to near zero) all but the top 3 or so crawlers. For a popular commerce site a significant fraction (like 1/3rd) of the infrastructure capacity is just responding to crawler traffic.
Our project was a bit odd, in that we had millions of documents available to browse. We wanted them all to be available to search engines, but it was an operational headache to keep that much data moving.
Our normal daily traffic might have been 10,000 users with perhaps 50K page requests. We built our first small web server to handle this load.
We were caught blindsided by the Googlebot, which from day one demanded hundreds of thousands of pages per day. Googlebot offers no option to slow it down, and if you try to slow it down on purpose your search ranking will suffer.
So 90% of our early operational burden was devoted to keeping Googlebot happy. This was an unexpected burden, but it was worth it to be in the Google index, because they sent a lot of traffic our way and kept our fledging startup alive.
But it was definitely not worth it to spend equivalent operational expense to be in marginal search engines that don't refer much traffic. To a site operator, they don't offer any value in exchange for what they demand.
It was very interesting to see the relative volume of requests that we got from Googlebot versus the other crawlers. While Googlebot found us on day one, it took over 3 months before we appeared in Microsoft's index, and even then, they had perhaps only 10% of the coverage of our site when compared to Google.