Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Quora Blocks Startup Search Engines (readwriteweb.com)
136 points by bjonathan on Jan 27, 2011 | hide | past | favorite | 78 comments


When our startup was just getting going, search spiders were an unexpected problem for us.

Our project was a bit odd, in that we had millions of documents available to browse. We wanted them all to be available to search engines, but it was an operational headache to keep that much data moving.

Our normal daily traffic might have been 10,000 users with perhaps 50K page requests. We built our first small web server to handle this load.

We were caught blindsided by the Googlebot, which from day one demanded hundreds of thousands of pages per day. Googlebot offers no option to slow it down, and if you try to slow it down on purpose your search ranking will suffer.

So 90% of our early operational burden was devoted to keeping Googlebot happy. This was an unexpected burden, but it was worth it to be in the Google index, because they sent a lot of traffic our way and kept our fledging startup alive.

But it was definitely not worth it to spend equivalent operational expense to be in marginal search engines that don't refer much traffic. To a site operator, they don't offer any value in exchange for what they demand.

It was very interesting to see the relative volume of requests that we got from Googlebot versus the other crawlers. While Googlebot found us on day one, it took over 3 months before we appeared in Microsoft's index, and even then, they had perhaps only 10% of the coverage of our site when compared to Google.


FWIW, most of the "Major Players" support the Crawl-Delay directive in robots.txt, see: http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl...

I know at minimum GoogleBot and Yahoo! Slurp respect this value. I believe Google AdsBot (Landing page URL verification for AdWords) also supports this value, but GoogleBot and AdsBot each will crawl at the rate you specify (so if you specify a Crawl-Delay of 1, each bot will crawl at ~1qps). I don't know if it is in the spec, but fractional crawl delays appeared to be respected (Crawl-Delay: 0.25 would result in ~4qps, for example).

I too have fought with this problem - at times more than 90% of the capacity of the site I was running was devoted to serving up content for bots. I sympathize with small (and large) companies who don't want to add capacity so that new/random bot can add another <x> QPS to the daily baseline load.


You can slow down Googlebot via the Google Webmaster Tools


At the time, I recall that our choices were between "too fast" or "way, way too fast".


Then deny everything until you are ready, but don't deny access to everything but the big boys (who are the ones with the real resources to hammer your site anyway).

I'm not saying you were in the wrong, but that is the stance I would (and do, with my current project) take. And of course, when we are ready, we will open the door to all indexing.


Just because you do this doesn't mean it makes sense for everyone. Allowing Google more access because it gives more reward makes perfect business sense. Why should anyone and everyone be allowed to crawl and scrape my website or nobody allowed? The world isn't a series of black and white decisions.



"You can't change the crawl rate for sites that are not at the root level—for example, www.example.com/folder."

What if Googlebot starts at a non-root location (via a link from another site)?

Will it always start at root?


I think they just mean that it's a whole site/domain action; that you can't change crawl rate for a sub-directory based site ala Wordpress MU/Multisite with directories.


Did anyone actually go through the robots.txt file? http://www.quora.com/robots.txt

The first two lines are:

    # If you operate a search engine and would like to crawl Quora, please
    # email info@quora.com. Thanks.
Looks sensible to me.


Yeah, because mailing every two-cent site to get permission to crawl is a sensible way to build a search engine.

Quora should just get over themselves.


Ironically I believe that this is to protect the users privacy on the Quora site. They have an option to hide your username from your answers, so if someone were to Google your name your Quora answers wouldn't appear.


If one wants to be an idiot or an asshole on the Internet, then one is free to do so. If one wants to attach their real name, address or some other identifiable bit of information to such outbursts, then once again, one is free to do so.

The same goes for life off of the Internet, which of course could be recorded or reported, and subsequently placed on the Internet.

The real problem is people mistakenly believe they can secretly operate with impunity in an age of increasing transparency.


You've successfully identified a fundamental problem with society, and it is by no means limited to Quora.

I think the search engine privacy feature is a fantastic innovation from Quora to help absolve potential problems in the future. (eg: you google my name and see my Quora answers on Sex)


The supposed "potential problems" are fictitious.

If you wrote public answers on questions of sex, why should they be hidden? --You made your answers public intentionally and also provided the answers intentionally so others would hopefully learn from them. Additionally, you did both while knowing your name is attached to them. In other words, you want to be known both for and by your contributions.

If you did not wish to be known for and by your public contributions, then the only answer is to either not make public contributions, or go to rather extreme lengths to only make public contributions anonymously. The latter fraught with plenty of caveats to maintain anonymity correctly, so the only real answer is to only make public contributions you would want attributed to you.

Back in the 80's when Chico State was the number 1 party school in the US, recruiters would show up in droves looking to get new employees. The unstated reason was very simple; if someone could graduate from Chico, it pretty much proved the person could not only get things done but also have a really fun time doing it.

Who would you rather work with? --A fiercely academic but socially limited graduate, or the guy who can get stuff done and also do keg stands?

If I searched on your name and found your public opinions, I'd think far more of you for having the stones to make your own contributions than I would be concerned about your predilection for midget porn.

Of course, the trouble with trying to be non-judgmental and unbiased is assuming others will do the same.


If Quora wanted privacy for their users, wouldn't they encourage them to sign up under a nick instead of with their real name?


Ah, but do you know what the results of that emailing have been?

Five bots are listed. Perhaps they're the only one who bothered to email. Or perhaps not.


What about the rule that matches "*" at the end? That's there for all the rest - the others are just fine-tuned for mutual benefit.

There is no rule in there that denys other search engines, just some rules that set some slightly different rules for different ones.

EDIT: I see now -therule at the end is rather specific, and wouldn't allow crawling the entire site. Still - without actual complaints from someone who contacted them and got no response, I see no evil.


Could you clarify what the line:

    Disallow: /
means?


It's indicating the root of the site. Documentation for further reading at http://www.robotstxt.org/robotstxt.html


Indeed,

It seemed self-explanatory here.

I was being perhaps a bit sarcastic...


Did you actually read the article?

EDIT: my bad. I got confused between Facebook and Quora


Running a scraping site pretending to be a search engine is an oldy, but a goody.

http://www.google.com/#sclient=psy&hl=en&q=site:duck...

Way to play the openness card Gabriel.


"...Google, Bing, Blekko and other big players access..."

Wait, since when was Blekko big? Also, why has every article about search engines recently ended up talking about DDG? :p


That's how news works. This story is a story because DDG made it one:

http://www.paulgraham.com/submarine.html

Your job as a startup is to sit around all day coming up with seemingly important news stories that just happen to revolve around your company. Get some press contacts to feed those stories to, and you're sorted for marketing.

Bonus points for attacking the Big Competitor in a public way, thus forcing major news outlets to cover the "story" and ask Big Competitor for a quote about you.


I didn't make this story. I replied to a comment on twitter that was @replied to me and then Pete came to me for a quote.


>why has every article about search engines recently ended up talking about DDG?

Because they're the new darling, and tech types like them. It's always nice to see someone come in as a substantial underdog and make a real effort to succeed. Also, their results seem to be at least as good as Google's which is pretty impressive.


"Also, their results seem to be at least as good as Google's which is pretty impressive."

They're Bing with some twiddles on top. DDG uses Yahoo! BOSS, which is Yahoo's API to its search service, which is now powered by Bing. Your DDG results all originally come from Bing's index, they just get filtered and possibly re-ranked by some custom code Gabriel's written.


Actually, that's not true and it really isn't hard to verify so I'm not sure why you keep saying that.


Where are you getting your results?


From the FAQ:

How do you get your results?

From over 30 sources, including DuckDuckBot (our own crawler), crowd-sourced sites, Yahoo! BOSS, embed.ly, WolframAlpha, EntireWeb, Bing & Blekko.

Just do some searches and compare to Bing, etc., e.g. http://duckduckgo.com/?q=hacker+news vs http://www.bing.com/search?q=hacker%20news vs http://www.google.com/search?hl=en&q=hacker%20news


Alright. I think the spirit of what our ancestor poster was saying is that it's mostly just powered by an API (by which most of the "work" is being done). In a previous discussion I asked if you were going to do full-scale Web crawls, and you said:

>We never stopped crawling; we just scaled it back for specific purposes, mainly now for spam detection and zero-click info

So if you've scaled back the crawling and are mostly using it just for spam detection and zero-click info (forgive my ignorance, but I'm not sure what zero-click info is) then does that mean that you are getting the majority of your result data from outside APIs like BOSS?

Edit: And does that mean that you store those pages in your database and use your own ranking algorithm, or are you using their ranked results for a query and then rearranging them on the fly?


No, the spirit of the above comment was condescending, dismissive and only mentioned one source: They're Bing with some twiddles on top. (I believe he is a Google employee.)

About 50% of queries show information from my index. I've concentrated on the fat head of the search space, where I get a lot of volume (1 and 2 word queries). For long-tail stuff, yes, the majority of results come from external APIs, but even there it is inaccurate to say they are all derived from a direct call to one API. It really varies by query. Some will look like Bing for sure, but others will look very different. I'm not going to disclose everything I do of course.


>No, the spirit of the above comment was condescending, dismissive

Well, I didn't even think of it that way. Naturally you're probably looking at these types of criticisms under a magnifying glass since it's your project.

>About 50% of queries show information from my index. I've concentrated on the fat head of the search space, where I get a lot of volume (1 and 2 word queries).

See now that's actually good to know! From many people's point of view (including mine), DDG is Bing with some twiddles on top. Maybe out of ignorance, but what can you expect? If you aren't already enamored by the idea of DDG then you probably don't care enough to find out what you're really doing on the back-end, which is what really matters most to me in these discussions. (Not the results since unless I'm told otherwise I just assume it's Bing or Yahoo.)

Edit: It's quite humorous to be down-voted by the fan boys for expressing an honest and IMO helpful opinion to the owner of DDG. If all he gets is blind love he won't know how to get the rest of us on board. Believe it or not, the down vote button isn't the disagree button. If you have something constructive to add, say it.


See, and you were doing so well until the edit. I was about to upvote your rather interesting reply, but reddit has left an unforgivably bad taste in my mouth for people who say "not sure why the downvotes" or "I'll probably get downvoted, but..."

I'm not going to downvote you, but next time just say what you have to say. If people like it and upvote it, great. if not, it's a meaningless metric on a fairly small social news site. There are far more important things to worry about.


I understand your point, but I'm not worried about the actual points being added or subtracted from my karma. People here tend to continue up-voting something that's been up-voted or continue down-voting something that's been down-voted. I'm attempting to curb any further lack of real consideration for my comments due to the blind nature of many of the users here regarding voting etiquette. If you've seen me around here before you'll know that I often post comments that attempt to uphold what I consider to be the HN standard. Which is why I ended it with "if you have something constructive to add, say it." I realize some users may find this grating, but perhaps those users aren't the target audience for the remark to begin with. Thanks for the heads up though.

To add to that, if I were trolling or being rude or vicious I would not have edited to bring up the down-voting. I hate to see this site slip into the bad habit of down-voting legitimate and helpful points due to simple disagreement with the commenter. With so many new users coming to HN daily it's important to occasionally remind people that this isn't Reddit, it's not Digg, and it's not Slashdot.


I actually have a lot of respect for DDG and what you've accomplished (well, until the recent ad campaign, but that's a different issue...)

My impression from the FAQ and your comments here is that you start with Yahoo! BOSS results, merge in other results that you suspect may be useful from your own crawl, filter out known spam pages, and possibly re-rank things afterwards according to whatever algorithms you have. Hence, "Bing with a bunch of twiddles on top." I don't really get a different impression after reading this comment - "50% of queries show information from my index" could mean as little as reranking one result, and query distribution is known to be heavily biased towards the fat head.

Although, I just looked at one not-quite-head-but-not-quite-long-tail query ([modern family episode list]):

http://duckduckgo.com/?q=modern+family+episode+list

http://www.bing.com/search?q=modern+family+episode+list

And the top 3 are identical, but the rest of the top 10 are about as different from Bing as Bing is from Google. Twiddles can be fairly powerful.

You obviously don't have to disclose everything you do in response to criticism - Google and Bing certainly don't. But that was my perception from your FAQ and posts, and certainly seems to be the perception of many other HN users as well.


> And does that mean that you store those pages in your database and use your own ranking algorithm, or are you using their ranked results for a query and then rearranging them on the fly?

I have looked at those search APIs and they don't let you write your own ranking formula. They just give you back 10-20-30 top results. There is no info on term frequency, inverted document frequency, pagerank or any other component of a decent ranking formula. So you can rank yourself the pages from your own crawls, but you can't merge this nicely with the API results.

If I were Garbiel, I would merge them in some crude way basically assuming my results are better, and add those URLs to the list for my crawler. The top URLs for all one and two word queries aren't that many.


Considering it's basically just a syndicated Bing search, I don't see what the fuss over results is.


This is like saying a Mac is just an Intel PC. If DDG re-ranks the results, or provides a better interface for reading the results, or provides some privacy when obtaining the results...

Those are all big wins by themselves. And speaking as a user, if it's doing all three I am only mildy curious about how it gets those results.


We were driving into SF today, saw their billboard, and it made me happy.


I would expect an article about search engines to mention all the credibly useful search engines. Though I usually use Google, if we remove that as an option my next choice is DDG.


Blekko have around 800 servers in their data center, each with 64 GB RAM and eight SATA drives.

http://www.readwriteweb.com/hack/2010/12/the-secrets-behind-...

That's pretty big in my book.


I think he meant in terms of traffic and popularity. If you're measuring hardware, "large" is so very relative. Someone's going to beat you with a bigger stick, unless you're the NSA.


From the article:

"[Quora's] robots.txt file explicitly grants Google, Bing, Blekko and other big players access"

Blekko is a big player? It launched two months ago and it's still in public beta. According to Compete.com [1], it attracts only 120,000 visitors a month.

I'd say that if Blekko is a big player, DuckDuckGo is one too.

[1] http://siteanalytics.compete.com/blekko.com/?metric=uv


Blekko has around 800 servers, each with 64 GB RAM and 8 TB of storage across SATA drives.

http://www.readwriteweb.com/hack/2010/12/the-secrets-behind-...

Has DuckDuckGo published the details of their data-center anywhere?

Also, Blekko may have launched recently, but it was founded in mid-2007.


I agree wholeheartedly. For comparison, I currently see ~95% of search engine referral traffic from Google, 2.8% from Yahoo, 1.4% from Bing, and <1% from all other search engines combined. It makes no sense to let X, Y, Z random search engines index content like Google, when they send nowhere near the traffic. As others have mentioned, blocking by robots.txt often doesn't work as some bots don't obey it. I currently use an extensive rewrite rule to block those bots' User-Agents, and block specific IPs as a last resort.


Problem is, that by disallowing those smaller players, you are preventing them from getting bigger because they would need to be able to index sites to actually become good enough to attract visitors.

Also, when disallowing a spider you can be sure that you won't get ANY visitors from them, as you are not in their index to start with.

I mean: if I disallow any bot but Googlebot, I will have 100% Google referrals by design.

I can understand that in an emergency situation you might want to block some bots, but blocking all of them because they are currently small feels a) a bit unfair due to not giving them the chance to grow and b) shortsighted due to never knowing if they might be getting big enough or the visitors they deliver might be better ad-clickers


Speaking from experience, this is why robots.txt files don't work; they've become anti-competitive in nature. If you actually build code to obey them you get locked out of actual sites, while the real scrapers you don't want don't listen to robots.txt anyways.

If a site really doesn't want you there, they can block you at the IP/User-Agent level, which is what Quora will end up doing.


What's the penalty (if any) for ignoring robots.txt and crawling anyway? Any citations of it being legally enforced via TOS or ASP?


robots.txt in the past has been the legal defense against claims of copyright infringement.


Curious if you have any citations for this remark. I have always been under the impression that robots.txt is entirely optional, both to implement and to respect.


IANAL and looking for other sources, but it's discussed along with many other factors here: http://www.benedict.com/Digital/Internet/Field/Field.aspx

Essentially, the existence of robots.txt and meta noindex has made courts more comfortable ruling that the index of search engines constitutes fair use.


But this still doesn't specifically answer the question, "What is the penalty for ignoring 'User-agent:* Disallow: /' in robots.txt, or meta tag no-index", only that without a robots.txt or no-index tag, crawlers can't be found liable for accessing information that is freely available to them.

Interesting link, however. I was unaware of that case.


There is no penalty, but they can always block the crawler or serve rubbish.


I don't blame them for doing this. They are trying to protect their assets (questions and answers). However, this is not a good move for two reasons:

1. Honoring robots.txt is optional, so this doesn't actually offer any form of asset protection.

2. They are losing serious amounts of exposure that they would reap if they'd lift this restriction so that smaller engines can find them.


Is robots.txt against net neutrality? net neutrality for bots.


I think the answer is no, because it's on an endpoint. Net neutrality is about making sure that middlemen don't abuse their position.


It has as much to do with net neutrality as banning someone from an IRC server.


Question: Are smaller search companies allowed to build off of the results of other search providers, or is this blocked by the TOS.


Google in particular absolutely does not allow automated search queries.


Not quite true.

They do allow automated queries through their APIs for at least some of their search verticals.

http://code.google.com/more/#google-search


Why does http://lmgtfy.com/ still exist then?


lmgtfy just issues a redirect, it doesn't perform a query and return results of its own


You don't get your results on their site, they redirect you to Google.


I didn't know that, can you point to a ref doc for this?

thanks!


The Google TOS: http://www.google.com/accounts/TOS

5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.



see the above comment from jws that cites the appropriate text.

here ya go: http://lmgtfy.com/?q=how+to+be+helpful


Yahoo has an API to let other companies do exactly that. DuckDuckGo uses it.

http://developer.yahoo.com/search/boss/


Did anyone look at the robots.txt?

It looks to me like it doesn't block anything - it does have some specific settings tailored for various big-boy search engines, with a catch-all rule at the end for those not otherwise defined.

It also says right there at the top "# If you operate a search engine and would like to crawl Quora, please # email info@quora.com. Thanks. "

So they're trying to get the right data to the right search engines the right way - makes sense to me.

User-agent: * Allow: /$ Allow: /about$ Allow: /about/ Allow: /jobs$ Allow: /challenges$ Allow: /press$ Allow: /login$ Allow: /login/ Allow: /signup$


The important part is Disallow: / right after the allows.


Search engines can be a real PITA for a site. They can take a site down. When it's my site - I don't have to let you, or anyone else, index it. We can do it nicely through robots.txt, or cold-war style through firewalls and script and tarpits and who knows what else. So - assuming Quora took the time to put what they did in their robots.txt, in a world where many sites still don't bother at all - one can assume they are paying close attention to the business value they are extracting from search engine driven traffic.


This is indeed a disturbing, oligopolistic trend.

Craiglist policy of specifically disallowing only classified search engines should look kinder in retrospect.

It is reasonable to block sites which aim to merely re-frame your content within a "search" parameter. But that kind of thing can be dealt with terms of service.

The problem is that it takes time to actively filter against offenders.


If they are blocking to save resources, then it might be worth noting that startup search engines will not bomb their servers the way Google and Bing can. Are there any published numbers as to how the crawling load varies from startups to Googlebot?


Just about everybody effective blocks (or throttles to near zero) all but the top 3 or so crawlers. For a popular commerce site a significant fraction (like 1/3rd) of the infrastructure capacity is just responding to crawler traffic.


Aggressive spiders are a pain in the ass. I'm with them on this.


Search engine spiders, too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: