This is a subject that really irks the engineering side of me. It's utterly ridiculous that engineering and efficiency decisions are so deeply affected by whether or not the largest search engine will properly index your content.
Why is it that Google doesn't get flak for not discovering content that's engineered to send the absolute minimum over the wire, cache intelligently in localStorage and IndexedDB, and scale well by distributing the appropriate amount of rendering work to the client agent? Why can't I expose a (JSON/)REST-API-to-deep-link mapping and have Google just crawl my JSON data and understand (perhaps verifying programmatically some percent of the time) that the links they show in search will deep link appropriately to the structured JSON content they crawled?
It's such a waste of talent and resources to force server-side rendering. There's obviously the resource cost of transmitting more repetitive content over the wire, and requiring servers to do more work that the client could do. (Yes, even with compression this will still be a higher cost, because more repeated sequences reduces the value of variable-length encoding). But more than that, what bothers me is that there's this false truth that server-side rendering is a requirement for modern architectures, which must result in hundreds of thousands of wasted engineering hours trying to enable the idea of server-side and client-side rendering with the same code.
This is not about time-to-first-byte either. Yes, the user-perceived latency matters, but the idea that server rendering even solves this problem is again utterly false. Sure, the time to very first byte ever may be faster, but that's not a winning long-term strategy unless you never expect your client to request the same content twice (or come back to your site at all). When properly cached and synchronized, the client-side-only app has many orders of magnitude faster TTFB, because it's coming from disk or even memory, and can be shown immediately. The only thing left to do is ask the server "what's new since my last timestamp?"
All of these benefits seem to be completely disregarded 99% of the time because the golden "SEO" handcuffs are already on. I really hope we can get away from this mindset as a community and rather let the better-engineered and sites with the best and fastest UX over time will start driving search engine technology, instead of the other way around.
> result in hundreds of thousands of wasted engineering hours trying to enable the idea of server-side and client-side rendering with the same code
It this problem really that difficult? Why?
Why should your code care if it is running on my computer or yours?
Isomorphic JS has been around for years. Build your product on bloated tech stack relying on a increasingly poorly planned web of dependencies, and I'll agree it could be challenging.
> Why can't I expose a REST API to deep link mapping and have Google just crawl my REST API
They do crawl REST APIs. Specifically, ones using Hypertext Markup Language.
> cache intelligently in localStorage and IndexedDB
> utterly ridiculous that engineering and efficiency decisions are so deeply affected by whether or not the largest search engine will properly index your content
Actually, it makes a lot of sense. Content needs to be discoverable. Hosting a complex language in a VM where the slightest deviation from the 600-page specification (and that's just for the core language...not the browser APIs) causes failure -- that's not "discoverable". It's like putting up a billboard with one giant QR code, just because that makes it easier to develop the content.
Isomorphic. Thank you, I was searching my brain for that word for like half an hour. :-)
> Why should your code care if it is running on my computer or yours?
It shouldn't. But my users already care about perceived latency, and that is directly limited by the speed of light. My users want feedback as quickly as possible that their input has been received, and that something is happening in response. Thanks to the speed of light, this would ideally take place instantly right in front of their eyeballs. That can't happen yet, so as much as can realistically happen on my user's CPU, memory, and storage is the next best thing.
> They do crawl REST APIs. Specifically, ones using Hypertext Markup Language.
What I meant to say was JSON, so I'm contributing to my own pet peeve of saying "REST" and meaning "JSON." :-p
HTML is awesome and does a wonderful job of letting me mark content in a way that it can be efficiently rendered, semantically, and be both human-readable and marginally machine readable. There are two problems, though. The first is that full documents (since the article points out AJAX is not performed by Google) are incredibly repetitive and wasteful, especially when retrieving the same content fragments multiple times.
The second is that it is strongly coupling content and presentation, two orthogonal concepts, much earlier than is optimal. Sure, you can cache full documents and display them when requested again, but the more common case is that a large subset of what I just displayed to my user will be displayed again, with one new item, but has still invalidated my cache because the granularity is at the full-page presentation level, and not the business domain object level. If, instead, I cache and render business objects on the client side, I can be more intelligent and granular with my caching strategy, react much more quickly to my users' feedback, and have a much smaller impact on their constrained devices. Not only that, but transmitting structured business objects instead of presentation-structured content lets me more efficiently reuse that data across devices for which HTML may not be the most effective way to present the data to them.
My personal architectural bents aside, the truth remains that content discovery agents (e.g. indexers) should not be treated as content delivery agents with such a huge influence on content format. This ends up creating (IMO) too much influence over external engineering decisions, rather than allowing engineers to think critically about the right architecture that gives users the best possible experience.
Most importantly, I'm not saying that all the engineering effort should be placed upon the discovery agents. Of course there are limits on how much they can discover on their own, and (as always in matters involving many parties) there need to be good conversations about the state of things, and what we think is the right direction to go to support each other and our users. It's just been my opinion lately that this is not so much a conversation anymore as a unidirectional stream of "best practices" coming from a single group.
Yeah, I understand the server-side rendering vs. client-side updating, and the design benefits of API-driven development. And unfortunately, a lot of popular JS frameworks haven't done a great job about helping with these.
Closure Library/Templates was meant to render server-side and bind JS functions after render, or create client-side dynamically. (Interestingly, the historic reasons were performance, not SEO.)
React and Meteor have good server-side stories. Angular 2 is getting one.
I would say there is a lot of low hanging fruit in just avoiding most client-side JS. Take http://wiki.c2.com/ -- the "original" wiki. That should all be static. Same with blogs, documentation, and lots of other public, indexable content.
[Disclosure: I work at Google but don't work on anything related to the crawler]
All this anger aside, I'm actually pretty impressed with the world we live in and proud of my company. Think about how far we've come that merely crawling and indexing the vastness of the internet is so mundane now. Now we should expect the whole internet to be downloaded and executed. That's got to be a great security and integrity problem. Surely someone had tried to break out of the sandbox. Can that be abused to affect SEO of other sites. The easy answer is "spin up a new VM for each page" but that would slow the indexing process down by orders of magnitude.
I'm not sure where you're sensing anger. The thread so far is a pretty great example of the discourse I've come to really appreciate on HN. Sure, disagreement may be uncomfortable or feel awkward to read at times, but I think it's easily for the best. I'd much rather have somebody disagree with me and give good reasons than just blindly agree.
Yeah, I think I'm probably thinking much more heavily of the heavily-data-driven, dynamic web application use case since that's the kind of thing I've been working on for 5+ years now. I imagine that the vast majority of the internet content actually consists of much more long-form prose that doesn't benefit quite so much from a deferred-rendering approach since it varies little if any from user to user. In fact, that would probably be an overall systemic loss since now the same work is being done many times to render the same content, when it could be done once and cached for all.
And I don't expect Google or anyone to be able to support every edge case, either. I really would just like some sort of better solution that involves a global minimum of effort to achieve the same thing -- indexing what the user actually sees (non-private info, at least), and helping users discover sites that will give them a great experience and not just sites that give indexers a great experience.
If you're developing websites which are inoperative if the user agent does not support JavaScript, your development practices are broken.
I browse without JavaScript by default, and if a page doesn't load properly because someone decided to implement not a web page but a web page viewing client-side web application, I usually just leave. Then there are terminal browsers like lynx.
Moreover, reimplementing a web browser's navigation logic for a specific site is silly. It will in all likelihood be less reliable than a web browser's navigation logic. Moreover, it will always be slower for the initial page load than just serving a normal web page.
Yes, maybe you'll make things slightly faster for subsequent page loads. But consider that initial page loads from search engine referrals may well be the most important case, latency-wise. And if you do server-side rendering with progressive enhancement, you can have your cake and eat it if you really want to implement your own navigation logic with pushState, etc.; serve the static page and enhance it with an async script.
Uhh, the answer to all your "why"s is pretty much" because that would be an open door to abuse"? For better or worse, Google has figured that the best way for them to have the most accurate indexing is to get things the exact same way browsers do, and figure it out from there.
On the point of indexing what the user sees, we agree completely. On the statement that Google does this now, that's not actually true. The current fact is, obviously from the article content, that my browser can do things that Google won't. Namely, AJAX, which is critical to truly scalable pages that cache and perform minimal delta requests. Even the JavaScript required using Google-specific webmaster tools.
It's fairly clear that just by developing those webmaster tools, Google is effectively (and understandably to a point) saying that they won't try to create engineering solutions to certain problems. Or that at this point it's not a sound financial investment because, after all, people will come to them because they're the biggest game in town.
If you're referring to my alternate crawling strategy suggestion of mapping a deep link structure into a structured REST URL structure, that's just an optimization I'd love to see. Really, I just think that search indexers should index what my users see, regardless of my engineering decisions.
It's utterly ridiculous that engineering and efficiency
decisions are so deeply affected by whether or not the
largest search engine will properly index your content.
There being more sites that can only be crawled and indexed well if you run js is probably good for Google relative to competitors. Anything that makes crawling the web harder increases the barrier to entry, making it harder for sites like DuckDuckGo to serve search results from their own crawl.
(Disclosure: I work for Google, on unrelated stuff.)
Google brings something that no amount of fancy engineering and elegant solutions can provide: New Users. Because of that it is necessary to dance to their SEO tune.
Isn't it ironic, though? Google exists because they had the best solution in 1998 for helping users discover the content that already existed.
Of course, now that they're gigantic, they now are the primary force that's adding unneeded engineering complexity to keep content in a format they can already read.
Offtopic, but Google's propensity to migrate anything popular to their own hosted content is a better example of this. They find ways to present the good stuff without the end user ever visiting your site. At some point, this starves off the sources.
"That's some nice data you go there, guys. It'd be a shame if... somebody scraped it and kept that page view and ad revenue for themselves."
Again ironically, this will put Google out of business if they keep it up, unless they can start to collect all that data on their own or otherwise incentivize content producers to allow them access.
> It's utterly ridiculous that engineering and efficiency decisions are so deeply affected by whether or not the largest search engine will properly index your content.
I would argue that it is nothing to do with Google.
Google's official line is that that will index your content, even if your entire frontend is a big client side only SPA. The explicitly tell you to focus on having good content that people actually want to read, and little else.
As others have suggested in this thread, it is likely that they developed Chrome for this reason. My understanding is that they basically crawl the web using a headless version of Chrome, and take a snapshot of the augmented HTML straight away, after 5 seconds, and after 10 seconds.
Other search engines aren't so clever, and so the work to have your important content available even when Javascript is disabled is done so you will show up on Bing, Baidu, Yahoo, etc also.
To go further, I reckon Google would love it if you didn't do this, as it would give them an edge over their competition which has less advanced capability to crawl the web.
Not everyone uses JavaScript, and as a user is prefer basic HTML websites over JavaScript bloat every day. Also note from the article, that the conclusion seem to be that Google does in fact index JavaScript content, but only for trusted websites. I very much like this decision, because JavaScript (and Ajax in particular) is so easy to abuse, and it's never a win for anybody when you're redirected to a bloated website that downloads 100s of MB over 100s of requests and from multiple domains. Also, they're a billion dollar company, I'm sure they've already A/B tested enough to decide that indexing JavaScript is not good :)
Why is it that Google doesn't get flak for not discovering content that's engineered to send the absolute minimum over the wire, cache intelligently in localStorage and IndexedDB, and scale well by distributing the appropriate amount of rendering work to the client agent? Why can't I expose a (JSON/)REST-API-to-deep-link mapping and have Google just crawl my JSON data and understand (perhaps verifying programmatically some percent of the time) that the links they show in search will deep link appropriately to the structured JSON content they crawled?
It's such a waste of talent and resources to force server-side rendering. There's obviously the resource cost of transmitting more repetitive content over the wire, and requiring servers to do more work that the client could do. (Yes, even with compression this will still be a higher cost, because more repeated sequences reduces the value of variable-length encoding). But more than that, what bothers me is that there's this false truth that server-side rendering is a requirement for modern architectures, which must result in hundreds of thousands of wasted engineering hours trying to enable the idea of server-side and client-side rendering with the same code.
This is not about time-to-first-byte either. Yes, the user-perceived latency matters, but the idea that server rendering even solves this problem is again utterly false. Sure, the time to very first byte ever may be faster, but that's not a winning long-term strategy unless you never expect your client to request the same content twice (or come back to your site at all). When properly cached and synchronized, the client-side-only app has many orders of magnitude faster TTFB, because it's coming from disk or even memory, and can be shown immediately. The only thing left to do is ask the server "what's new since my last timestamp?"
All of these benefits seem to be completely disregarded 99% of the time because the golden "SEO" handcuffs are already on. I really hope we can get away from this mindset as a community and rather let the better-engineered and sites with the best and fastest UX over time will start driving search engine technology, instead of the other way around.