This is a subject that really irks the engineering side of me. It's utterly ridi...

paulddraper · on Jan 1, 2017

> result in hundreds of thousands of wasted engineering hours trying to enable the idea of server-side and client-side rendering with the same code

It this problem really that difficult? Why?

Why should your code care if it is running on my computer or yours?

Isomorphic JS has been around for years. Build your product on bloated tech stack relying on a increasingly poorly planned web of dependencies, and I'll agree it could be challenging.

> Why can't I expose a REST API to deep link mapping and have Google just crawl my REST API

They do crawl REST APIs. Specifically, ones using Hypertext Markup Language.

> cache intelligently in localStorage and IndexedDB

Speak of hundreds of thousands of wasted engineering hours...HTTP caches are simple and straightforward. IndexedDB leaks memory in Chrome so badly that Emscripten had to disable it (https://bugs.chromium.org/p/chromium/issues/detail?id=533648 https://github.com/kripken/emscripten/pull/3867/files). Mozilla advised developers not to adopt Local Storage due to the inherent performance issues. (https://blog.mozilla.org/tglek/2012/02/22/psa-dom-local-stor...) And how many wasted hours went into WebSQL?

> utterly ridiculous that engineering and efficiency decisions are so deeply affected by whether or not the largest search engine will properly index your content

Actually, it makes a lot of sense. Content needs to be discoverable. Hosting a complex language in a VM where the slightest deviation from the 600-page specification (and that's just for the core language...not the browser APIs) causes failure -- that's not "discoverable". It's like putting up a billboard with one giant QR code, just because that makes it easier to develop the content.

andrewstuart2 · on Jan 1, 2017

Isomorphic. Thank you, I was searching my brain for that word for like half an hour. :-)

> Why should your code care if it is running on my computer or yours?

It shouldn't. But my users already care about perceived latency, and that is directly limited by the speed of light. My users want feedback as quickly as possible that their input has been received, and that something is happening in response. Thanks to the speed of light, this would ideally take place instantly right in front of their eyeballs. That can't happen yet, so as much as can realistically happen on my user's CPU, memory, and storage is the next best thing.

> They do crawl REST APIs. Specifically, ones using Hypertext Markup Language.

What I meant to say was JSON, so I'm contributing to my own pet peeve of saying "REST" and meaning "JSON." :-p

HTML is awesome and does a wonderful job of letting me mark content in a way that it can be efficiently rendered, semantically, and be both human-readable and marginally machine readable. There are two problems, though. The first is that full documents (since the article points out AJAX is not performed by Google) are incredibly repetitive and wasteful, especially when retrieving the same content fragments multiple times.

The second is that it is strongly coupling content and presentation, two orthogonal concepts, much earlier than is optimal. Sure, you can cache full documents and display them when requested again, but the more common case is that a large subset of what I just displayed to my user will be displayed again, with one new item, but has still invalidated my cache because the granularity is at the full-page presentation level, and not the business domain object level. If, instead, I cache and render business objects on the client side, I can be more intelligent and granular with my caching strategy, react much more quickly to my users' feedback, and have a much smaller impact on their constrained devices. Not only that, but transmitting structured business objects instead of presentation-structured content lets me more efficiently reuse that data across devices for which HTML may not be the most effective way to present the data to them.

My personal architectural bents aside, the truth remains that content discovery agents (e.g. indexers) should not be treated as content delivery agents with such a huge influence on content format. This ends up creating (IMO) too much influence over external engineering decisions, rather than allowing engineers to think critically about the right architecture that gives users the best possible experience.

Most importantly, I'm not saying that all the engineering effort should be placed upon the discovery agents. Of course there are limits on how much they can discover on their own, and (as always in matters involving many parties) there need to be good conversations about the state of things, and what we think is the right direction to go to support each other and our users. It's just been my opinion lately that this is not so much a conversation anymore as a unidirectional stream of "best practices" coming from a single group.

paulddraper · on Jan 1, 2017

Yeah, I understand the server-side rendering vs. client-side updating, and the design benefits of API-driven development. And unfortunately, a lot of popular JS frameworks haven't done a great job about helping with these.

Closure Library/Templates was meant to render server-side and bind JS functions after render, or create client-side dynamically. (Interestingly, the historic reasons were performance, not SEO.)

React and Meteor have good server-side stories. Angular 2 is getting one.

I would say there is a lot of low hanging fruit in just avoiding most client-side JS. Take http://wiki.c2.com/ -- the "original" wiki. That should all be static. Same with blogs, documentation, and lots of other public, indexable content.

inlined · on Jan 1, 2017

[Disclosure: I work at Google but don't work on anything related to the crawler]

All this anger aside, I'm actually pretty impressed with the world we live in and proud of my company. Think about how far we've come that merely crawling and indexing the vastness of the internet is so mundane now. Now we should expect the whole internet to be downloaded and executed. That's got to be a great security and integrity problem. Surely someone had tried to break out of the sandbox. Can that be abused to affect SEO of other sites. The easy answer is "spin up a new VM for each page" but that would slow the indexing process down by orders of magnitude.

andrewstuart2 · on Jan 1, 2017

I'm not sure where you're sensing anger. The thread so far is a pretty great example of the discourse I've come to really appreciate on HN. Sure, disagreement may be uncomfortable or feel awkward to read at times, but I think it's easily for the best. I'd much rather have somebody disagree with me and give good reasons than just blindly agree.

andrewstuart2 · on Jan 1, 2017

Yeah, I think I'm probably thinking much more heavily of the heavily-data-driven, dynamic web application use case since that's the kind of thing I've been working on for 5+ years now. I imagine that the vast majority of the internet content actually consists of much more long-form prose that doesn't benefit quite so much from a deferred-rendering approach since it varies little if any from user to user. In fact, that would probably be an overall systemic loss since now the same work is being done many times to render the same content, when it could be done once and cached for all.

And I don't expect Google or anyone to be able to support every edge case, either. I really would just like some sort of better solution that involves a global minimum of effort to achieve the same thing -- indexing what the user actually sees (non-private info, at least), and helping users discover sites that will give them a great experience and not just sites that give indexers a great experience.

hlandau · on Jan 2, 2017

If you're developing websites which are inoperative if the user agent does not support JavaScript, your development practices are broken.

I browse without JavaScript by default, and if a page doesn't load properly because someone decided to implement not a web page but a web page viewing client-side web application, I usually just leave. Then there are terminal browsers like lynx.

Moreover, reimplementing a web browser's navigation logic for a specific site is silly. It will in all likelihood be less reliable than a web browser's navigation logic. Moreover, it will always be slower for the initial page load than just serving a normal web page.

Yes, maybe you'll make things slightly faster for subsequent page loads. But consider that initial page loads from search engine referrals may well be the most important case, latency-wise. And if you do server-side rendering with progressive enhancement, you can have your cake and eat it if you really want to implement your own navigation logic with pushState, etc.; serve the static page and enhance it with an async script.

nolok · on Jan 1, 2017

Uhh, the answer to all your "why"s is pretty much" because that would be an open door to abuse"? For better or worse, Google has figured that the best way for them to have the most accurate indexing is to get things the exact same way browsers do, and figure it out from there.

andrewstuart2 · on Jan 1, 2017

On the point of indexing what the user sees, we agree completely. On the statement that Google does this now, that's not actually true. The current fact is, obviously from the article content, that my browser can do things that Google won't. Namely, AJAX, which is critical to truly scalable pages that cache and perform minimal delta requests. Even the JavaScript required using Google-specific webmaster tools.

It's fairly clear that just by developing those webmaster tools, Google is effectively (and understandably to a point) saying that they won't try to create engineering solutions to certain problems. Or that at this point it's not a sound financial investment because, after all, people will come to them because they're the biggest game in town.

If you're referring to my alternate crawling strategy suggestion of mapping a deep link structure into a structured REST URL structure, that's just an optimization I'd love to see. Really, I just think that search indexers should index what my users see, regardless of my engineering decisions.

jefftk · on Jan 1, 2017

    It's utterly ridiculous that engineering and efficiency
    decisions are so deeply affected by whether or not the
    largest search engine will properly index your content.

There being more sites that can only be crawled and indexed well if you run js is probably good for Google relative to competitors. Anything that makes crawling the web harder increases the barrier to entry, making it harder for sites like DuckDuckGo to serve search results from their own crawl.

(Disclosure: I work for Google, on unrelated stuff.)

qyv · on Jan 1, 2017

Google brings something that no amount of fancy engineering and elegant solutions can provide: New Users. Because of that it is necessary to dance to their SEO tune.

andrewstuart2 · on Jan 1, 2017

Isn't it ironic, though? Google exists because they had the best solution in 1998 for helping users discover the content that already existed.

Of course, now that they're gigantic, they now are the primary force that's adding unneeded engineering complexity to keep content in a format they can already read.

So who's really helping whom?

tyingq · on Jan 1, 2017

>So who's really helping whom?

Offtopic, but Google's propensity to migrate anything popular to their own hosted content is a better example of this. They find ways to present the good stuff without the end user ever visiting your site. At some point, this starves off the sources.

andrewstuart2 · on Jan 1, 2017

"That's some nice data you go there, guys. It'd be a shame if... somebody scraped it and kept that page view and ad revenue for themselves."

Again ironically, this will put Google out of business if they keep it up, unless they can start to collect all that data on their own or otherwise incentivize content producers to allow them access.

Senji · on Jan 1, 2017

You do know, you can give google an xml map of your site.

jmkni · on Jan 2, 2017

> It's utterly ridiculous that engineering and efficiency decisions are so deeply affected by whether or not the largest search engine will properly index your content.

I would argue that it is nothing to do with Google.

Google's official line is that that will index your content, even if your entire frontend is a big client side only SPA. The explicitly tell you to focus on having good content that people actually want to read, and little else.

As others have suggested in this thread, it is likely that they developed Chrome for this reason. My understanding is that they basically crawl the web using a headless version of Chrome, and take a snapshot of the augmented HTML straight away, after 5 seconds, and after 10 seconds.

Other search engines aren't so clever, and so the work to have your important content available even when Javascript is disabled is done so you will show up on Bing, Baidu, Yahoo, etc also.

To go further, I reckon Google would love it if you didn't do this, as it would give them an edge over their competition which has less advanced capability to crawl the web.

mopper51 · on Jan 1, 2017

Not everyone uses JavaScript, and as a user is prefer basic HTML websites over JavaScript bloat every day. Also note from the article, that the conclusion seem to be that Google does in fact index JavaScript content, but only for trusted websites. I very much like this decision, because JavaScript (and Ajax in particular) is so easy to abuse, and it's never a win for anybody when you're redirected to a bloated website that downloads 100s of MB over 100s of requests and from multiple domains. Also, they're a billion dollar company, I'm sure they've already A/B tested enough to decide that indexing JavaScript is not good :)