\* PDFs are self-contained and offlineable HTML can easily be offline-able. Base...

LeifCarrotson · on July 19, 2021

When you find a page - inherently a document-oriented term - like an article, blog post, how-to, or project writeup that's interesting or useful, and you want to make sure it's available to you later, what do you do?

Do you save the HTML, CSS, and Javascript, and hope that it works offline? I used to use the "Save page as..." tool back in the early 2000s, but it's become less and less useful, with too many dysfunctional disappointments.

No, I cut out some junk I don't need with the Printliminator [1] bookmarklet, then I do a *print-to-PDF.* This gives me a file. I can save the file, back it up to my NAS, search for it later, keep it with other files from a project where it was useful, and otherwise hang onto it. This is so common, in fact, that it's gone from being an obscure thing you could do with a Postscript-to-PDF converter or (before the adware/Ask toolbar scandal) the installing the CutePDF virtual printer. Modern OSes bundle a PDF printer, and print dialogs understand that you want to "Save as PDF". Google Docs and Office 365 editors allow downloading a document as a PDF.

I totally agree that a dynamic, interactive page or a comment section is not compatible with this model of usage. There's a lot of consumption of endless feeds, and a lot of one-time video views that also don't make sense to save as offline files. However, the web for creators, where people write articles that are worth hanging onto, has a definite place for PDFs.

[1]: http://css-tricks.github.io/The-Printliminator/

derefr · on July 19, 2021

> When you find a page [...] and you want to make sure it's available to you later, what do you do?

Instead of doing a bad and lossy job of archiving the page myself, I notify† our friendly neighbourhood archivists at the Internet Archive of the page; and they then do the best, most lossless job of preserving the page that they're able, given their cumulative experience.

† http://blog.archive.org/2017/01/25/see-something-save-someth...

As a side-benefit, they also then take care of keeping the archive they've made around and available online in perpetuity, with no additional marginal effort on my part. The same can't be said for something in my own "private collection."

daggersandscars · on July 19, 2021

This may not be well-known, but archive.org can and does remove pages / sites from the archive. Authors can request this, site owners (separate from the authors) can request this. There may be others who can request this.

Just an FYI. If there are critical sites you want copies of, I'd recommend making your own copy. I've lost access to important pages / sites twice before taking this to heart.

Edited for clarity

Santosh83 · on July 19, 2021

There is value in having a personally curated, offline collection of documents. You can search, annotate or otherwise manipulate it to your heart's content, all without having to be connected.

Of course the Internet Archive serves other purposes for which it is (currently) irreplaceable.

cxr · on July 19, 2021

Zotero is much better for this than the too-fiddly print-to-PDF workflow described in the earlier comment.

admax88q · on July 19, 2021

There's also opportunity cost in spending time maintaining, indexing, annotating your own archive of documents.

tenebrisalietum · on July 19, 2021

> in perpetuity

Hopefully it really is around a very long time, but the world is unpredictable and things change. It's great to enhance the Internet Archive, but you can bet I'm keeping my local copy too. Just in case.

htek · on July 19, 2021

That's subobtimal as well. The site could come out with a new robots.txt file which is just <code>User-agent: * Disallow: /</code> and everything already indexed by the Internet Archive is now inaccessible to you.

turtlebits · on July 19, 2021

Do you never get online receipts that you need to keep a copy of?

derefr · on July 19, 2021

I don't think I've ever had such a thing that only appeared as a web page, without being emailed to me. To me, the email is the primary-source document in that arrangement.

gregsadetsky · on July 19, 2021

There was an interesting discussion about this a year ago:

https://news.ycombinator.com/item?id=23228098

——

This is still not as powerful as my one, simple trick to handle all bookmarks, ever: Print to PDF. I've been doing it since last century, and I have 10's of thousands of PDF's of every single web page I've ever found interesting, sitting right there in a directory on my computer

——

Including the suggestion that was brought up to use ripgrep to search in the pdf text content.

anigbrowl · on July 19, 2021

Sometimes if I'm researching a topic I'll dig up a big number of newspaper articles and want to print them and read them away from the screen while scribbling notes etc, but on a lot of websites banner ads or footers with copyright statements can really mess it up.

apotheon · on July 19, 2021

I actually dislike HTML per se, but the only two benefits I see for PDFs in the general case are:

- In my experience, it's a little harder and rarer to make PDFs utterly incompatible with different means of viewing them, and it generally requires more overt (if perhaps slightly unintentional, at times) sadism to make that happen.

- PDFs can do some things HTML can't (easily, at least) with document design -- though those things are generally things that would be disallowed in our new "deurbanized" PDF-based web replacement.

Everything else that comes to mind goes the other way, including the fact that the viewing-mechanism incompatibility thing can be even worse with PDFs, even if it's more rare for that to happen at present, and if PDFs became the new standard for the web I'm pretty sure that relative rarity would evaporate anyway. Let's also not forget that HTML can also do some things PDFs can't (as easily, at least) do.

jhgb · on July 19, 2021

> Do you save the HTML, CSS, and Javascript, and hope that it works offline? I used to use the "Save page as..." tool back in the early 2000s, but it's become less and less useful, with too many dysfunctional disappointments.

I'm too lazy, so I just tend to use SingleFile these days...

blooalien · on July 19, 2021

Also useful: https://pypi.org/project/html2text/

Bjartr · on July 19, 2021

I've used chrome's ability to save a single .mhtml file that contains all the resources for this purpose in the past.

camgunz · on July 19, 2021

You got nerd sniped by the HTML vs. PDF format thing and missed the entire point of TA:

> Isn’t it a good thing that we enjoy rapid progress? To the extent that we get to enjoy things like YouTube and sandspiel, yes! But to the extent that we want the internet to be a place where we can work and live and think and communicate free of malware, surveillance, dark patterns and the insidious influence of advertising, the answer is, empirically, sadly, no. The web has become ad-corrupted hand-in-hand with growth in technological capability, and the symbiotic relationship between web and browser means they feed on each others’ churn. Ads demand new sources of novelty to put themselves on, so the web expands continually, the specs grow in complexity, the browsers grow in sophistication, the barrier to entry grows ever higher, the vast cost of it all demands more ad revenue to fund it... and thus the perpetual motion machine is complete.

cxr · on July 19, 2021

The author does identify a problem, and so you want to focus on that. That's fine. There is the issue of triviality, however.

The problem described is widely felt, and also widely discussed. We already know this stuff to be a problem. For the piece to be worthwhile, then, it should do something that is not present in the other instances where the topic has been raised. It should articulate (or at the very least exhibit, without necessarily articulating) a solution for us. It doesn't. A bad remedy to a genuine problem does not yield a solved problem.

camgunz · on July 19, 2021

The article is called "Deurbanising the Web", and its thesis is:

- Publish in static file formats.

- Date and hash your work.

- Stop spying on your users.

HN is a discussion forum, not project planning software. Not everything has to "yield a solved problem". Are you really setting the bar at "design a technology stack for replacing HTML/CSS/JS"? That's way, way too high.

bccdee · on July 19, 2021

You say that its thesis is (in part) to generally publish in static file formats, but that's not quite accurate. The piece specifically touts PDF/A as the best format and makes several arguments against the use of html/css. I agree that they're making a broader point than just "use pdf," but "use pdf" is definitely a large part of it.

apotheon · on July 19, 2021

Those points can be trivially met with static HTML and something like IPFS, and you can still download HTML for local storage and viewing. You can even print to PDF if you really want to do so. Meanwhile, PDFs also allow dynamic files, don't require dating and hashing, and can be used to spy on users or deliver malware.

EDIT: Oh, yeah, and static file formats doesn't necessarily have to mean static document formatting when viewing -- unless you're using PDFs, which tends to break useful stuff like reflowing for paginated documents (one of the worst things about even simple PDFs).

xialvjun · on July 21, 2021

ipfs solve this well.

slashdot2008 · on July 19, 2021

The author brings a solution, it is to publish documents in PDF instead of HTML.

apotheon · on July 19, 2021

"A bad remedy to a genuine problem does not yield a solved problem."

slashdot2008 · on July 20, 2021

PDF is a great way to publish documents. which is what the web originally was.

The web has become a bad remedy to some distributed software problems.

tovej · on July 20, 2021

Why do you feel PDFs are a bad remedy? PDFs are the usual way I absorb information.

prophesi · on July 19, 2021

No, the entire point of the article is to convince people to use PDF/A. Which I find comical since you have to go out of your way to check if a PDF is PDF/A compliant. If the web was run by PDF's, there's no reason why any big corporations would abide by those rules, and it'd be just as messy as HTML is today.

camgunz · on July 19, 2021

You've also been nerd sniped. TA goes on and on about surveillance capitalism and the attention economy. Weird, for an article that's supposedly convincing engineers of the merits of one file format over another.

prophesi · on July 19, 2021

Did you read beyond the "How did it come to this?" section? TA goes on and on about web standards and the need for PDF/A.

Edit: If the article _was_ all about surveillance capitalism, then it wouldn't be worth upvoting as actionable solutions are much more valuable than preaching to the choir.

camgunz · on July 19, 2021

If you don't think it's clear that the author's advocacy of PDF is a means to an end, subservient to their desire to dismantle surveillance capitalism and the duopoly that Google/Apple have on the web, I don't know where to go from here.

anigbrowl · on July 19, 2021

why don't we have both?

prophesi · on July 19, 2021

I think you're the one who got nerd-sniped here. 1.5 of the 13 pages in this PDF are about surveillance capitalism. The rest's about web standards.

Aeolun · on July 19, 2021

What in the nine hells is nerd sniping?

camgunz · on July 22, 2021

It's when you trick a technically-minded person into jumping down a rabbit hole of a technical problem/controversy. Here it's PDF vs. HTML, but other classic nerd snipes are UTF-8 vs. anything else, "fixing" election tech, etc.

monkeynotes · on July 19, 2021

I tackled the premise. I think addressing the premise is the logical place to dismantle an argument.

camgunz · on July 19, 2021

But, again, the premise is not that "as a file format, PDF is better than HTML". The premise is: because HTML is two-way, it enables surveillance capitalism and allows bad actors to monopolize the attention economy. The author wrote it thus:

> Sure, you can write good HTML. I won’t argue with that. And if you’re writing good HTML, good for you. But HTML is a dual-use technology, the bad guys are dual-using it an awful lot, and I feel that the stone age still has a part to play in the progression of the information age.

The part where you engage with this is where you write:

> I'm sorry, the more I think about this the dumber I feel. The web is useful because it's 2-way. I am excited by the web because I can interact with other people. I come to hacker news to engage with thinkers, not to just read a published article from one single author. I want to read ad-hoc opinions and user submitted content. PDF web, really?

Which is interesting! Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?

apotheon · on July 19, 2021

> > Sure, you can write good HTML.

A key here is that it's easier to write good HTML docs than good PDF docs, and much harder to deal with the harmful aspects of PDF docs given present technology.

> Which is interesting! Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?

I don't know about the other person's ideas, but decentralization plus better anonymization and pseudonimization, with always-on strongest-reasonably-posible encryption, seems like the direction to go.

camgunz · on July 19, 2021

> A key here is that it's easier to write good HTML docs than good PDF docs, and much harder to deal with the harmful aspects of PDF docs given present technology.

Oh, yeah I'm not on the PDF train. That's wild. I'm more of a Markdown or Gemtext advocate, or even LaTeX.

> I don't know about the other person's ideas, but decentralization plus better anonymization and pseudonimization, with always-on strongest-reasonably-posible encryption, seems like the direction to go.

Yeah, projects like IPFS (which you reference above) are working towards this, but JavaScript still works over IPFS. Plus, fingerprinting techniques are pretty bonkers. Most of it comes down to JS and various state you keep on your local machine (cookies, flash cookies, etc.), but I think you need that. How do you maintain a session with a peer without some kind of token/cookie?

monkeynotes · on July 20, 2021

> Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?

Yes, it's call TOR. However, legislation is where we should start. Crippling/abandoning an incredibly useful technology which works very well just because it's often used nefariously seems to be a bit of an overreaction.

Until then, stop using social platforms, use an ad blocker, and use VPN if you really care about "surveillance capitalism".

6510 · on July 19, 2021

The classic mistaking the example for the topic.

hyperpape · on July 19, 2021

Saying HTML can be offlineable is like saying C can be provably terminating. There's a subset of programs where that's true, but it's not inherent to the form. A PDF is inherently self-contained, standard web technologies are not. When you open the page and it's a PDF, it gives you certain guarantees, when you open it and it's HTML, you have to have to do further investigation.

lucideer · on July 19, 2021

Firstly, C being provably terminating is a problem dealing with the full body of C programs written in the world. The OP is dealing with their own self-published content. That's a different problem: if your analogy held it would need to be limited to proving that a subset of C programs written by the author terminate.

Secondly, the level of difficulty in making HTML offlineable is many orders of magnitude simpler than your C analogy: there's really no comparison. For the OP we only need to make HTML documents that they have authored themselves offlineable and yet people have written general purpose tools to do this automatically for most webpages. This is not a hard problem.

TL;DR your analogy is absurd.

hyperpape · on July 19, 2021

This is a helpful post because it gets to the heart of the difference. Many people are saying "if you do HTML in a particular way, you get the same benefits." I'm asking "what's inherent to the form?" That's exactly the point about C--you can write it in a way that's provably terminated, but it's not guaranteed. Consider the consumer's perspective.

When I land on a page that's a PDF, I know certain things--I can easily save it and read it later. How do I know that? Not because I have read the PDF spec, or know that much about it, but because of my experience as a consumer of the web.

When I land on an arbitrary web-page, do I know the same thing? No. I don't know what the page is doing, I don't know what my browser will do when I try to save the page. When I save this page, I have the option to save HTML only, or a complete web page. Will the complete page actually work? I go into the source, and there's a link to the javascript (which is saved locally). Does rendering the page rely on that javascript? Does that javascript do xhr or fetch calls? Since it's Hacker News, I suspect the answer is no. However that's not inherent to the medium.

There are better ways to archive the content of even dynamic JS heavy pages, but they are not things that you learn as an average user of the web.

apotheon · on July 19, 2021

It's possible to write PDFs that don't "work" (for some useful definition of "work" similar to the case with HTML) offline. Please stop pretending that's not true.

The reason offline utility tends to be true more often for PDFs is that PDFs are not generally regarded as the preferred online-default format of choice, which is in turn a matter of social effects rather than technical capacity. Reverse the socially accepted roles of the two document formats and watch the same complaints get made against PDFs as you're making against HTML. I'd bet money the "normal" state of affairs would remain the same in terms of the perceived benefit/detriment allocation between online/offline formats; only which format was considered which would have changed.

. . . but then all the web would be even heavier documents, and even less customizable for local viewing, thanks in part to that pagination and strict formatting situation.

anigbrowl · on July 19, 2021

It's possible, but it takes work. I can't remember the last time a pdf did something unreadably weird, usually my only gripe is with something that's a scan of an old document but whoever turned it into PDF didn't do OCR.

lucideer · on July 19, 2021

I don't really follow. How does this author converting their entire site to PDF help readers/visitors/users?

The original HTML site[0] was printable as PDF, and save-able as both HTML and "Web page, complete", all of which result in a well-formatted & readable offline experience. (It was also responsive: very readable on mobile, but that's an aside).

The new PDF site is not accessible to some, difficult to read on mobile, and interacts poorly with all of the norms web users are accustomed to (back navigation, anchors, etc.)

[0] https://web.archive.org/web/20130127175816/http://www.lab6.c...

hyperpape · on July 19, 2021

It's the difference between "this thing has X property" (termination or able to save for offline reading) and "this thing _obviously_ has X property, in a way that you can tell without any expertise, or doing any investigation".

How important this is to users, or whether it is worth it is something I've not commented on, but it is a difference.

ksec · on July 21, 2021

Yes. The sort of discussion happening every day between a product manager and engineers.

chalst · on July 19, 2021

hyperpage's analogy would work if the property was "avoids undefined behaviour", rather than "avoids nontermination". When we encounter a webpage, we are being expected to execute potentially complex, well-being threatening code whose behaviour is about as easy to predict as obfuscated C.

lucideer · on July 19, 2021

True but again only if we're talking about parsing the web. This is about HTML files the author is producing themselves.

apotheon · on July 19, 2021

PDFs are capable of the same issues.

JadeNB · on July 19, 2021

> When you open the page and it's a PDF, it gives you certain guarantees ….

I think that this is a lot less true than we're used to thinking. The PDF spec contains a lot more interactive capabilities than I think most people realise. (It supports JavaScript!) We're not used to seeing those capabilities abused, because there's no point; it is so much easier to abuse HTML. But, if people want to abuse PDF—and, if we somehow convinced the world to move to it, then they would—then they easily can.

(I'm not conversant enough in the spec to know, but I do know that Postscript is Turing complete, and I don't know that PDF isn't. At least HTML on its own certainly isn't—no recursion!—although all bets go out the window once you start layering other tech on top of it.)

monkeynotes · on July 19, 2021

I don't buy that the problem with the web is that HTML is not inherently offlineable. HTML may not be inherently offlineable but it can be. PDF isn't inherently a web friendly format, but it can be. There really isn't any good argument for PDFing the web.

pajko · on July 19, 2021

Print the page to PDF.

tablespoon · on July 19, 2021

> Print the page to PDF.

Even that usually sucks nowadays, because web developers don't care anymore. Probably 75% of the time before I do that, I have to go into the dev console to delete overlay elements that obscure content and garbage that will waste 10 pages (e.g. grossly oversized images, related article recommendations, etc.).

There was a time when most websites had a print view that gave you a simplified html page that worked well, but I think most of those are gone now. Now it's all some print "media-type" CSS that no one ever put the time in to do properly or keep up to date.

stjohnswarts · on July 19, 2021

I agree, I don't see why anyone can call publishing in PDF is "dumb". The author of the material gets to choose his medium. If "you" don't like it then move along or convert it to your preferred format. In other words "why not both?"

hypertele-Xii · on July 20, 2021

I bet HTML to PDF is a lot easier conversion than PDF to HTML.

Formats matter.

EugeneOZ · on July 19, 2021

> A PDF is inherently self-contained, standard web technologies are not

What technologies exactly? You can have absolutely everything you need inside the HTML. You can inline css, js, svg and images. What technologies you can’t inline?

aenigma · on July 19, 2021

you are correct that you CAN - but who does. That's no longer considered best practice. The arugment these days is that it's a lot easier to manage css if it's in a separate file, same with js, etc. So none of the serious web developers actually do anything inline anymore. The time it would take to convert a "best practice" website with separate files for html, css, js, etc. is just not worth it. The point he's making is still valid - why not have the option for something static.

EugeneOZ · on July 19, 2021

But with the same (and even much bigger) success you can declare “I’m switching to self-contained HTML! No more external resources!” instead of “I’m switching to PDF, saying farewell to interactivity and mobile devices”.

It's just the declaration of ONE person, switching ONE site.

apotheon · on July 19, 2021

> why not have the option for something static

You have the same option with either HTML or PDF:

- PDF files can be dynamic or static, depending on how you write them.

- HTML files can be dynamic or static, depending on how you write them.

tablespoon · on July 19, 2021

>> * PDFs are self-contained and offlineable

> HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.

You're missing the point. Even a relatively computer-illiterate person can easily save a PDF to my hard drive, and it's significantly more difficult with HTML. At a minimum you're probably going to get an HTML file with a sidecar directory (or I believe a sometimes browser-specific archive, it's been a long time since I tried since it works so poorly), and even that may not have the content you want to due to dynamic sites.

monkeynotes · on July 19, 2021

As I explained, if the author wants to make HTML easily offlineable then inline CSS and Base64 images. Or, you know, make your website printable. If authors actually thought about the print to PDF "problem" it could be solved with traditional CSS and HTML. As someone else said, we used to do this. It used to be part of my every day web design job to make sure the page printed nicely.

The idea that the whole web is going to pander to edge case archivers is asinine. This whole conversation is about supporting the needs of the very, very few and romanticizing about the time when only interesting people used the internet. It's kind of elitist and self serving.

enumjorge · on July 19, 2021

I guess I don’t really understand the point being made. Does it matter that much that saving a page create a single file in your hard drive? If you really want a static rendering of a site why not just print it to a PDF. Why does that have to dictate the file format you use for distribution? With PDFs you don’t have to worry about conversion but they are also comparatively larger over the wire.

> even that may not have the content you want to due to dynamic sites

But PDFs also don’t give you dynamic content. Nothing is stopping people from using HTML to serve static, JS-less content. In fact that’s what it was originally designed to do. All this web app stuff was bolted on afterwards, and it’s optional.

What do we accomplish by having some people switch over to PDFs? The people who don’t care about bloat will continue to not care about it. It’s not like thin content will become more discoverable or more common. It doesn’t really change incentives. The author says using PDFs makes it so you’re not tempted to add cruft to your sites but that’s not really a compelling argument.

Getting content creators to produce content without bloat is not really a technical problem. It’s a cultural and economic one. I don’t see how a file format addresses that.

fjtktkgnfnr · on July 19, 2021

> Does it matter that much that the artifact of saving a page be a single file in your hard drive?

Yes, it matters a lot. Word/Excel files are actually a zip archive containing many files and sub-directories. Can you imagine people working with exploded Word files, sending over mail and WhatsApp complete directory trees?

spion · on July 19, 2021

The file format restricts the possibilties. You know what to expect when you see a PDF - static, JS-less content. With HTML on the other hand, it depends on what the author decided.

JadeNB · on July 19, 2021

> You know what to expect when you see a PDF - static, JS-less content.

You know to expect that, but there's no guarantee that's what you get. PDF supports JavaScript too.

MisterBastahrd · on July 19, 2021

Or I could just make sure that my page prints reasonably well (we used to do this) and use the print-to-pdf functionality available in modern browsers.

apotheon · on July 19, 2021

You can write HTML pages to be self-contained and offline-friendly.

You can write PDFs to include resources that are not part of a single, self-contained file, and to be quite unfriendly with offline use.

justusthane · on July 19, 2021

But if you want a page in PDF, you can print it to PDF. Sure, non-computer-savvy users might not know how to do it off-the-bat, but browsers make it pretty easy.

tablespoon · on July 19, 2021

> But if you want a page in PDF, you can print it to PDF...

Printing a page to PDF usually sucks: See https://news.ycombinator.com/item?id=27883028

justusthane · on July 20, 2021

Oh, I know that. I just meant that if your goal is for the website to be easily archivable, rather than publishing the website as PDF you could use simple HTML which wouldn't suck when printed to PDF.

stzups · on July 19, 2021

>> it's significantly more difficult with HTML

Right Click > Save as

Try it with this page!

tablespoon · on July 19, 2021

> Right Click > Save as

> Try it with this page!

Say hello to your new sidecar directory (or broken CSS/images/God knows what else)!

I tried to save an NY Times article, and it 1) needed JS to display anything, 2) even with the sidecar stuff was broken, 3) it was so plastered with ads and other junk I thought it was incomplete (it wasn't, I just had to scroll waaay down past something that looked like a footer and some voids after that).

If you save a PDF, you get that exact PDF on your hard drive, and when you open it (even in 10 years) it will look exactly the same as it did on the site.

With PDF WYSIWYS: What you see is what you save.

trey-jones · on July 19, 2021

This is of course the point of the article - that the web is a giant steaming pile of shit for the most part, plagued by JS and external resource requirements, all of which contribute to massive total page size.

I'll preface by saying I have some expertise in HTML, but none in PDF (the format).

The point of most commenters who suggest that HTML is still a better alternative than PDF (I agree), are assuming that if this is an important issue to you, that you would craft your page in a simpler style compared to most of what we see on the web, making Print to PDF or Save As... more viable.

  > PDFs and a PDF tool ecosystem  exist today. No need for another ghost town   GitHub   repo   with   a   promising   README   and   v0.1   in progress.

This is news to me. I'm not sure that I buy it. PDFs have always been a pain in the ass to work with in my opinion. Maybe there are tools, but in my experience they aren't very good.

In general, we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters. Another downside shows itself by trying to copy and pasting the above quote: PDF formatting seems to be weird.

chalst · on July 19, 2021

> In we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters.

PDFs can be tiny if they do not embed fonts. Serving fonts is very much a complex technology in HTML world.

Browsing the web is a pain in the ass if you don't use a browser compliant with up-to-date standards, but the whole "HTML can be lightweight" argument pretty much depends on avoiding much of today's standardisation. As an objection to the original argument, it is not comparing like with like.

tablespoon · on July 19, 2021

> This is news to me. I'm not sure that I buy it. PDFs have always been a pain in the ass to work with in my opinion. Maybe there are tools, but in my experience they aren't very good.

> In general, we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters. Another downside shows itself by trying to copy and pasting the above quote: PDF formatting seems to be weird.

PDF is a display format. I once worked on a project parallel to a guy who was parsing PDF to extract text content. IIRC, Text in PDFs is stored in a way that works fine for printing/rendering but not so well for manipulation (e.g. it's a bunch of commands to render line Z at position X,Y with font W). Those commands don't have to be in reading order, nor do they have the semantic meaning you can get from markup like HTML (e.g. superscript can just be nothing more than a different line rendered with a smaller font).

IMHO, PDF is actually less optimal than HTML for what this guy is advocating, except that it's those precisely those limitations that have prevented PDF from becoming the mess than Web HTML has. Though, that's probably in large part because the bloaters have been too distracted by the easier-target that is HTML to bother.

romwell · on July 19, 2021

Yeah, no. Try it with any other page, and see why nobody would be inclined to even try "Save As.." a web page anymore.

biztos · on July 19, 2021

I actually did this pretty recently, in an attempt to get some magazine articles onto my Kobo e-book reader since Pocket couldn’t fetch the paywalled ones (I do pay).

I figured I could just save the page, automate a few edits to get around dynamic stuff, and then use it as, you know, an HTML document.

Even with a nice friendly mostly-text literary magazine, after about five hours I gave up and just copy-pasted the rendered text.

JadeNB · on July 19, 2021

> >> it's significantly more difficult with HTML

> Right Click > Save as

> Try it with this page!

HN is not a good site to illustrate the unpleasantnesses of navigating the modern web. As you'd hope for a hacker news site, it is very friendly to this sort of thing. Most sites aren't.

naravara · on July 19, 2021

> You're missing the point. Even a relatively computer-illiterate person can easily save a PDF to my hard drive, and it's significantly more difficult with HTML. At a minimum you're probably going to get an HTML file with a sidecar directory (or I believe a sometimes browser-specific archive, it's been a long time since I tried since it works so poorly), and even that may not have the content you want to due to dynamic sites.

Ctrl+P -> Save as PDF

You don't need the page to be a PDF to save it as a PDF.

playpause · on July 19, 2021

These all seem like technical quibbles that miss the point.

monkeynotes · on July 19, 2021

The guy outlines his whole case based on those exact points which are, as you have observed, technical quibbles and not a basis for abandoning HTML.

Under the hood it seems apparent to me that the real premise is an emotional one, not a technical one.

The internet is plastic not because of HTML, but because of money and people. When you have teens driving content it's going to feel plastic. When Walmart uses the internet to sell you crap it's gonna be plastic. Gossip / social platforms are trash, no matter the medium.

It could be argued that TV is an incredible learning platform ruined by HD. Back in the standard definition days we had proper news, documentaries that were substantial, and no reality TV. We need to go back to black and white standard definition.

Sorry, but the PDF web is not a solution to societal rot.

tablespoon · on July 19, 2021

> The guy outlines his whole case based on those exact points which are, as you have observed, technical quibbles and not a basis for abandoning HTML.

He's actually more of a social observation: it doesn't matter what the technology can do, what matters how how the developers of that technology actually use it.

People who use PDF almost never use 3D graphics and heavy dynamic JS, so PDFs almost always have many of the qualities he's seeking.

Web developers almost never inline anything, and do all kinds of things that are arguably deal-breakers except for a few lowest-common-denominator use cases.

> Under the hood it seems apparent to me that the real premise is an emotional one, not a technical one.

The premise is that the web has failed in important and clear ways, it's impossible to fix so we should give up, so many use cases should abandon it for something else, and PDFs are unexpectedly well suited for that.

On a related note, part of me wishes Java Applets never died. Getting rid of them seems to have caused the Web to turn into them, and maybe if they'd remained some kind of separation could have been maintained.

apotheon · on July 19, 2021

Turning PDFs into the replacement for HTML would change the incentives around PDF authoring, and PDFs would then acquire the same problems identified with HTML.

The solution to the identified problems is not to switch to PDFs. Stop reshuffling the chairs on the deck of your sinking ship, and start figuring out how to design, implement, and incentivize the use of, some means of conveyance other than iceberg-vulnerable ships.

> On a related note, part of me wishes Java Applets never died. Getting rid of them seems to have caused the Web to turn into them, and maybe if they'd remained some kind of separation could have been maintained.

Java Applets were killed by Flash.

chalst · on July 19, 2021

> PDFs are unexpectedly well suited for that.

Not so surprising, really: the PDF standard evolved in parallel with Adobe's Flash between 2005 and 2010, which was then the key technology in Adobe's effort to keep a strategic toehold on the web. If Flash had not been a security clusterfuck, it might still be around. The PDF standard was always meant to be a complementary standard, and Adobe's attempted successor technologies have followed an even closer technological path.

The PDF standard has benefited from the fact that, unlike the W3C and WHATWG, surveillance capitalists have not been in the driving seat of its standardisation effort. Adobe's interests are not identical to those of the public, but they are not as essentially adversarial to them as the web standards bodies have been.

adolph · on July 19, 2021

Is the medium the message? Does style have substance? Is form also a function?

leetcrew · on July 19, 2021

I'm not exactly sure what point you're trying to make here, but I don't think two different formats for encoding formatted text with images constitute different "mediums".

megameter · on July 19, 2021

Of course they are, and we run into it constantly in computing. You can encode text with images as a bitmap, as vector graphics, as symbolic content that references bitmaps or vectors, as an algorithm that procedurally generates any of the above...

While you can produce identical outputs from the different methods, it's not hair-splitting to say that the authoring process and hence the nature of the medium to shape expression is affected by choosing one. When you opt towards maximizing generality your production cycle can grow without bound because everything is possible by layering different media, even if all of it is unnecessary. That's how you end up with creative projects that take multiple years to decades to accomplish.

runawaybottle · on July 19, 2021

Well, you seem to get the gist of the hot take the author put out. This article is not about PDFs. There is something wrong with the world and we can sense it.

This is close to it: When you have teens driving content it's going to feel plastic.

Youth is the ultimate quality destroyer. They just fucking suck. I’m quite sick of their drivel honestly, and yet, we let them dictate the world (watch my childish cartoons, even in old age).

And the little shits complicate code bases. All you little rascals under 30, scram, I’m on to you.

And all you little adults acting like children, with your stupid motivational posts on LinkedIn, and your garbage bragging on there, I see you too.

Stop.

wlesieutre · on July 19, 2021

Unless I'm on a paper-sized tablet I would definitely rather have an offline HTML file than a PDF. Nobody likes to pan back and forth on lines of text to read something.

Robotbeat · on July 19, 2021

I had the exact opposite reaction. I’m reading this on an iPhone SE2020, and I MUCH appreciate reading this in pdf form. I didn’t have to pan back and forth or even put the phone in landscape orientation. This is one of the smallest smartphones you can still buy, and the experience of PDF is WAY better than the user-hostile auto-flow text forced down mobile users’ throats.

I was skeptical at first, but I think the author made the point fantastically well.

cunthorpe · on July 19, 2021

What.

Your browser has a zoom functionality that lets you make the text smaller, essentially replicating the PDF site above. Only the opposite of what you say is correct: I can’t read that PDF’s text without turning my phone into landscape and picking up my glasses.

wlesieutre · on July 19, 2021

To get equally small text on my desktop I have to turn the font size all the way down to 7. God forbid you have readers with less than stellar eyesight.

I get what they're going for but the PDF is not exactly an accessible reading experience.

nemetroid · on July 19, 2021

I’m using a 2016 iPhone SE, and it’s largely unreadable without being very up close.

apotheon · on July 19, 2021

EPUB would beat the shit out of PDF for that.

(EPUB is basically a subset of HTML with client-oriented context.)

pseingatl · on July 19, 2021

PDF is size-agnostic. There's nothing to stop you from creating documents the size of a phone screen.

wlesieutre · on July 19, 2021

I’m commenting here as a user reading a PDF. The fact that someone else could have laid it out differently doesn’t change the fixed layout of the PDF that I’m trying to read.

There’s a reason responsive design has been a big deal for the last 10+ years and I don’t think the benefits of PDF are worth throwing it out.

JohnFen · on July 19, 2021

As someone who really detests responsive design, the lack of it in a PDF strikes me as a feature, not a bug.

quietbritishjim · on July 19, 2021

> These all seem like technical quibbles that miss the point.

If these all "miss the point", what is the point?

It seems to me that the article's point is that PDF as a format has attributes that satisfy the author's goal, whereas HTML does not. The parent comment says that HTML does have those attributes after all (if you choose to use HTML that way). That is very directly addressing the article's point, as I understand it.

JohnFen · on July 19, 2021

Perhaps I misunderstood, but I believe the author's point was to highlight what a steaming mess the modern web is. The PDF aspect strikes me as illustrating a point, not a seriously proposed solution.

jedimastert · on July 19, 2021

This statement could be for both the comment you're replying to and the original article.

Frost1x · on July 19, 2021

>PDFs don't have any dynamic interaction...

Just a caveat to that statement, you can literally do interactive and dynamic 3D graphics rendering in PDFs: https://helpx.adobe.com/acrobat/using/enable-3d-content-pdf....

You can also embed JS in PDFs: https://helpx.adobe.com/acrobat/using/applying-actions-scrip...

dathinab · on July 19, 2021

Yes, and many of this things are "in general" not well supported by anything but adobe PDF.

Even most simple interactive things can easily not work correctly even in more widely spread PDF readers.

IMHO PDF is in many ways worse then HTML, it's just that this ways are less commonly used, but if you start a PDF instead of HTML trend it's just a matter of time until this "not so compatible" aspects of PDF become widely used by some people.

monkeynotes · on July 19, 2021

JS in a PDF? You can do that in HTML, why not use the tools you already have that work together by design?

This guy is arguing that removing JS is what makes the web better. Having published, static, paper-like content is the way forward.

Frost1x · on July 19, 2021

Just caveating a technical statement I knew wasn't quite true, not making any sort of assessment either way.

As someone who has had to extract data from large sets of PDFs and modern web presentation formats, I'm not a fan of either, really. Even verifying that a visibly presented string exists in a PDF document programmatically can be a non-trivial task, as with a given website as well. That to me says a lot.

chalst · on July 19, 2021

monkeynotes seems to take the line that technical defects in claims others make fatally undermines their case, but technical defects in his/her arguments are irrelevancies.

For what it's worth, the same objection occured to me. The use of scripting I've seen in PDFs has been use-supporting and consistent with their book-like feel.

rexreed · on July 19, 2021

Also - how are PDFs exactly "discoverable"? I have petabytes of PDFs and making them easily "discoverable" for any mass use, such as analytics, search, or data analysis is a massive pain. I'd rather have them in a non-PDF format.

relaxing · on July 19, 2021

The author calling for new content to be authored as PDF, which can easily be made discoverable.

I’m guessing your data set is made of scans with poor or no OCR.

rexreed · on July 19, 2021

Not a single researcher or data analyst I know of would prefer "discoverable" content to be in PDF format, regardless of just how awesome the OCR is (which it often isn't, especially for tabular data). Even for all-text, non-tabular documents, OCR does not provide the metadata needed to make sense of the documents. Why PDF is claimed to have superior "discoverability" in the OP essay is a mystery to me. For the sake of "discoverability", PDF is definitely not the way to go.

relaxing · on July 19, 2021

The essay claimed

> PDFs are discoverable. Search engines index them as easily as any other format.

What you’re taking about has nothing to do with that.

noduerme · on July 19, 2021

Honestly, if you're going to put out a manifesto as a PDF, at least take some time "layouting" your design. The one advantage of that format is that you control the aspect ratio. Every font is permissible, everything is absolutely positioned. Using a generator to create it is cringey. Show the art that's possible. Really sell the format.

FWIW I deliver PDFs daily as an art director; not ideal, but they work in most cases. There's certainly nothing rebellious or non-commercial about them.

EugeneOZ · on July 19, 2021

...and difficult to read on the small screens of mobile devices.

noduerme · on July 19, 2021

Yeah. That's why they're only used for print.

chowderman · on July 19, 2021

> HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.

I built a tool for this exact purpose[0] since the HTML specification and modern browsers have a lot of nice features for creating and reading documents compared to PDF (reflow and responsive page scaling, accessibility, easily sharable, a lot of styling options that are easy to use, ability for the user to easily modify the document or change the style, integration with existing web technologies, etc.). In general I would rather read an HTML document than the PDF document since I like to modify the styling in various ways (dark theme extensions in the browser for example) which may be hard to do with a PDF, but its more of a personal preference. Some people will prefer that the document adjusts to the screen size of the device (many HTML pages), and others will prefer the exact same or similar rendering regardless of the screen size (PDF).

Either way, kind of a fun idea making a website using just PDFs. Not the most practical choice, but fun none-the-less.

[0] https://github.com/chowderman/hyperfiler

supperburg · on July 19, 2021

This reminds me of the guy who said drop box was stupid because he could set up an ftp server. It’s the exact same argument.

People understand PDFs, they are extremely common in the academic and business world as “digital paper” standalone documents. Hypothetically, anything in memory can be made into a file but in this scenario what matters is the practical goal of people actually using these files.

I think it makes sense for the web to be made up of discreet primitives not only so that the web can be browsed in an intuitive and frictionless way but also because it lends itself to being backed up and easily re-hosted.

pajko · on July 19, 2021

This. Also who hates the huge double margins? The slow rendering? The unnatural break-up of text? Meaningless headers and footers? And the whole page-based layout? PDF is not meant for the web. Period.

goodpoint · on July 19, 2021

You seem to miss the point of the post:

----

Call to action

Publish in static file formats

Date and hash your work

Stop spying on your users

----

All this cannot be GUARANTEED by HTML/pdf/epub and requires active cooperation from the author. This is bad.

Koshkin · on July 19, 2021

All true. Incidentally, I do not see pagination as necessary or in most cases even desirable; rather, I see it as a vestige of the printing technology, while the need for printing has shrunk dramatically over the past 20 years.

marcosdumay · on July 19, 2021

> PDFs don't have any dynamic interaction

Oh, you are set for a world of surprises. Nearly every single one bad, but running our current web over PDFs is well within the specs.

majkinetor · on July 19, 2021

PDF

- does not reflow, major suck

- is binary format, another major suck

So no thx, PDF is outdated tech, while HTML and friends are just abused.

anigbrowl · on July 19, 2021

What I like best about pdf files is that I can just give them to someone and be almost certain that any questions will be about the content rather than the format of the file.

gunapologist99 · on July 19, 2021

agreed.

and, ancient HTML can still be easily read by modern browsers, so that's not exactly a special attribute of PDF either.

anigbrowl · on July 19, 2021

HTML can easily be offline-able.

Sure - if the publisher cares. From the user's standpoint, the safe assumption is that they don't. Of course PDF is No Good for many contexts, but for any sort of long-form document that is primarily meant to be read, it's so often better.

Also, if something is available in pdf, I can be moderately sure that someone else took the time to make sure it would be formatted correctly and print out OK.* If it only exists in HTML it's more of a roulette wheel experience.

* Unless some graphic designer thought 'gee this report would look so cool if the cover pages were black or some other highly saturated block of solid color.'

baybal2 · on July 19, 2021

HTML used to be a very nice format at the age of xhtml 1.1, very formally specified, and a tie with DOM was assured by vert strictly standardised DOM v3. And ACID3 was giving you a pixel for pixel repeatability during rendering.

HTML+JS today... now it's effectively a standard in name only, and Chrome is the new IE6. The standard is now "what has worked in the last stable release"

Now go to http://acid3.acidtests.org/ and see how the latest stable Chrome release can't render a decade old CSS testcase.

ChrisMarshallNY · on July 19, 2021

> Simply build your website with pagination.

My experience is that browsers are terrible with CSS pagination support in their display and printing directly.

The only place it seems to actually work is...saving as a PDF...

grishka · on July 19, 2021

PDFs aren't really meant to be read off a screen, they're much better suited for stuff that's meant to be printed out.

And you can have a single self-contained file with a webpage, it's called a "web archive", with .mhtml extension.

Tomte · on July 19, 2021

> Base64 your images […], put your CSS in the HTML page

Is there a tool that does those two things (or at least the first one) and that can be used by non-programmers (command line use is fine, a Python library would not be)?

gildas · on July 19, 2021

You can use SingleFile for this, see https://github.com/gildas-lormeau/SingleFile/

1vuio0pswjnm7 · on July 19, 2021

"I come to hacker news to engage with thinkers, not just read a published article from a single author."

And how many websites today are anything like HN, in terms of relative simplicity, e.g., no images^1, 3rd party requests or ads, only a tiny bit of (gratuitous)^2 JS.

1. I do not particpate in the voting scheme but I could vote from the command line if I wanted to. I use a text-only browser so the grey, fading text gimmick is irrelevant. I see all comments and treat them according to the thinking not the voting.

2. If we exclude the .ico and a .gif

There seems to be a double-standard, for lack of a better term, where many HN commenters and voters appear to work for companies that make websites with tracking and ads and various gimmicks targeted at "non-thinkers" which are nothing at all like HN. Whatever these commenters and voters see and appreciate in HN they are not working to bring it to the rest of the web. I seriously doubt they comment and vote on HN out of fear of so-called "power users" or a belief that the HN type of simplicity could become more popular and threaten their jobs that depend on surveillance, online ads and a non-thinking audience of "powerless" users. Rather, a more rational explanation might be that they see some value in a website that shows no ads and generally uses no gimmicks; that's something to think about.

"PDF web" may not make sense to many folks who have invested heavily in JS and Big Tech web browsers, but Postscript is arguably more elegant than Javascript. "Thinkers" usually like FORTH.

https://en.m.wikipedia.org/wiki/Display_PostScript

The tracking section mentions the Abe Vigoda status page.

http://www.abevigoda.com/

kemitche · on July 19, 2021

PDFs are also horrible to view on mobile, as the text doesn't reflow.

novok · on July 19, 2021

Sounds a lot like epub.

stjohnswarts · on July 19, 2021

so because someone chooses to publish their website in an open format that they prefer "it's dumb" because they don't agree with you.