Hacker Newsnew | past | comments | ask | show | jobs | submit | venki80's commentslogin

Wondering if this is basically what all data lakes will look like in the future. All data stored in these table formats…


Consider the whole data sector as:

    Generation ->
    Ingestion ->
    Transform (and possible looping back as derived data is created) ->
    Resting place ->
    Final Useful Product
Not saying that's a perfect model, just something to hang the terms I use in this post on.

And then consider that in order to get from the beginning of that process to the end, there is a certain amount of "Data Cleanup" to be done, ranging from merely validating that the data is sensible to in the limit literally handing huge blobs of text and unstructured data to humans and making them input something useful into the system out of it.

My assessment of the whole data community right now (and please, by all means react to this with your own opinions, I'm curious about them) is that the entire flush of fads going back and forth right now amounts to an argument about how exactly to distribute the necessary Data Cleanup work across that pipeline. The theoretical ideal is for everything to just be super awesome at the generation phase and nothing else has to worry about it, but it was rapidly discovered that making the generation part so expensive inhibits the data from ever being generated. With clean data, downstream could do all sorts of database-y storage technologies and do all sorts of clever things with the clean data, but the data is never clean.

The natural overreaction is to flip entirely in the other direction and just get it in and push the validation as far down the pipeline as possible. Here you get the "big piles of vaguely organized files". You get more data this way because you lower the costs of generation and to some extent ingestion, but you complicate everything downstream.

It seems to me we're currently in a phase where everyone is just sort of hoping somebody else will do it, and we're flailing around a bit.

Very opinionated: Where we're going to settle in, and where you can already see the shape forming up, is that it'll be a little mix & match at each level. Do what's easiest in each level at that level, and you end up with the cheapest and most effective result across the pipeline considered as a whole, even though no individuals working in any part of it will be 100% happy. There won't be a magic solution, but if, for instance, Ingestion demands that the Generation at least be amenable to some tabular view, even if there are some escape hatches for generic JSON bits, they can start operating with sensible tools (SQL-ish like Clickhouse or something) instead of just having a pile of opaque nothingness, and then the next levels down won't be able to count on data quality or coherence 100% but you can start layering in cleanliness and coherence as you go, etc. There just isn't a magic solution that fits into bullet points cleanly.

(There's this "bronze/silver/gold" thing going on, which I think is silly because there's really not much benefit to trying to force an arbitrarily-deep and complicated pipeline into such classifications, but the idea is there.)

Or, in short, yes I expect to see more tabular data. It just won't be tabular for the same reason that relational DBs use tables. It'll be tables even fairly early just because you need some sort of handle on the data to do any sort of useful manipulation on it. If relational DBs use tables as an emphasis on tables qua tables of data, data lakes will use tables as defined handles on individual pieces of data to be able to manipulate them as opposed to pure unstructured piles of "something".

It reminds me of the 20+ year, still ongoing argument about where in the "Browser -> Server -> Backend Services (including DB)" stack the work needs to be done. There's a certain amount of work that has to be done. You've got a bajillion choices about where to do it, and it's been sloshing back and forth across the entire time the web has existed ("do it all in SQL procedures! Do it all on the client!") because there is no simple hard & fast correct answer that everyone can follow for every case.

Just as with that world, this reality won't stop a pile of vendors from promising they can somehow make this problem go away, but they really can't. They can reduce the accidental complexity, and that's cool and may be worth paying for, but there's essential complexity that isn't going anywhere.

(Stretching even more abstractly, I'm writing a bit about how to do stream processing with io.Reader in Go, and it reminds me a bit of that, too. Stream processing is too complicated to write a single-shot conversion from "whatever's coming in" to the golden data you're looking for in many cases, and the solution is to fold in several transforms at a time, each comprehensible and testable, until you get what you need. The whole composed stream would be impossible to understand at once, but each piece can make sense. Trying by ideological fiat to jam it all into one piece or forcing the wrong place to do something is a recipe for disaster. You have to let the problem guide along its solution, or you'll end up wasting effort fighting to impose your beliefs on a system that doesn't care about them at all.)


I definitely think that some kind of semi-structured storage is the future. "Here's a giant heap of files" was always kinda a hack for when you outscaled your RDBMS but didn't have time to build something better.


The big tech companies like Apple and Netflix all use Iceberg



Apple has named committees and tech talks on iceberg. Like all big companies they use multiple technologies


Does this work with AOL dial-up?


We use Dremio for queries on S3. Years ahead of Athena.


Is there a scenario in which all the major airlines collapse?


Bankruptcy is certainly possible, but those are something of a fact of life for the airline industry.


i think the major airlines all exist under the "too big to fail" umbrella, and will be bailed out by the goverment in some fashion before they actually collapse.

however, as long as they can cut expenses by laying off staff with no consequences, they're probably safe. previous tough times for airlines haven't been in quite the same situation where they can just cut expenses by cutting flights. i'd be more worried about the airports than the airlines - at some point, they're going to have a big backlog of fixed non-staff expenses that are no longer being covered by ramp fees.


We're not that lucky.

IIRC American has some serious debt issues but Delta and United are sufficiently solvent that they'd get bailed out.


Nope. See also: Amtrak.


On the contrary, Amtrak represents the exact scenario that OP was asking about. It was created following the complete collapse of all the major passenger railroads.


I think the point was "there'll always be passenger service", not "the current companies will survive unscathed".


The worst case is they could all merge into "Amwings" and those who choose to AGTOW would go bankrupt. I doubt it. Consolidation down to fewer major carriers per country would happen first because it's not like HYPErloop is going to put airlines or air cargo out of biz anytime soon.


How do I know if I have this problem?


This appears to be a blood test for the "Type-I IFN" factor referenced in the study:

https://dxterity.com/interferon-type-1/#:~:text=The%20IFN%2D....

A quick search indicates there might be something comparable from the usual mass-market retail labs (Quest, LabCorp).


How interesting. I have lupus and I’m in a very small lupus-related subreddit (<5,000 members) and we’ve had quite a few posts from people who are claiming they had COVID-19 and now have lupus. I kind assumed they were just having the lingering COVID symptoms that have been in the news rather than actual lupus. Perhaps there is a relation after all?


Yikes. Use Outlook Web Access (OWA)!!


Amazon is the new Walmart. Low quality products. Increasingly the top brands aren't even available on the site and all you have is Chinese no-name crap.


For me its even worse than Walmart. No matter how shitty it is, at least I know what I'm getting at there.

With Amazon its pretty much a gamble whether or not I'm getting what I actually ordered or some cheap knock-off crap.

For the entirety of quarantine I've been ordering directly from companies and its saved a lot of headache, sure theres no 2-day shipping, but I'm no longer receiving a fake Anker charger four times in a row.


Quarantine has also taught me Amazon is just terrible at handling some stuff. I ordered a new French press from them, which came broken both times. Ordered from Crate & Barrel and it came wrapped and packaged beautifully. Special bubble wrap with creased edges for folding to the contours of the box, and a box that actually fit the product (as opposed to Amazon: "let's toss this in a box 3 times the size of the item, with 3 squares of inflated plastic. Surely nothing bad will happen as it bangs around in there.") I'll never order any glassware or glass products from Amazon again.

And before I tried Crate & Barrel I actually was making excuses for Amazon. "Well, shipping glass and ceramic is hard. I knew there was a chance it would break in shipping." Still valid concerns, but easier to stomach a bad experience when it shows they put forth effort and thought.


Yeah, I'll use them as a last resort but going to B&M retailers or even walmart online is working real well for me. Free shipping online if you spend more than a certain amount a lot of the time.


I've been shopping online directly lately and the service and shipping times have been phenomenal. It's almost as though Amazon and small ecommerce shops have swapped places when it comes to quality assurance, service and shipping times.


What you’re describing has been Amazon for at least a decade and it’s transparently what they offer and how they market themselves, so the Walmart comparison is not even necessary.

If you are using Amazon to find top premium brands / products, you’re shopping in the wrong place to begin with.


I think it's worse than Walmart. At Walmart you get cheap garbage for low prices. At Amazon you have cheap garbage for high prices because masquerading as a premium brand with high prices and drop shipping from (ex:) AliExpress is super lucrative if you can corner a market.


Amazon knowingly sells tons of products that maim or children, and refuses to take them down even when reported. I literally saw an Amazon recommended product the other day where the #1 review was that it gave the person's kid lead poisoning. Even at it's worst Wal-Mart wasn't that bad.

Example: https://www.amazon.com/Star-Right-Flash-Cards-Set/dp/B073GBD...


At this point brand names seem to be very weak indicators of value, except a few exceptions in a specific niches with exclusive technology.

Most brands that get any traction seem to abuse that halo to also shove mediocre product to the middle/lower market. In the end we still need to do a ton of research to see which product is individually good, and looking at reviews is a way better indicator than brand marketing.

I hit this issue when trying to buy a vacuum cleaner. Perhaps it would be different if I was willing to pay a grand for the top of the line, but otherwise house name brands had at most mediocre reviews, often with documented cheaping out on important parts. We had to go through video reviews of each specific model of the price range targeted to get a decent idea of what we were buying.


I've been trying to find a decent webcam for a while now on Amazon, if you want prime and not ridiculously over priced it's all no name off brands. Really frustrating.


TFW the brand name you want is Chinese and Amazon offers you Chinese knockoffs of the Chinese thing you want.

(I was shopping for a replacement power brick for a ThinkPad.)


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: