Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> invariably all of those startups moved from MongoDB

Why? Especially after point #1 and assuming the document-store was a good fit for the data model.



Because NoSQL is a hype.

Many of the NoSQL essentially takes us back to 60s before Codd came up with relational model[1] These ideas tend to come back once in a while [2][3], but so far nothing is better than relational model.

NoSQL still makes sense in many cases (generally when your specific use case does not need all guarantees of ACID), you can get in return higher performance or horizontal scalability.

MongoDB is aim to be generic database but rides on the wave of NoSQL "coolness". It was created by people who had no experience in databases and are learning as they go[4]. As they started adding things that are essential for database they realized it's not that simple.

Currently MongoDB is outperformed by Postgres. In fact there is an application called ToroDB which provides protcol compatibility for Postgres that emulates MongoDB and even that outperforms Mongo. Mongo also doesn't scale well horizontally, so essentially you don't really don't get any significant advantage.

[1] https://en.wikipedia.org/wiki/Hierarchical_database_model

[2] https://en.wikipedia.org/wiki/Object_database

[3] https://en.wikipedia.org/wiki/XML_database

[4] For example they started with mmap'ed memory regions to store the data. Did not initially use fsync() to make sure data is saved on disk. In a way it reminds me of MySQL several years ago. It's much better now than it was in the past, but it has a lot of warts left of from the past.


MongoDB is really, really bad. I've never come across another product that was so horrible and yet so widely used.

MongoDB: for when you don't need consistency, availability, or partition tolerance.

There are some really good NoSQL products out there. I seriously think RethinkDB is on par with Postgres. I've also used Cassandra and BerkeleyDB and they're both decent. But unless some core part of your business logic is pathological to implement in SQL (like Reddit's comment trees) you should go with Postgres.


No no no, you don't understand at all. Mongo DB is webscale. /slashdot-comment


Right, absolutely true. Mongo is just such a bad piece of software.


> so far nothing is better than relational model

my favorite quote: "the relational model is rooted depth in the set theory and math is hard to fool"

it was on a post on how all NoSQL are ending up having to reinvent having, group and join clauses on their API because that's what apps do with data


A recent candidate said basically this during his interview, he got the job.


Neither HAVING nor GROUP BY are part of the relational model, though. Really, aggregation isn't part of the traditional relational model. People just tend to associate it with RDBMSes.


Currently MongoDB is outperformed by Postgres

I hate hearing absolutist dogma like this. Some things are faster in Postgres, some things are faster in Mongo. We are migrating our analytics db from Mongo to PG but we hit a wall because SELECT COUNT(DISTINCT x)) performs abysmally in Postgres. Mongo's aggregation framework is an immature PITA but it performs this little trick well enough that we're pretty much stuck keeping Mongo around unless we want to use MSSQL or Oracle or something else with a better (and much more expensive) query planner.

We still love (and prefer) Postgres but it is not a pareto improvement. There are always tradeoffs, and this kind of fanboyism just speaks to inexperience.


There's no reason for SELECT COUNT(DISTINCT x)) to perform badly in Postgres, as long as you have an appropriate table design and indexes.

Also MSSQL's query planner isn't better than Postgres', I work with both. Postgres does have its quirks though, especially with the MVCC row expiry.


> There's no reason for SELECT COUNT(DISTINCT x)) to perform badly in Postgres, as long as you have an appropriate table design and indexes.

Meh. Postgres' planner doesn't know how to generate a skip-scan/loose index scan for DISTINCT. You can write it yourself, but it's a bit painful: https://wiki.postgresql.org/wiki/Loose_indexscan

If you have a low cardinality that can be a huge efficiency difference.


Actually that's wrong:

    EXPLAIN ANALYZE SELECT COUNT(DISTINCT(calc)), calc FROM price_history GROUP BY calc;
                                                          QUERY PLAN                                                           
    -------------------------------------------------------------------------------------------------------------------------------
     GroupAggregate  (cost=10713.04..11381.06 rows=10 width=5) (actual time=1010.383..1073.263 rows=11 loops=1)
       Group Key: calc
       ->  Sort  (cost=10713.04..10935.68 rows=89056 width=5) (actual time=1010.321..1049.189 rows=89041 loops=1)
             Sort Key: calc
             Sort Method: external merge  Disk: 1392kB
             ->  Seq Scan on price_history  (cost=0.00..3391.56 rows=89056 width=5) (actual time=0.007..20.516 rows=89041 loops=1)
     Planning time: 0.074 ms
     Execution time: 1076.521 ms
    (8 rows)

With Index:

    EXPLAIN ANALYZE SELECT COUNT(DISTINCT(calc)), calc FROM price_history GROUP BY calc;
                                                                         QUERY PLAN                                                                          
    -------------------------------------------------------------------------------------------------------------------------------------------------------------
     GroupAggregate  (cost=0.29..2804.82 rows=10 width=5) (actual time=0.117..47.381 rows=11 loops=1)
       Group Key: calc
       ->  Index Only Scan using price_history_calc_idx on price_history  (cost=0.29..2359.52 rows=89041 width=5) (actual time=0.054..18.579 rows=89041 loops=1)
             Heap Fetches: 83
     Planning time: 0.208 ms
     Execution time: 47.416 ms
    (6 rows)

Actually that is called a index only scan, and happens when you have a data type that is inside your index. Which means if you need a aggregate you could try to index everything you need. Mostly a aggregate only contains some values of a row so a index is mostly not a problem.


I didn't say an index couldn't be used at all. Just not to actually make the query fast. This will get all duplicates for a value from the index, before going to the next value. If you have a couple thousand or more of each to be counted value that'll make the query rather slow.


An index only scan isn't the same as a loose index scan. They're orthogonal tricks.


You would need to add support for skip scans to the query planner, and adding feature to the query planner is rarely simple.


We are migrating our analytics db from Mongo to PG but we hit a wall because SELECT COUNT(DISTINCT x)) performs abysmally in Postgres.

An alternative is PipelineDB, which is built on Postgres (and drop-in compatible, supposedly) and should provide an efficient implementation for those kinds of queries: https://www.pipelinedb.com/

I've never used it, though.


"We are migrating our analytics db from Mongo to PG but we hit a wall because SELECT COUNT(DISTINCT x)) performs abysmally in Postgres."

https://github.com/aggregateknowledge/postgresql-hll ?

or even

https://www.periscopedata.com/blog/hyperloglog-in-pure-sql.h... ?


Try:

    SELECT COUNT(*) FROM (SELECT DISTINCT x FROM table) AS temp;

The

    SELECT COUNT(DISTINCT x))
Is slow because it performs sort.


> It was created by people who had no experience in databases and are learning as they go[4]. As they started adding things that are essential for database they realized it's not that simple.

This sounds exactly like what MySQL development looked like to me early in its rise to popularity.


I conjecture that it's rare that someone just needs a document DB.

The main problem someone is thinking about may be solved by a document DB, but then they also have 1000 other problems they didn't think about that are a horrible fit for a document DB.

And due to the nature of databases, where data has greater value when combined, using a new database system for every application isn't a great option.

SQL is great for business data, and passable-to-good for everything else. So the only way to beat it is by being way better at some specific thing, and finding users where that specific thing is so important that it's OK if the data is on an island.


Can't speak for the original poster, but at this point Postgres is a very capable and performant document-store database as well with the addition of the JSONB type and associated indexing and modification (in 9.5) functions. It's also more mature, has generally better durability than MongoDB, and it is easy to fall back/extend into relational queries on your document-store data if need be.

IMHO, unless you have a situation such that Mongo's horizontal scaling is actually required, it's better to use Postgres even if you're doing document-store stuff.


http://instagram-engineering.tumblr.com/post/10853187575/sha... it certainly can be done with postgres, but isn't as simple.


Because MongoDB is chasing a moving target. Hype, built in sharding and built in automatic failover are what it has now. The PostgreSQL project was strong when Mongo was conceived and has been accelerating. PostgreSQL has already gobbled up and improved on other MongoDB features like BSON and is where you go when you grow out of MongoDBs more primitive document model. Easy sharding and failover will fall too, years before MongoDB catches up with other must have features like transactions (you will need them eventually) or a half decent query language so you can actually maintain your code rather than having to plan your own queries and encode them as a sequence of JSON documents.



I don't like this one too much, because Mongo is wrong choice here domain wise, but this one was really convincing for me:

http://cryto.net/~joepie91/blog/2015/07/19/why-you-should-ne...


diaspora was such an interesting project, I dream about resurrecting it often.


Postgres is a better document store than Mongo.

The only time I would recommend Mongo is for storing geo-spatial data, as it has several built ins that make it much easier to work with. Even then I find it a lot more convenient to keep that data in Postgres and replicate it to Mongo.


Why not use PostGIS for geo-spatial data?


For processing, 100% agree.

However if the write load is very high then Mongo is better suited as the intial store. I then replicate to other dbs.


Mongo actually has worse performance than postgres, including inserting data.

You can of course improve Mongo's speed by using weaker write concern, but you can also disable WAL in Postgres too.


For a single server, I believe you are correct.

However I found scaling write load past a single server's limits more difficult using Postgres than MongoDB.

Fortunately that is part of what the OP addresses in his post.


Apologies, when I was talking about horizontal scaling I meant compared to other NoSQL databases[1].

The biggest selling point of NoSQL databases really is increasing performance and horizontal scalability due to removing some guarantees that RDBS provides. With MongoDB you get neither of the advantages.

Generally with Postgres you can't scale easily, unless you only do reads, but postgres outperforms MongoDB. You can get some marginal improvement by running multiple MongoDB nodes, but a single Postgres will still outperform it.

Also with benchmarks like this, it is often overlooked that in a relational database you often can store data in a smarter way and get extra boost of performance.

For example in my previous job we had 3 MongoDB which were used for mapping IP address to a zip code. Those databases were 12GB each. They run on a beefy instances in AWS because they wanted to make sure all the data could fit in RAM.

I did a POC and put the same data in Postgres and it essentially was just 600MB of data. All queries were sub milisecond, on smallest instance.

How come? Well in Postgres I stored ranges of IPs using ip4r extension. And put an GiST index which works with ranges. Mongo did not understand IP addresses so what they essentially did is they computed every possible IPv4 address and stored it into 64 bit integer, then they placed an index which probably was bigger than the data itself.

Their solution also won't scale with IPv6.

[1] http://www.datastax.com/wp-content/themes/datastax-2014-08/f...


Hm, fast bulk write is an area where PostgreSQL generally handily beats MongoDB. I do not see why it would be different for geodata. MongoDB beats PostgreSQL for some things, but not bulk load.


Seems like the poster is talking about high write load and not necessarily bulk load.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: