Relational databases don't scale horizontally. That's still as true today as it ...

workhere-io · on April 25, 2014

AdWords was run on MySQL up until two years ago. The vast majority of developers working on "web-scale" projects won't get to handle projects with larger requirements than AdWords. In that perspective the whole NoSQL trend makes very little sense, especially given the fact that e.g. PostgreSQL and MySQL have a lot of needed features that many NoSQL databases don't, and that the most "hip" NoSQL database a couple of years ago was a database that doesn't even scale that well (MongoDB).

h1karu · on April 25, 2014

AdWords was designed with sharding built into it's application layer. This essentially means that the developers were forced to implement a custom app-specific persistence layer that rides on top of MySQL.

This limits your ability to write SQL queries because you can only query within a particular shard and that decreases the usefulness of mysql and makes it feel more like a nosql data-store. At that point you find yourself asking why didn't we just use a NoSQL store ? And often times the answer is "because we wanted to use something we were already familiar with". Sometimes people are willing to add a lot more complexity to their application just to allow themselves to avoid having to learn something new.

balfirevic · on April 25, 2014

"At that point you find yourself asking why didn't we just use a NoSQL store"

Because you still have the full ability to write general SQL within the shard. For many types of application this is useful.

h1karu · on April 25, 2014

That's a fair point. If you need highly relational queries within a given shard then using a relational database there can make sense, but for many apps the kinds of queries that are used are not highly relational in nature so using a NoSQL store could have reduced the need for all of that app-specific sharding engineering while still allowing for all the queries that were needed by the app.

I personally find that SQL relational databases are not flexible enough for my application so I go with a polyglot persistence architecture that ties together a nosql key-value store with a graph database technology. This way I can use the cypher query language to access data relationally across all my shards while scaling writes and Big-data horizontally.

To each his own.

balfirevic · on April 25, 2014

"This way I can use the cypher query language to access data relationally across all my shards while scaling writes and Big-data horizontally"

That sounds pretty powerful. Do you mind sharing more information about the data storage/querying stack you use?

h1karu · on April 25, 2014

NDA and all of that, but if you pick up this book:

http://pragprog.com/book/rwdata/seven-databases-in-seven-wee...

and grep for the phrase "polyglot persistence" you'll find some great information to get you started.

a video promo for the book: https://www.youtube.com/watch?v=bSAc56YCOaE

twic · on April 25, 2014

Can you talk about what sort of queries you do in Neo4j?

Graphs are perhaps the only area where NoSQL may beat relational databases for expressiveness, rather than alleged scalability or ease of use. I've read the documentation for Neo4j and Cypher, but never used them in anger, so i'd be really interested to know more about how they are actually used.

In particular, i would love to be able to make a comparison between a graph database and a relational database queried using recursive common table expressions, in the context of a real-world problem.

h1karu · on April 25, 2014

The approach I like is to use a key/value document store as the system of record (SOR). If any data conflicts arise between Redis, the document store, and the graph database, then the SOR wins. A good SOR should contain all of the data necessary to rebuild any other data source in its domain.

Graph databases can scale write throughput fine when those writes are being fed to it from a single source so it's best to have a service who's sole purpose is to keep the graph database in sync with the SOR. The graph should only store the data you actually need in order to get the queries that you want.

The specific queries in my domain are exactly the kind of thing I'm not supposed to talk about, but I will say you can do things like:

Give me 50 users who who live near me (lat/long bounding box), who I'm not already following, who I have not already sent a message to, who have at least two friends who also live near me, and who have matching tags, order by number of followers.

there are a bunch of great examples in this awesome book: http://www.amazon.com/Graph-Databases-Ian-Robinson/dp/144935...

twic · on April 26, 2014

Thanks. Your example sounds like something it would be pretty easy to express with a RCTE in a relational database. How the performance would compare, i have no idea.

h1karu · on April 26, 2014

Yea performance is the thing. You're correct that most of these queries can also be expressed in a relational DB.

The underlying data structures involved in Neo4J do help with performance on these kinds of queries, but in addition to that I find that for me the Cypher query language feels like a more concise and elegant way to ask for the data.

With RCTE it feels like I'm first asking the DB to construct a custom data structure, then I'm asking the DB to query that custom data-structure.

With Cypher it feels like I'm simply asking for the exact data I need by specifying the relationships and attribute characteristics that I care about.

This might just be a matter of taste, but if you start playing around with Cypher it just might grow on you.

threeseed · on April 25, 2014

Sure. But only if ALL the data within that query exists on the shard. It is rare that this would happen if you have anything resembling a normalised schema. In which case you would still be doing a lot of joins in your application layer.

IMHO Sharded MySQL very much belongs in the NoSQL camp.

twic · on April 25, 2014

I don't know that this is true.

Imagine you're LeanKit, or Fog Creek, and you run a kanban board as a service. Or a bug tracker, CMS, whatever. You have many customers, each of whom has no more than thousands of users and millions of items. There are many relationships between objects belonging to a given customer, but precisely zero relationships between objects belonging to different customers.

Shard using the customer identity as a key, and you have nicely spread-out data and the ability to do any query the application might need to, while still having a normalised schema.

There are plenty of other application whose schemas have this property, or almost have it. In my company, we make financial applications, and a lot of the data has very similar siloed ownership structure.

The one thing you can't do is reporting queries across your customers. That doesn't seem like a killer, though - it's normal to farm that stuff out to an offline reporting database even in single-server environments.

threeseed · on April 25, 2014

You do understand how MySQL et al are used in those cases right ? They are treated as dumb key value stores and sharded horizontally with joins done in the application layer. They are NOT your typical SQL deployment and the features you talk about are often meaningless.

And you are 100% wrong about MongoDB not scaling well. The stories you hear of people switching are never going back to PostgreSQL or MySQL they are going to the next level in scalability e.g. HBase or Cassandra.

h1karu · on April 25, 2014

Like I said because the relational database doesn't scale horizontally you are forced to build a sharding layer into your application which introduces complexity into the application layer and limits your ability to use the relational features of the relational database. At that point you might as well just be using a key-value store that does the sharding for you and offers greater flexibility.

g8oz · on April 25, 2014

It's only a problem if your app needs cross-shard reporting. And things like Gearman can help you out with the application layer complexity of horizontal scaling.

baibaichen · on April 25, 2014

there is still transaction in one shard beyond SQL. I believe that ACID is the main benefit comparing with NoSQL store

h1karu · on April 25, 2014

My NoSQL store of choices gives me document-level ACID semantics along with eventual consistency, MapReduce, and replication.

Here's an example that explains how to get transaction-like guarantees from this kind of NoSQL data-store:

http://guide.couchdb.org/editions/1/en/recipes.html https://en.wikipedia.org/wiki/BigCouch

baibaichen · on April 25, 2014

Interesting, that isn't transaction, it is just a workaround, and I don't think you can design your app like that which treat document as a transaction log. And then using view to generate the real information.

h1karu · on April 25, 2014

that's exactly how I design the parts of my app that need a transactional nature. Map/reduce views make it a breeze to query for a consistent aggregate view of the world.

This is not something new it's a well established technique from the relational world called "event sourcing"

http://www.martinfowler.com/eaaDev/EventSourcing.html

when everything is a write you don't have to worry about conflicts or locks, so that's nice, plus couch is really good at scaling write thoughput with small documents.

twic · on April 25, 2014

Urban Airship switched from MongoDB to PostgreSQL.

I see they have since switched from PostgreSQL to Cassandra, though.

rsynnott · on April 25, 2014

The usual thing, and I believe what was done with AdWords, was application-level sharding. This is in effect implementing your own database using MySQL as a glorified key-value store (you will NOT be doing any interesting joins, and you will not have cross-shard transactions unless you do those yourself too), and is thus not really a great argument for relational databases.

ithkuil · on April 25, 2014

Sharding and SQL are orthogonal.

However SQL doesn't make the problem of handling distributed databases magically disappear.

An example of distributed, horizontally scalable database supporting strong consistency and offering an SQL interface:

http://static.googleusercontent.com/media/research.google.co... and a layer above: http://static.googleusercontent.com/media/research.google.co...

It still requires the users to carefully organize their data, according to a hierarchical data model provided by the database.

There is a mapping between a hierarchical relational model and a column store model. (EDIT: see one possible mapping in http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf)

The article skims over this aspect, as if all that matters is the syntax of the query language.

wanderingmarker · on April 25, 2014

Oracle RAC does scale horizontally. Sort of: it requires putting your data in a SAN, but that is mostly horizontally-scalable as well.