Postgres has terrible indexing with json. It doesn’t keep statistics so simple queries sometimes take much longer than expected due to query planner not knowing much about the data.
Postgres does actually keep statistics on json columns, but if you've got a functional index on the table and the query uses it then it doesn't matter if there is one "jane" and a million "johns". You're looking up a key in a btree index.
Hmm. Looks like it does though. Not that it makes a damn bit of difference because if you haven't got a functional index (i.e the stats are next to useless) then you're doing a full table scan, and in that case it sounds like you “expect full table scans to always be fast” :)
And sure, the statistics don't help with the query planner, unless you've got a computed column, but again see "I expect full table scans to always be fast" and re-consider the statement "postgres doesn't keep statistics on json columns" given the fact that it actually does, just like any other column.
I’ve seen that as well, the default estimate for jsonb can seriously confuse the query planner. There is a patch in PG13 that addresses this as far as I understand, but I’m not familiar enough with PG internals to be sure I’m reading that right. I’ll be playing with this when PG13 is out, the jsonb feature is really useful, though I wouldn’t recommend to shove relational data into it. Many things are much, much harder to query inside jsonb than regular columns.
There are ways around the statistics issue in some cases, e.g. defining a functional index on a jsonb property will collect proper statistics.
* Extract the attributes you're interested in into their own columns, index these. With the extraction happening outside the database, this is the most flexible option.
* Similar to above, use a trigger to automatically extract these attributes.
* Also similar to above, used a generated column[0] to automatically extract these attributes.
* Create an index on the expression[1] you use to extract the attributes.
My use a JSON in PostgreSQL tends towards the first option. This works well enough for cases where documents are ingested and queried, but not updated. The last three options are automatic - add/change the JSON document and the extracted/indexed values are automatically updated.
You could, of course. But that would mean that you are effectively not using json anymore. You need to pull the data out of your json on each write, update in two places, and so on. And if you need to delete a json column, what do you do with the other one? You need to delete it also. You are then managing two things.
There is always a trade off. If the column is important enough, then you are right, it should stand on its own, but then you lose the json flexibility. I personally almost always only use jsonb if I know I only care about that overall object as a whole, and rarely need to poke around to find an exact value. As a the grandparent comment mentions, if you do need a particular value, then it might be slower if your JSON records are too different (if you think about it, how can you calculate selectivity stats on a value if you have no idea how wide or different JSON records are?).