Postgres has terrible indexing with json. It doesn’t keep statistics so simple q...

orf · on May 23, 2020

“It doesn’t keep statistics” is a weird way to say “I expect full table scans to always be fast”.

Create a functional index.

aeonsky · on May 24, 2020

I’m not sure if I can even understand what you are talking about or why you said that.

If you have pid,{name:’val’, others...}. And an index of name with a million John and one Jane. Good luck getting fast results.

orf · on May 24, 2020

> If you have pid,{name:’val’, others...}. And an index of name with a million John and one Jane. Good luck getting fast results.

Fast results: https://dbfiddle.uk/?rdbms=postgres_10&fiddle=dd9370966b1528....

Postgres does actually keep statistics on json columns, but if you've got a functional index on the table and the query uses it then it doesn't matter if there is one "jane" and a million "johns". You're looking up a key in a btree index.

aeonsky · on May 24, 2020

No it doesn’t. But what do I know running several clusters in production for a $100 million business? Please read up on the subject before arguing.

orf · on May 24, 2020

> SELECT most_common_vals

> FROM pg_stats

> WHERE tablename = 'json_test'

> AND attname = 'json_column';

> {"{\"name\": \"john\"}"}

Hmm. Looks like it does though. Not that it makes a damn bit of difference because if you haven't got a functional index (i.e the stats are next to useless) then you're doing a full table scan, and in that case it sounds like you “expect full table scans to always be fast” :)

And sure, the statistics don't help with the query planner, unless you've got a computed column, but again see "I expect full table scans to always be fast" and re-consider the statement "postgres doesn't keep statistics on json columns" given the fact that it actually does, just like any other column.

Read up indeed!

fabian2k · on May 23, 2020

I’ve seen that as well, the default estimate for jsonb can seriously confuse the query planner. There is a patch in PG13 that addresses this as far as I understand, but I’m not familiar enough with PG internals to be sure I’m reading that right. I’ll be playing with this when PG13 is out, the jsonb feature is really useful, though I wouldn’t recommend to shove relational data into it. Many things are much, much harder to query inside jsonb than regular columns.

There are ways around the statistics issue in some cases, e.g. defining a functional index on a jsonb property will collect proper statistics.

pletnes · on May 23, 2020

DB noob question: if you know that you should be indexing on a json attribute, can’t you put it into a «proper column» and index there?

Mister_Snuggles · on May 23, 2020

There are a number of ways to do this:

* Extract the attributes you're interested in into their own columns, index these. With the extraction happening outside the database, this is the most flexible option.

* Similar to above, use a trigger to automatically extract these attributes.

* Also similar to above, used a generated column[0] to automatically extract these attributes.

* Create an index on the expression[1] you use to extract the attributes.

My use a JSON in PostgreSQL tends towards the first option. This works well enough for cases where documents are ingested and queried, but not updated. The last three options are automatic - add/change the JSON document and the extracted/indexed values are automatically updated.

[0] https://www.postgresql.org/docs/12/ddl-generated-columns.htm...

[1] https://www.postgresql.org/docs/12/indexes-expressional.html

why-el · on May 23, 2020

You could, of course. But that would mean that you are effectively not using json anymore. You need to pull the data out of your json on each write, update in two places, and so on. And if you need to delete a json column, what do you do with the other one? You need to delete it also. You are then managing two things.

There is always a trade off. If the column is important enough, then you are right, it should stand on its own, but then you lose the json flexibility. I personally almost always only use jsonb if I know I only care about that overall object as a whole, and rarely need to poke around to find an exact value. As a the grandparent comment mentions, if you do need a particular value, then it might be slower if your JSON records are too different (if you think about it, how can you calculate selectivity stats on a value if you have no idea how wide or different JSON records are?).