Posted in16-02-2021vonHans Juergen Schoenig

to analyze optimizer the statistics

Our 24x7 PostgreSQL support team recently received a request from one of our customers who was experiencing a performance issue. The solution to the problem can be found in the way PostgreSQL handles query optimization (statistics in particular). So I thought it would be cool to share some of this knowledge with my dear readers. The theme of this publication is therefore: WhatStatistics ClassesWhat does PostgreSQL store and where can it be found? Let's dive in and find out.

## Purpose of optimizer statistics

Before diving into PostgreSQL optimization and statistics, it's helpful to understand how PostgreSQL executes a query. The typical process works as follows:

First, PostgreSQL parses the query. The traffic cop then separates the auxiliary commands (ALTER, CREATE, DROP, GRANT, etc.)

Next comes the optimizer, which must produce the best possible plan. The plan can then be executed by the executor. The main question now is: what does the optimizer do to find the best possible plan? In addition to many mathematical transformations, it uses statistics to estimate the number of rows involved in a query. Let's take a look and see:

test=# CREATE TABLE t_test AS SELECT *, 'hans'::text AS name FROM generate_series(1, 1000000) AS id;SELECT 1000000test=# ALTER TABLE t_test ALTER COLUMN id SET STATISTICS 10;ALTER TABLEtest=# ANALYZE;ANALYZE

I created 1 million rows and instructed the system to calculate statistics on that data. To make things fit on my site, I also instructed PostgreSQL to reduce the accuracy of the statistics. By default, the stat target is 100. However, I've chosen to use 10 here to make things more readable, but more on that later.

Now let's run a simple query:

test = # erklären SELECT * FROM t_test WHERE id < 150000; ABFRAGEPLAN ------------------------------------------------- --------------- Seq Scan in t_test (cost=0.00..17906.00 lines=145969 width=9) Filter: (id < 150000)(2 lines)

What we see here is that PostgreSQL expected the sequential scan to return 145,000 rows. This information is very important because if the system knows what to expect, it can adjust its strategy accordingly (index, no index, etc.). In my case there are only two options:

- sequential scan
- Parallel sequential scan.

teste = # declare SELECT * FROM t_test WHERE id < 1; AGENDA DE CONSULTAS ------------------------------------------------ - --------------- Collect (Cost = 1000.00..11714.33 rows = 1000 width = 9) Planned Workers: 2 -> Parallel Seq Scan at t_test (Cost = 0, 00.. 10614,33 linhas = 417 largura=9) Filtro: (id < 1)(4 linhas)

I just changed the number in the WHERE clause and suddenly the plan changed. The first query expected a reasonably large result set; So enabling concurrency didn't make sense because collecting all those rows on the collection node would have been too expensive. In the second example, stream scanning rarely produces rows, so a parallel query makes sense.

To find the best strategy, PostgreSQL relies on statistics to give the optimizer an indication of what to expect. The better the statistics, the better PostgreSQL can optimize the query.

## Examine optimizer statistics

If you want to see what kind of data PostgreSQL uses, you can have a look at`pg_stats`

This is a view that shows statistics for the user. This is the content of the view:

test=# \d pg_statsView "pg_catalog.pg_stats" column | Type | a triage | Ignorable | Pattern----------------------+----------+-----------+- - - -------+---------Schema Name | name | | |table name | name | | | surname | name | | |inherited | boolean | | |null_frac | actually | | |Average width | whole | | |n_unique | actually | | |most_common_values | any matrix | | |most_common_frequencies | real[] | | |histogram_bounds | any matrix | | |correlation | actually | | |most_common_items | any matrix | | |Frequencies_most_common_items | real[] | | |elem_count_histogram | real[] | | |

Let's walk through it step by step and discuss what kind of data the planner can use:

**Schema name + table name + attribute name:**For every column in every table in every schema, PostgreSQL stores one row of data.**inherited**: Are we considering an inherited/partitioned table or not?**null_bruch:**What percentage of the column contains NULL values? This is important when you're doing something like "`WHERE COL IS ZERO`

"Ö"`WHERE col IS NOT NULL`

”**medium width**: What is the expected average column width?**n_anders**: expected number of different entries in the column**most_common_values**: We have more accurate information for the most common values. This is particularly important when table entries are not evenly distributed.**most_common_frequencies**: How common are these common values? PostgreSQL stores a percentage here. Example: "Man" is a common entry and 54.32% of entries are men.**histogram_limits**: PostgreSQL uses a histogram to store the data distribution. If the statistics target is 100, the database will store 101 entries to indicate limits within the data (1% increments).**correlation**: The optimizer also wants to know about the physical order of the data on disk. It makes a difference whether the data is stored in order (1, 2, 3, 4, 5, 6, ...) or randomly (6, 1, 2, 3, 5, 4, ...). If we look for ranges, it takes less blocks to read the sorted data. This is especially important if you want to use BRIN indexes.

Finally, there are some entries related to fixes, but we won't worry about that for now. Instead, let's look at some sample content:

test=# \xExtended display enabled.test=# SELECT * FROM pg_stats WHERE tablename = 't_test';-[ RECORD 1 ]----------+--------- - - ------------------------------------------------- - ------------ - ------ -------------- Schema Name | public table name | t_testname | inherited | fnull_frac | 0avg_width | 4n_different | -1most_common_vals |most_common_freqs |histogram_bounds | {47.102906,205351,301006,402747,503156,603102,700866,802387,901069,999982}correlation | 1most_common_items |counts_of_most_common_items |element_count_histogram |-[ RECORD 2 ]-----+---------------------- ----- --- ----- ----- -------------------------------- ----- --- ----- -------schema name | public table name | t_testname | inherited name | fnull_frac | 0avg_width | 5n_different | 1most_common_values | {hans}most_common_frequencies | {1}histogram_bounds |correlation | 1most_common_elems |most_common_elem_freqs |elem_count_histogram |

In this listing you can see what PostgreSQL knows about our table. In the id column, the histogram part is most important: "{47.102906.205351.301006.402747.503156.603102.700866.802387.901069.999982}". PostgreSQL assumes the smallest value is 47. 10% is less than 102906, 20% is less than 205351, and so on. This is also interesting here`n_anders`

: -1 basically means that all values are different. This is important when using`GROUP BY`

. In the case of GROUP BY, the optimizer wants to know how many groups to expect.`n_anders`

is used in many cases to give us this estimate.

In the "Name" column, we see that "hans" is the most common value (100%). That's why we don't get a histogram.

Of course, there's a lot more to say about how the PostgreSQL optimizer works. However, to get started, it helps a lot to have a basic understanding of how Postgres uses statistics.

## Auto vacuum and statistics

In general, PostgreSQL generates statistics automatically. The autovacuum daemon ensures that statistics are updated regularly. Statistics are the fuel needed to properly optimize queries. That's why they are super important.

If you want to manually generate statistics, you can do so at any time.`TO ANALYZE`

. However, for most use cases, the automatic vacuum is adequate.

## Finally …

If you want to learn more about query optimization in general, seemy blog post about GROUP BY. It includes some valuable tips for running analytic queries faster.

If you have any comments, you can also share them in the Disqus section below. If there's a topic you're particularly interested in, feel free to share it. We can certainly cover them in future articles.