# How reliable it is using DBMS_STATS.AUTO_SAMPLE_SIZE for gathering table statistics??

27/03/2013 6 Comments

##### Statistics collection in Oracle is managed through a pl/sql package “dbms_stats”. It provides several procedures to gather statistics for a table, schema, or a database. This procedure has an estimate_percent parameter, which specifies the sampling percentage or in other words the sample of object’s data Oracle have to scan to come to an estimation on different statistics. The users can specify any number between 0 and 100 for this parameter.

##### For example, suppose you have a table ABCD, and you want to specify a 10% sampling percentage, you can do this:

exec dbms_stats.gather_table_stats(null, ‘ABCD’, estimate_percent => 10);

For more details on this procedure: http://docs.oracle.com/cd/B28359_01/appdev.111/b28419/d_stats.htm

It is not always easy for users to pick the appropriate sampling percentage. If the specified sampling percentage is too high, it can take longer to gather statistics (I’ve seen hours on some really big tables). On the contrary, if the data is skewed and the specified sampling percentage is too low, the resulting statistics can be inaccurate which is even a bigger problem.

For this reason, Oracle introduced the AUTO value for the estimate_percent parameter.

For example, you can gather statistics on ABCD as follows:

exec dbms_stats.gather_table_stats(null, ‘ABCD’,

estimate_percent => dbms_stats.auto_sample_size);

There are 2 primary advantage of using AUTO sample size over a fixed number.

- When AUTO sample size is specified, the system automatically determines the appropriate sampling percentage.
- AUTO sample size is more flexible than a fixed sampling percentage. A fixed sampling percentage size that was good at some point in time may not be appropriate after the data distribution in the table has changed. On the other hand when AUTO value is used Oracle will adjust the sample size when the data distribution changes.

When AUTO is used Oracle picks a sample size where the statistics quality is good enough. However, it does not work very well under cases of extreme skew in the data. To illustrate the accuracy of auto sample size , I compare the performance of using a fixed sampling percentage, AUTO sample size in a Oracle 11g database.

There’s a table called CUSTOMERSCSM which contains 23 columns and 6,216,000 rows. This is a table I’ve taken from the Oracle’s sample tables in SH schema (tables that comes with installation).

SQL> exec dbms_stats.gather_table_stats(null, ‘CUSTOMERSCSM’,**estimate_percent => 1**);

PL/SQL procedure successfully completed.

**Elapsed: 00:00:11.25**

and

SQL> exec dbms_stats.gather_table_stats(null, ‘CUSTOMERSCSM’,**estimate_percent => 100**);

PL/SQL procedure successfully completed.

**Elapsed: 00:02:52.37**

and

SQL> exec dbms_stats.gather_table_stats(null, ‘CUSTOMERSCSM’,**estimate_percent => dbms_stats.auto_sample_size**);

PL/SQL procedure successfully completed.

**Elapsed: 00:00:29.47**

You can see there’s a huge difference between 1 and 100 percent samples whereas 1 and auto sample size are finished almost in the same time (A 18 seconds delay, to be accurate).

As timing wise we have significant gains with auto sample size, now we’ll see the accuracy part. Among all the statistics of a column, number of distinct values (NDV) is the one whose accuracy used to be an issue.

the actual distinct values in the column CUST_MAIN_PHONE_NUMBER of CUSTOMERCSM table is

SQL> select count(distinct(CUST_MAIN_PHONE_NUMBER)) from CUSTOMERSCSM;

COUNT(DISTINCT(CUST_MAIN_PHONE_NUMBER))

—————————————

** 51000**

Case1: **With 1%** sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_MAIN_PHONE_NUMBER’;

COLUMN_NAME NUM_DISTINCT

—————————— ————

CUST_MAIN_PHONE_NUMBER **49362**

Case2: **With100%** sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_MAIN_PHONE_NUMBER’;

COLUMN_NAME NUM_DISTINCT

—————————— ————

CUST_MAIN_PHONE_NUMBER **51000**

Case3: **With AUTO** sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_MAIN_PHONE_NUMBER’;

COLUMN_NAME NUM_DISTINCT

—————————— ————

CUST_MAIN_PHONE_NUMBER **51344**

Estimate Percentage | Actual value | Value from stats | % deviation |

1% | 51000 | 49362 | 3.21% |

100% | 51000 | 51000 | 0% |

AUTO | 51000 | 51344 | 0.67% |

So percentage deviation is very less with auto sample compared to 1%

Now we’ll do the same exercise on another column called CUST_ID.

SQL> select count(distinct(CUST_ID)) from CUSTOMERSCSM;

COUNT(DISTINCT(CUST_ID))

————————

**55500**

Case1: With 1% sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_ID’;

COLUMN_NAME NUM_DISTINCT

—————————— ————

CUST_ID 55704

Case2: With100% sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_ID’;

COLUMN_NAME NUM_DISTINCT

—————————— ————

CUST_ID 55500

Case3: With AUTO sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_ID’;

COLUMN_NAME NUM_DISTINCT

—————————— ————

CUST_ID 55992

Estimate Percentage | Actual value | Value from stats | % deviation |

1% | 55500 | 55704 | 0.37% |

100% | 55500 | 55500 | 0% |

AUTO | 55500 | 55992 | 0.89% |

Here if you see, the AUTO sample size is much deviated from the 1% sample.

So, just now we’ve seen two different scenarios with 1% and auto sample sizes. So it’s reliable to depend on the auto sampling??

I’m still trying to find the answers and will post my observations soon through another article. I’ll be happy if anyone could clear the clouds even before I do it for myself

Have a good day guys. Keep Contemplating.

regards,

CSM

good post 🙂 It was very helpful

Thanks for visiting here and replying Sravya 🙂

I am also in the process of assessing whether using auto_sample_size makes sense from the quality of the statistics perspective. From what I have gathered, using auto_sample_size results in frequency histograms missing buckets (when comparing to the number of distinct values), which I don’t think is very good. These missing buckets then lead to inaccurate cardinality estimations.

On the other hand, I also see quite a few histograms being created on columns with a single distinct value and I am not sure how useful a single-bucket histogram would be.

Yeah, absolutely agree. You could have struck the nail on the head completely

Sweet post, dude. Cannot wait for your next one.

Engaging reading. Excellent ideas, well made.

I reckon you’re right there.