How reliable it is using DBMS_STATS.AUTO_SAMPLE_SIZE for gathering table statistics??

Statistics collection in Oracle is managed through a pl/sql package “dbms_stats”. It provides several procedures to gather statistics for a table, schema, or a database. This procedure has an estimate_percent parameter, which specifies the sampling percentage or in other words the sample of object’s data Oracle have to scan to come to an estimation on different statistics. The users can specify any number between 0 and 100 for this parameter.
For example, suppose you have a table ABCD, and you want to specify a 10% sampling percentage, you can do this:

exec dbms_stats.gather_table_stats(null, ‘ABCD’, estimate_percent => 10);

For more details on this procedure: http://docs.oracle.com/cd/B28359_01/appdev.111/b28419/d_stats.htm

It is not always easy for users to pick the appropriate sampling percentage. If the specified sampling percentage is too high, it can take longer to gather statistics (I’ve seen hours on some really big tables). On the contrary, if the data is skewed and the specified sampling percentage is too low, the resulting statistics can be inaccurate which is even a bigger problem.

For this reason, Oracle introduced the AUTO value for the estimate_percent parameter.

For example, you can gather statistics on ABCD as follows:

exec dbms_stats.gather_table_stats(null, ‘ABCD’,

estimate_percent => dbms_stats.auto_sample_size);

There are 2  primary advantage of using AUTO sample size over a fixed number.

  1. When AUTO sample size is specified, the system automatically determines the appropriate sampling percentage.
  2. AUTO sample size is more flexible than a fixed sampling percentage. A fixed sampling percentage size that was good at some point in time may not be appropriate after the data distribution in the table has changed. On the other hand when AUTO value is used Oracle will adjust the sample size when the data distribution changes.

When AUTO is used Oracle picks a sample size where the statistics quality is good enough. However, it does not work very well under cases of extreme skew in the data. To illustrate the accuracy of auto sample size , I compare the performance of using a fixed sampling percentage, AUTO sample size in a Oracle 11g database.

There’s a table called CUSTOMERSCSM which contains 23 columns and 6,216,000 rows. This is a table I’ve taken from the Oracle’s sample tables in SH schema (tables that comes with installation).

SQL> exec dbms_stats.gather_table_stats(null, ‘CUSTOMERSCSM’,estimate_percent => 1);

PL/SQL procedure successfully completed.

Elapsed: 00:00:11.25

and

SQL> exec dbms_stats.gather_table_stats(null, ‘CUSTOMERSCSM’,estimate_percent => 100);

PL/SQL procedure successfully completed.

Elapsed: 00:02:52.37

and

SQL> exec dbms_stats.gather_table_stats(null, ‘CUSTOMERSCSM’,estimate_percent => dbms_stats.auto_sample_size);

PL/SQL procedure successfully completed.

Elapsed: 00:00:29.47

 

You can see there’s a huge difference between 1 and 100 percent samples whereas 1 and auto sample size are finished almost in the same time (A 18 seconds delay, to be accurate).

As timing wise we have significant gains with auto sample size, now we’ll see the accuracy part. Among all the statistics of a column, number of distinct values (NDV) is the one whose accuracy used to be an issue.

the actual distinct values in the column CUST_MAIN_PHONE_NUMBER of CUSTOMERCSM table is

SQL> select count(distinct(CUST_MAIN_PHONE_NUMBER)) from CUSTOMERSCSM;

COUNT(DISTINCT(CUST_MAIN_PHONE_NUMBER))

—————————————

                                 51000

 

Case1: With 1% sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_MAIN_PHONE_NUMBER';

COLUMN_NAME                                              NUM_DISTINCT

——————————                                    ————

CUST_MAIN_PHONE_NUMBER                  49362

Case2: With100% sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_MAIN_PHONE_NUMBER';

COLUMN_NAME                                               NUM_DISTINCT

——————————                                      ————

CUST_MAIN_PHONE_NUMBER                    51000

Case3: With AUTO sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_MAIN_PHONE_NUMBER';

COLUMN_NAME                                           NUM_DISTINCT

——————————                                  ————

CUST_MAIN_PHONE_NUMBER                51344

 

Estimate Percentage Actual value Value from stats % deviation
1% 51000 49362 3.21%
100% 51000 51000 0%
AUTO 51000 51344 0.67%

So percentage deviation is very less with auto sample compared to 1%

Now we’ll do the same exercise on another column called CUST_ID.

SQL> select count(distinct(CUST_ID)) from CUSTOMERSCSM;

COUNT(DISTINCT(CUST_ID))

————————

55500

Case1: With 1% sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_ID';

COLUMN_NAME                    NUM_DISTINCT

—————————— ————

CUST_ID                               55704

Case2: With100% sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_ID';

COLUMN_NAME                    NUM_DISTINCT

—————————— ————

CUST_ID                               55500

 

Case3: With AUTO sample size

SQL> select column_name,NUM_DISTINCT from DBA_TAB_COL_STATISTICS where TABLE_NAME=’CUSTOMERSCSM’ and COLUMN_NAME=’CUST_ID';

COLUMN_NAME                    NUM_DISTINCT

—————————— ————

CUST_ID                               55992

Estimate Percentage Actual value Value from stats % deviation
1% 55500 55704 0.37%
100% 55500 55500 0%
AUTO 55500 55992 0.89%

Here if you see, the AUTO sample size is much deviated from the 1% sample.

So, just now we’ve seen two different scenarios with 1% and auto sample sizes. So it’s reliable to depend on the auto sampling??

I’m still trying to find the answers and will post my observations soon through another article. I’ll be happy if anyone could clear the clouds even before I do it for myself Smile

Have a good day guys. Keep Contemplating.

 

regards,

CSM

About these ads

4 Responses to How reliable it is using DBMS_STATS.AUTO_SAMPLE_SIZE for gathering table statistics??

  1. Sravya says:

    good post :-) It was very helpful

  2. Roberto says:

    I am also in the process of assessing whether using auto_sample_size makes sense from the quality of the statistics perspective. From what I have gathered, using auto_sample_size results in frequency histograms missing buckets (when comparing to the number of distinct values), which I don’t think is very good. These missing buckets then lead to inaccurate cardinality estimations.

    On the other hand, I also see quite a few histograms being created on columns with a single distinct value and I am not sure how useful a single-bucket histogram would be.

  3. Yeah, absolutely agree. You could have struck the nail on the head completely

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 332 other followers

%d bloggers like this: