Presentation by: Joe Geller
Author of: DB2 Performance and Development Guide and IMS Administration, Programming and Data Base Design
Date: November 2016, IDUG Db2 EMEA Tech Conference, Brussels, Belgium
Slides: The ABCs of Filter Factors (aka Selectivity)
Actual Presentation: Youtube : IDUG Tech Talk: The ABCs of Filter Factors (aka Selectivity)
Word filter means to remove something. Air filter removes dust. What Db2 is filtering, are rows.
1SELECT ... FROM CUST WHERE LASTNAME ='PURCELL'
Is the above selection criteria or filtering criteria
Filter factor is the fraction of rows that satisfy a predicate
Filter factor and Selectivity mean the same thing.
1SELECT ... FROM CUST WHERE LASTNAME = 'PURCELL'2-- Total rows = 1,000,0003-- There are 100 rows with lastname purcell45-- 100/1,000,000 = .0001 of rows (ie., 1/10,000)6-- Filter factor is .0001
1Employee table = 1M rows23salary_grade has 20 distinct values(COLCARDF)4-- salary_grade = ? ; FF = 1/20 = 0.05 = 50K rows56hire_date has 10k values7-- hire_date = ? ; FF = 1/10K = 0.0001 = 100 rows89sex has 2 values10-- sex = ? ; FF = 1/2 = 0.5 = 500K rows1112sex = 'M' AND salary_grade = 2313-- FF = 1/2 * 1/20 = 1/40 = 0.025 = 25K rows
Its, number of rows selected(cardinality) which affects the cost of an accesspath. The filter factor determines the cardinality
If the predicates are independent, you multiple the filter factors.
1sex = 'M' AND salary_grade = 232-- 1/2 * 1/20 = 1/40 = 0.025 = 25k rows
If they are dependent(such as city and state)
If Db2 knows they are dependent, then it does not multiply. Db2 will know that columns are dependent, if the predicate columns are the leading columns of an index or if you have collected colgroup statisitcs on those columns.
LUW – index stats have firstkeycard, first2keycard, first3keycard, first4keycard, fullkeycard
z/OS – index stats have firstkeycardfand fullkeycardf, but all intermediate combinations are gathered too and stored in SYSCOLDIST
BETWEEN is treated as dependent
If the predicates are exclusive, then the filter factors are added
1Salary_grade= 23 OR Salary_grade= 172-- 1/20 + 1/20 = .1 = 100,000 rows
If the predicates are inclusive, then the filter factor is less than the sum of the individual FFs.
1Sex = ‘M’ OR Salary_grade= 232-- 1/2 + 1/20 = .55 = 550,000 rows
But the actual FF will be somewhat less than .55 and the estimated cardinality will be between 500,000 - 550,000
11/2 + 1/20 – (1/2*1/20) = 0.525
When host variables are used, Db2 cannot use distribution statistics.
Assumption made by Db2 is, more distinct values you have smaller the range you are asking for. This assumption is very bad for timeststamps.
WHERE CUST_NBR BETWEEN ? AND ? AND LASTNAME BETWEEN ? AND ?
WHERE CUST_NBR BETWEEN 100 AND 119 AND LASTNAME BETWEEN ? AND ?
Visual Explain is processed - left to right, bottom to top
Matchcols(start keys in LUW) are the number of leading columns of an index that are used to position within the index
In short how Merge and NL works,
Does this matter ?
1Select * from customer c join orders o2on c.custid= o.custid3where c.level= ‘GOLD’ and o.order_date= current date
Problem: If statement is generated with literals for key fields (such as IDs), then every execution will likely be different. No reuse and the cache will be flooded –forcing out other statements that could have been reused.
Solution: Statement Concentrator
If turned on, Db2 will be look for exact statement in the cache, if its not found. It will replace the literal with parameter marker and search for it again. If found, it will reuse the statement in cache and doesn't have to do full prepare.
Problem: Program uses variables for static SQL or parameter markers for dynamic SQL because different values are passed in. But, some columns have highly skewed data
In general you should write dynamic SQL using parameter markers or host variables.
Use literals where it will make a difference –for low cardinality columns with a highly skewed distribution
For software packages, check with vendor whether there is any option for parameter markers or literals. Otherwise look to use statement concentration.
Need more stats – Runstats with:
Column statistics on a subset of columns
RUNSTATS TABLESPACE dbname.tsname
Distribution stats have to be asked for on a column or colgroupbasis
HISTOGRAM option for ranges
Distribution stats for indexes are specified by using NUMCOLS option preceding FREQVAL and HISTOGRAM
Column List – If you list specific columns, these are added to existing stats for other columns
Referential Integrity helps the Optimizer. With RI, the Optimizer knows that each child row will match a single parent row if the join is on the FK columns
Sometimes DB2 just doesn’t have enough information
Two sets of tables are used
|SYSIBM tables||User tables to feed into APR tables|
The process described in slide 58 & 59