Sybase: Data skew and query plans

In databases, data skew means that many values occupy a small number of rows each and a few values occupy many rows each. So basically it is an asymmetry in the distribution of the data values.

The distribution of data would look like this:
Data Skew

Here a short example to illustrate it:

We have an employee table containing all our employees world-wide (and since we’re in this example a quick-ass multinational corporation, we have millions of employees). Now assume we also want to track in this table employees of external companies we are contracting. Additionally we want to also track for which company they are actually working. So we add a table called "company" and reference the ID of this company in our employee table.

The table structure would look like this:

create table employee(
id varchar(32),
name varchar(32),
company_id varchar(32))
go
create unique index employee_id_index on employee(id)
go
create index employee_company_index on employee(company_id)
go
create table company(
id varchar(32),
name varchar(32))
go
create unique index company_id_index on company(id)
go
create index company_name_index on company(name)
go

Now let’s assume we want to get the list of all employees working for company “ext2”:

select e.name from employee e, company c where e.company_id=c.id and c.name="ext2"

This statement would result in the following query plan:

    STEP 1
        The type of query is SELECT.

  FROM TABLE
  company
  c
        Nested iteration.
  Table Scan.
  Forward Scan.
  Positioning at start of table.
  Using I/O Size 4 Kbytes for data pages.
  With LRU Buffer Replacement Strategy for data pages.

  FROM TABLE
  employee
  e
        Nested iteration.
  Table Scan.
  Forward Scan.
  Positioning at start of table.
  Using I/O Size 32 Kbytes for data pages.
  With LRU Buffer Replacement Strategy for data pages.

So basically ASE decides to perform a table scan on both tables. The density information for the company_id column leads the optimizer into thinking that it will probably have to through all data pages anyway when using the index on company_id and that it’s then actually cheaper to skip the index and directly read the data pages.

The solution to this issue is to modify the statistics in order to change the total density of a column to be equal to the range density:

sp_modifystats employee, company_id, REMOVE_SKEW_FROM_DENSITY

After that the statement above results in this query plan:

    STEP 1
        The type of query is SELECT.

  FROM TABLE
  company
  c
        Nested iteration.
  Table Scan.
  Forward Scan.
  Positioning at start of table.
  Using I/O Size 4 Kbytes for data pages.
  With LRU Buffer Replacement Strategy for data pages.

  FROM TABLE
  employee
  e
        Nested iteration.
  Index : employee_company_index
  Forward Scan.
  Positioning by key.
  Keys are:
    company_id ASC
  Using I/O Size 4 Kbytes for index leaf pages.
  With LRU Buffer Replacement Strategy for index leaf pages.
  Using I/O Size 4 Kbytes for data pages.
  With LRU Buffer Replacement Strategy for data pages.

Which is much faster. Of course if the company name we search for is “benohead corp.” (our company) instead of “ext2”, this query plan would actually be slower than going directly for a table scan. So the optimizer is actually kind of right. You just need to know how what kind of query you perform and if you see that you only check employees for the external companies then it makes sense to modify the density information.

You will have to repeat this operation (modifying the statistics) after every update statistics since updating the statistics will undo the REMOVE_SKEW_FROM_DENSITY.

Note that this problem doesn’t occur anymore (or less often ?) with ASE 15. So it’s mostly a problem when using ASE 12.5.4 or lower or ASE 15 with the compatibility mode switched ON (e.g. because you have a legacy application which performs really bad with the new optimizer…). With the ASE 15 optimizer, you get the following query plan even though without modifying the density information:

    STEP 1
        The type of query is SELECT.



       |ROOT:EMIT Operator (VA = 3)
       |
       |   |NESTED LOOP JOIN Operator (VA = 2) (Join Type: Inner Join)
       |   |
       |   |   |SCAN Operator (VA = 0)
       |   |   |  FROM TABLE
       |   |   |  company
       |   |   |  c
       |   |   |  Table Scan.
       |   |   |  Forward Scan.
       |   |   |  Positioning at start of table.
       |   |   |  Using I/O Size 4 Kbytes for data pages.
       |   |   |  With LRU Buffer Replacement Strategy for data pages.
       |   |
       |   |   |SCAN Operator (VA = 1)
       |   |   |  FROM TABLE
       |   |   |  employee
       |   |   |  e
       |   |   |  Index : employee_company_index
       |   |   |  Forward Scan.
       |   |   |  Positioning by key.
       |   |   |  Keys are:
       |   |   |    company_id ASC
       |   |   |  Using I/O Size 32 Kbytes for index leaf pages.
       |   |   |  With LRU Buffer Replacement Strategy for index leaf pages.
       |   |   |  Using I/O Size 32 Kbytes for data pages.
       |   |   |  With LRU Buffer Replacement Strategy for data pages.

But the ASE 15 optimizer also has other problems of its own…

One thought on “Sybase: Data skew and query plans

Leave a Reply

Your email address will not be published. Required fields are marked *