20

Does adding DISTINCT in the following example have any impact on the query running time?
Is it wise to use it as a hint sometimes?

SELECT *
FROM   A
WHERE  A.SomeColumn IN (SELECT DISTINCT B.SomeColumn FROM B) 

1 Answer 1

26

When wondering about things like this you should compare the execution plans for your queries.

The shape of the execution plan for your query will of course differ depending on how many rows you have in your tables and what indexes is defined.
One scenario that shows there is no difference in performance is when there are substantially more rows in A than there are in B. The optimizer will then choose B as the driving table in a nested loop join against A. In order to get a correct result back it has to use a Stream Aggregate on table B in both queries to get only the distinct rows from B. So in this case the distinct keyword has no impact on performance.

enter image description here

enter image description here

The execution plan for two other obvious cases to test, more rows in B than A and equal number of rows in the tables, also shows the exact same execution plan for the queries.

Update

Before query optimization takes place the query goes through a simplification phase. You can see what the logical tree looks like using the trace flag 8606.

The Input Tree for the queries are clearly different but after simplification they are the same.

Ref: More Undocumented Query Optimizer Trace Flags and Query Optimizer Deep Dive – Part 2

Input tree and simplified tree for query using distinct:

*** Input Tree: ***
        LogOp_Project QCOL: [xx].[dbo].[A].SomeColumn
            LogOp_Select
                LogOp_Get TBL: A A TableID=213679909 TableReferenceID=0 IsRow: COL: IsBaseRow1002 
                ScaOp_SomeComp 2
                    ScaOp_Identifier QCOL: [xx].[dbo].[A].SomeColumn
                    LogOp_GbAgg OUT(QCOL: [xx].[dbo].[B].SomeColumn,) BY(QCOL: [xx].[dbo].[B].SomeColumn,)
                        LogOp_Project
                            LogOp_Project
                                LogOp_Get TBL: B B TableID=229679966 TableReferenceID=0 IsRow: COL: IsBaseRow1006 
                                AncOp_PrjList 
                            AncOp_PrjList 
                        AncOp_PrjList 
            AncOp_PrjList 
*******************
*** Simplified Tree: ***
        LogOp_LeftSemiJoin
            LogOp_Get TBL: A A TableID=213679909 TableReferenceID=0 IsRow: COL: IsBaseRow1002 
            LogOp_Get TBL: B B TableID=229679966 TableReferenceID=0 IsRow: COL: IsBaseRow1006 
            ScaOp_Comp x_cmpEq
                ScaOp_Identifier QCOL: [xx].[dbo].[A].SomeColumn
                ScaOp_Identifier QCOL: [xx].[dbo].[B].SomeColumn
*******************

Input tree and simplified tree for query not using distinct:

*** Input Tree: ***
        LogOp_Project QCOL: [xx].[dbo].[A].SomeColumn
            LogOp_Select
                LogOp_Get TBL: A A TableID=213679909 TableReferenceID=0 IsRow: COL: IsBaseRow1002 
                ScaOp_SomeComp 2
                    ScaOp_Identifier QCOL: [xx].[dbo].[A].SomeColumn
                    LogOp_Project
                        LogOp_Get TBL: B B TableID=229679966 TableReferenceID=0 IsRow: COL: IsBaseRow1006 
                        AncOp_PrjList 
            AncOp_PrjList 
*******************
*** Simplified Tree: ***
        LogOp_LeftSemiJoin
            LogOp_Get TBL: A A TableID=213679909 TableReferenceID=0 IsRow: COL: IsBaseRow1002 
            LogOp_Get TBL: B B TableID=229679966 TableReferenceID=0 IsRow: COL: IsBaseRow1006 
            ScaOp_Comp x_cmpEq
                ScaOp_Identifier QCOL: [xx].[dbo].[A].SomeColumn
                ScaOp_Identifier QCOL: [xx].[dbo].[B].SomeColumn
*******************
0

Not the answer you're looking for? Browse other questions tagged or ask your own question.