Does adding DISTINCT
in the following example have any impact on the query running time?
Is it wise to use it as a hint sometimes?
SELECT *
FROM A
WHERE A.SomeColumn IN (SELECT DISTINCT B.SomeColumn FROM B)
When wondering about things like this you should compare the execution plans for your queries.
The shape of the execution plan for your query will of course differ depending on how many rows you have in your tables and what indexes is defined.
One scenario that shows there is no difference in performance is when there are substantially more rows in A
than there are in B
. The optimizer will then choose B
as the driving table in a nested loop join against A
. In order to get a correct result back it has to use a Stream Aggregate on table B
in both queries to get only the distinct rows from B
. So in this case the distinct keyword has no impact on performance.
The execution plan for two other obvious cases to test, more rows in B than A and equal number of rows in the tables, also shows the exact same execution plan for the queries.
Update
Before query optimization takes place the query goes through a simplification phase. You can see what the logical tree looks like using the trace flag 8606.
The Input Tree for the queries are clearly different but after simplification they are the same.
Ref: More Undocumented Query Optimizer Trace Flags and Query Optimizer Deep Dive – Part 2
Input tree and simplified tree for query using distinct:
*** Input Tree: ***
LogOp_Project QCOL: [xx].[dbo].[A].SomeColumn
LogOp_Select
LogOp_Get TBL: A A TableID=213679909 TableReferenceID=0 IsRow: COL: IsBaseRow1002
ScaOp_SomeComp 2
ScaOp_Identifier QCOL: [xx].[dbo].[A].SomeColumn
LogOp_GbAgg OUT(QCOL: [xx].[dbo].[B].SomeColumn,) BY(QCOL: [xx].[dbo].[B].SomeColumn,)
LogOp_Project
LogOp_Project
LogOp_Get TBL: B B TableID=229679966 TableReferenceID=0 IsRow: COL: IsBaseRow1006
AncOp_PrjList
AncOp_PrjList
AncOp_PrjList
AncOp_PrjList
*******************
*** Simplified Tree: ***
LogOp_LeftSemiJoin
LogOp_Get TBL: A A TableID=213679909 TableReferenceID=0 IsRow: COL: IsBaseRow1002
LogOp_Get TBL: B B TableID=229679966 TableReferenceID=0 IsRow: COL: IsBaseRow1006
ScaOp_Comp x_cmpEq
ScaOp_Identifier QCOL: [xx].[dbo].[A].SomeColumn
ScaOp_Identifier QCOL: [xx].[dbo].[B].SomeColumn
*******************
Input tree and simplified tree for query not using distinct:
*** Input Tree: ***
LogOp_Project QCOL: [xx].[dbo].[A].SomeColumn
LogOp_Select
LogOp_Get TBL: A A TableID=213679909 TableReferenceID=0 IsRow: COL: IsBaseRow1002
ScaOp_SomeComp 2
ScaOp_Identifier QCOL: [xx].[dbo].[A].SomeColumn
LogOp_Project
LogOp_Get TBL: B B TableID=229679966 TableReferenceID=0 IsRow: COL: IsBaseRow1006
AncOp_PrjList
AncOp_PrjList
*******************
*** Simplified Tree: ***
LogOp_LeftSemiJoin
LogOp_Get TBL: A A TableID=213679909 TableReferenceID=0 IsRow: COL: IsBaseRow1002
LogOp_Get TBL: B B TableID=229679966 TableReferenceID=0 IsRow: COL: IsBaseRow1006
ScaOp_Comp x_cmpEq
ScaOp_Identifier QCOL: [xx].[dbo].[A].SomeColumn
ScaOp_Identifier QCOL: [xx].[dbo].[B].SomeColumn
*******************