I've got a query like the following:
DELETE FROM tblFEStatsBrowsers WHERE BrowserID NOT IN (
SELECT DISTINCT BrowserID FROM tblFEStatsPaperHits WITH (NOLOCK) WHERE BrowserID IS NOT NULL
)
tblFEStatsBrowsers has got 553 rows.
tblFEStatsPaperHits has got 47.974.301 rows.
tblFEStatsBrowsers:
CREATE TABLE [dbo].[tblFEStatsBrowsers](
[BrowserID] [smallint] IDENTITY(1,1) NOT NULL,
[Browser] [varchar](50) NOT NULL,
[Name] [varchar](40) NOT NULL,
[Version] [varchar](10) NOT NULL,
CONSTRAINT [PK_tblFEStatsBrowsers] PRIMARY KEY CLUSTERED ([BrowserID] ASC)
)
tblFEStatsPaperHits:
CREATE TABLE [dbo].[tblFEStatsPaperHits](
[PaperID] [int] NOT NULL,
[Created] [smalldatetime] NOT NULL,
[IP] [binary](4) NULL,
[PlatformID] [tinyint] NULL,
[BrowserID] [smallint] NULL,
[ReferrerID] [int] NULL,
[UserLanguage] [char](2) NULL
)
There's a clustered index on tblFEStatsPaperHits that does not include BrowserID. Performing the inner query will thus require a full table scan of tblFEStatsPaperHits - which is totally OK.
Currently, a full scan is executed for each row in tblFEStatsBrowsers, meaning I've got 553 full table scans of tblFEStatsPaperHits.
Rewriting to just a WHERE EXISTS doesn't change the plan:
DELETE FROM tblFEStatsBrowsers WHERE NOT EXISTS (
SELECT * FROM tblFEStatsPaperHits WITH (NOLOCK) WHERE BrowserID = tblFEStatsBrowsers.BrowserID
)
However, as suggested by Adam Machanic, adding a HASH JOIN option does result in the optimal execution plan (just a single scan of tblFEStatsPaperHits):
DELETE FROM tblFEStatsBrowsers WHERE NOT EXISTS (
SELECT * FROM tblFEStatsPaperHits WITH (NOLOCK) WHERE BrowserID = tblFEStatsBrowsers.BrowserID
) OPTION (HASH JOIN)
Now this isn't as much a question of how to fix this - I can either use the OPTION (HASH JOIN) or create a temp table manually. I'm more wondering why the query optimizer would ever use the plan it currently does.
Since the QO doesn't have any stats on the BrowserID column, I'm guessing it's assuming the worst - 50 million distinct values, thus requiring quite a large in-memory/tempdb worktable. As such, the safest way is to perform scans for each row in tblFEStatsBrowsers. There is no foreign key relationship between the BrowserID columns in the two tables, so the QO can't deduct any info from tblFEStatsBrowsers.
Is this, as simple as it sounds, the reason?
Update 1
To give a couple of stats:
OPTION (HASH JOIN):
208.711 logical reads (12 scans)
OPTION (LOOP JOIN, HASH GROUP):
11.008.698 logical reads (~scan per BrowserID (339))
No options:
11.008.775 logical reads (~scan per BrowserID (339))
Update 2
Excellent answers, all of you - thanks! Tough to pick just one. Though Martin was first and Remus provides an excellent solution, I have to give it to the Kiwi for going mental on the details :)