declare @rows int
select @rows = count(1) from t
-- otherOther issues if row counts in the bigint range..
-- thisThis is also not 'true random', although such is likely not required.
declare @skip int = convert(int, @rows * rand())
select t.*
from t
order by t.id -- makeMake sure this is clustered PK or IX/UCL axis!
offset (@skip) rows
fetch first 1 row only
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*)
/ BINARY_CHECKSUM(*)
issues with runs of data. When using the CHECKSUM(*)
approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID()
can never be stable/repeatable.
Does not use ORDER BY NEWID()
of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding theunnecessary sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE
and thus works with a WHERE
pre-filter.
Compared to other answers here:
- Unlike the basic
SELECT TOP n .. ORDER BY NEWID()
, this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*)
approaches.
Here is the gist. See this answer for additional details and notes.
declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table.
select @sample_percent = 100.0 * (1000.0 / count(1)) from t
-- There is a very high chance we'll sample a limited-yet-non-zero number of rows were sampled.
-- whichThe limited rows are then sorted randomly before the first is selected!.
select top 1
t.*
from t
where 1=1
and ( -- sample
@sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
)
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()