Revisions to How to request a random row in SQL?

added 353 characters in body

Source Link

edited Feb 13, 2021 at 2:32

61.2k
15
154
222

declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare @rows int
select @rows = count(1) from t

declare @sample_size int = 1000
declare @sample_percent decimal(7, 4) = case
    when @rows <= 1000 then 100                              -- not enough rows
    when (100.0 * (1000@sample_size / @rows) < 0.0001 then 0.0001 -- min sample percent
    else 100.0 * @sample_size / count(1))@rows from t                      -- everything else
    end

-- There is a verystatistical high"guarantee" chanceof having sampled a limited-yet-non-zero number of rows were sampled.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()

declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table.
select @sample_percent = 100.0 * (1000.0 / count(1)) from t

-- There is a very high chance a limited-yet-non-zero number of rows were sampled.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()

-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare @rows int
select @rows = count(1) from t

declare @sample_size int = 1000
declare @sample_percent decimal(7, 4) = case
    when @rows <= 1000 then 100                              -- not enough rows
    when (100.0 * @sample_size / @rows) < 0.0001 then 0.0001 -- min sample percent
    else 100.0 * @sample_size / @rows                        -- everything else
    end

-- There is a statistical "guarantee" of having sampled a limited-yet-non-zero number of rows.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()

added 4 characters in body

Source Link

edited Feb 13, 2021 at 1:54

user2864740

61.2k
15
154
222

Note: This is a summaryan adaptation of the answer as found on a SQL Server specific question about fetching a sample of rows. It has been tailored for context.

deleted 12 characters in body

Source Link

edited Feb 13, 2021 at 1:47

user2864740

61.2k
15
154
222

declare @rows int
select @rows = count(1) from t

-- otherOther issues if row counts in the bigint range..
-- thisThis is also not 'true random', although such is likely not required.
declare @skip int = convert(int, @rows * rand())

select t.*
from t
order by t.id -- makeMake sure this is clustered PK or IX/UCL axis!
offset (@skip) rows
fetch first 1 row only

It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.
Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding theunnecessary sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.

Compared to other answers here:

Unlike the basic SELECT TOP n .. ORDER BY NEWID(), this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*) approaches.

Here is the gist. See this answer for additional details and notes.

declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table.
select @sample_percent = 100.0 * (1000.0 / count(1)) from t

-- There is a very high chance we'll sample a limited-yet-non-zero number of rows were sampled.
-- whichThe limited rows are then sorted randomly before the first is selected!.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()

declare @rows int
select @rows = count(1) from t

-- other issues if row counts in the bigint range..
-- this is also not 'true random', although such is likely not required
declare @skip int = convert(int, @rows * rand())

select t.*
from t
order by t.id -- make sure this is clustered PK or IX/UCL axis!
offset (@skip) rows
fetch first 1 row only

It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.
Does not use ORDER BY NEWID(), as ordering can become a significant bottleneck with large input sets. Avoiding the sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.

Compared to other answers here:

Unlike the basic SELECT TOP n .. ORDER BY NEWID(), this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*) approaches.

Here is the gist. See this answer for additional details and notes.

declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table.
select @sample_percent = 100.0 * (1000.0 / count(1)) from t

-- There is a very high chance we'll sample a limited-yet-non-zero number of rows
-- which are then sorted before the first is selected!
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()

declare @rows int
select @rows = count(1) from t

-- Other issues if row counts in the bigint range..
-- This is also not 'true random', although such is likely not required.
declare @skip int = convert(int, @rows * rand())

select t.*
from t
order by t.id -- Make sure this is clustered PK or IX/UCL axis!
offset (@skip) rows
fetch first 1 row only

It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.
Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding unnecessary sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.

Here is the gist. See this answer for additional details and notes.

declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table.
select @sample_percent = 100.0 * (1000.0 / count(1)) from t

-- There is a very high chance a limited-yet-non-zero number of rows were sampled.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()

deleted 12 characters in body

Source Link

edited Feb 13, 2021 at 1:41

user2864740

61.2k
15
154
222

Loading

deleted 12 characters in body

Source Link

edited Feb 13, 2021 at 1:36

user2864740

61.2k
15
154
222

Loading

added 88 characters in body

Source Link

edited Feb 13, 2021 at 1:30

user2864740

61.2k
15
154
222

Loading

added 88 characters in body

Source Link

edited Feb 13, 2021 at 1:18

user2864740

61.2k
15
154
222

Loading

added 93 characters in body

Source Link

edited Feb 13, 2021 at 1:09

user2864740

61.2k
15
154
222

Loading

[Edit removed during grace period]; added 93 characters in body

Source Link

edited Feb 13, 2021 at 0:46

user2864740

61.2k
15
154
222

Loading

deleted 1202 characters in body

Source Link

edited Feb 13, 2021 at 0:40

user2864740

61.2k
15
154
222

Loading

deleted 1202 characters in body

Source Link

edited Feb 13, 2021 at 0:35

user2864740

61.2k
15
154
222

Loading

edited body

Source Link

edited Feb 12, 2021 at 23:28

user2864740

61.2k
15
154
222

Loading

edited body

Source Link

edited Feb 12, 2021 at 23:20

user2864740

61.2k
15
154
222

Loading

added 788 characters in body

Source Link

edited Feb 12, 2021 at 23:14

user2864740

61.2k
15
154
222

Loading

added 788 characters in body

Source Link

edited Feb 12, 2021 at 23:07

user2864740

61.2k
15
154
222

Loading

added 788 characters in body

Source Link

edited Feb 12, 2021 at 23:01

user2864740

61.2k
15
154
222

Loading

added 12 characters in body

Source Link

edited Feb 12, 2021 at 22:40

user2864740

61.2k
15
154
222

Loading

added 9 characters in body

Source Link

edited Feb 12, 2021 at 22:22

user2864740

61.2k
15
154
222

Loading

added 62 characters in body

Source Link

edited Feb 12, 2021 at 22:16

user2864740

61.2k
15
154
222

Loading

Source Link

created Feb 12, 2021 at 22:10

user2864740

61.2k
15
154
222

Loading

Collectives™ on Stack Overflow

Return to Answer