Skip to main content
added 353 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare @rows int
select @rows = count(1) from t

declare @sample_size int = 1000
declare @sample_percent decimal(7, 4) = case
    when @rows <= 1000 then 100                              -- not enough rows
    when (100.0 * (1000@sample_size / @rows) < 0.0001 then 0.0001 -- min sample percent
    else 100.0 * @sample_size / count(1))@rows from t                      -- everything else
    end

-- There is a verystatistical high"guarantee" chanceof having sampled a limited-yet-non-zero number of rows were sampled.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table.
select @sample_percent = 100.0 * (1000.0 / count(1)) from t

-- There is a very high chance a limited-yet-non-zero number of rows were sampled.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare @rows int
select @rows = count(1) from t

declare @sample_size int = 1000
declare @sample_percent decimal(7, 4) = case
    when @rows <= 1000 then 100                              -- not enough rows
    when (100.0 * @sample_size / @rows) < 0.0001 then 0.0001 -- min sample percent
    else 100.0 * @sample_size / @rows                        -- everything else
    end

-- There is a statistical "guarantee" of having sampled a limited-yet-non-zero number of rows.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
added 4 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222

Note: This is a summaryan adaptation of the answer as found on a SQL Server specific question about fetching a sample of rows. It has been tailored for context.

Note: This is a summary of the answer as found on a SQL Server specific question about fetching a sample of rows. It has been tailored for context.

Note: This is an adaptation of the answer as found on a SQL Server specific question about fetching a sample of rows. It has been tailored for context.

deleted 12 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
declare @rows int
select @rows = count(1) from t

-- otherOther issues if row counts in the bigint range..
-- thisThis is also not 'true random', although such is likely not required.
declare @skip int = convert(int, @rows * rand())

select t.*
from t
order by t.id -- makeMake sure this is clustered PK or IX/UCL axis!
offset (@skip) rows
fetch first 1 row only
  • It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.

  • Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.

  • Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.

  • Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding theunnecessary sorting also reduces memory and tempdb usage.

  • Does not use TABLESAMPLE and thus works with a WHERE pre-filter.

Compared to other answers here:

  • Unlike the basic SELECT TOP n .. ORDER BY NEWID(), this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*) approaches.

Here is the gist. See this answer for additional details and notes.

declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table.
select @sample_percent = 100.0 * (1000.0 / count(1)) from t

-- There is a very high chance we'll sample a limited-yet-non-zero number of rows were sampled.
-- whichThe limited rows are then sorted randomly before the first is selected!.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
declare @rows int
select @rows = count(1) from t

-- other issues if row counts in the bigint range..
-- this is also not 'true random', although such is likely not required
declare @skip int = convert(int, @rows * rand())

select t.*
from t
order by t.id -- make sure this is clustered PK or IX/UCL axis!
offset (@skip) rows
fetch first 1 row only
  • It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.

  • Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.

  • Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.

  • Does not use ORDER BY NEWID(), as ordering can become a significant bottleneck with large input sets. Avoiding the sorting also reduces memory and tempdb usage.

  • Does not use TABLESAMPLE and thus works with a WHERE pre-filter.

Compared to other answers here:

  • Unlike the basic SELECT TOP n .. ORDER BY NEWID(), this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*) approaches.

Here is the gist. See this answer for additional details and notes.

declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table.
select @sample_percent = 100.0 * (1000.0 / count(1)) from t

-- There is a very high chance we'll sample a limited-yet-non-zero number of rows
-- which are then sorted before the first is selected!
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
declare @rows int
select @rows = count(1) from t

-- Other issues if row counts in the bigint range..
-- This is also not 'true random', although such is likely not required.
declare @skip int = convert(int, @rows * rand())

select t.*
from t
order by t.id -- Make sure this is clustered PK or IX/UCL axis!
offset (@skip) rows
fetch first 1 row only
  • It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.

  • Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.

  • Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.

  • Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding unnecessary sorting also reduces memory and tempdb usage.

  • Does not use TABLESAMPLE and thus works with a WHERE pre-filter.

Here is the gist. See this answer for additional details and notes.

declare @sample_percent decimal(7, 4)
-- Sample "approximately 1000 rows" from the table.
select @sample_percent = 100.0 * (1000.0 / count(1)) from t

-- There is a very high chance a limited-yet-non-zero number of rows were sampled.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
    t.*
from t
where 1=1
    and ( -- sample
        @sample_percent = 100
        or abs(
            convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
        ) % (1000 * 100) < (1000 * @sample_percent)
    )
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
deleted 12 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
deleted 12 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
added 88 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
added 88 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
added 93 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
[Edit removed during grace period]; added 93 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
deleted 1202 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
deleted 1202 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
edited body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
edited body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
added 788 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
added 788 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
added 788 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
added 12 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
added 9 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
added 62 characters in body
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading
Source Link
user2864740
  • 61.2k
  • 15
  • 154
  • 222
Loading