Split intervals

Question

I have two tables. Each holds some attributes for a business entity and the date range for which those attributes were valid. I want to combine these tables into one, matching rows on the common business key and splitting the time ranges.

The real-world example is two source temporal tables feeding a type-2 dimension table in the data warehouse.

The entity can be present in neither, one or both of the source systems at any point in time. Once an entity is recorded in a source system the intervals are well-behaved - no gaps, duplicates or other monkey business. Membership in the sources can end at different dates.

The business rules state we only want to return intervals where the entity is present in both sources simultaneously.

What query will give this result?

This illustrates the situation:

Month          J     F     M     A     M     J     J
Source A:  <--><----------><----------><---->
Source B:            <----><----><----------------><-->
               
Result:              <----><----><----><---->

Sample Data

For simplicity I've used closed date intervals; likely any solution could be extended to half-open intervals with a little typing.

drop table if exists dbo.SourceA;
drop table if exists dbo.SourceB;
go

create table dbo.SourceA
(
    BusinessKey int,
    StartDate   date,
    EndDate     date,
    Attribute   char(9)
);

create table dbo.SourceB
(
    BusinessKey int,
    StartDate   date,
    EndDate     date,
    Attribute   char(9)
);
GO


insert dbo.SourceA(BusinessKey, StartDate, EndDate, Attribute)
values
    (1, '19990101', '19990113', 'black'),
    (1, '19990114', '19990313', 'red'),
    (1, '19990314', '19990513', 'blue'),
    (1, '19990514', '19990613', 'green'),
    (2, '20110714', '20110913', 'pink'),
    (2, '20110914', '20111113', 'white'),
    (2, '20111114', '20111213', 'gray');

insert dbo.SourceB(BusinessKey, StartDate, EndDate, Attribute)
values
    (1, '19990214', '19990313', 'left'),
    (1, '19990314', '19990413', 'right'),
    (1, '19990414', '19990713', 'centre'),
    (1, '19990714', '19990730', 'back'),
    (2, '20110814', '20110913', 'top'),
    (2, '20110914', '20111013', 'middle'),
    (2, '20111014', '20120113', 'bottom');

Desired output

BusinessKey StartDate   EndDate     a_Colour  b_Placement
----------- ----------  ----------  --------- -----------
1           1999-02-14  1999-03-13  red       left     
1           1999-03-14  1999-04-13  blue      right    
1           1999-04-14  1999-05-13  blue      centre   
1           1999-05-14  1999-06-13  green     centre   
2           2011-08-14  2011-09-13  pink      top      
2           2011-09-14  2011-10-13  white     middle   
2           2011-10-14  2011-11-13  white     bottom   
2           2011-11-14  2011-12-13  gray      bottom

So I'm a little confused to why the EndDates are so premature - for BusinessKey 1, Source A, the greatest end date is 1999-06-13, but in Source B, the greatest StartDate occurs after that. Are there rows we're not seeing (hence the cutoff)? Generally using intervals (as opposed to points in time) are bad for exactly this reason. — user212533, Commented Jan 5, 2021 at 16:05
@bbaird SourceA has one interval that ends before any B starts and vice versa at the end. This is one purpose, as test data, to exercise the rule that output only includes times when both sources have input. — Michael Green, Commented Jan 6, 2021 at 10:17
There's a much cleaner solution, but it requires a most recent row to ensure the proper value is returned. So I think I'm asking if this is more a case of "the data is incomplete on purpose" or "the data really does have hard end dates that conflict"? — user212533, Commented Jan 6, 2021 at 15:00

Lennart - Slava Ukraini · Accepted Answer · 2021-01-05 14:36:52Z

I may have misunderstood your question, but the results seem to be according to your question:

select a.businesskey
     -- greatest(a.startdate, b.startdate)
     , case when a.startdate > b.startdate 
            then a.startdate 
            else b.startdate 
       end as startdate
     -- least(a.enddate, b.enddate)
     , case when a.enddate < b.enddate 
            then a.enddate 
            else b.enddate 
       end as enddate
     , a.attribute as a_color
     , b.attribute as b_placement
from dbo.SourceA a 
join dbo.SourceB b 
        on a.businesskey = b.businesskey
       and (a.startdate between b.startdate and b.enddate 
          or b.startdate between a.startdate and a.enddate)
order by 1,2

Since intervals need to overlap most of the work can be done with a join with that as the predicate. Then it's just a matter of choosing the intersection of the intervals.

LEAST and GREATEST seem to be missing as functions, so I used a case expression instead.

Fiddle

Michael Green · Accepted Answer · 2021-01-05 13:40:40Z

This solution deconstructs the source intervals to just their starting dates. By combining these two list a set of output interval start dates are obtained. From these the corresponding output end dates are calculated by a window function. As the final output interval must end when either of the two input intervals end there is special processing to determine this value.

;with Dates as
(
    select BusinessKey, StartDate
    from dbo.SourceA

    union

    select BusinessKey, StartDate
    from dbo.SourceB

    union

    select x.BusinessKey, DATEADD(DAY, 1, MIN(x.EndDate))
    from
    (
        select BusinessKey, EndDate = MAX(EndDate) 
        from dbo.SourceA
        group by BusinessKey

        union all

        select BusinessKey, EndDate = MAX(EndDate) 
        from dbo.SourceB
        group by BusinessKey
    ) as x
    group by x.BusinessKey
),
Intervals as
(
    select
        dt.BusinessKey,
        dt.StartDate,
        EndDate = lead (DATEADD(DAY, -1, dt.StartDate), 1)
                  over (partition by dt.BusinessKey order by dt.StartDate)
    from Dates as dt
)
select
    i.BusinessKey,
    i.StartDate,
    i.EndDate, 
    a_Colour = a.Attribute,
    b_Placement = b.Attribute
from Intervals as i
inner join dbo.SourceA as a
    on i.BusinessKey = a.BusinessKey
    and i.StartDate between a.StartDate and a.EndDate
inner join dbo.SourceB as b
    on i.BusinessKey = b.BusinessKey
    and i.StartDate between b.StartDate and b.EndDate
where i.EndDate is not NULL
order by
    i.BusinessKey,
    i.StartDate;

The "Dates" CTE uses UNION rather than UNION ALL to eliminate duplicates. If both sources change on the same date we want only one corresponding output row.

As we want to close output when either source closes the third query in "Dates" adds the earliest end date i.e. the MIN of the MAX of EndDates. As it is an EndDate masquerading as a StartDate it must have another day added to it. It's purpose is to allow the window function to calculate the end of the preceding interval. It will be eliminated in the final predicate.

Using inner joins for the final query eliminates those source intervals for which there is no corresponding value in the other source.

Michael Green · Accepted Answer · 2022-05-19 03:37:30Z

There are a lot of interesting solutions to this problem (stated in different terms) here and its preceding pages. There it is presented as matching supply and demand in an auction. The units supplied/demanded is directly analogous to the days in an interval from this question so the solution translates. I've left it in the terms used in the linked site, though.

Sample data.

DROP TABLE IF EXISTS dbo.Auctions;
 
CREATE TABLE dbo.Auctions
(
  ID INT NOT NULL IDENTITY(1, 1)
    CONSTRAINT pk_Auctions PRIMARY KEY CLUSTERED,
  Code CHAR(1) NOT NULL
    CONSTRAINT ck_Auctions_Code CHECK (Code = 'D' OR Code = 'S'),
  Quantity DECIMAL(19, 6) NOT NULL
    CONSTRAINT ck_Auctions_Quantity CHECK (Quantity > 0)
);
 
SET NOCOUNT ON;
 
DELETE FROM dbo.Auctions;
 
SET IDENTITY_INSERT dbo.Auctions ON;
 
INSERT INTO dbo.Auctions(ID, Code, Quantity) VALUES
  (1, 'D', 5.0),
  (2, 'D', 3.0),
  (3, 'D', 8.0),
  (5, 'D', 2.0),
  (6, 'D', 8.0),
  (7, 'D', 4.0),
  (8, 'D', 2.0),
  (1000, 'S', 8.0),
  (2000, 'S', 6.0),
  (3000, 'S', 2.0),
  (4000, 'S', 2.0),
  (5000, 'S', 4.0),
  (6000, 'S', 3.0),
  (7000, 'S', 2.0);

The solutions expounded reduce the elapsed time for his 400k row sample data from a naive 11 seconds to 0.4s. The fastest is by Paul White (of this parish), shown here.

DROP TABLE IF EXISTS #MyPairings;
 
CREATE TABLE #MyPairings
(
  DemandID integer NOT NULL,
  SupplyID integer NOT NULL,
  TradeQuantity decimal(19, 6) NOT NULL
);
GO
 
INSERT #MyPairings 
    WITH (TABLOCK)
(
    DemandID,
    SupplyID,
    TradeQuantity
)
SELECT 
    Q3.DemandID,
    Q3.SupplyID,
    Q3.TradeQuantity
FROM 
(
    SELECT
        Q2.DemandID,
        Q2.SupplyID,
        TradeQuantity =
            -- Interval overlap
            CASE
                WHEN Q2.Code = 'S' THEN
                    CASE
                        WHEN Q2.CumDemand >= Q2.IntEnd THEN Q2.IntLength
                        WHEN Q2.CumDemand > Q2.IntStart THEN Q2.CumDemand - Q2.IntStart
                        ELSE 0.0
                    END
                WHEN Q2.Code = 'D' THEN
                    CASE
                        WHEN Q2.CumSupply >= Q2.IntEnd THEN Q2.IntLength
                        WHEN Q2.CumSupply > Q2.IntStart THEN Q2.CumSupply - Q2.IntStart
                        ELSE 0.0
                    END
            END
    FROM
    (
        SELECT 
            Q1.Code, 
            Q1.IntStart, 
            Q1.IntEnd, 
            Q1.IntLength, 
            DemandID = MAX(IIF(Q1.Code = 'D', Q1.ID, 0)) OVER (
                    ORDER BY Q1.IntStart, Q1.ID 
                    ROWS UNBOUNDED PRECEDING),
            SupplyID = MAX(IIF(Q1.Code = 'S', Q1.ID, 0)) OVER (
                    ORDER BY Q1.IntStart, Q1.ID 
                    ROWS UNBOUNDED PRECEDING),
            CumSupply = SUM(IIF(Q1.Code = 'S', Q1.IntLength, 0)) OVER (
                    ORDER BY Q1.IntStart, Q1.ID 
                    ROWS UNBOUNDED PRECEDING),
            CumDemand = SUM(IIF(Q1.Code = 'D', Q1.IntLength, 0)) OVER (
                    ORDER BY Q1.IntStart, Q1.ID 
                    ROWS UNBOUNDED PRECEDING)
        FROM 
        (
            -- Demand intervals
            SELECT 
                A.ID, 
                A.Code, 
                IntStart = SUM(A.Quantity) OVER (
                    ORDER BY A.ID 
                    ROWS UNBOUNDED PRECEDING) - A.Quantity,
                IntEnd = SUM(A.Quantity) OVER (
                    ORDER BY A.ID 
                    ROWS UNBOUNDED PRECEDING),
                IntLength = A.Quantity
            FROM dbo.Auctions AS A
            WHERE 
                A.Code = 'D'
 
            UNION ALL 
 
            -- Supply intervals
            SELECT 
                A.ID, 
                A.Code, 
                IntStart = SUM(A.Quantity) OVER (
                    ORDER BY A.ID 
                    ROWS UNBOUNDED PRECEDING) - A.Quantity,
                IntEnd = SUM(A.Quantity) OVER (
                    ORDER BY A.ID 
                    ROWS UNBOUNDED PRECEDING),
                IntLength = A.Quantity
            FROM dbo.Auctions AS A
            WHERE 
                A.Code = 'S'
        ) AS Q1
    ) AS Q2
) AS Q3
WHERE
    Q3.TradeQuantity > 0;

Stack Exchange Network

Split intervals

Sample Data

Desired output

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
sql-server
interval
azure-synapse-analytics
or ask your own question.

Linked

Hot Network Questions

Split intervals

Sample Data

Desired output

3 Answers 3

Not the answer you're looking for? Browse other questions tagged sql-serverintervalazure-synapse-analytics or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
sql-server
interval
azure-synapse-analytics
or ask your own question.