0

I came to ask this question after coding in a script component for some time, testing mainly how to loop over columns of the Row object, see Looping Through Columns in SSIS Script Component - Stack Overflow and the remark below of it that asks me to open a new question. I also checked thoroughly the even older question on the same thing at Apply row transformation for multiple input columns in Script Transformation so that I found out how to loop through the columns. This is therefore not the question here.

I migrate some SQL Agent jobs to SSIS. The Agent jobs run SQL and have a lot of nested select:s or CTE:s. My aim is to take each given SQL code and split it by each nested query level so that each level catches the eye at first sight when looking at the SSIS Data Flow Task. Step by step is better than looking at large nested SQL queries.

I am not the first to ask questions about staying inside the DFT and changing the data source, see SSIS data flow to update source table rows after copying to destination, though that question is about changing the source table itself while this question is about putting changes on top of the source table and making that the new output within the DFT.

Take this pattern:

select abc.*, abc.column1 + abc.column2 as column3 from (select * from xyz) abc

Or with a CTE:

with abc as (
select * from xyz
)
select abc.*, abc.column1 + abc.column2 as column3 from abc

The two are the same, and I try to split them into two steps in the Data Flow Task (DFT), one after another, so that there is at best just one level of the SQL query for each nested query level:

OLE DB Source "DataSourceAbc":

select * from xyz

Script component:

create ##tmpDataSourceAbc as (...);
select abc.*, abc.column1 + abc.column2 as column3 from ##tmpDataSourceAbc;

It will be often just about adding a few columns or changing the columns a bit, or making a lookup or the like, and for this, SSIS or third parties give you tools like Derived Column or Lookup components. But I want to avoid rewriting the SQL code that I have at hand into the SSIS language with its own data type. Call me lazy or afraid, but I would like to take over the SQL code as it is instead of making it SSIS code.

What is more, I would like to take the built-in tools that allow SQL and avoid script components where I can:

  • That is why I take the "OLE DB Source" at the beginning to fetch data with SQL from the server instead of just selecting into a temporary table inside the script component.
  • And what does not work is to make a temp table in a Data Source and pass that further downstream since it gets lost right after leaving the Data Source item, see Is it possible to use a temp table in a data flow source?.

And since I want to work further downstream on the output, I cannot just attach a Data Destination since that item does not have an output arrow. Instead, I need a Script Component that has both input and output arrow so that it can pass all of the columns further downstream. I want to stay inside the DFT all the time since the the query that I put into SSIS is also just one sort of SQL data flow task in a big stored procedure of many steps.

Therefore, the plan is to make a script component that takes up the data from an OLE DB Source (for example), then makes a temporary table and fills that temporary table with the data source.

The aim after this and which is not in the question anymore is that afterwards, I want to add a new column or change something on top of the temporary table, and since it is a temporary table, the new column can be done with mere SQL inside the same DFT.

What I know but still do not want to do:

  • I know how to select data into a temporary table with SQL in C#, see Select Into Explanation / Temporary Tables. But that would mean to have a SQL that fills the tempoary table while I want to fill it with the imported data that I got from a component that is upstream in the data flow.

  • I also know that I can make the temporary table in the control flow, see How to create a temporary table in SSIS control flow task and then use it in data flow task?. Yet, I want the temporary table to be made on the run in the DFT right when I need it, like a nested query or CTE that is just replaced by a temporary table query. Then, the package is more readable if I keep oversight over the whole data flow in on DFT without going back to the Control Flow.

Thus, the plan is:

  • A DataTable stores the data from the input rows with all its original columns.
  • In the PreExecute method, dynamically create columns in the DataTable based on the input columns coming in from the Data Source.
  • In the Input0_ProcessInputRow method:
    • fill the DataTable with each input row.
  • In the PostExecute method, insert all rows from the DataTable into the temporary table.

And after this, which is not in the question anymore:

  • Add a column in that temporary table.
  • Change something in that temporary table.
  • Save the temporary table to a steady table to save the work.

This would all be one script component that adds a column to the data source.

But I struggle with mapping the Row object columns with the DataTable columns. Thus, the question:

Has anybody found out how the columns of the Row object can be mapped to a DataTable that I take to fill the temporary table? Something like:

DataTable dt = new DataTable();
DataRow dr = dt.NewRow(); 

...       

foreach (var c in columns){
             dr[c] = c;
         }
         dt.Rows.Add(dr);

I get for this code: "Column 'myCol' does not belong to table ." Thus, the c which was a column name of the Row object, was not found in the DataRow object dr of the DataTable object dt.

The Row object columns do not or not always match the column name of the DataTable. How do I paste all columns of the Row object in the DataTable for each row so that in the end, I can copy the full data from the DataTable to the temporary table?

2 Answers 2

0

First commend you on opening the question adding some detail and attempting a thoughtful answer. Honestly your answer has made me think a bit more and could work as long as public class ScriptMain is created once at the DataSet level which I believe is likely but I haven't tried it.

However, There are a few design points you may want to consider in your final solution though.

  • SSIS is designed to pass tables from source(s) through a series of transformation tasks to destination(s). So there are components to query a subquery (original source) and to add columns, calculate, transform all without script for most use cases.
  • Performance. Using a CTE/Subquery in the original source definition could be exponentially more performant as it will reduce the number of rows/amount of data that is actually pulled to the SSIS server into memory/disk to be processed in the rest of the steps.
  • Input0_ProcessInputRow(Input0Buffer Row) is designed to be a per row transformation and the output of which is consumable by the next steps. To say this differently it is designed to add the new columns or transform contents of the column directly in the original row which then modifies the Tabular dataset available to the next steps in the Package.
  • The use of the DataTable and temp table can lead to memory and performance issues in large Data Sets as well.

So I think in determining use of Per Row Script Task like this I would ask

  • Can it be done in a Derived Column Transformation or a Lookup, Grouping, Conditional Split etc
  • Can it be done with SQL formula in original source query or another transformation
  • Will c# allow me to do something that I cannot do in SQL
  • Is there a better tool for this particular transformation

I have used scripts its definitely been a few years as I have gone away from SSIS in favor of other tools given modern API and cloud architectures. However, using a script in SSIS was usually (not exclusively) for getting a source from somewhere that required scraping a website, interacting with, Microsoft Exchange, doing things that really are less than ideal. In pure SQL ETL and Archival processes SSIS should be able to handle it within its native components.

1
  • It can be done in a Derived Column Transformation or a Lookup, Grouping, Conditional Split etc., but these are all not in SQL. I want to avoid rewriting a pile of SQL as "SSIS code". Since I do not have a large dataset, I do not need to care about the performance. I just want to avoid the work and also see an advantage in having everything readable in good old SQL. And yes, since the queries have CTE:s and the like, they can all be put in one SQL. But I do not want that, I would like to have each layer of the nested query in one SSIS step. Commented Mar 30 at 15:11
0

After checking this for long hours, I found a way to do this. The main thing to understand is that the column names of the Row object are changed so that only letters are kept, example: my_column-1 b becomes mycolumn1b. If you map on the shortened column names instead so that you shorten them also on the other side, that is, the data source original column names, it works.

Code

There is no need to hardcode the connection string, as the code takes the connection manager instead. But then, you have to choose it in the main menu of the script component, see How do I embed the Connection Manager inside the C# Code of the SSIS Script Component? - Stack Overflow.
For those who want to hardcode the connection string, it is still commented out in the code.

#region About me
/* We use a DataTable to store the data from the input rows.
 * In the PreExecute method, we dynamically create columns in the DataTable based on the input columns.
 * In the Input0_ProcessInputRow method, we add each input row to the DataTable.
 * In the PostExecute method, we use SqlBulkCopy to efficiently insert all rows from the DataTable into the temporary table.
 * Once that is done, we can add columns with `select` queries from the temporary table in mere SQL.
 * The last step is not shown here; it could be either in the PostExecute or in another Script Component. 

(Mind: For it to be in another Script Component, the data would have to be saved in a steady table, or the data would already have to be changed before or while the output rows are looped over in the `Input0_ProcessInputRow(Input0Buffer Row)`, and I do not know how to get the data source input into a temp table without looping over the `Input0_ProcessInputRow(Input0Buffer Row)`.)

One main aim is reached with this code: running SQL on a new temp table.
*/
#endregion

#region Namespaces
using System;
using System.Data;
using System.Data.SqlClient;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using System.Diagnostics;
#endregion

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private SqlConnection conn = null;
    //private static string connectionString => @"myconnectionstring";
    string connectionString = Connections.Connection.ConnectionString;
    private DataTable dt = new DataTable();
    //private SqlCommand command = null;

    public override void PreExecute()
    {
        base.PreExecute();
        conn = new SqlConnection() { ConnectionString = connectionString };
        conn.Open();
        // create temp table, make sure it matches your input
        using (SqlCommand cmd = new SqlCommand(@"CREATE TABLE ##TempTable (
my_column1 varchar(50));", conn))
        {
            cmd.ExecuteNonQuery();
        }
        // Iterate over input columns to get original column names for the dt
        foreach (IDTSInputColumn100 inputColumn in this.ComponentMetaData.InputCollection[0].InputColumnCollection)
        {
            // Get the original column name
            string originalColumnName = inputColumn.Name;

            // Add column to DataTable with the original column name
            dt.Columns.Add(originalColumnName, typeof(object));
        }
    }
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // Create DataRow to store values
        DataRow dr = dt.NewRow();

        // Iterate over input columns to get original and then modified column names for the chosen Row
        foreach (IDTSInputColumn100 inputColumn in this.ComponentMetaData.InputCollection[0].InputColumnCollection)
        {
            // Get the original column name
            string originalColumnName = inputColumn.Name;

            // Get modified column name for checking null values in Row object
            string modifiedColumnName = originalColumnName.Replace("_", "").Replace(" ", "").Replace("-", "");

            // Check if the column is null
            PropertyInfo isNullProperty = typeof(Input0Buffer).GetProperty(modifiedColumnName + "_IsNull");

            if (isNullProperty != null)
            {
                // Handle null value as needed
                // For example, you can set it to DBNull.Value
                dr[originalColumnName] = DBNull.Value;
                continue; // Skip to the next column
            }

            // Get property with matching name from Row object
            PropertyInfo property = typeof(Input0Buffer).GetProperty(modifiedColumnName);

            // Check if the property is null
            object value = property.GetValue(Row);

            // Add value to DataRow
            dr[originalColumnName] = value;
        }
        // Add populated DataRow to DataTable
        dt.Rows.Add(dr);
    }

    public override void PostExecute()
    {
        base.PostExecute();
        // here we would bulk copy data into the temp table
        // you may need to batch this operation or handle it differently based on your situation
        using (SqlBulkCopy bulkCopy = new SqlBulkCopy(conn))
        {
            // Now write that DataTable to the database 
            bulkCopy.DestinationTableName = "##tmpTable";
            bulkCopy.WriteToServer(dt);
        }
        conn.Close();
    }
}

Output

The output of a temporary table is saved in the "tempdb" database of your server. It is not saved in the database that you work on.

See How to see temp table created by code in sql server.

Not the answer you're looking for? Browse other questions tagged or ask your own question.