Monday, December 15, 2008

Determining Delimiters & Qualifiers at Runtime with SSIS 2008

I was recently tasked with the objective of creating an SSIS package to perform some standard ETL types of functions. Big deal, right? What made this particular go around unique for me, was that the input file used as a data source can have different column/row delimiters and text qualifiers for each file and we do not know what the formatting is going to look like until we try to load the file. The number of columns stays the same, as do the data types of each column. Here is a simplified example:

File 1:
Rick Jones\t 123 Main Street\t Saint Paul\t MN\t 55144\r\n
Bill Smith\t 789 Blueberry Court\t Wayzata\t MN\t 55391\r\n

File 2:
"Rick Jones", "123 Main Street", "Saint Paul", "MN", 55144\r\n
"Bill Smith", "789 Blueberry Court", "Wayzata", "MN", 55391\r\n

As you can see in File 1 above, the columns are tab delimited with a carriage return and line feed at the end of the row. In File 2, the columns are comma delimited. Also, text is qualified with double quotes.

From an ETL perspective, I had two choices: I could write an SSIS package for each permutation and create a maintenance and support nightmare for myself, or find a way to dynamically set the Flat File Connection properties at runtime. I decided to opt for the latter. In earlier versions of DTS and SSIS, this was not as easy as it sounds. In SSIS 2008, the task became relatively simple.

The first thing I did was to create a Flat File Connection and point it to a sample file that represented an instance of the input format that I would need to load data from. This creates metadata behind the scenes in SSIS to represent the Flat File Connection. The trick is to modify the metadata at the runtime of the SSIS package and replace the delimiters and qualifiers with new values. But how do you get the new values? I am glad you asked.

I created a script task and added references to the script project:
Microsoft.SqlServer.DTS.RuntimeWrap
Microsoft.SqlServer.DTS.ManagedDTS

Once I could reference the SSIS object models I needed, the next thing I did was to create some variables to hold the values of the delimiters and qualifiers, once I knew what they were. The variables are set by actually opening the file and reading the first row (I know, I know. There are A LOT of assumptions in doing it this way) . I called two different methods, that loop through arrays of values, until I find what I am looking for. Once I have a good idea of the input file structure, I modify the Flat File Connection with the new delimiters and qualifiers. Here is an example of the code I used in the Script Task:

using Microsoft.SqlServer.Dts.Runtime;
using RuntimeWrapper = Microsoft.SqlServer.Dts.Runtime.Wrapper;

private string _RowDelimiter = "\r\n";
private string _ColumnDelimiter = "\t";
private string _TextQualifier = string.Empty;
private bool _IsTextQualified = false;
private int _ColumnCount = 0;

public void Main()
{

string delimiter = string.Empty;

//create the connection to the flatfile connection to dynamically set the properties
RuntimeWrapper.IDTSConnectionManagerFlatFile100 flatFileConnection
= Dts.Connections["InputFileFlatFileConnection"].InnerObject as RuntimeWrapper.IDTSConnectionManagerFlatFile100;
RuntimeWrapper.IDTSConnectionManagerFlatFileColumns100 columns = flatFileConnection.Columns;

//establish the count of columns based on the flatfile connection
_ColumnCount = columns.Count;

//determine the column delimiters, and text qualifiers
interrogateFileContents("C:\InputFile.txt");

//iterate throught he columns and set the delimiters based on the file interrogation
foreach (RuntimeWrapper.IDTSConnectionManagerFlatFileColumn100 column in flatFileConnection.Columns)
{
//if we are at the last column, set the row delimiter
if (column == flatFileConnection.Columns[flatFileConnection.Columns.Count - 1])
{
delimiter = _RowDelimiter;
}
//we are not at the last column, set the column delimiter
else
{
delimiter = _ColumnDelimiter;
}

//set the current column's delimiter
flatFileConnection.Columns[column].ColumnDelimiter = delimiter;
//set the current column's text qualified property
flatFileConnection.Columns[column].TextQualified = _IsTextQualified;
}

if (_IsTextQualified)
{
flatFileConnection.TextQualifier = _TextQualifier;
}

Dts.TaskResult = (int)ScriptResults.Success;
}


///
/// Reads the first line of the file to determine the number of columns, column delimiters, and text qualifier
///

/// file to interrogate
/// bool indicating whether or not the interrogation was successful
private void interrogateFileContents(string filePath)
{
string contents = string.Empty;
System.IO.TextReader objReader = new System.IO.StreamReader(filePath);

//read the first line of the input file (assuming the first line is data)
contents = objReader.ReadLine();
objReader.Close();

//establish the column delimiter
setColumnDelimiter(contents);
//establish the text qualifiers
setTextQualifier(contents);
}

///
/// Sets the private modular level variables that controls the columns delimiters
///

/// string that represents the first row of data in the file
private void setColumnDelimiter(string row)
{
char[] delimiterChars = { '\t', ',', ''};
int columnCount = 0;

//iterate through the collection of delimiters until we find the one that is used in the file
for (int i = 0; i <= delimiterChars.GetLength(0) - 1; i++) { columnCount = row.Split(Convert.ToChar(delimiterChars[i])).GetLength(0); if (columnCount == _ColumnCount) { _ColumnDelimiter = delimiterChars[i].ToString(); break; } } } ///
/// Sets the private modular level variables that controls the text qualifier
///

/// string that represents the first row of data in the file
private void setTextQualifier(string row)
{
char[] qualifierChars = { '\"', '\'' };
string column = row.Split(Convert.ToChar(_ColumnDelimiter)).GetValue(0).ToString();


//iterate through the collection of qualifiers to see if the text is qualified or not
for (int i = 0; i <= qualifierChars.GetLength(0) - 1; i++) { if (column.StartsWith(qualifierChars[i].ToString()) && column.EndsWith(qualifierChars[i].ToString())) { _TextQualifier = qualifierChars[i].ToString(); _IsTextQualified = true; break; } } } }

}

So, as you can see from the code above, I am wading through the data with character arrays and setting the Flat File Connection properties when I find a match. This code can be extended to do a lot more, but I will leave it in someone else's capable hands to implement factory patterns and the such.The point of this code is this: you do not have to create separate packages for each combo of delimiters and qualifiers.

No comments:

Post a Comment