Tuesday, September 29, 2009

SQL Performance Test Harness

SQL performance test harness.

Occasionally, I like to be able to test concepts in SQL server and be able to tell if a particular method is faster or slower. For example, a question might be "are table variables or temp tables more efficient and under what circumstances?" What I have done is to create a small testing harness for the purposes of testing various concepts. In the example below, the test I wished to perform involved insert triggers vs default field values. I had run across a colleague’s code where he had created an insert/update trigger to set a last modified field value. The trigger looked something like this.



ALTER TRIGGER [dbo].[tr_UpdateDateTime]
ON [dbo].[tbl_DataTable]
FOR INSERT,UPDATE
AS
BEGIN
UPDATE dbo.tbl_DataTable SET LastUpdated = GETDATE()
FROM dbo.tbl_DataTable DT
INNER JOIN Inserted I ON DT.int_ID = I.int_ID
END


This particular table actually doesn't see much in the way of updates. Most of the activity on this table is through inserts. In my opinion, we could speed up the operations on this table by making the field in question have a default value of GetDate() and then making this trigger be an UPDATE trigger only. In order to backup my assertion, I had to have some concrete numbers. Here are the steps that I used.

1. Create a test table to mimic the production table.
2. Setup an identical trigger on the test table.
3. Run a test of X number of inserts and record how long it took.
4. Remove the INSERT portion of the trigger.
5. Add a default value of GetDate() to the table.
6. Run the test again and record the results.
7. Compare and see if I was right.

In order to gather accurate results, I was going to have to measure the time taken using the SQL environment itself. One other thing to consider is the SQL caching mechanism. In other words, the first run of a piece of code will take longer than subsequent runs. In order to account for this, I had to setup a mechanism to repeat the test over and over and then gather an average of the results. The SQL code below will allow for me to execute my test and gather the needed statistics. Here is the result.



Set Nocount On

-- Holds the time we started a given run
Declare @StartTime As DateTime
-- One of the variables in our test.
Declare @RecordCount as Integer
-- Holds the stats for each run.
Declare @TimingTable as Table
(Run int, -- Records which test run we are on.
Tme BigInt, -- Time for that run in MS
Iter Int, -- Iterations of the target statement.
TmePerCall Decimal(18,10)) -- Avg MS per Run

-- This is how many times we are going to
-- execute the test and collect stats
Declare @RunCount int
Set @RunCount = 1 -- Start at 1

While @RunCount < = 5 -- End at 5 tests.
Begin
Set @StartTime = GetDate()
Set @RecordCount = 1

-- For my test, I want to start with an
-- empty table on each iteration
Truncate Table PerfTest
While @RecordCount <= 10000
Begin

/********************************************/
/* This is the actual Test code that I want */
/* to execute multiple times */
/********************************************/
Insert Into PerfTest (F2,F3,F4,F5)
Values ('Some Text','Some Longer Text',
@RecordCount, NEWID())

-- Increment Counter
Set @RecordCount = @RecordCount + 1
End

-- Record stats for this run.
Insert Into @TimingTable (Run,Tme,Iter)
-- Difference to be recorded in Milliseconds.
Values (@RunCount, DateDiff(ms,@StartTime,getDate()), @RecordCount -1)

-- Increment Counter
Set @RunCount = @RunCount + 1
End

Update @TimingTable
Set TmePercall = convert(Decimal(18,10),Tme) /
convert(Decimal(18,10),Iter)
Select * From @TimingTable
Select COUNT(*) Runs, AVG(Tme) AverageTime,
AVG(TmePercall) AvgTimePerCall From @TimingTable


Let's break down the script. to begin with, we declare the @StartTime. This will be a time date variable that will record the start of any given test. The @RecordCount variable will the counter for the number of records we will insert on any given run. The heart of the statistics gathering is the @TimingTable. This table variable will hold the results of the runs so that we can query them later for information about our test. The outer loop based on @RunCount is the number of times we wish to execute our test. This can usually be a low number (under 10) and is there to counteract the effects of SQL caching. Within the RunCount loop, I truncate my table and then enter a second loop where I can perform inserts (which is what I am testing in this case). This loop may be much larger because I want to simulate bulk loads of the system. As I exit the @RecordCount loop, that is when I gather the statistics in the @TimingTable. Finally, I increment the run count and do it all again.

At the end of the test, I have now gathered information in the @TimingTable. Here is what each field represents.

Run = Which run we are dealing with. This maps back to the @RunCount variable.
Tme = The amount of time, in Milliseconds, the run took.
Iter = The number of iterations in this particular run. This maps back to the @RecordCount variable.
TmePerCall = This is manufactured at the end of the test to give an average of how long each call took in MS.

Once all the stats are gathered, I can take averages of all the fields and get the final results. Using this tool, I found out that the trigger method takes almost twice as long as the default field value. This corraborated my theory and gave me some hard numbers to go back to my co-worker and make the case for making the change. In the end, the final goal is to produce fast, efficient code that meets the specifications. It was a quick change that sped up the system and everyone was able to see a better way to do it.

You can take the above routine and modify it to fit your own needs. Hope this helps.

Tuesday, July 7, 2009

Counting records between records in SQL server 2005

I recently worked my through this problem and thought I would share the solution. Here is the problem as was presented.

I have a table that stores Calls data for each customer... and whether the Sales Calls were Successful:

Table1
ID_field TheDate Successful
100 07/03/09 N
100 07/01/09 N
100 06/26/09 Y
100 06/21/09 N
100 06/18/09 N
100 06/12/09 N
100 06/10/09 Y
100 06/06/09 N
100 06/02/09 N
100 05/31/09 Y
100 05/28/09 N

Question: For the each Success(Y), I need to know how many unsuccessful (N) calls there were before the Success occurred. So for ID 100, he had a Y on 6/26/09 ... and it took 3 N's (6/21, 6,18, and 6/12) before it happened. The result would look like this:

ID_Field Success Attempts
100 06/26/09 3
100 06/10/09 2
100 05/31/09 1

So What we have to do is to build a list of the successes, and then find out how many unsuccessful records we had between this success and the last one. Here is the code to do just that.


Select ID_Field, TheDate,
(
Select Count(*) From Table1 T3
Where TheDate <>
(
Select Coalesce(Max(TheDate),

(Select DateAdd(d,-1,Min(TheDate)) From Table1))
From Table1 T1
Where (T1.TheDate < id="SPELLING_ERROR_9" class="blsp-spelling-error">TheDate
)
And T1.Successful = 'Y'
)
) Attempts
From Table1 T2
Where Successful = 'Y'
Order by TheDate

The use of sub queries is very handy here to be able to use the data from any given record to look up data on other records. The important thing to remember here is that if you are using a sub query on the same table as the main query, you will need to Alias the table so that the query processor does not get confused as to which Table1 you are referring to at any given point.

Tuesday, June 23, 2009

Generate sequential numbers within a grouping in SQL Server

This article will discuss how to generate sequential numbers within a grouping in a SQL server table. I will say right up front, I know that set based operations are best and that any kind of cursor or looping should be avoided if possible. I have not been able to come up with a solution based purely on sets. This solution does not involve a cursor but it does involve a loop. Also, the data is being stored in a table variable so that limits what you can do. Having said all that, here is what I set out to accomplish. Suppose you have the following table structure.


Declare @Tab Table
(
ID UniqueIdentifier not null default NewID(),
Parent int not null,
Description char(20),
GrpRowCounter int
)


The ID uniquely identifies eac record. The Parent field is the field we are interested in grouping. The Description is there for flavor and the GrpRowCounter should be the value 1,2,3, etc... within each Parent grouping. Now that we have the table, let's load it up with sample data.


Insert Into @Tab (Parent, Description) Values (10,'ABC')
Insert Into @Tab (Parent, Description) Values (10,'DEF')
Insert Into @Tab (Parent, Description) Values (30,'GHI')
Insert Into @Tab (Parent, Description) Values (30,'JKL')
Insert Into @Tab (Parent, Description) Values (22,'MNO')
Insert Into @Tab (Parent, Description) Values (30,'PQR')
Insert Into @Tab (Parent, Description) Values (30,'STU')
Insert Into @Tab (Parent, Description) Values (10,'VWX')
Insert Into @Tab (Parent, Description) Values (22,'YZ1')
Insert Into @Tab (Parent, Description) Values (30,'234')


Notice that we did not populate the GrpRowCounter field just yet. The next step is to populate the table with that information. Since we are using a table variable, I selected to go with a loop structure. This loop structure will visit each of the rows in order and add the GrpRowCounter. For the code, we are going to follow this basic flow.

1. Visit each of the rows in order.
2. If the Parent code changes, reset our GrpRowCounter variable.
3. If the Parent code does not change, increment our GrpRowCounter.
4. Update the table.

Here is the code to do it.


Declare @ID uniqueidentifier -- ID we are currently working with.
Declare @Parent int -- parent we are currently working with.

Declare @TotalRows int -- This holds the total number of rows in the table.
Declare @RowCounter int -- This is my counter variable.
Declare @LastParent int -- This holds the last looked up parent. When a parent
-- changes, the GroupRowCounter Resets
Declare @GrpRowCounter int -- this holds the 1,2,3 per group.

-- this gets our maximum number of records so we know when to stop.
Select @TotalRows = COUNT(*) From @Tab
-- Counter is initialized to 1.
Set @RowCounter = 1

While @RowCounter <= @TotalRows
Begin
-- Get the current record. We are using the Row_Number function to match the corect
-- row back to the appropriate ID. Also, notice we order on Parent first and then
-- on ID.
Select @ID = T.ID, @Parent = T.Parent From @Tab T
Inner Join (Select ID, ROW_NUMBER() Over (Order by Parent, ID) RowNum From @Tab) SubT
On T.ID = SubT.ID
Where SubT.RowNum = @RowCounter
Order by Parent, T.ID

-- Check to see if the Parent has changed. On the first record, @LastParent
-- is null so it should fail and set the grp counter to 1
IF @Parent = @LastParent
Set @GrpRowCounter = @GrpRowCounter + 1
Else
Set @GrpRowCounter = 1

-- Now update the Record
Update @Tab Set GrpRowCounter = @GrpRowCounter
Where ID = @ID

-- Now update the last parent
Set @LastParent = @Parent

-- Finally update our overall counter.
Set @RowCounter = @RowCounter + 1
End

-- See the results
Select * From @Tab Order by Parent, ID

Wednesday, April 22, 2009

Extend SqlMembershipProvider and access hidden information.

This article describes how to override some of the basic functionality in the SqlMembershipProvider class. Of course every article like this started out as a problem. This article is a discussion of the problem and what I did to solve it.

The Problem:
I inherited a website which used the MembershipProvider functionality of the Microsoft Enterprise Library. One of the pages was a "Forgot Password" screen. Now this particular site used a hashed password format so retrieval of the newly generated password was not possible. This was desired. However, the client did want to store the newly generated passwords as a means of telling whether or not a user had logged in and changed the password. Unfortunately, the site had implemented this functionality in the form of a PasswordRecovery object. All that functionality, including mailing the user their password, was behind the scenes. So the question is, how do I get that generated password on the server side.

Solution:
The solution I came up with was actually quite easier than I thought it would be. The short answer is to create my own membership provider class and override the GeneratePassword functionality. Now I didn't want to change that behavior, only capture the newly generated password for that user. This password is to eventually be stored in the comment field of the aspnet_Membership table. Now, on to code.

First thing I did was to create my own provider class by inheriting from the SqlMembershipProvider class. The code looks something like this.
public class MyMembershipProvider :

System.Web.Security.SqlMembershipProvider

{

public string GeneratedPassword;

public MyMembershipProvider()

: base()

{

}

public override string GeneratePassword()

{

GeneratedPassword = base.GeneratePassword();

return GeneratedPassword;

}

}
Note that I am inheriting from the SqlMembershipProvider class and overriding the GeneratePassword function. Within that function, I still call the base.GeneratePassword to do the actual generation but I save it off in an attribute. We will come back to this later. The next step is how do I use my new class?

In the markup of the page, I located the PasswordRecovery control and the MembershipProvider attribute. That control looks something like this:
<asp:PasswordRecovery ID="PasswordRecovery1" runat="server" MembershipProvider="AspNetSqlMembershipProvider" OnSendingMail="PasswordRecovery1_SendingMail"

OnVerifyingUser="PasswordRecovery1_VerifyingUser">
This value is then located in the Web.Config class:
        <membership>

<providers>

<remove name="AspNetSqlMembershipProvider" />

<add name="AspNetSqlMembershipProvider" type="MyMembershipProvider" enablePasswordRetrieval="false" enablePasswordReset="true" requiresQuestionAndAnswer="false" passwordFormat="Hashed" applicationName="/" maxInvalidPasswordAttempts="10000" />

</providers>

</membership>

Note the Type attribute is the name of my class. At this point, your class is fully hooked in and is being used transparently by the password recovery control. Within the page code, I would like to now access the generated password before it is emailed. Withing the On_SendingEmail function, I have the following code.

protected void PasswordRecovery1_SendingMail(object sender, MailMessageEventArgs e)

{

PasswordRecovery passwordRecovery = (PasswordRecovery)sender;

string user = passwordRecovery.UserName;



_loginUser = Membership.Provider.GetUser(user, false);

_loginUser.Comment = ((MyMembershipProvider)Membership.Provider).GeneratedPassword;

Membership.UpdateUser(_loginUser);

}

Within this code I use the password recovery object to get the user name which is then used to lookup the user. In order to access the class attribute GeneratedPassword, I have to cast the Membership.Provider to ((MyMembershipProvider)Membership.Provider). Now, I can get the generated password, put it in the comment field, and update the user record.

Friday, January 30, 2009

More Fun with Recursion: The Triangle Game

Quite often, I like to dig a little deeper into some problems. That is why I love what I do for a living, write computer software. In this profession, you can simulate just about anything. The cool part is, change your parameters, re-run the process and get new answers from a different perspective. Recently, I was with my family at Cracker Barrel and my son was playing with their famous triangle game. Basically, the way this game works is you start with a triangle board with pegs that looks something like this.

    O
   O O
  O O O
 O O O O
O O O O O


Now there are different ways to play the game but one of the most common ways is to start with the top most peg empty and rest of the pegs with a golf tee in them. Jumps are made in sequence, removing the peg that was just jumped over. the object, remove all but one peg. For example, suppose the board looks like this where O represents an empty spot.

  +-O
  |B X
  A X X
 X X X X
X X X X X


You can make a jump over on the left side which would result in a board that looks like this.

    A
   o X
  o X X
 X X X X
X X X X X


That would be a valid jump. Now that the basic rules have been explained, what does this have to do with computer programming. I started asking questions like
"How many ways are there to win this game?"
"How many ways ar there to lose it?"
"Where are all the places you can leave just one peg?"
"What are the most pegs you can leave on the board without having any jumps left?"
"Is there a way to do just 3 jumps and leave pegs all around the outside with the middle empty, thus leaving you with no jumps?"

With questions like these, I sat down to write the triangle game simulation. When doing something like this, I have found that you need about 3 things.

1. The basic set of rules on how to play and win the game.
2. A model to play the game on.
3. A good approach to finding all the possible combinations.

The basic rules are explained above. Now onto a model. I decided to model the game by representing the board as an array of boolean values. The top most position would be position 0 then incrementing by 1 with the next row, left to right and the next row until all the spaces were represented, 15 in all. The next part of the model was to determine what were all the potential valid moves (pegs not withstanding). For example if the positions looked like this.

    0
   1 2
  3 4 5
 6 7 8 9
A B C D E

an example of some potential moves would be A-6-3 or 5-8-C or 1-4-8. By examining all of the possible moves on a given board, we can determine which ones would work and which ones wouldn't. If we "perform" the move, that would put the board into a different configuration. On each new configuration, the board analyzed. There are two possible outcomes of this analysis.

1) More moves are possible.
2) The game is over as no moves are available.

One each evaluation, once the game is over, that configuration is recorded. Also, a history of the moves that led to that configuration can also be recorded so that you can have a sequence of moves leading up to that win. This can be used to learn to solve the puzzle.

The final piece of the solution was how to implement this. This is where the concept of recursion shines. You can develop the routines to analyze a board and apply the moves once. On each new iteration of a board, you can take that new configuration and recursively call the analysis function. As each new configuration is generated, it is evaluated and recorded. Then the new configuration is run through the same function and the process starts all over for the new configuration. What the recursion does is to provide us with an automatic mechanism to save the state so that when we exit the evaluation process, we are taken back to where that eval was started so that the code can pick back up where it left off and continue. To demonstrate this, let's create a very small sample. Let's say our board looks like this.

  0
 1 2
3 4 5

Pegs are in positions 1-5. Potential jumps are (1) 3-1-0, (2) 0-1-3, (3) 5-2-0, (4) 0-2-5, (5) 3-4-5, (6) 5-4-3. If our board is represented as an array of bools, then the starting configuration would look like this.

0 1 2 3 4 5
F T T T T T



The evaluation would look something like this.

Start - Recursion level 0
0 1 2 3 4 5 - Apply Rule 1 - 0 1 2 3 4 5 
F T T T T T                  T F T F T T 
        Recursion Level 1
        0 1 2 3 4 5 Apply Rule 1 - Invalid
        T F T F T T Apply Rule 2 - Invalid
                    Apply Rule 3 - Invalid
                    Apply Rule 4 - Invalid
                    Apply Rule 5 - Invalid
                    Apply Rule 6 - 0 1 2 3 4 5 
                                   T F T T F F
                Recursion Level 2
                0 1 2 3 4 5 Apply Rule 1 - Invalid 
                T F T T F F Apply Rule 2
                            Apply Rule 3
                            Apply Rule 4 - 0 1 2 3 4 5 
                                           T F T T F F
                        Recursion Level 3
                        0 1 2 3 4 5 Apply Rule 1 - Invalid
                        T F T T F F Apply Rule 2 - Invalid
                                    Apply Rule 3 - Invalid
                                    Apply Rule 4 - 0 1 2 3 4 5
                                                   F F F T F T
                                Recursion Level 4
                                0 1 2 3 4 5 Apply rule 1 - Invalid
                                F F F T F T Apply rule 2 - Invalid
                                            Apply rule 3 - Invalid
                                            Apply rule 4 - Invalid
                                            Apply rule 5 - Invalid
                                            Apply rule 6 - Invalid
                                            All rules invalid. Losing combination.
                        Recursion Level 3
                        0 1 2 3 4 5 Apply Rule 5 - Invalid
                        T F T T F F Apply Rule 6 - Invalid

                Recursion Level 2
                0 1 2 3 4 5 Apply Rule 5 - Invalid 
                T F T T F F Apply Rule 6 - Invalid

        Recursion Level 1
        0 1 2 3 4 5 All Rules applied already
        T F T F T T 
0 1 2 3 4 5 - Apply Rule 2 - Invalid
F T T T T T - Apply Rule 3 - 0 1 2 3 4 5 
                             T T F T T F
. . . . . . . .


As you can see from the sample run, each time a move is valid, that move is applied and a new level of recursion is entered into. When all moves for that configuration have been tried, the previous level of recursion is returned to and the rules pick up where they left off. Of course with a bigger playing board, the iterations will be more and deeper. Once this system is in place, you could create different board configurations and rule congfigurations and play a wide variety of virtual boards.

Now that we have the concept out of the way, let's see how to implement it in code.
First we start with a Triangle object. This represents the playing board.

class Triangle
{
bool[] Board = { false, false, false, false, false,
false, false, false, false, false,
false, false, false, false, false };
string status;
int MoveApplied = -1;

public Triangle NextTriangle;
public Triangle PrevTriangle;

public Triangle(Triangle starting)
{
// Using the last triangle to initialize this triangle.
for (int x = 0; x < starting.Board.Length; x++)
Board[x] = starting.Board[x];
// Setting the pointer to the last triangle.
PrevTriangle = starting;
}
public Triangle(bool[] values)
{
// Init through an existing array.
for (int x = 0; x < values.Length; x++)
Board[x] = values[x];
}


We have the ability to create the board with either an existing triangle or if we are creating for the first time, an array of bools that indicate what the starting positions are. Now, I will switch over to the class that defines the legal moves. This is implemented as a static class.

static class LegalMoves
{
static public frmMain TheMainForm;
static public int SolutionsFound = 0;

static public string GetString(int ndx)
{
if (ndx < 0)
return "";
return "[" + moves[ndx, 0] + "," + moves[ndx, 1] + "," + moves[ndx, 2] + "]";
}
static public int[] Solutions = new int[] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
static public int[,] moves = new int[,]
{
{0,1,9},{0,2,5},{1,3,6},{1,4,8},{2,4,7},{2,5,9},
{3,1,0},{3,4,5},{3,7,12},{3,6,10},{4,7,11},
{4,8,13},{5,2,0},{5,4,3},{5,8,12},{5,9,14},
{6,3,1},{6,7,8},{7,4,2},{7,8,9},{8,4,1},{8,7,6},
{9,5,2},{9,8,7},{10,6,3},{10,11,12},{11,7,4},
{11,12,13},{12,7,3},{12,8,5},{12,11,10},{12,13,14},
{13,12,11},{13,8,4},{14,13,12},{14,9,5}
};

}


This defines a class that holds all possible legal moves that can be applied to the board. Of course the move is valid if the pegs are in the correct positions. Now back to the triangle class. The recursive procedure is called EvalLegalMoves. This will process any given triangle for legal moves. Once a legal move is found, the move is applied and the EvalLegalMoves procedure is called again. Look below..


public void EvalLegalMoves()
{
int lmi; // Legal Move Index
bool LegalMovesFound = false;
for (lmi = 0; lmi < 36; lmi++)
if (Board[LegalMoves.moves[lmi, 0]] == true &&
Board[LegalMoves.moves[lmi, 1]] == true &&
Board[LegalMoves.moves[lmi, 2]] == false)
{
LegalMovesFound = true;
// Create the next triangle.
NextTriangle = new Triangle(this);
NextTriangle.ApplyMove(lmi);
NextTriangle.EvalLegalMoves();
NextTriangle = null;
}

// Now that we have done evaluating legal moves, let's see if this
// one is one of our end cases.
if (LegalMovesFound == false)
{
int PegsLeft=0;
for (int x = 0; x < Board.Length; x++)
if (Board[x])
PegsLeft++;
// One thing we are looking for is how many solutions leave
// one peg.
if (PegsLeft == 1)
DocumentWinner();
}
}


You can define DocumentWinner any way you wish. You can output the results or whatever you want to do. Play with the code from here. Hope you enjoyed.

Scott.

Wednesday, January 21, 2009

Generating String Permutations Recursively in C#

This article will demonstrate a simple routine that uses recursion in C# to generate
a list of string permutations. I originally wrote this routine because I was
interested in a little bit larger project in which I thought a routine like this
would be useful. I had become interested in an online game where, given a
string of X number of characters, you see how many way you can combine the letters
in that string to make words. The program would go something like this.

  1. Generate the permutations.
  2. Break each permutation down into 3, 4, 5 ... X letter strings.
  3. Check each break down against a dictionary, if in the dictionary, keep the word

So the first order of business was to generate every possible permutation of an
X character string. After working with a recursive routine, I managed to get
it down to a very simple few lines of code.

First, before going over the code, let's understand the concept. Let's
look at the permutations for a 2 character string. Let's say our string
is 'AB'. The possible permutations are

  1. 'AB'
  2. 'BA'

Very easy, we just switch the last two characters. Now if we want to get the
permutations for a 3 character string, Say 'ABC', that process is very simple
too.

  1. Start with 'ABC'.
  2. Swap the last two characters to get ACB
  3. Now take the original string and move the A into the second to last position giving
    you BAC
  4. Again Swap the last two characters giving you BCA
  5. Now come back out and swap the first two characters giving you CBA
  6. And finally swap the last two characters giving you CAB

If you want to do four characters, the process is the same. Permutate the
right most 2 characters, then swap in character 3 and do the permutations, then
swap in character 4 and do the permutations again. The beauty of the recursion
is that upon leaving one swapping operation, you can get back to an original string
(stored on the stack) and keep the permutations going. Since the calls use
the string lenght to determine how to proceed, you should be able to pass a string
of any length in and get the permutations. The only caveat is that the number
of permutations generated for any given string is exponential as the string get's
larger. The number of permutation is X ! given that X is the number of characters
in the original string. As you can see, it can get quite large pretty quick.

String SizeNumber of PermutationsTime to Process
11 (1!)Negligable
2
2 (2!)Negligable
36 (3!)Negligable
424 (4!)Negligable
..........
103,628,800 (10!)7.27 Seconds
12479,001,600 (12!)Out of Memory error thrown.


The time to process above was done on a dual processor test machine running at 3
GHz with 2 Gig of internal memory. As you can see, with a short string, you
can generate many permutations. Get much more over 10 and you will start to
see some serious performance degradation. If you really need to generate permutations
on strings that large, consider disk storage.

Now that we have discussed the concept, let's look at the code. I have
encapsulated all of the code into one class. Here is the listing for that
class.

using System;
using System.Collections;

namespace Permutations
{
// To save space, I have not used proper techniques of
// hiding attributes behind properties. There are
// probably other things that haven't been done as well
// but I am just trying to show the concept.
class PermutateString
{
// Holds the final array of permutations
public ArrayList AL = new ArrayList();

public PermutateString(string Source)
{
Perm("", Source);
}

private void Perm(string Beginning, string Shuffle)
{
// By the time the Shuffle length get's down to 1,
// we have another permutation to store.
if (Shuffle.Length <= 1)
AL.Add(Beginning + Shuffle);
else
{
for (int Ndx = 0; Ndx < Shuffle.Length; Ndx++)
{
// here, we move the character at the ndx position
// to the front of the line, then copy in the first
// ndx characters of the string and then the characters
// after that. so if NDX is at 3 in the following string,

// then the string ABCDEF would become DABCEF. This loop
// pulls each postion out to the front of the line and then
// calls the Perm operation again.
Shuffle = Shuffle.Substring(Ndx, 1) +
Shuffle.Substring(0, Ndx) +
Shuffle.Substring(Ndx + 1);

// Now that we have moved the shuffled characters around a bit,
// we call the perm function again moving one of the characters back

// to the beginning string.
Perm(Beginning + Shuffle.Substring(0, 1),
Shuffle.Substring(1));
}
}
}
}
}

It may be a little difficult to follow so the best thing to do is to copy the
code into your project and run it through the debugger. As you trace the
values of Beginning and Shuffle, you should see the logic unfold. Hope you
enjoyed this.

Friday, November 21, 2008

Brute Force Data Find Method for SQL Server

On several occasions, I have been tasked with finding some miscellaneous piece of information in a database. Usually, it is some Third party database that I am not that familiar with. For example, someone says, “I need to know where all the occurrences of a particular person’s email appears in this database.” My first approach was to try and find all the fields that might have EMAIL or MAIL or something like that in the title. I quickly found out that not all database designers are created equal. Standards are lacking and consistency is foreign word. To that end, I decided to create a brute force method for finding data. I may not be the quickest but it let’s someone else do the work; namely the database engine.

I am going to be using the Information_Schema views for this exercise. Since they are ANSI standard views, they should be compatible with future versions of SQL server as well. I try to stay away from system tables as there is no guarantee that they will remain consistent between versions. The way I decided to approach this was to query the Information_Schema.Columns view in order to retrieve a complete listing of tables and columns in the database. I use the results of this view to build another script which will actually do the work. This script I will call the “build” script. We will use the build script to build the “query” script.

The build script will consist of 3 parts; 1) create the table variable to hold the results, 2) populate that table with the results and 3) retrieve the values from the variable. Part 1 looks like this.

Build Script Part One
Set NoCount ON
Print 'Set NoCount On'
Print ''
Print 'Declare @Results Table '
Print '('
Print ' TableName varchar(50),'
Print ' FieldName varChar(50),'
Print ' FieldLength int,'
Print ' DataValue varChar(2000)'
Print ')'
Print ''

If we set the results screen in query analyzer to TEXT, then when we run this, it will generate the declaration much like you see it here. This will build the first part of the query script to declare the table value we will then populate. Part 2 will generate the statements to populate this. It looks like this.

Build Script Part Two
Select 'Insert Into @Results Select ''' + Table_Name + ''' as TableName, ''' +
Column_Name + ''' AS FieldName, ' +
Convert(varchar,Character_maximum_length) + ' as FieldLength, [' +
Column_Name + '] as DataValue ' +
'From [' + Table_Schema + '].[' + Table_Name + '] ' +
'Where Upper([' + Column_Name + ']) Like Upper(''%Prestwick%'')'
From Information_Schema.Columns

Where Data_Type in ('char', 'varchar','nvarchar','nchar') And
Character_maximum_length >= 9
and Table_Schema = 'dbo'

Let’s examine this code bit by bit. We are going to generate Insert statements. According to our @Results table from above, we are going to need 4 pieces of information. We will need the Table Name, the Field Name, the Field Length, and the actual data value where we found what we are looking for. As you can see from the code above, that is exactly what we are extracting from the Information_Schema.Columns view. We are using the meta data in the database to build insert statements that will return to query the data. On line 6, we specify the selection criteria for the “query” script. To eliminate and CASE issues, we will convert the value of the column as well as the value of the TextToFind to Upper case. You can massage this part to better fit your needs. If you know the value is exact, you can change the ‘LIKE’ to an ‘=’. In this case, I am looking for an address that contains the word Prestwick. I need to know the table and column. In the last part of the build query , I use the meta data in the columns view to eliminate some of the field. If I know that I am looking for text, I can immediately eliminate all the numeric, bit and date fields. If I know the text is at least 12 characters, there is no need to query any Char(6) fields. When this portion is run agains AdventureWorksDW database, here are some of the results.

Insert Into @Results Select 'DimCustomer' as TableName, 'CustomerAlternateKey' AS FieldName, 15 as FieldLength, [CustomerAlternateKey] as DataValue From [dbo].[DimCustomer] Where Upper([CustomerAlternateKey]) Like Upper('%Prestwick%')

Insert Into @Results Select 'DimCustomer' as TableName, 'FirstName' AS FieldName, 50 as FieldLength, [FirstName] as DataValue From [dbo].[DimCustomer] Where Upper([FirstName]) Like Upper('%Prestwick%')

Insert Into @Results Select 'DimCustomer' as TableName, 'MiddleName' AS FieldName, 50 as FieldLength, [MiddleName] as DataValue From [dbo].[DimCustomer] Where Upper([MiddleName]) Like Upper('%Prestwick%')

Insert Into @Results Select 'DimCustomer' as TableName, 'LastName' AS FieldName, 50 as FieldLength, [LastName] as DataValue From [dbo].[DimCustomer] Where Upper([LastName]) Like Upper('%Prestwick%')

Insert Into @Results Select 'DimCustomer' as TableName, 'Suffix' AS FieldName, 10 as FieldLength, [Suffix] as DataValue From [dbo].[DimCustomer] Where Upper([Suffix]) Like Upper('%Prestwick%')

Of course there are more records that fit this criteria but for now, I am only showing a few for example. Now that we have the declaration and the insert statements generated, the last part of the script is the retrieval of those records. That looks like this.

Build Script Part Three
Print ''
Print 'Select Distinct * From @Results'

If we take all three sections of the Build script and combine them together and run it, we can generate a script that will query each potential field in the database for the value you are looking for. Remember, to generate the “query” script, you will need to set the query analyzer to return text results, not grid results. The next step is to copy the contents of the results of the Build script and paste them into a new query window. If I run the entire script on the AdventureWorksDW database, I can now find out which tables and fields contain this value. The actual Query script that this generates will look at over 140 fields and let you know what it finds. The results of my script are as follows.


TableName FieldName FieldLength DataValue
----------------- ------------- ----------- ---------------------
DimCustomer AddressLine1 120 1588 Prestwick Drive
DimCustomer AddressLine1 120 1841 Prestwick Drive
DimCustomer AddressLine1 120 4152 Prestwick Drive
DimCustomer AddressLine1 120 4998 Prestwick Ave.
DimCustomer AddressLine1 120 5259 Prestwick Dr.
DimCustomer AddressLine1 120 6208 Prestwick Dr.
DimCustomer AddressLine1 120 6709 Prestwick Ave
DimCustomer AddressLine1 120 7568 Prestwick Court
DimCustomer AddressLine1 120 8079 Prestwick Drive
DimCustomer AddressLine1 120 9271 Prestwick Ave.
DimCustomer AddressLine1 120 9875 Prestwick Court
ProspectiveBuyer AddressLine1 120 1841 Prestwick Drive
ProspectiveBuyer AddressLine1 120 8079 Prestwick Drive
ProspectiveBuyer AddressLine1 120 9875 Prestwick Court
vTargetMail AddressLine1 120 1588 Prestwick Drive
vTargetMail AddressLine1 120 1841 Prestwick Drive
vTargetMail AddressLine1 120 4152 Prestwick Drive
vTargetMail AddressLine1 120 4998 Prestwick Ave.
vTargetMail AddressLine1 120 5259 Prestwick Dr.
vTargetMail AddressLine1 120 6208 Prestwick Dr.
vTargetMail AddressLine1 120 6709 Prestwick Ave
vTargetMail AddressLine1 120 7568 Prestwick Court
vTargetMail AddressLine1 120 8079 Prestwick Drive
vTargetMail AddressLine1 120 9271 Prestwick Ave.
vTargetMail AddressLine1 120 9875 Prestwick Court


From here I can see that the value ‘Prestwick’ appears in 3 different tables. This gives me a headstart on doing the rest of my research.