So I'm working with DTS at the moment. Pulling data from files and access databases to staging databases to warehouse databases to mart databases to cube databases.
I need to do lots of cleaning and validation along the way.
You might cry if I filled you in on some of the details, so I'll spare you from them.
Suffice to say, nothing is ever simple.
Anyway, in order to validate my data load scripts, I spent some time a while back picking a representative sample.
My sample ended up needing to be relatively large, so as to contain all the permutations and combinations of bad data that are likely to appear in the full set. That said, the test data set is still well less than one percent of the full initial set.
In order to clean and validate my data, which is a relatively complicated task, I've created a host of views, temp tables, UDFs, sprocs and DTS tasks.
The problem with this is that views calling UDFs to do cleansing and return data is that performance starts to suck. Especially when you have views aggregating, unioning, and joining other views. So I try to move data from a view, to a temp table, and sort of edge my way to clean data in the staging database, so that I can export it to the warehouse.
Blah, di blah.
About 20 minutes ago I kicked off a DTS package that runs through the entire process. It takes about an hour and a half to run, and I've run it from beginning to end about 3 times so far this week. I need to run it occasionally to verify that what I'm doing is working properly.
Each of the machines here has a maintenance script that is scheduled to run every evening. They start one hour after each other, and while exactly what the script does varies from machine to machine depending on its purpose, a script basically just reports on unchecked-in files, defrags the hard-disks, copies backups around, etc.
I made my scripts run from 6:37pm - ~9pm because I figure that is the time of day that I'm least likely to be at my desk.
So, naturally, just shortly after I'd kicked off the DTS transform -- which, by the way, absolutely nails the living shit out of my development database server, thrashing the CPU, and network, disk and RAM I/O to the max -- I decided that I really needed an extra level of caching for some of the data I was using. So while DTS was running a package, I added some new tables to my server from Query Analyser (with select into). Then I switched over to an instance of Enterprise Manager to fix up the details of those tables. I changed some of a table's properties, and clicked save. The poor server, whose HDD LED has just been a constant red glow for several minutes, didn't respond in a hurry to my ALTER TABLE, and Enterprise Manager just hung on my workstation.
So I toggled over to the server, just to see how badly it was thrashing, and to get an indication of the log file size. The log file size is about the best indicator I have of how long a task will take to complete, because I know that it tends to grow to about 16GB before the task finishes. Just as I toggled over to my development server, the scheduled daily maintenance script started running. The script, no less, compares MD5 checksums of ~330,000 files, defrags 4 different IDE partitions, backs up ~20 development databases comprising several GB storage, etc.
Piggy ain't coming back.
Thanks for letting me share.
Just as I'm typing now, the big task completed! Cool.
And get this. It took 37 mins and 12 secs to run.
See that! 37 minutes! Haha!
It took about 1.5 hours yesterday.
Good work me!
Anyway.. must go.. I've got some referential integrity to enforce.