I was presented a problem which entailed removing 100s of millions of records from several tables in order to recover disk space and improve query performance.
There were several options to choose from of which I explored:
I could do the most basic “DELETE FROM…” all at once. This would have been very time consuming. The query alone for one table would fill up the log like mad (each deleted row is logged and stays there until the query is finished). You could imagine how much space that would take up and how slow it would be. So, I didn’t even bother.
A little more advanced method would be to loop through each table and delete sets of data, 50K records at a time for example. This seems to be a favorite of a lot of non TSQL programmers 😉 Nonetheless, I gave it a shot. Two days later, that thing was still running and wasn’t close to doing all the tables. Abort mission!!!
I tried to get a little more creative and came up with an idea that I wanted to pat myself on the back for. Rather than selecting the records I wanted to delete, why not select the ones I wanted to keep? Initially, I did some research (scoping the project out) of what needed to be done (the number of records in each table, what needed to be deleted, the space that would be recovered, etc). Research, by the way, is always a good idea. From it, I discovered that 80% or higher of the records needed to be removed. I could do a “SELECT * INTO….”, truncate the table and then insert back into the original table with 50K records at a time. The truncate would be much faster, I would have a much smaller data set to work with and the original table would stay intact. This mirrored the previous method in a sense, inserting rows in batches rather than deleting them. The problem with both of these is that you have to know which ones to select each time and the tables are askew during the process. Each time you loop through, you need to query a million or so records to pick out 50K in order to know which ones are good (what to delete or not to insert). Each query took several minutes compounded by how many times it needed to loop through to get the job done. It took FOR-EV-ER. Good idea at first but didn’t quite work out.
I piggy backed off that idea yet it was more complicated. Rather than making the “SELECT * INTO…..” tables my temp tables, I figured I would make them my replacement tables. Why spend time to put them in there only to put them back and run numerous queries over just one per table, right? In order to do this though, I had to mirror the original table’s objects, indexes and keys. Management studio allows you to script these out but they couldn’t be the same name. SQL Server doesn’t allow it. So, I had to rename them myself. (Thank programmers for regular expressions and notepad++ or I would still be working on it). I went ahead and named all of these <original_name>_NEW. All I had to do then was to rename the original ones <original_name>_OLD and remove the _NEW. This was the culmination of my search for the Holy Grail. Not only did it take about 1 hour vs. 2 days to complete all the tables, I still had the old tables that just needed to be renamed if it wasn’t right and there was almost no impact on the system. Behold the power of set operations, ah-ee-ahhhh.