PowerShell Script to Remove SharePoint 2010 or 2013 List Duplicates

<Updated 2013-05-11>

   Recently Derek, a fellow Premier Field Engineer (PFE), had a customer request on how to remove duplicates from a SharePoint 2010 list.  I have written some PowerShell scripts that remove list items, but never to check for duplicates.  My main concern for this was performance as looping through a list searching for duplicates could be an expensive operation.

 

Problem

   As it turns out the customer had a list with more than 85,000 items and many of the list items were duplicates.  The original solution the custom attempted took 5 hours to run and threw out of memory exception most times it was run.  My assumptions on the resource intensive list queries and looping were correct.

 

Solution

   I first wrote my own version of a looping structure to iterate over the items and remove duplicates while trying to be as efficient with memory and CPU usage as possible.  After about 10 minutes of scripting I wasn’t making good progress so I decided to switch up my approach.

   A quick Bing search pulled up the following article on checking for duplicates items in a SharePoint list.  The key piece of info from that article was not to loop through the items one at a time, but instead to convert the list into a DataTable using the method SPListItemCollection.GetDataTable() and then group the items on the column to compare for duplicates.  In this case the Title column was used for grouping.  With a little trickery I was then able to find the IDs of the duplicate items to then be used to delete the individual duplicate items.

   Two added added bonuses are included in the below script.  The first is that the script writes out the progress of the deletion process using Write-Progress.  The second is that the commented portion at the bottom will generate list items with random single letter titles to test out the duplicate deletions.

<Update 2013-05-11>

   Thank you to reader Santosh for pointing out an error in the script I published.  I had mistakenly published a version that included an invalid reference.  On line 10 in the foreach block (%) the call to $list.DeleteByItemID() will fail because that method doesn’t exist on that object.  I’ve updated the downloadable script and below sample to the original I used for $list.GetItemById($_.ID).Delete().  Santosh also got it to work by calling $list.Items.DeleteItemById().  I prefer not to call the Items member on an SPList object because of the performance impact which you can read about in my previous post.

</Update 2013-05-11>

Add-PSSnapin microsoft.sharepoint.powershell 
$web = Get-SPWeb -Identity “<URL of Site>” 
$list = $web.Lists[“DuplicatesList”] 

$AllDuplicates = $list.Items.GetDataTable() | Group-Object title | where {$_.count -gt 1} 
$count = 1 
$max = $AllDuplicates.Count 
foreach($duplicate in $AllDuplicates) 
{ 
$duplicate.group | Select-Object -Skip 1 | % {$list.GetItemById($_.ID).Delete()} 
Write-Progress -PercentComplete ($count / $max * 100) -Activity “$count duplicates removed” -Status “In Progress” 
$count++ 
} 

   Be sure to modify the site URL and list name to work in your environment.  In my example the list is named DuplicatesList and must already exist.

 

Conclusion

   With my current solution I was able to remove 75,000 duplicates in a list with a single column (Title) in roughly 45 minutes.  Derek made some tweaks and was able to remove 45,000 duplicates in about 3.5 hours on a list with dozens of columns.  It appears that a list with additional columns takes longer to process.  If I had additional time I would investigate the SPList.GetDataTable() method which allows passing in a query to only retrieve specific columns (i.e. the column we are using to determine duplicates).  I believe this might speed up the process but don’t know for sure.

   As it stands I was very happy with the solution I came up even if it did take a long time to run.  The important part was that it didn’t throw out of memory exceptions while running and did have increased performance vs. the customer’s original solution.  Thanks to Derek for collaborating on this item with me.  It really is great to work with other PFEs who are passionate about helping customers solve real problems in their environments.  I hope this script and / or post helps someone else in their environment.

 

      -Frog Out

 

Links

Check for duplicate items in Sharepoint lists with Powershell

http://jespermchristensen.wordpress.com/2008/09/22/check-for-duplicate-items-in-sharepoint-lists-with-powershell/

 

SPListItemCollection.GetDataTable() method

http://msdn.microsoft.com/en-us/library/microsoft.sharepoint.splistitemcollection.getdatatable.aspx

 

SPList.GetDataTable() method

http://msdn.microsoft.com/en-us/library/microsoft.sharepoint.splist.getdatatable(v=office.14).aspx

6 thoughts on “PowerShell Script to Remove SharePoint 2010 or 2013 List Duplicates

  1. Originally posted on: https://briantjackett.com/archive/2013/03/19/powershell-script-to-remove-sharepoint-2010-or-2013-list-duplicates.aspx#626828This below line of code gives error saying the DeleteItemByID doesnt exist for list$duplicate.group | Select-Object -Skip 1 | % {$list.DeleteItemById($_.ID)} I have replaced with $list.DeleteItemById($_.ID) with $list.Items.DeleteItemById($_.ID)it worked for me.This script takes long time to delete items. It takes one minute to delete a record. I have got 84 fields on in the list.Is there are better solution which resolves this much faster.

    Like

  2. Originally posted on: https://briantjackett.com/archive/2013/03/19/powershell-script-to-remove-sharepoint-2010-or-2013-list-duplicates.aspx#627903Santosh,Unfortunately I made the switch to $list.deleteitembyid() late in the script development and didn’t fully test it out. Originally I used $list.getitembyid(<id>).delete(). I fear I may have posted the wrong script. I will test this out when I get a chance and update if needed. Thanks for catching this.

    Like

  3. Originally posted on: https://briantjackett.com/archive/2013/03/19/powershell-script-to-remove-sharepoint-2010-or-2013-list-duplicates.aspx#645357This runs amazingly fast – iterating through a list with >100k entries and ~15 columns, grouping by multi-line columns as duplicate criteria, the GetDataTable took roughly 10 minutes, and with 181 duplicate candidates with ~500-1000 clones each, the deletion process took ~1h. And due to the beautiful Write-Progress view, it was a pleasure watching it run. :DThank you so much for this codebit, it saves me one hell of a hassle in developing a resource-optimized solution myself!

    Like

Leave a comment