SharePoint WSP deployment jobs failing due to a faulty timer job can be a difficult problem to track down. I’ll walk you through the scenario I faced at a client recently with all of the issues I faced due to the SharePoint Timer Service repeatedly crashing.
At this client we have close to 20 custom and 3rd party WSPs that are deployed to the primary farm. During a routine rebuild one day we noticed that 3 jobs were consistently failing or listing as “not deployed” (see below.) After quickly digging into the event viewer on the 3 web front end (WFE) servers we noticed that one them had the below error.
WSP deployment jobs failing
“Faulting application owstimer.exe” error
The error suggested that the owstimer.exe (SharePoint Timer Service responsible for running deployment jobs on each server) process was crashing on that server but with no clear explanation. We restarted the Windows SharePoint Timer Service on that server and retried the deployment. Thankfully the deployment succeeded on the trouble server. Unfortunately the deployment failed on a different WFE. The cause: owstimer.exe crashed on the other server now. Hmm, one step forward then one step back, not a good sign.
Fast forward a few more failed deployment attempts and owstimer.exe crashes on 1 or 2 of the servers each deployment attempt. We then decided to fully remove the faulty deployment jobs that lingered on the failing servers. This can be accomplished a few different ways. First is running the “stsadm –o canceldeployment –id <guid>” command as seen here with the <guid> value coming from running “stsadm –o enumdeployments.” Another (I feel slightly more risky) option is deleting the timer jobs through the Central Admin interface as seen below. I say more risky because Central Admin usually presents a much easier interface for clicking the wrong button or choosing the wrong option (just my personal opinion.) You can view deployment job status through “Operations –> Timer Job Status” and cancel them through “Operations –> Timer Job Definitions.” Inside time job definitions click the specific job and then Delete from the following screen.
Timer Job Definitions
Once we had completely cancelled, retracted, and removed all of the failing solutions then restarted the timer service on all WFEs we were able to successfully deploy all solutions on all WFEs. There are times when you are working with SharePoint that you just need to back up your process and start back over from a clean slate to get successful results. Knowing when to completely back up or when to forge ahead with your current direction is something I continually learn the more I work with SharePoint. The biggest take-away from this experience is that when the timer job fails on one server there is a chance it is a farm wide issue and not just isolated to that server. I can’t say that is always the case, but know the chance exists. Until next time boys and girl happy SharePointing.