A Tale of the SharePoint Timer Service Repeatedly Crashing

SharePoint WSP deployment jobs failing due to a faulty timer job can be a difficult problem to track down.  I’ll walk you through the scenario I faced at a client recently with all of the issues I faced due to the SharePoint Timer Service repeatedly crashing.

At this client we have close to 20 custom and 3rd party WSPs that are deployed to the primary farm.  During a routine rebuild one day we noticed that 3 jobs were consistently failing or listing as “not deployed” (see below.)  After quickly digging into the event viewer on the 3 web front end (WFE) servers we noticed that one them had the below error.

 

TimerJobFail3

WSP deployment jobs failing

TimerJobFail2

“Faulting application owstimer.exe” error

The error suggested that the owstimer.exe (SharePoint Timer Service responsible for running deployment jobs on each server) process was crashing on that server but with no clear explanation.  We restarted the Windows SharePoint Timer Service on that server and retried the deployment.  Thankfully the deployment succeeded on the trouble server.  Unfortunately the deployment failed on a different WFE.  The cause: owstimer.exe crashed on the other server now.  Hmm, one step forward then one step back, not a good sign.

Fast forward a few more failed deployment attempts and owstimer.exe crashes on 1 or 2 of the servers each deployment attempt.  We then decided to fully remove the faulty deployment jobs that lingered on the failing servers.  This can be accomplished a few different ways.  First is running the “stsadm –o canceldeployment –id <guid>” command as seen here with the <guid> value coming from running “stsadm –o enumdeployments.”  Another (I feel slightly more risky) option is deleting the timer jobs through the Central Admin interface as seen below.  I say more risky because Central Admin usually presents a much easier interface for clicking the wrong button or choosing the wrong option (just my personal opinion.)  You can view deployment job status through “Operations –> Timer Job Status” and cancel them through “Operations –> Timer Job Definitions.”  Inside time job definitions click the specific job and then Delete from the following screen.

TimerJobFail4

Timer Job Definitions

Once we had completely cancelled, retracted, and removed all of the failing solutions then restarted the timer service on all WFEs we were able to successfully deploy all solutions on all WFEs.  There are times when you are working with SharePoint that you just need to back up your process and start back over from a clean slate to get successful results.  Knowing when to completely back up or when to forge ahead with your current direction is something I continually learn the more I work with SharePoint.  The biggest take-away from this experience is that when the timer job fails on one server there is a chance it is a farm wide issue and not just isolated to that server.  I can’t say that is always the case, but know the chance exists.  Until next time boys and girl happy SharePointing.

 

-Frog Out

3 thoughts on “A Tale of the SharePoint Timer Service Repeatedly Crashing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s