Premise Job Queue stalls requires manual port reset

doczaius · Jul 24, 2013

I'm having an issue where every 12-24 hours, jobs stop processing and won't restart until I manually reset the com port (by changing it to an unused port and then pack to the VRC0P port). The VRC0P itself does not require being restarted.

I read through the forum and found some posts regarding firmware issues and the VRC0P not responding after X amount of commands but I don't think this issue is related as the VRC0P does not require restarting, just the com port. Has anyone experienced this issue?

etc6849 · Jul 25, 2013

I've never had this problem... It could be serial port issue? Can you unplug the VRC0P and paste what is shown in port spy when you plug it back in? This will show you the firmware version...

The worst case is we'll have to set up a script that runs every 12 hours to reset the thing automatically. However, this should not be necessary. I've run the thing for over a year with no issues or reboot needed.

doczaius · Jul 25, 2013

Its a +3 model:

Leviton(C) 2008 V2.36S/Z-Wave 3.4

I guess it could be a serial port issue, but the odd thing is these are legit serial ports, no USB adapter -- I purposefully dug out & setup an old WinXP era motherboard to eliminate any compatibility issues.

chucklyons · Jul 26, 2013

I haven't had any problems with the VRCOP, either. Have you tried using another device on that port? Or have you tried using a different port (like a USB port)?

doczaius · Jul 26, 2013

I'll try with a USB adapter today and see what happens.

etc6849 · Jul 27, 2013

Mine shows "$Leviton(C) 2008 V2.36S/Z-Wave 3.42." If you want, you could try updating the firmware (not without risk though). I would suggest using a non-generic USB cable to do this as I had issues updating until I tried an old Digi 8 port usb to serial adapter.

Initially, there were a lot of issues with the VRC0P +3's firmware. That's why there's the add delay after next job feature (that is not needed now). It was put in place because an older version of the firmware would lock up if you sent too many commands to the VRC0P in a short time frame.

doczaius · Aug 1, 2013

So I'm running the same firmware.. I think my previous post was a bad copy paste job omitting the 2 at the end.

No change trying with an iogear usb serial adapter. It just stops working, almost like there is a break in the code (though there isn't), usually with commands left in the queue, occasionally with it empty.

chucklyons · Aug 1, 2013

I apologize, I've been following this on my phone...what are the problems you're seeing, the steps you have taken and the current results.

I initially had a lot of problems with the Vizia module...ultimately it turned out to be a combination of my wife's errors ^_^ and my configuration...after I figured it out with the help of some fine folks here (ETC!), I haven't had a problem since...actually, if I do have a problem, I know it was caused by me... I mean my wife...
It's solid. Lets figure out what we can do to help you...

etc6849 · Aug 1, 2013

Here are some things I would try:
1. Run the burn in test for a few hours to toggle all of your lights. Does this cause the freeze up? If so, run it again to verify. If the burn in test is failing, you may have an intermittent hardware issue of some kind.

2. Check and make sure you have no other program trying to use the COM port periodically.

3. Determine if the issue occurs at a set time interval or after a set number of commands.

4. Examine the windows logs and look for any clues.

Be sure to let us know what you find.

doczaius · Aug 1, 2013

OK -- Thank you Chuck & ETC for your help...

Here is are the summaries:

System:

Windows XP Install
2 old-school com ports on system board, 1 usb->serial
I've tried with the VRC0P v3 plugged into all three.
No issues with X10 module plugged into com ports (though I realize its much simpler tech)

[*]Premise is the only software installed (besides standard system software, firefox and debug tools that require manual launches)
[*]Premise modules installed: Default, Leviton, Automation Browser, Minibrowser, Occupancy
Issue:

Manual polling of 1 device (Thermostat)
Once or twice per day, premise stops processing the queue.
The queue never clears, nor do commands back up.
The process simply halts, with or without items in the queue.
There are no code breaks or error popups. As far as I can tell, nothing abnormal in debug either.

Troubleshooting:

Resetting the VRC0P by unplugging from A/C and plugging back in does nothing.
Rebooting the computer, or temporarily changing the COM port for the ViziaRF module to an unused port and then back again gets everything running again.
Increasing the polling delay has no effect... I've tried 120, to 300, to 600 and 1200 increments.
I only have one device set to One-Way so reducing the amount of devices to poll isn't really an option.
No difference re-flashing the VRC0P (latest firmware).
No difference between integrated COM ports and USB adapter.

I'm going to try a burn in test now... currently before I clear the job count, I have about 12K successful jobs and only 340 failed. Of course this doesn't count all the unprocessed jobs when it "halts."

doczaius · Aug 1, 2013

Ok well the Burn In test caused a code break pretty quickly:

In RetryJob:

Error: Object doesn't support this property or method: oJob.Runs

Code:

	if oJob.Runs < this.MaxRuns then ' Run the same Job again

etc6849 · Aug 1, 2013

What script is referenced with the error? Also, clear the job queue, count (go to JobQueue and toggle the boolean properties) and reset the port and please try the burn in test again. Maybe disable polling, then retry the burn in test too.

What polling interval are you using and for how many devices? It sounds like you are polling a bunch and the queue fills up since there is not enough time to empty the queue? I'd recommend trying things with polling interval set to 0 for a few days, (e.g. disabled). If that fixes things, try polling every 1800 seconds.

If you never see the queue empty (even after resetting the port), things will not work right.

I'd question why you need to use polling... None of my devices require polling and give instant two-way feedback. Have you tried to associate the thermostat with the VRC0P (do this manually from Premise)?

doczaius · Aug 1, 2013

sys://Devices/CustomDevices/ViziaRF -> JobQueue -> Retry Job.

Only 1 device polling, at 180 now which is the thermostat. Queue isn't filling up. It just stops going. I think there are 5 or so polling commands sent for the Thermostat... at any given "halt", there may be 1,2,3,4 or 5 or none, but never more than 1 round. The polling and queue work just fine... when its working.

No difference clearing the queue, job count & turning off polling. Burn in test fails in the same place.

The 2GIG CT-30 (my thermostat) does not support the association class, so polling is the only option.

etc6849 · Aug 2, 2013

Thanks for that: I now know what's causing the error is about during burn in!

Just comment out these three lines and commit the code for sys://Schema/Modules/Leviton/Classes/JobQueue/RetryJob:

After you do that, rerun burn in and watch port spy. Try to see which device is having the job failures and start there.

Definitely, you should see no job failures (receipt of X002) with z-wave normally.

The problem is that you have job failures, and this can be hard to debug. It could be anything from RF interference, device placement, a device going bad, a neighbor has lost power, etc...

These job failures are a problem and must be corrected, especially if you are going to poll the device giving the failures. If you have three consecutive job failures (defined by sys://Schema/Modules/Leviton/Classes/JobQueue/MaxConsecutiveFailedJobs), the VRC0P port will be reset. However, once you reach a maximum number of port resets (defined by sys://Schema/Modules/Leviton/Classes/ViziaRF/MaxInterfaceFailures), the port will be set to nothing and you'll have to manually reset the com port. This sounds like it may be your issue.

'if this.Parent.StopOnFirstJobTimeout = true and this.Parent.TestingInProgress = true then
'this.Parent.Stop = true
'end if

Code:

'
' Either no response was received and the JobTimer expired or
' the received response indicated a failure, so retry the Job
' provided that it has not been executed more times than MaxRuns.

dim oJob
'debugout "<RetryJob>"

this.DeleteJobTimer

if this.CurrentJob <> "" then
	set oJob = this.GetObject(this.CurrentJob)
	if not oJob is nothing then
			
		' if running burn in test, stop on first job failure
		'if this.Parent.StopOnFirstJobTimeout = true and this.Parent.TestingInProgress = true then
		'	this.Parent.Stop = true
		'end if
			

		if oJob.Runs < this.MaxRuns then ' Run the same Job again
			
			if gViziaIsDebugOn(1) then debugout "RetryJob(): <" & now & "> RETRY JOB, command=" & oJob.Description
			this.ProcessCurrentJob
		else ' Job failed to execute even after repeated attempts
			
			' Get the node ID for object that had the failed job
			iNodeID = oJob.NodeID
			
			' Record the failed job to the specific device
			this.Parent.Devices.SetFailedJob iNodeID
			
			this.TotalFailedJobs = this.TotalFailedJobs + 1
			if this.ConsecutiveFailedJobs < this.MaxConsecutiveFailedJobs then
				this.ConsecutiveFailedJobs = this.ConsecutiveFailedJobs + 1
			end if
			this.GetNextJob
		end if
	else
		' CurrentJob not found.
		' Something has gone terribly wrong.
		' Flush the Job Queue.
		this.PurgeAllJobs
	end if
end if

set oJob = nothing

doczaius · Aug 2, 2013

OK --

With that I was able to identify that I had a light switch that went bad and was dropped by my primary controller.. Anyways, I replaced the light switch and all is good, I ran the burn-in test for a little over an hour until.... it stopped working..

There are 4 jobs sitting in the queue.

InterfaceFailureTime: 7/31/2013 11:46:10 AM
InterfacefaceFailures: 2 (Neither of which occurred today)
MaxInterfaceFailures:50

TotalJobs: 2264
TotalFailedJobs: 0
MaxConsecutiveFailedJobs: 3
ConsecutiveFailed Jobs: 0

Premise Job Queue stalls requires manual port reset

doczaius

Member

etc6849

Senior Member

doczaius

Member

chucklyons

Guest

doczaius

Member

etc6849

Senior Member

doczaius

Member

chucklyons

Guest

etc6849

Senior Member

doczaius

Member

doczaius

Member

etc6849

Senior Member

doczaius

Member

etc6849

Senior Member

doczaius

Member

Similar threads