WC32 "goes to sleep"

rossw

Active Member
I have an instrument built based on a WC32.
It has a pyranometer, low-level light sensor and 240-360nm UV sensor all driving analog inputs, I2C sensor for humidity and temperature and three DS18B20 (two for air temperature, one for internal case temperature).
 
Over the first year, it's operated adequately, requiring the occasional powercycle (once a week or so), but since the weather started warming up a couple of months ago, it's been stopping dead in its tracks anything up to 8 times a day!
Usually starts around 13:00 local, frequently peaks around 16:00 or so, then improves and rarely fails after about 19:00.
The temperatures outside during this time are frequently from 30 to 40 degrees C.
 
The device, when built, was powered via PoE. 48V coming up the ethernet to an end-span PoE terminator. That was supplying 12V DC to the WC32. WC32 switching a low-current (100mA) 12V fan via open-drain FET.
 
The internal case temperature was reading about 2 deg C higher than ambient overnight and up to 10 deg C higher than ambient when the sun was striking the widest face of the case.
 
Thinking perhaps the POE device was failing under the temperature, I replaced it with a passive PoE injector/splitter (midspan) and 12V supply. Absolutely no change.
 
I added a solar shield - gloss white stainless steel sheet, about 20mm off the case. This dropped the case temperature to about 3 deg above ambient in sun, but the problem persisted unchanged.
 
I double-checked all my firewall rules, spent ages doing TCPDUMP and ensuring there was absolutely no odd traffic to or from it. I reconfigured the WC32 to use my local NTP timeserver and DNS server, then completely blocked all outside traffic from being able to get to the WC32. The problem persisted.
 
It became evident that it would usually stop responding "soon after" it did a DHCP request. Not exactly co-incident, but the DHCP transaction was almost always the last thing that happened before it stopped responding.
 
When the WC32 was "dead", it doesn't respond to ICMP echo-request or reply, or to tcp connection requests on port 80.
 
On a whim, I changed it from DHCP to static IP. It still had the same address, just that it was set in its internal configuration rather than always being given the same address by the DHCP server. Interestingly, it now ran for 3 days straight, without failing! I thought I was onto a winner!
 
But then this afternoon, it died at 12:30, and again at 14:30, and again 5 more times! So that wasn't it.
 
When it stops talking, the network LINK light remains on on the switch, it's not like it's lost power...
 
Anyone got any similar experience, clues, hints or cures?!?
 
CAI_Support said:
Did you check the power supply to make sure the problem is not caused by intermittent failure in power?
 
Yes, that was part of the replacing the 48V PoE (endspan) injector and terminator with a passive 12V injector.
The same power outlet is also running another WC32 and three WC8 boards, none of which are misbehaving.
 
If the other 3 boards all working fine, then maybe measure the voltage after the regulator on the problem board to see if that was caused by power regulation on the board.
 
When I measured the voltage early on, it was stable and about what I expected.
It's difficult for me to measure at the moment - when the instrument goes off-line is pretty much right when it's MOST required, and I can't really afford the time to gather tools and meter and go to where it is. (It takes several minutes to get to it, and longer again to get the box open). Removing power, waiting 2 seconds and re-applying power takes me about 10 seconds. The intermittent nature of the beast is the most frustrating thing!
 
I agree, it is very hard to diagnose an intermittent problem. It might be better to remove that board and put on bench to test out what was wrong. We know for sure the CPU and firmware never sleep. So it is something else failing.
 
An update.
It's still doing it. Some days it doesn't happen at all. Other days, like today, it's done it 6 times in the last 4 hours!
A new observation. The ethernet "LINK" light on the switch goes off for a couple of seconds, then comes on for about 9 seconds, then goes off... and repeats the cycle.

I remember seeing this very early on, when I had a network switch that the WC32 board didn't like.

I've pulled out the switch and replaced it with 3 others. They all must use the chipset that the WC32 doesn't like, as this board hasn't worked with any of them. I've returned to the one it was working with, or at least "mostly" working with.
 
I've now modified my code a little, to use the PING command to test for network connectivity. Each time it gets a reply, it clears a counter. Every time through the main loop, it increments the counter and tests. If the board has been unable to get an ICMP ECHO REPLY in 20 seconds, I issue a SREBOOT command. On start-up, I'm setting an output high so I have an indication that it has reset. It won't tell me WHY it reset, but at least it's an indicator. I'll wait for a few days before I decide if this has made any difference. While it won't be a cure, it might save the constant interruptions and having to go powercycle it.
 
 
Edit: Well, that was a dismal failure. Already barfed. So the "SREBOOT" doesn't completely re-initialize things like a powercycle does. :(
 
Do you monitor the temperature inside the box?
 
I had a router that would crap out at 42C every time
.
As you know many regulators have self shut-down with temperature. When you open the box and reduce the heat build-up it all works fine.
 
SREBOOT does completely reinitialize everything firmware controlling, however, it will not be able to reinitialize anything attached to the board, since it does not have ability to recycle power.
 
LarrylLix said:
Do you monitor the temperature inside the box?
 
Yes - this was detailed in the first post of this thread.
With the extra radiation shield on, yesterday was not hot - when it first failed the ambient temperature was just over 30 deg C and internal case temperature was 38C
Over the next 6 hours the ambient temperature remained between 29 and 35 C, the internal temperature varied from 36 to 42 C.
In the last 10 days it has operated quite happily and without locking up with temperatures of 42C ambient and 49C internal.
These temperatures are not what I'd call "excessive".
 
Ross,
 
You have multiple WC32 boards, could you please replace it with another one to isolate the problem was this board, or not caused by this board?
 
Y'know, it'd be really nice if we could somehow pre-populate the DS18B20 device IDs before we put a board in place.
Right now, I've set up everything else except for the temperature sensors. Tomorrow, I'll have to go up, pull the old one out, put the new one in, then have garbage data until I can get back to a web browser, scan the bus, identify and select the sensors...
 
It would be a complete PITA sending a "replacement" board to someone far away for an embedded controller without internet access!
I always record the sensor serial numbers, and could plug them in before sending a board away - but there's no mechanism to do it :(
 
(Similarly, it'd be very handy to be able to download a "config backup" from a board that could just be "uploaded" to restore its settings, or perhaps to fully configure a replacement board!)
 
That is a great suggestion. From /api/status.xml web interface, one can get all temp sensor bounded list in details. However, it will require during configuration to match that manually. I will check with developer to see if there are easy way to upload the bounded sensor list to board. Not sure that is easy to be done.  Right now, the way to get configuration match original bounded sensor configuration still need human identify the sensor ID from the status.xml list.
 
rossw said:
 
Yes - this was detailed in the first post of this thread.
With the extra radiation shield on, yesterday was not hot - when it first failed the ambient temperature was just over 30 deg C and internal case temperature was 38C
Over the next 6 hours the ambient temperature remained between 29 and 35 C, the internal temperature varied from 36 to 42 C.
In the last 10 days it has operated quite happily and without locking up with temperatures of 42C ambient and 49C internal.
These temperatures are not what I'd call "excessive".
No I wouldn't consider them excessive either if you could actually measure the internal core tempearture of the regulators, assuming they are the hottest.

But...as I posted in my previous post, I had problems with my router at anything over 41C. Strange thing is once it happenned it became sensitive to that temperature from then on. It went for years with the same environment until then. Junction breakdown?

My usual test is the back of the finger right on the chip.. If you can't hold your skin against it for more than 1 second, then it's excessive. :)

IN the ole' days the 7805 line had internal shutdown from too much heat if you could measure the load voltage with a hair dryer on it to exagerate it.
 
Copied the code, settings, config exactly from one this board to a new one, same IP, everything.
Went up top, pulled the old board out, put *everything* back exactly as it was - pin for pin. Same cables, same sensors, same WC32IO board, even the same screws to mount it. No changes to power supply, cables, network switch or port... and it hasn't missed a beat in a week.... which includes days with ambient of 44.5C and above.
 
So it looks like it's the old board, for sure. Funny how it would run fine for days, then die 6 or 8 times in a few hours!
 
Back
Top