12

I'm building a system with a raspberry pi located in a very remote area connected to internet with an internet stick. The tests are promising so far but the pi freezes every here and then and I'm not able to connect to the pi anymore. Because I don't want to take a 2 hour drive everytime it freezes I want to build a redundant system which checks the other system.
The worst case will be to cut the frozen system from power to reboot. This should be done by the working pi.

Now the question as a total noob when it comes to building electronics.

I checked out the ATXRaspi R3 but I'm not sure how to "digitally" fire off the 6sec press on that power controller to cut the power by the other pi...

What would be the easiest way to cut power by another pi? Any hints are greatly welcomed.

tlfong01
  • 4,384
  • 3
  • 9
  • 23
Jurudocs
  • 223
  • 2
  • 6
  • 2
    Not sure anyone is going to design this circuit for you. But one additional thing to consider: Whatever causes the first Pi to freeze might have a common failure mode to the second Pi. For example, if it's freezing because of a power fluctuation, you might end up with two frozen Pis instead of the independent redundancy that you want. Might be worth trying to understand why that first Pi freezes first. – Brick Jun 14 '19 at 12:33
  • 1
    How quickly do you need the pi to come back online? A simple holiday light timer could cycle the power every X hours, as long as you don't mind waiting until the reset interval to have it back online again. – Tim Jun 14 '19 at 17:30
  • @Jurudocs, I followed #berto's watchdog timer tutorial and found everything good. I don't quite understand what the watchdog is doing, but I am 90% sure that the watchdog timer method should solve your problem, much cleaner to my proposed hardware solution. – tlfong01 Jun 17 '19 at 06:14

5 Answers5

14

Before you go looking into additional hardware, please read up on what's called a "watchdog timer". The Raspberry Pi has a hardware watchdog built in that will power cycle it if the chip is not refreshed within a certain interval.

I have setup the watchdog on a Raspberry Pi 3 and a new'ish version of Raspbian with very little configuration. The first thing to check is that the hardware watchdog is available (I checked my system and it looks like the version of Raspbian I have installed compiles watchdog support right into the kernel; no need to load a kernel module):

pi@unicornpi:~ $ ls -al /dev/watchdog*
crw------- 1 root root  10, 130 Nov  3  2016 /dev/watchdog
crw------- 1 root root 252,   0 Nov  3  2016 /dev/watchdog0

If you see /dev/watchdog you're all set. All you have to do is configure the watchdog facility built into Systemd.

In the file /etc/systemd/system.conf, set the following lines:

pi@unicornpi:~ $ grep Watchdog /etc/systemd/system.conf
RuntimeWatchdogSec=10
ShutdownWatchdogSec=10min

What the lines above say is:

  • refresh the hardware watchdog every 10 seconds. if for some reason the refresh fails (I believe after 3 intervals; i.e. 30s) power cycle the system

  • on shutdown, if the system takes more than 10 minutes to reboot, power cycle the system

Once you have this configured and reboot, you will see something like this in the dmesg logs:

pi@orangepi:~ $ dmesg | grep -i watchdog
[    0.763148] bcm2835-wdt 3f100000.watchdog: Broadcom BCM2835 watchdog timer
[    1.997557] systemd[1]: Hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0
[    2.000728] systemd[1]: Set hardware watchdog to 10s.

If you see Set hardware watchdog to 10s you're all set.

The best way I've found to verify that the watchdog works is to overload the system. I've done this with a "fork bomb", which will completely saturate the system with garbage process forks. If you run this the Pi will become unresponsive and the watchdog should kick in. Your system should be up and running again after about a minute:

:(){ :|:& };:

Paste that into a shell and your system will be taken down. You've been warned.

More info on the watchdog system built into Systemd is on the author's website.

berto
  • 1,171
  • 1
  • 8
  • 12
  • Many thanks for advice. I have heard watchdog for a long time but never tried it, because no necessity, until now, building smart rooftop garden away from home (actually 50 feet above home). Another reason did not try because tutorials not newbie friendly. When started Rpi1 years ago, I found terminal commands very scary (it took me more than three hours to download a zip (tar actually) and extracted it, but I did not know where to find the extracted files!) Now I find terminal commands not that scary, but sometimes very efficient, though I still love Win PowerShell terminal commands, ... – tlfong01 Jun 15 '19 at 04:03
  • And the advice at the beginning of your answer of first reading up what is a watch dog is very good. I did not know that watchdog is actually "watchdog TIMER" in short. This is important because if I know it is a timer beforehand, I can understand things better. And as usual, I started with Wiki, which is always a good read for newbies. Now I know that watch dog is actually some sort of hardware sitting alongside the Rpi. So even Rpi messes up things, the outside guy can come to rescue (or "kick in"?). Reading Wiki let me know that "kick in" is not slang, but technical term. – tlfong01 Jun 15 '19 at 05:10
  • I also didn't know what is a "daemon". When I was a child, I read the Bible that daemon is a bad guy, so righteous programmers like me should not use daemons, otherwise I might go be Hell. But then Wiki tells me who the MIT/UNIX guys coined the name and why it spells "daemon" not demon. It also clarifies that daemons can be good and even the righteous guy Socrates owns a daemon. Anyway, I finished reading Wikis, and now ready to start your tutorials, :) – tlfong01 Jun 15 '19 at 05:16
  • So I have followed your very detailed watchdog tutorial and found everything OK to the point of setting the watchdog to 10 seconds. Next step is to try a fork bomb, perhaps late this evening or tomorrow. – tlfong01 Jun 15 '19 at 09:17
  • Thank you for suggesting to call it a “watchdog timer”. I’ve made the edit – berto Jun 15 '19 at 13:17
  • I believe you are referring to “demons” which are considered bad/evil. On the other hand, “daemon” — with an `a` — is benevolent and stems from Greek mythology. Processes on a system that run in the background are referred to as daemons because they are like ghosts. They work without being seen. https://en.wikipedia.org/wiki/Daemon_(classical_mythology) – berto Jun 15 '19 at 14:39
  • Yes, "Watch Dog Timer" tells the story in more detail. But the "watch dog" idea is still puzzling to me. The following article tells me more details, but I still find the idea of "kicking the dog" puzzling. I have never heard about this before. I guess I missed some point somewhere. I think it is like recursion, which is simple ONLY after you understand the "trick". https://www.microcontrollertips.com/whats-watch-dog-timer-wdt-faq/ – tlfong01 Jun 17 '19 at 12:26
  • I'm not sure where the term originates, but saying that something "kicks in" is another way of saying that something is triggered, or that an event is happening. In the watchdog case, when it kicks in, that means the power cycle is taking effect, generally because the timer has not been refreshed, which suggests there is a problem. – berto Jun 27 '19 at 17:39
  • Thanks a lot for your detailed explanation of the phrase "kicking in". I was once confused because I read another article mentioning the approach of "keep kicking the dog, or it will bite you back." This keep kicking the dog approach is actually used the the Rpi watchog algoritrm, in that if the watchdog timer is set to 10 seconds say, Rpi should keep restarting the timer (kicking the dog before its timer runs out) perhaps every 8 seconds. Another confusion for newbies like me is the following: / to continue, ... – tlfong01 Jul 27 '19 at 05:55
  • Hi @berto, Another confusion is the following. Rpi does restart the timer, perhaps every 8 seconds, but it does not monitor the timer to see if 10 seconds time runs out. It is the watchdog which 'monitors' the timer, and when 10 seconds timer runs out, the watchdog hardware will reset the Rpi. In other words, it is not the Rpi which "triggers" a reset, but the hardware reset circuit of the watchdog "kicks in" (or triggers itself, if you like) to reset the Rpi. – tlfong01 Jul 27 '19 at 06:00
  • 2
    Hi @berto, I just upgraded my Rpi3B+ stretch to Rpi4B buster. I followed your very detailed instruction again to set the 10 seconds watch timer and use the time bomb to verify everything works. In other words, your instruction is good for both Rpi3 stretch and Rpi4 buster. Many thanks again for your help. – tlfong01 Jul 27 '19 at 06:04
  • I'm virtually certain setting a watchdog timer on my Pi corrupted the SD card and forced me to reinstall Raspbian – readyready15728 Feb 06 '21 at 04:42
  • @readyready15728 sorry to hear your SD card corrupted. The watchdog itself is simply a packet that is periodically sent to the system's main processor, and doesn't really interact with the SD card. Do you have a particular reason to believe it was the watchdog that caused the problem? I have a system running well over a year with the watchdog enabled and haven't had issues with the SD card. – berto Feb 08 '21 at 04:14
  • I can't think of any other explanation and have read elsewhere that watchdog timers can lead to this happening. – readyready15728 Feb 08 '21 at 23:18
  • I can see an issue where the watchdog reset, for some reason, triggered. For instance, if the system is under heavy enough load that the watchdog doesn't send the heartbeat packet, or if the watchdog refresh process gets killed (like an OOM condition) and is no longer running. In these cases, perhaps the card is being written to and the watchdog reset triggers when the card is not in a sync'd state. I'm sure it's possible the watchdog played a part in the corruption, though it still feels like a symptom, not the direct cause. – berto Feb 09 '21 at 19:39
  • 1
    Well, I'll see what happens now that I don't use the watchdog timer approach anymore but I think, regardless of what may have or have not happened, I think it's a very good idea to issue any advice about using said approach with the caveat that it could have a downside, and to suggest backing up all important data. – readyready15728 Feb 12 '21 at 04:37
  • That's fair. I suppose "backup your data" is still something that has to be stated over and over. My general experience with SD cards is that they are flaky and will die, so I don't ever expect them to be the source of truth for any data. – berto Feb 12 '21 at 18:44
6

Cutting power is a brute force method and has risks.

The conventional solution to lock-up problems is to use a watchdog.

There is a BCM hardware watchdog; If you want to start the hardware watchdog include dtparam=watchdog=on in /boot/config.txt

In and of itself this does little, although it should restart the system if not "kicked" regularly. You can write code which opens /dev/watchdog to kick it off.

There is also a watchdog daemon which you can configure to activate the watchdog; you should be able to start with sudo systemctl enable watchdog

PS Incidentally, if you want to pursue the brute force approach - don't bother cutting power - just pull the Reset pin (labeled RUN) low. This is equivalent to powering off then on again.

Milliways
  • 54,718
  • 26
  • 92
  • 182
3

Question

Remote Rpi's freeze from time to time. How to wake them up?

Answer

Update 2019jul27hkt1406

I recently upgraded my Rpi3B+ stretch to Rpi4B buster and again I followed @berto's tutorial to set the watch dog timer. I found everything works as smoothly as before. In other words, no changes need to make to @berto's tutorial when upgrading to Rpi4.

Last time I knew nothing about the watchdog timer thing. So it took me more than 3 hours to google to understand everything inside out (well, almost inside out). This time I know what is going on, and all the linux tricks, so it took me only a couple of minutes to complete @berto's tutorial.

2019jun18 Updates

After more thoughts, I concluded that my answer is coming to an end. My conclusion it that @berto's watchdog tutorial and experiment suggestion is good, and his answer is the real answer for the OP's question.

I did his suggested experiment successfully, verified results by the forkbomb program, and after a lot of googling and reading for more than 10 hours, I think I finally understood thoroughly the idea of watchdog timer.

Earlier I wrongly thought that I still needed to learn how to set the timer to 10 seconds or more. But as @berto says, 10 seconds is all that to be set. I also read that I can set timer to as long as 16 seconds, and linux watchdog default is even one minute. But that is not critical.

I have removed all the long winded reading notes in the appendices, to make the answer shorter. I would suggest newbies not to try to understand all the details of watchdog, not to mention the much more complicated daemon SystemD, because our life is short, and those system things are too complicated for non professionals.

I would like to add two points to end my answer.

(1) There are many reasons for an Rpi to hang in a couple of days (but usually not months). Often it is not the application program's fault, but because of the drivers or library functions creating too much garbage, eg. sockets created, used but not properly disposed. If it is the application program itself making garbage, the program can do "garbage collection" and problem solved. But it is hard to remove garbage sockets which are not generated by the application program. So a watchdog timer is useful here.

(2) Other ways to avoid too much garbage using up resources include rebooting every now and then by software or hardware. I do think rebooting every morning and also use software switchable power supply to do the system resetting adds another layer of protection. And using only one Rpi is not very safe. Using two Rpi's as each other's watchdog (using URT for message passing, eg) add one more layer of protection. Another method I have not explored is using ESP8266 Wifi sockets. I hope I can try that later.

This the the end of my answer. Cheers.

2019jun17 Updates

So I tried the fork bomb. The system rebooted after executing the program, in about 15 seconds.

fork bomb test results

2019jun16 Updates

I found @berto's fork bomb program is a bit newbie scary. So I am learning Bash to find out what that fork bomb is doing. Basically it is just a function named ":", which is defined as a function calling itself two times, thus forking indefinitely, as fast as rabbits growing exponentially, using up all the resources, and crashing linux.

fork bomb

I have also found the following interesting version of forkbomb using Unicode symbols:

( ) { | & } ;

2019jun14/15 Updates

@thesnow suggests a very nice layered approach using a smart plug. I think the smart plug or smart IoT stuff is the way to go. However, I am a not so smart newbie in smart stuffm though I am keen to learn. So I am going to buy a smart plug, do some research, and improve my answer afterwards. For now, I have added some related learning resources in the reference section below.

I found @berto's suggestion of using Rpi's hardware watchdog timer also very good. I have not played with any watchdoog stuff before. So I am going to try it now. @berto's instructions are very detailed, but still a bit hard for me, because I don't know very well the meaning of the commands "grep" and "dmseg". So I googled and made some reading notes in the appendices below. Then I followed @berto's suggestion, and strugged a bit to complete part 1. I have not yet reboot, because I need to take a break to digest things. Anyway, here is the screen capture.

watchdog_test_2019jun1501

I rebooted and got the following dmesg:

watchdog 3

I think I am going too fast and now need to take a break to first study more linux things, like systemd, before coming back to carry on the test on watchdog.

systemd architecture

/ to continue, ...

The Answer

I have the same problem. I am building a rooftop garden with a couple of Rpi's each of which connects to various wireless stuff (BlueTooth, Wifi) sensors, relays, and solenoids. There are two huge motors near by, controlling big water tanks and lifts. The motors generate EMI and from time to time freeze nearby electronics things.

My plan is to use software switchable PSUs (Power Supply Units) to power switch off/on frozen Rpi's and other devices (Bluetooth devices freeze most often. The BlueTooth and other little devices do not have any software reset command or hardware reset pin, so powering off/on their 5V Vcc is a quick and dirty, but still safe get around). In short, The Rpi's regularly watch each other and their devices and POR (Power On Reset) any guy fallen to sleep.

Of course I can also use a GPIO pin to trigger the Rpi hardware on board reset pin. But I am too lazy to do extra wiring, and too poor a hobbyist to afford professional/industrial grade non stop system devices such as the SwitchDoc Labs Dual WatchDog Timer (see reference below)

I modify ordinary DC-DC (12V to 5V) PSUs' so that any Rpi or MCP23x17 GPIO pins can power on/off the LM2956/LM2947 voltage regulator chip of the PSU. (LM2941 can be used for 1A current switches, LM2596 for 5V 3A PSU. The on/off pin is also connected to a push button, for manual power on/off testing.)

Actually each of my 7 Rpi3B+'s is connected to a cheapy DS3231 Real Time Clock Module which has a hardware interrupt pin to reset PSU, Rpi, or other devices.

Whenever possible and practical I tie up all the devices' reset pins together (removing some of the pull up resistors, so not to overload the GPIO pin).

Now the external DS3231 RTC wakes up everybody in the morning, and switches off lights at midnight, so everybody goes to bed.

software switchable PSU

software switch PSU

software switch

References

1. LM2596/LM2941 Based Software Resettable PSU / Current Switches - Rpi StkEx Discussion

Rpi Hardware watchdog Discussion

SwitchDoc Labs Dual WatchDog Timer

ATXRaspi R3 - LowPowerLab US$14.95

A hackable ESP8266 inside a smart plug Want to play with ESP8266 without worrying about the hardware? - Mat 2017aug06

Reverse Engineering 101 of the Xiaomi IoT ecosystem HITCON Community 2018 – Dennis Giese

Xiaomi WiFi socket + MiHome app 21,307 views

espHome [ESP8266/ESP32]

AliExpress WiFi Smart Plug

Smart device -Wikipedia

WiFi Garage Door Opener using ESP8266 - Ray Wang 2016may13 56,335 views

Appendices

Appendix A - WatchDog Timer Reading Notes

Watchdog timer -Wikipedia

Linux WatchDog Man Page

Linux Watchdog - General Tests

Appendix B - Linux commands grep and dmesg reading notes

Appendix C - systemd references

systemd System and Service Manager - FreeDeskTop

systemd - Wikipedia

Appendix D - Fork and Fork Bomb References

Fork (system call) Wikipedia

Appendix E - Bash Learning Notes

tlfong01
  • 4,384
  • 3
  • 9
  • 23
  • 1
    Such a great answer! Thanks also for the pictures. Glad that you didn't took it just for this question :-D So I guess what I need is the LM25966S PSU to connect it to the GPIO as you said. I will try!!! Good that I have still my old soldering iron... – Jurudocs Jun 14 '19 at 08:55
  • @Jurudocs Thank your for your nice words. I cut and pasted, and modify my old answers for your question, so it did not take me much time. I am a PSU hobbyist, and I DIYed PSUs using LM2596 chips and inductor coils etc. But nowadays everything goes SMD and assembled modules are dirt cheap, so I have been lazy to "make" things. By the way, to messy around the LM2596 PSU, you don't need to test by using Rpi GPIO. You can just test by hand! :) Good luck! – tlfong01 Jun 14 '19 at 09:15
  • 1
    I noticed you mentioned reading up on Systemd. While I definitely recommend you do that because it's a significant component to the way modern Linux systems work, fully understanding it is going to take a long time and not necessary to try out the watchdog. :) – berto Jun 15 '19 at 14:43
  • 1
    @berto, I agree it might take me a very long time to understand the complicated SystemD. As Poettering says: "[systemd] never finished, never complete, but tracking progress of technology". I remember Oliver Heaviside, saying: "Am I to refuse to eat because I do not fully understand the mechanism of digestion?" - https://en.wikiquote.org/wiki/Oliver_Heaviside So I will forget systemd now and come back to watchdog. Actually I need to learn Bash first, before I can understand the weird Bash script of Fork Bomb. – tlfong01 Jun 16 '19 at 06:30
  • 1
    The fork bomb line is pretty simple once you understand what you are looking at. It’s a function named `:` that calls itself recursively and puts a copy of itself in the background which also calls itself recursively. The Wikipedia page you have in your notes explains this further. – berto Jun 17 '19 at 02:01
  • Well, I was not aware that the symbol ":" can be a function name. In the beginning, I wrongly thought that the function has no name, a "lambda", in other words. I guess over 90% of the visitors in this forum don't understand what is the idea of recursion, not to mention double recursion used here. Recursion in mathematics is an algorithm that would come to an end and problem solved. In this case, there is no end. IT IS INCORRECT AND MISLEADING to call the function recursive. Function calling itself, is not recursion in full sense, or according to the rigorous mathematical definition. – tlfong01 Jun 17 '19 at 03:19
  • I found your older answer (6 years ago!) here: https://raspberrypi.stackexchange.com/questions/3732/watchdog-daemon-not-restarting-pi-after-fork-bomb Things are more complicated than I thought, so I will spend more time before defusing the bomb. :) – tlfong01 Jun 17 '19 at 03:43
  • I checked everything OK. So I executed the fork bomb. As you expected, the system rebooted in about 15 seconds. To summarize, I followed your nice tutorials and found everything good, though I don't quite understand what is going on. I need to spend more time to understand you commands, before I know how to set the watchdog timer. – tlfong01 Jun 17 '19 at 06:10
1

I have quite a few Pis. All of them, except one ran flawlessly. The problem child would crash periodically and would never recover after a power outage without being power cycled again. I had it reboot itself every night via cron and that helped somewhat.

What fixed it though was taking the SD card and sensor hardware and putting them into another Pi. It has run without error ever since. Maybe you too have a hardware issue.

Wildbill
  • 11
  • 1
  • I didn't catch your second paragraph about the hardware problem. Did you mean that the SD card and sensor caused all the trouble, and replacing them solved the problem? – tlfong01 Jun 15 '19 at 02:44
  • No, The Pi itself was the problem. I had a spare one, so I transferred the SD card and the sensors to the spare and used it instead of the original. No problems since. – Wildbill Jun 16 '19 at 11:42
  • I see. So it is always a good idea to have a spare Rpi for swap troubleshooting. Perhaps the OP should also consider this. – tlfong01 Jun 16 '19 at 13:02
0

If you have wi-fi and just need to power off / power on, you could also consider using a smart plug. Amazon makes one for ~$25, you can power it on / off remotely and also set up timer routines if that's preferable. I've had a few for several months and they're quite reliable. You don't actually need an Echo or any other dedicated device. I use my smart phone. Amazon Smart Plug

Edit: I realize this doesn't provide a solution to the first part of the question, but if I had the prospect of a 2 hour drive if something went wrong I'd consider a layered approach.

thesnow
  • 11
  • 1
  • , I appreciate very much your suggestion of a layered approach, with a smart plug at the top layer. Actually some months I have been trying to DIY a smart plug based on the ESP8266 WiFi controller. However I found the ESP8266 with NodeMCU Lua has a very steep learning curve. It took the newbie, ie, me over 100 hours just to blink a LED (compared to less than one hour writing an Arduino or Rpi blinky program) So I sadly gave up and now decide cheat by buying a ESP8266 XiaoMi smart plug and modify it. I am going to add your suggestion to my answer soon. Many thanks again! :) – tlfong01 Jun 15 '19 at 02:17