By heartbeat I mean as applied to computing, not someone's body
I want to create a device the monitors another program running on a Linux server. If this Linux program fails, the monitoring device will turn red, blow a horn or make the fact known otherwise somehow. The device and server are within two metres of one another. The Linux machine's correct operation is important from a marketing perspective. There are no lives at stake though.
Clearly some form of communication will have to be established, and thus a protocol used/created. Something like RS422, Ethernet or USB running over twisted pairs or other suitable cabling. UDP or TCP perhaps? I think that the semantic content that needs to be transferred over the protocol is simply "yes" or "no". Reliability is the most important aspect of this project.
I've looked at the wiki page, and I've read the Aguilera ,Chen & Toueg paper. I just wondered if there was any other practical advice I could get from here. Is there something already on the shelf..?
From an Arduino point of view the problem is rather simple: Attach the device to the server (USB is probably the easiest way), write a simple sketch that waits for a simple key to arrive by the serial interface (might be an increasing decimal number together with the word "OK") every few seconds. If that key doesn't arrive for twice the expected time, blow the horn or turn on the light or whatever you think is appropriate.
The whole task of checking the health status of that "program" is definitely a job to be done on the Linux server and not on the alarming device (that's what the Arduino actually is). That way you also monitor the monitoring application because if that process dies for whatever reason it will be detected too.
Is USB reliable in the Linux + Arduino configuration? I don't find it 100% so in my humble experience. Not every IDE upload to the boards works every time, the port numbers swap about on restarts and the output monitor can get confused occasionally. I'm not sure that I'd trust Linux + Arduino USB to hold up my trousers if I was going commando.
Might hardwired and dedicated serial be more trustworthy (RS232 or RS422)?
If it's true that I'm not the first to want such a monitoring arrangement, there must be existing protocols and perhaps even libraries? No?
When you delve into embedded watchdog design and multiple task monitoring, and read some of the NASA stuff, it's a bit more complicated than querying the server. And isn't that the wrong way around anyway? The server monitors itself and polls the alarming device (Arduino), no?
The "heartbeat" I have used in the past is to periodically send the date and time to all remote devices. The time period is up to you. The remote devices could actually use the time for some other purpose, but if the message does not appear within some time frame, the server is down.
cossoft:
If it's true that I'm not the first to want such a monitoring arrangement, there must be existing protocols and perhaps even libraries? No?
I suspect that many servers are successfully monitored without any Arduinos. Do some research.
The server monitors itself and polls the alarming device (Arduino), no?
It can't do that if the server has failed.
I can't see any difference between the server sending "I'm alright" messages to an Arduino and the Arduino (or any other computer, including another program running on the server) sending "Are you alright" messages to the server and expecting a response.
Based on your experience, over what 'stuff' would you send the time? Would you be happy to use Arduino's USB, or would a commercial outfit opt for something more robust like Ethernet or RS232/422?
pylon:
Does your server still has an available RS232 interface?
Well, there are various options. The motherboard has a comm port connector. Many do, they're just not visible externally. Or I can plug in a PCI RS232/422/485 card. They're < £100. Or there's the Integrated Lights-Out (iLO) Ethernet port for a network link to the Arduino + Ethernet shield, or a combined Arduino Eth board.
It's just that I'm not experienced with the reliability of long term Arduino USB linkage...
What sort of failure are you trying to monitor?
A hardware failure (motherboard or PSU) that takes down the entire server.
Network connections (various causes).
A specific program/service has not crashed?
If you want to monitor a specific program/service then how can you tell if it has stopped working or not?
cossoft:
It's just that I'm not experienced with the reliability of long term Arduino USB linkage...
You are facing the commonplace problem that more potential points of failure are created by adding systems to monitor the reliability of a process.
You need answers to these questions
Without any monitoring in place what are the consequences of the main process failing and what is the probability of that?
With monitoring in place by how much will the duration of a typical main system failure be reduced?
With monitoring in place what are the consequences (and probabilities) of either {A} an undetected main system failure and {B} a false report of a main system failure.
Efforts to reduce the probability of main system failure - for example an automatic switch to a parallel system - may be a better use of your resources if a failure has significant consequences. This is why I suggested earlier that you research how other high reliability severs are managed.
I wrote a lengthy reply, but the site kindly logged me out & threw it away >:(
The takeaway was - decide how many nines you want and realize that you'll need to pay for them.
See whether you can run in multiple Docker containers and oversee them with Kubernetes. For best results, nodes on multiple machines would give more reliability. Or just use a conventional load balancer against multiple physical boxes.
Run on the target server and it can restart processes that crash. For example, if httpd disappears, monit can restart it. Send alerts and warnings on excessive CPU utilization and/or load average, low free disk space, missing critical files, etc.
monit also runs on small Linux computer such as raspberry pi. If the target server is running an important network service such as http(s), the monit server can periodically attempt to get index.html from the target server. Send alerts if this fails. A Pi also has GPIO, I2C, and SPI so it can flash LEDs when alerts occur, monitor ambient temperature to etect air conditioner failures, etc.
On alerts, monit can execute a program instead of sending email, so people have figured out using curl/wget to interface to services with HTTP/REST interfaces such as slack and pushover.