A Long Shot - Anyone Have Experience With *nix Signals in a Multi-Threaded App?

I seem to have run up against the limits of my (admittedly limited) knowledge in handling signals in a multi-threaded Linux app. I've found no help on the Linux forums I've tried, and I'm getting desperate. I'm sure I'm doing something stupid, but have exhausted everything I can think of to try to find the problem... Might there be one of the senior folk here with experience in this area? If so, PLEASE PM me...

Regards,
Ray L.

Sorry, no help from this quarter.

Maybe if you post your problem, you have a better chance :wink: It’s 15 years ago that I did something with Linux (and signals) so not sure if I can help.

RayLivingston:
handling signals in a multi-threaded Linux app.

What does that mean?

I use Linux all the time but I would certainly not consider myself an expert. It seems to me that phrase could cover 17 different situations.

And what has it got to do with an Arduino?

...R

Robin2:
And what has it got to do with an Arduino?

That's why OP said it was a long shot and asked for PM :wink:

Hmmmm.... Posted a response, but it never showed up...

The problem I'm having is getting my application to reliably handle signals. The app is multi-threaded, using pthreads, so the "old-timey" signal handling methods don't work, as they are reliable only in a single-thread context. I've tried several other methods, all of which work fine in a stand-alone test program, but none of which work in the actual application. The current approach uses a dedicated thread for signal handling, which uses sig_wait. It works.... sometimes. If I repeatedly send a signal to the app, I will get, typically 2-5 that are captured by the OS, and print the standard "Got unhandled signal XX" message. Eventually, my handler thread WILL capture the signal, but wont fully handle it. For example, sending repeated SIGTERMs will get several "Got unhandled signal" messages, then finally my handler will catch one, print its message, and SHOULD close the app, but does not. The very NEXT SIGTERM will correctly close the app! Very confusing...

I have tried to mask signals to all but the signal handler thread, but seem to be unable to make that work properly either.

Regards,
Ray L.

Robin2:
What does that mean?

I use Linux all the time but I would certainly not consider myself an expert. It seems to me that phrase could cover 17 different situations.

And what has it got to do with an Arduino?

...R

The term "signal", in a *nix context, has a very specific, well-defined meaning, as it has since the early days of Unix in the '70s. It is the mechanism by which all *nix OSes deal with asynchronous events, like memory faults, access errors, loss of communications, loss of the controlling terminal, and many other events. They are roughly the equivalent of "exceptions" in modern c/c++/c#/etc. languages, but processed at the OS level.
Regards,
Ray L.

RayLivingston:
The term “signal”, in a *nix context, has a very specific, well-defined meaning,

Thanks for clarifying.

I did not know that was the context in which you were using the phrase.

And that stuff is well above my pay grade.

…R

Robin2:
And that stuff is well above my pay grade.

Apparently, mine too! :slight_smile:
Regards,
Ray L.

Since you posted code on the RaspberryPi forum why didn't you post it here or at least link to it?

I've used pthreads before (some time ago), but probably not so signal heavy and not on Linux.

Looking at the code you posted at Fri Aug 10, 2018 3:03 am, the first thing that jumps out at me is that you are sharing unprotected variables between threads. What you have might work on a single cored machine, but on a multi-core with multiple caches you are begging for trouble.

You need to add a memory barrier to ensure cache coherence; you can do so by use of a mutex. Maybe even declare volatile as well, but I'm not sure if it will add anything more.
That might fix the unusual termination behaviour.

Since you posted code on the RaspberryPi forum why didn't you post it here or at least link to it?

Because I was only looking for someone who might be able/willing to help, without creating an RPi thread on the Arduino forum. I figured I was pushing my luck far enough by even making the request...

Looking at the code you posted at Fri Aug 10, 2018 3:03 am, the first thing that jumps out at me is that you are sharing unprotected variables between threads. What you have might work on a single cored machine, but on a multi-core with multiple caches you are begging for trouble.

The only "shared variables" are the one variable "Terminate", which is only EVER written (to 1), by the signal handler thread, only in response to a SIGTERM, and is only read by the main thread. I don't see the problem there. It is declared volatile on the version I actually tested. The test program works perfectly, every time, but in the full app context, NONE of the signals work more than occasionally, including the ones that access no data whatsoever.
Regards,
Ray L.

Have you run this example from the man page: pthread_sigmask(3) - Linux manual page

RayLivingston:
The only "shared variables" are the one variable "Terminate", which is only EVER written (to 1), by the signal handler thread, only in response to a SIGTERM, and is only read by the main thread. I don't see the problem there. It is declared volatile on the version I actually tested. The test program works perfectly, every time, but in the full app context, NONE of the signals work more than occasionally, including the ones that access no data whatsoever.

You also share signal_mask but, since it is initialised before the threads are created and since pthread_create also triggers a memory barrier, it should be OK.

Look, the fact remains, multi-threaded stuff is tricky and so you have to be scrupulous. Casually sharing even an int between threads is risky. In your problem statement you said: My handler finally gets the signal, prints its message, and sets Terminate. BUT, it does NOT terminate until I send one MORE SIGTERM! And that is exactly what could happen if Terminate is set, but sits in the cache on one core, until the cache is at some point later synchronised and made visible to the main thread.

Are you running this on a multi-core device?

At the end of the day, if you are adamant that your test program is flawless, then the problem must be in the full app's code which we haven't seen yet.

arduino_new:
Have you run this example from the man page: pthread_sigmask(3) - Linux manual page

That is virtually identical to my test program, and, I believe, is the code I started with.
Regards,
Ray L.

arduarn:
You also share signal_mask but, since it is initialised before the threads are created and since pthread_create also triggers a memory barrier, it should be OK.

Look, the fact remains, multi-threaded stuff is tricky and so you have to be scrupulous. Casually sharing even an int between threads is risky. In your problem statement you said: My handler finally gets the signal, prints its message, and sets Terminate. BUT, it does NOT terminate until I send one MORE SIGTERM! And that is exactly what could happen if Terminate is set, but sits in the cache on one core, until the cache is at some point later synchronised and made visible to the main thread.

Are you running this on a multi-core device?

At the end of the day, if you are adamant that your test program is flawless, then the problem must be in the full app's code which we haven't seen yet.

Unless I understand a lot less than I think I do (very likely).... sharing a variable, multi-core processor or not, is perfectly safe provided:

  1. The variable is declared as volatile
  2. Accesses are atomic
  3. Only one thread ever writes it

Volatile should negate any cache effects. I am running an RPi3 which is a 4-core ARM, so ever 32-bit accesses are atomic (assuming the compiler is not stupid enough to mis-align them). I have used the above rules, and they have worked flawlessly, for many years on all kinds of processors.

The app in question kicks off lots of threads, and they're often sharing data, with no problems, and there are only perhaps 3 mutexes in the entire thing, for those places where the rules cannot be enforced (in particular when using the accept() call to respond to socket requests, each of which kicks off a new handler thread).

I would not be at all surprised to find I am doing something wrong, but I don't even know where to look, when the signal handling thread is not even being called! Posting the whole program would be pointless, even if I could do it, because it is a gigantic, and complex, application spread across multiple RPis.

What seems to be a likely problem, and something that is not at all clear to me, is the blocking of signals to the various threads. When a signal occurs, the OS will send it, more or less randomly, to any one of the threads for which that signal is masked. Right now, I suspect that is ALL threads. I've tried un-blocking the signals in all but the handler thread, but that was unsuccessful, likely because I mis-understand the process and terminology. It appears to me that "blocking" refers to blocking the OS from handling the signal, in which case I would want to un-block in all threads EXCEPT the handler thread, as follows:

	static sigset_t		my_mask;


	sigemptyset(&my_mask);
	sigaddset(&my_mask, SIGINT);
	sigaddset(&my_mask, SIGTERM);
	sigaddset(&my_mask, SIGCONT);
	sigaddset(&my_mask, SIGTERM);
	sigaddset(&my_mask, SIGSEGV);
	rc = pthread_sigmask(SIG_UNBLOCK, &my_mask, NULL);
	if (rc != 0)
	{
		logprintf("Failed in pthread_sigmask in signal_thread!\n");
		exit(1);
	}

then block in the signal handler thread, as follows:

	static sigset_t		my_mask;


	sigemptyset(&my_mask);
	sigaddset(&my_mask, SIGINT);
	sigaddset(&my_mask, SIGTERM);
	sigaddset(&my_mask, SIGCONT);
	sigaddset(&my_mask, SIGTERM);
	sigaddset(&my_mask, SIGSEGV);
	rc = pthread_sigmask(SIG_BLOCK, &my_mask, NULL);
	if (rc != 0)
	{
		logprintf("Failed in pthread_sigmask in signal_thread!\n");
		exit(1);
	}

Do you know if that is correct?

It's also not at all clear to me if/how/from whom threads "inherit" their signal handling. I've read a lot of conflicting information (one of the joys of the Internet - you can find ANY answer you want!).

Regards,
Ray L.

If the example from the official source does not run correctly for you. Then you have problems somewhere else.


Your 1) and 3) assumptions are wrong.

  1. volatile keyword does not have any effect on concurrently accessing data.
  2. you will have data race condition when multiple threads operating on the same data. It is not fine if you have one writer but one or more readers.

From: Blocking Signal

Blocking a signal means telling the operating system to hold it and deliver it later. Generally, a program does not block signals indefinitely—it might as well ignore them by setting their actions to SIG_IGN. But it is useful to block signals briefly, to prevent them from interrupting sensitive operations.


Sub-Thread will inherent signal attribute from main thread by default. So, If A has blocking SIGINT, and A creates B, then B will also block SIGINT.

I'll maybe take another look tomorrow evening, but just to quickly answer:

You were right with your test program: you want to block signals in ALL the threads (by setting the signal mask in main since subthreads inherit) to disable the normal signal delivery. Then you use sigwait to poll what I think amounts to a queue of pending signals.

Your 1) and 3) assumptions are wrong.

  1. volatile keyword does not have any effect on concurrently accessing data.
  2. you will have data race condition when multiple threads operating on the same data. It is not fine if you have one writer but one or more readers.

IF the data in question is declared volatile, so the compiler will ALWAYS read the current value, NOT the cached value…
IF the data in question can be read/written atomically (an 8-bit quantity, for example)
IF only ONE AND ONLY ONE thread EVER changes the value of that byte…
There is NO chance of data corruption, no matter how many threads READ that data. No race conditions are possible. If you believe otherwise, what situation do you believe COULD cause a problem?

Sub-Thread will inherent signal attribute from main thread by default. So, If A has blocking SIGINT, and A creates B, then B will also block SIGINT.

Looking at the example you linked to earlier, which is the model for my current attempt:

       #include <pthread.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <unistd.h>
       #include <signal.h>
       #include <errno.h>


       /* Simple error handling functions */


       #define handle_error_en(en, msg) \
               do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)


       static void *
       sig_thread(void *arg)
       {
           sigset_t *set = arg;
           int s, sig;


           for (;;) {
               s = sigwait(set, &sig);
               if (s != 0)
                   handle_error_en(s, "sigwait");
               printf("Signal handling thread got signal %d\n", sig);
           }
       }


       int
       main(int argc, char *argv[])
       {
           pthread_t thread;
           sigset_t set;
           int s;


           /* Block SIGQUIT and SIGUSR1; other threads created by main()
              will inherit a copy of the signal mask. */


           sigemptyset(&set);
           sigaddset(&set, SIGQUIT);
           sigaddset(&set, SIGUSR1);
           s = pthread_sigmask(SIG_BLOCK, &set, NULL);
           if (s != 0)
               handle_error_en(s, "pthread_sigmask");


           s = pthread_create(&thread, NULL, &sig_thread, (void *) &set);
           if (s != 0)
               handle_error_en(s, "pthread_create");


           /* Main thread carries on to create other threads and/or do
              other work */


           pause();            /* Dummy pause so we can test program */
       }

Note the signals in question are BLOCKED in main, THEN the signal handler thread is launched. So, the signals are BLOCKED for ALL threads that are part of the application, and the ONLY place they should be getting processed is in the signal_thread function, which uses sig_wait to read the queue. All works just fine in the above simple test program.
When I do that in my application, MOST signals get handled by the OS, but they do NOT exhibit their default behaviors. For example, a SIGTERM should always terminate the main process, killing the entire program. Instead, most of the time, I get only “Got unhandled signal 15” on the console, and the application continues to run as if nothing happened. After several SIGTERMs, one eventually makes it to signal_thread, and it prints it’s message, but the execution continues, until a SECOND one makes it to the handler.
If the signals are being BLOCKED, as specified by the pthread_sigmask call in main(), HOW is this happening? I do not set or change signal masks anywhere else in any of my code. Could it be happening in some library? I am using only standard libraries for networking, console access, math, etc.
Regards,
Ray L.

RayLivingston:
IF the data in question is declared volatile, so the compiler will ALWAYS read the current value, NOT the cached value...
IF the data in question can be read/written atomically (an 8-bit quantity, for example)
IF only ONE AND ONLY ONE thread EVER changes the value of that byte...
There is NO chance of data corruption, no matter how many threads READ that data. No race conditions are possible. If you believe otherwise, what situation do you believe COULD cause a problem?

Nope.
It seems like you misunderstand race condition. Look at this simple code:

volatile int val = 1;

void threadA 
{
   val++;
   return;
}

void threadB
{
   val++;
   print(val);
   return;
}

int main()
{
   thread_start(threadA);
   thread_start(ThreadB);
}

Could you guess what value will be printed when threadB is executed?

RayLivingston:
Note the signals in question are BLOCKED in main, THEN the signal handler thread is launched. So, the signals are BLOCKED for ALL threads that are part of the application, and the ONLY place they should be getting processed is in the signal_thread function, which uses sig_wait to read the queue. All works just fine in the above simple test program.
When I do that in my application, MOST signals get handled by the OS, but they do NOT exhibit their default behaviors. For example, a SIGTERM should always terminate the main process, killing the entire program. Instead, most of the time, I get only "Got unhandled signal 15" on the console, and the application continues to run as if nothing happened. After several SIGTERMs, one eventually makes it to signal_thread, and it prints it's message, but the execution continues, until a SECOND one makes it to the handler.
If the signals are being BLOCKED, as specified by the pthread_sigmask call in main(), HOW is this happening? I do not set or change signal masks anywhere else in any of my code. Could it be happening in some library? I am using only standard libraries for networking, console access, math, etc.
Regards,
Ray L.

I dont understand this part. Are signal_thread and handler are 2 different functions?
SIGTERM should terminate your program unless it is blocked, or ignored, or handled.

Could you guess what value will be printed when threadB is executed?

Of course not, the two threads are running asynchronously. But it does not matter, if the writer thread only ever writes the value to 1, and never to 0. The reader thread will always read the latest valid value, and the writer thread can ONLY change the value when the reader thread is NOT reading it, because all accesses are atomic.

I dont understand this part. Are signal_thread and handler are 2 different functions?

Yes, signal_thread IS the one and only signal handler.

SIGTERM should terminate your program unless it is blocked, or ignored, or handled.

Should, but doesn’t in this case. I block all signals of interest in main, before any threads are created. I never block or unblock anywhere else in my code. The ONLY place where the signals are “read” is the sig_wait call in the signal handler thread. Yet, most, but not all, signals are trapped by someone/something else, and only occasionally does one get to my signal handler.

Regards,
Ray L.