Multiple SoftwareSerial issue

Hi folks,

I've given myself quite a challenge, I want to have multiple Arduino's talk to as much as 3 other Arduino's, at the same time, both ways, over 1 cable.
The 1 cable issn't an issue, when you switch the tx port to input mode when it's not transmitting everything seems to work as usual, without requiring extra hardware components.
Most people will say the multiple simultaneous communication is not possible, because an Atmega only has hardware support for one, but it's actually coming along quite well using SoftwareSerial, it's just slow. A big part of why it's slow is because of a problem I can't figure out, so I need some fresh input.

The reason why I'm trying to do this with simple Arduino's is because of costs, I want to build a big network.

Concept
I'll briefly try to explain what I'm doing at the moment. As you probably know there can only be one SoftwareSerial listening at the same time, so I switch the listening connection at a slightly random interval. This makes sure that all the connections overlap from time to time, because I can't predict the timing of the other Arduino's.
This means a arduino is talking to a "wall" from time to time, so I wrote a little library class that handles a connection, kind of a layer on top of SoftwareSerial. Just like SoftwareSerial keeps a buffer for incoming messages, this class keeps a buffer (queue) for outgoing messages. When a message is send it will wait for a confirmation message before it's removed from the buffer, until then it will try to resend it at a random interval.
Every cycle an update method is called, this will try to send a message and check if there are incoming messages. When there is a new message and it's not a confirmation it will send a confirmation back. These confirmations contain the checksum of the original message so that they can be checked.

There are a few extra tricks, but this is the gist. An Arduino will for example "wait" after sending a message (which means not sending other messages and not switching to another connections) until there is a confirmation or the max waiting time expires. A message, containing a type, value and checksum is translated into one byte.

Main problem
The main problem at the moment is that sometimes the confirmation message isn't received or transmitted properly. When I hook up two computers I can see the main message is transmitted, it's received, the confirmation message is send, but it's not received. And I'm pretty sure both parties have enough time to send or receive this confirmation. This would be normal for the normal messages, but the confirmation messages should almost always arrive because the sender waits for the confirmation.

Test setup
I've tried to simplify the setup for testing purposes. Two Arduino's are connected to potmeters, when you turn this it will send a message and the other Arduino's led's brightness should change. I simplified the potmeter's positions to 4 positions.
I did a test and counted the amount of times arduino a send a message and the amount of times arduino b received the message. The first column is the value that was send.
A sending:

1 25x
2 1x
3 1x
2 1x
3 1x
2 42x
1 1x
0 10x
1 50x

B receiving:

1 9x
2 1x
3 1x
2 1x
3 1x
2 12x
1 1x
0 1x
1 20x

I've added the code. Below is a illustration of my simple hardware setup.

(I'm actually using Ottantotto's instead of Arduino's)

More information on this project:
http://peteruithoven.nl/resilient-network

Thanks in advance for your input!
Please let me know if something is unclear.

attachment.zip (302 KB)

The terms "resilient network" and "software serial" should never appear in the same sentence (unless of course that sentence is saying that the terms "resilient network" and "software serial" should never appear in the same sentence :))

So is this a serious attempt to make a resilient network or just for the table lamp installation art?

Also, if the nodes simply blurt bytes out at random intervals and the line is not open-collector or something equivalent, how on earth do you avoid contention between the Tx pins?


Rob

Hi Rob,

The second, it's a art installation that tries to tell people something about resilience. :wink:
How it works technically, should stay as hidden as much as possible.

I don't really avoid contention, but using a lot of randoms (for switching between connections, to resend a message etc) it should work out often enough.This seems to work mostly, although there seems to be a issue with returning confirmations, even though both parties should handle this correctly.

The second, it's a art installation

That's OK, if this was a control network for a factory it would be a different story.

I don't really avoid contention,

That's not OK. It means that you regularly have one chip driving the line HIGH and another driving it LOW. That's a bad thing that will eventually blow something up.

If you want to do this you really have to use a physical level that can tolerate such things. For example if you use open-collector outputs with a single pullup resistor then a LOW overrides a HIGH with no harm done.

There are many ways to do this but connecting push-pull outputs together with no way to sync them is not one.


Rob

I want to build a big network.

This idea will not scale up very well because the clash detection is destructive, meaning that when you get a clash BOTH sides fail. You will reach a point where most packets fail and the whole thing will bog down.

I still haven't looked at the code, just working with general principles here.


Rob

Hi Rob,

Graynomad:
If you want to do this you really have to use a physical level that can tolerate such things. For example if you use open-collector outputs with a single pullup resistor then a LOW overrides a HIGH with no harm done.

There are many ways to do this but connecting push-pull outputs together with no way to sync them is not one.

Thanks, good point, I'll look into my hardware to try fix that, after googling some of those terms :wink:

Graynomad:

I want to build a big network.

This idea will not scale up very well because the clash detection is destructive, meaning that when you get a clash BOTH sides fail. You will reach a point where most packets fail and the whole thing will bog down.

Can you explain why you think it will bog down? I see in my tests that once the first communication goes wrong, it takes at least 5 times for it to work again. Is that what you mean? I don't really understand why this happens, why it won't just fail once and work the second time. Do you?

I have another "protocol" in mind that might work better:

I keep the random communication phases so that connections will overlap to communicate. Before it will start to listen it will transmit that it will be listening. When a unit gets this message they will start transmitting a part of their messages. Units will then switch again when they didn't receive messages for a set amount of time.

Would this be a better approach?

once the first communication goes wrong, it takes at least 5 times for it to work again. Is that what you mean?

Yes, that's the sort of thing.

As I understand it you wait until the end of the packet to detect the clash then wait a random time and retry. Depending on the length of the packets and the frequency at which they are being sent the likelihood of a clash may be very high. The more nodes the higher the chances of a clash until you spend all your time getting errors.

why it won't just fail once and work the second time.

Possibly because the delays aren't enough. How long do you delay after a clash and how long are the packets?

The delays not only have to be different but different by at least the length of a packet or you will get another clash.

This is a real problem with only detecting clashes at the packet level. Detecting at the byte level is MUCH better and at the bit level is the best. Having destructive clashing makes it even worse because you lose all packets being transmitted.

I don't follow your second idea right now but it's 2AM here and the old grey matter is not working that well. I'll read it again tomorrow.


Rob

Graynomad:

once the first communication goes wrong, it takes at least 5 times for it to work again. Is that what you mean?

Yes, that's the sort of thing.

As I understand it you wait until the end of the packet to detect the clash then wait a random time and retry. Depending on the length of the packets and the frequency at which they are being sent the likelihood of a clash may be very high. The more nodes the higher the chances of a clash until you spend all your time getting errors.

I see that that part of my code is a bit messed up, I used to use a random resendTime to retry sending the message as long as there was no confirmation. But now I just wait a period after sending something and then I retry, this time isn't random. I'll make this random again.

why it won't just fail once and work the second time.

Possibly because the delays aren't enough. How long do you delay after a clash and how long are the packets?

The delays not only have to be different but different by at least the length of a packet or you will get another clash.

This is a real problem with only detecting clashes at the packet level. Detecting at the byte level is MUCH better and at the bit level is the best. Having destructive clashing makes it even worse because you lose all packets being transmitted.

I don't follow your second idea right now but it's 2AM here and the old grey matter is not working that well. I'll read it again tomorrow.

At the moment it will wait 80 milliseconds after sending a message before it will retry, this should be more than enough, but I agree it should be random.
My messages are always one byte so I hope that helps. Checking that it at the bit level is a bit out my reach, knowledge wise.

I'll make sure it resends randomly again this evening and I'll check how long it should take for a message to get confirmed.

Thanks for your input so far, I'm curious what you thing of the second possible approach. Let me know if I need to explain it better.

At the moment it will wait 80 milliseconds after sending a message before it will retry,

Waiting 80mS before sending a single byte should be more than enough, but only if the nodes stagger the delay. If they all use 80mS it will never work until they drift apart due to small timing differences.

I'll make this random again.

Yep.

it's 2AM here and the old grey matter is not working that well.

My brain's working OK now and I still don't understand plan B.

Before it will start to listen it will transmit that it will be listening.

You don't see the problem here? You are going to transmit to tell everyone you are going to listen. The damage is done once you transmit.

Checking that it at the bit level is a bit out my reach, knowledge wise.

It would require some very tight and low-level code, even then there are traps.

I think if you change the physical level to something that can handle the clashing and simply read back the byte you sent to make sure it's the same you are 90% of the way there.

There are still potential problems with two or more transmissions being out of phase at the bit level, I did design something that I think would have got around this a while back but it would have required some serious coding and I never implemented it.


Rob

Hi Rob,

Doing some tests with my new code, the confirmation takes somewhere between 15 and 40 milliseconds.

Graynomad:

Before it will start to listen it will transmit that it will be listening.

You don't see the problem here? You are going to transmit to tell everyone you are going to listen. The damage is done once you transmit.

One aspect of this plan B is that every unit (lamp) is connected with max. of 3 other units, each with their own SoftwareSerial. In software it will switch the connection it's listening to at a random interval (because we can only have one Software serial listening at the same time) (random so that all the connections will overlap once in a while). When a unit starts listening it will send a message to that connection (this is one neighbor) that it's listening. It will wait a set time for messages or after the last received message before it switches away. When it receives the message that the other side is listening it will broadcast it's messages (probably limited by a certain amount).

Does this solve that problem? I would really appreciate it if you would explain the problem a bit further.

Graynomad:
I think if you change the physical level to something that can handle the clashing and simply read back the byte you sent to make sure it's the same you are 90% of the way there.

I'm very curious what you mean with this physical level, do you have some links / articles I can look into?

attachment.zip (302 KB)

I've implemented the second protocol idea. Seems to work quite nice.
The only problem is that messages sometimes aren't send correctly (the checksum doesn't match the type and value). In the previous protocol this would mean that there isn't a confirmation send back and a unit will try to resend it. In this case it's really lost. Not sure what to do about this. Tips anyone?

Now I'll try to test this with more units.

test01Binary01.zip (51.6 KB)

I implemented one fallback mechanism. When there is a read error a unit will send a error message back. The unit receiving a error message will resend the batch of messages he has previously send again.
Not sure how to test it though, because I don't know how to force errors.

test01Binary01.zip (51.8 KB)

When there is a read error a unit will send a error message back. The unit receiving a error message will resend the batch of messages he has previously send again.
Not sure how to test it though, because I don't know how to force errors.

Pretending that an error occurred, even when one didn't, doesn't seem that hard to do...

Sure, but the error is caused by something and I'm not sure if that will influence a resend. So that's why I'd rather have a actual error.

I'm very curious what you mean with this physical level, do you have some links / articles I can look into?

'Fraid not, try searching for "open collector" or "open drain".

What you need is a setup that actively pulls the line low for a 0 and just releases it to HI-Z for 1. The 1 is then handled by a pullup resistor.

That way any number of transmitters can be 0 when others are 1 and there's no harm done to them, it's just the signal that gets corrupted.

This can also be implemented by wiring RS-485 transceivers in a different manner to the norm, and you can use tri-state buffers like the 74xx125/6.

Until this is dealt with there's little point in worrying about protocols IMO.


Rob

It seemed my new approach wasn't reliable enough. To many messages where disappearing. So in the end I tweaked the MessengerBinary and this seemed to work better. Tweaking the switch time (between listening to a connection) seemed to help a great deal. After a few tests I ended up with between 50 and 100 milliseconds.

I added the latest version of the script. Please understand that this works very slow and is not very reliable. For my installation it's almost good enough.

MessengerBinary.zip (46.6 KB)