LoRa SPI corruption

So, LoRa chip has a bug where the SPI transfer gets corrupted and there is a workaround to be done in software. The workaround is to reset and re-initialise the chip, a heavy operation actually for a battery powered device.

The bug is not documented in LoRa Errata doc but the libraries do implement that workaround, so some people know about it.

See line 1391 to 1417 in https://github.com/Lora-net/LoRaMac-node/blob/develop/src/radio/sx1276/sx1276.c

Any reason why its not mentioned in errata?
And, the comments say "it depends on the platform design". What platform design? Its a flaw in the chip.

The workaround does work but does not inspire confidence on the chip.

WI

Have you asked Semtech ?

What "bug"?

This is the comment. It does not suggest a fault in the LoRa chip, or that the action taken is the only possible recovery.

// Tx timeout shouldn't happen.
// But it has been observed that when it happens it is a result of a corrupted SPI transfer
// it depends on the platform design.

srnet:
Have you asked Semtech ?

Not yet, I thought some users may already know.

jremington:
What "bug"?

The corruption is inside LoRa chip which corrupts a lot of registers. Its a bug.
Wondering why the dependency on platform.

Wondering why the dependency on platform.

Because that is where the real problem lies.

Myself I would not pay that much attention to a comment in a library, the library writer might be mistaken.

That there is the possibility of an undetermined issue causing a TX timeout must be clear to anyone writing a library, the TX timeout is there, so a library writer needs to assume that there is a potential problem to deal with.

I have been playing with the LoRa devices extrensivly since 2014\2015, I dont recall seeing a lockup during transmit, apart form the obvious issue of the transmit causing a supply current spike, but that tends to cause a propcessor reset anyway.

There are other issues with the LoRa device caused by inappropriate library routines whereby phantom packets consisting of noise are accepted as valid, thats not in the device errata either.

srnet:
Myself I would not pay that much attention to a comment in a library, the library writer might be mistaken.

I have faced it, I have taken the workaround from that code and implemented in my code and now the problem has gone. It is a real problem. We are not debating whether there is a problem or not. I am saying that this is a problem, so it given.

srnet:
That there is the possibility of an undetermined issue causing a TX timeout must be clear to anyone writing a library, the TX timeout is there, so a library writer needs to assume that there is a potential problem to deal with.

We are not talking about TX time out. We are taking about the entire register state being lost. Not a trivial issue, I believe you will agree.

srnet:
I have been playing with the LoRa devices extrensivly since 2014\2015, I dont recall seeing a lockup during transmit, apart form the obvious issue of the transmit causing a supply current spike, but that tends to cause a propcessor reset anyway.

So it does not happen on your platform/implementation/timing/settings/use case.
It happens on my (and many other user's) platform/implementation/timing /settings/use case.
Whats the point if yours work fine?

wonderfuliot:
So it does not happen on your platform/implementation/timing/settings/use case.

It happens on my (and many other user's) platform/implementation/timing /settings/use case.

So what platform are you using ?

Can you provide links to others apparently getting the problem ?

I have not seen it in Arduino ATMEL, I mainly use ATMega328 and ocaisional use of ATMega1284.

We are not talking about TX time out. We are taking about the entire register state being lost. Not a trivial issue, I believe you will agree.

The code you quoted says;

case RF_TX_RUNNING:
// Tx timeout shouldn't happen.

srnet:
So what platform are you using ?

Can you provide links to others apparently getting the problem ?

Nano, with Ra-01 modules.

Others facing hangs (which is the symptom)

They may not know that the register set (and hence the chip's internal entire state machine) is corrupted.

wonderfuliot:
FORUM - Search the FAQ for answers to the most frequently asked questions and participate in the Forum to connect with the community.

Of the links you posted the only one that directly suggests register corruption, by any cause is the above.

However the post the user made is interesting;

The module runs for 3, 4 days and then suddenly the register value changes. When I touch any pin of the module , the register value changes. I tried powering up via bench top Keithly 2231A power supply [3.3v] again if I touch even the +3.3V pins or GND pins of the module, the corruption occurs.

So if this were a random issue with SPI corruption, why does it not sometimes occur within a minute an hour or the same day etc ? Does the LoRa module know how long it has been running ?

Also note the point about touching the pins of the module which then directly causes 'corruption', to me that suggests a wiring or layout problem of some type.

srnet:
However the post the user made is interesting;

So if this were a random issue with SPI corruption, why does it not sometimes occur within a minute an hour or the same day etc ? Does the LoRa module know how long it has been running ?

I had to recall many units from customer's complaints of missing communication. We kept the device in our lab and figured out TX timeout for no apparent reason. After TX timeout the registers were corrupted. This issue is random. Sometimes we were lucky to have it within 5 minutes after reset, sometimes the device worked for hours. There were no h/w loose connections etc. It was same on several setups.

wonderfuliot:
I had to recall many units from customer's complaints of missing communication. We kept the device in our lab and figured out TX timeout for no apparent reason. After TX timeout the registers were corrupted. This issue is random. Sometimes we were lucky to have it within 5 minutes after reset, sometimes the device worked for hours. There were no h/w loose connections etc. It was same on several setups.

Well the issue does not seem to be widespread, if it were, then you might expect lots of comments over in the Things Networks Forums, which I dont see. TTN is of course a very widespread user of LoRa.

I dont ever use LoRa devices on 5V Arduinos like the Nano you are using, as I dont trust the necessary logic level converters.

The SPI signals look well dodgy when going through the average level converter.

srnet:
The SPI signals look well dodgy when going through the average level converter.

Do you protect LoRa SPI transactions by disabling interrupts?

wonderfuliot:
Do you protect LoRa SPI transactions by disabling interrupts?

Not directly.

Although you would want to avoid using software serial when writing to the LoRa device, that can cause problems. But self evidently the millis() interrupt does not cause an issue, otherwise all those Arduino powered TTN nodes out there would be failing on regular basis.

By dodgy I meant the signal levels and rise times etc through a logic level converter, for fast SPI signals, are close to failure point in my view.

It would not occur to me to use a 5V Arduino in a commercial product either, which logic level converters are you using ?

It would be helpful if you told us what form of logic level conversion you are using on the SPI bus.

After all the problem with your setup seems to be SPI corruption, so the type of logic level conversion you are using might be significant .........

The reported issue of an SPI data corruption which leads to a TX TIMEOUT is NOT related to a bug in the Semtech LoRa transceiver. It has been demonstrated by Semtech and independent parties that the corruption only occurs when SPI lines are not properly routed on the PCB and/or when the SPI specifications are not followed as per Semtech datasheet. Examples of the common causes are:
• SPI SCK line placed too close to MOSI and MISO lines.
• SPI interface lines are too long and too close to other interface signal lines.
• No proper ground plane under the SPI lines
• SPI clock and timing driven by the Master are not in compliance with Semtech datasheet.
Any of the practices above could corrupt one or more bits in the SPI data which may lead the Semtech transceiver into an unknown state and consequently resulting in a TX TIMEOUT.

As an effort to mitigate the effort of a major redesign, Semtech has come up with a workaround which involves a soft reset of the Semtech LoRa transceiver. It’s important to note that the creation of this workaround is NOT to patch a bug on the Semtech LoRa transceiver or reference designs, but rather to offer an alternative simple solution. And for this reason, Semtech does not see the need to document this matter in the errata document. To improve clarity, Semtech will update the comments provided with the code for the workaround.