I2C Random Freezing

Hello,
I built a prototype using an Arduino Mega with an OLED display, EEPROM, and temperature sensors on the same I2C bus. Everything works for a while, but after a few hours the bus hangs and the display stops updating.

I’m using 4.7k pull-ups and around 25 cm wires between boards.

Questions:

  1. What is the best way to recover a stuck I2C bus in software?
  2. Could bus capacitance be causing this issue?
  3. If I move this design to a PCB, should I keep SDA/SCL traces short and route them away from SPI signals?

Hi
I think, it is pointless to discuss electronics problem without a connect diagram.
Also showing your full code using a code tags will helps.

You are talking about I2C bus and SPI Port. Tell us cleary, how have you connected --
1. OLED using I2C bus or SPI Port?

2. What type of temperature sensor? Have you conncted it using I2C bus or SPI Port or 1-wire bus?

3. Are you using internal EEPROM or exernal EEPROM? If external EEPROM, then what is its type? Whcih bus/port you have used to connect it with MEGA?

You can't

Could bus capacitance be causing this issue?

It could

If I move this design to a PCB, should I keep SDA/SCL traces short and route them away from SPI signals?

Yes.

  1. Best way to recover a stuck I2C bus?

Yes .... the standard recovery method for a stuck I2C bus is to manually clock the bus:

Reconfigure SDA/SCL as GPIO
Pulse SCL around 10 times to release any stuck slave
Generate a STOP condition (SDA high while SCL is high)
Reinitialize the I2C peripheral

REad this guide, has everything about stuck I2C: https://www.ti.com/lit/an/scpa069/scpa069.pdf

Again YES .... its very likely that bus capacitance is causing this

Your setup has:

25 cm wires: high capacitance + noise pickup
Multiple devices on one bus
4.7k pull-ups: may be too weak for that load

This leads to:

Slow rise times (invalid logic levels)
Glitches from EMI
Slaves getting stuck in incomplete reads/writes over time

On PCB design, what should I do?

YES(its a hetric) ..... layout matters a lot for I2C stability. Thsi discussion would be helpful for you: https://electronics.stackexchange.com/questions/423122/i2c-issues-with-first-pcb
Read this guide for signal integrity during pcb design: https://www.aivon.com/blog/pcb-design/pcb-routing-techniques-achieving-signal-integrity-and-avoiding-common-pitfalls/

Keep SDA/SCL short and tightly routed together
Route away from SPI, PWM, motor, or switching signals
Use a solid ground plane under the traces, I would recommend to use seperate gnd and power planes, read this guide to understand planes: Routing Layers and Ground Planes & Power Planes - Engineering Technical - PCBway
Consider stronger pull-ups (2.2k–3.3k) depending on bus capacitance
Optional: add small series resistors (22–100ohm) near the MCU to reduce ringing

Your issue is almost certainly signal integrity (capacitance + noise), not I2C protocol limits. On a clean PCB layout, this kind of bus hang will be eliminated.

Describe your bus in more detail. Include:

  • number of boards with pullups
  • physical layout, routing of I2C
  • how many 25cm segments
  • bus speed used
  • exact devices used (links to products are useful). In particular, looking for things like built in pullups you haven't accounted for.

Thanks

In that casee, a 4-layer PCB would be required? Does it make sense for such a small project?

If the slave is stuck holding either SCL or SDA low there is nothing you can do in software

Keep SDA/SCL short and tightly routed together

Short is good tightly routed is a bad.

How random is it? Only after a few hours? Never sooner than that? Never longer than that?

Edit: When it stops working what's the voltage on SDA and SCL?

Also, measuring with everything connected, but all power off, what's the resistance between SDA and Vcc, and SCL and Vcc? Is it 4.7K?

Have you confirmed that the software is still running?

You may be able to change the I2C bus speed to make it less glitch prone.
See here for example: [solved] I2C speed to 400KHz ... HowTo?

But the suggestion of stronger pullup resistors is also good and easy to test.
If a rogue I2C device is hogging the bus then it may not be easy to solve.

Post an annotated schematic, that sounds like a hardware problem possibly in the power system. Try running with the display disconnected, I expect the problem will go away. If you must see the data use serial.print(). What type of area is this in, home, office, ..etc.

If a slave is stuck holding the bus low... can you depower the bus? I've had some success running the I2C bus powered from a GPIO, so that if the bus becomes truly stuck or broken I depower EVERYTHING on the bus for a period of time, then repower and reinitialize the bus. But... I admit that was just a stop-gap I tried and seemed to work (and requires powering everything from a GPIO pin, which works fine for very low-current-draw systems on the bus, but obviously is a problem for high-current-draw ones).

If there are other functioning devices on the bus, that could cause them to lock-up and it will not necessarily reset the stuck device.

Assuming all devices on the bus will reset when completely depowered (including the device supply power), then that is the only fullproof way to reset everything.

One of the most common issues for a "locked up" i2c bus when using an AVR processor is due to the Wire library hanging from it not properly handling multi master.
Even if you are not using multiple masters the multi master code in the low level Wire library still has support for it and can get confused and lock up.
Where this can become problematic is that if the Wire library "thinks" there is another master on the bus wanting to talk because of noise on the bus, it will back off and wait for that other master to finish - which never happens.The way that is implemented in the low level code is that the Wire s/w spins down in the library code waiting for the Wire h/w to post an event indicating that the other master is finished.
But... if there is noise on the bus when there is a single master, it can confuse the Wire library code into thinking that there is another master on the bus. So it will wait, for that other master to finish. But since there is no other master, the h/w will never post an event to indicate that the other master finished, and the Wire s/w will spin forever since there is no timeout in that spin wait loop. This will hang the system.

There are two approaches.

  • Use watchdog timer (this is the best approach)
    If the Wire library hangs, the watchdog will fire and reset the board,
    The Arduino sketch code will start over and everything will be initialized and start over just like a power cycle or hardware reset.
  • If using newer 2.x IDE and Wire library, enable Wire library timeouts:
    Search for how to use them. " arduino avr wire library enable timeouts"
    Basically, you enable the timeouts and you can call a Wire method to see if a timeout occurred.
    There was LOTs of discussion about this before it was implemented.
    I agreed with the need to fix the forever spins, but was very much against the way it was implemented.
    There are quite a few potential issues with using these timeouts, beyond the portability issues.
    For example, In order to be robust to get things working again, you will need to fully re-initialize the Wire library and every i2c slave on the bus.
    Some Wire slave libraries may have issues when their begin() function is called again. This is why I suggest using the watchdog timer.

--- bill

So you blindly have added 2 * 4k7 ?
Did you calculate total resistance of all connected modules, including the 2*10k that the Mega already has.
Total R should not be below 1k66 (3mA).
Leo..

because of noise on the bus

I would think it would be a very rare issue since it would require a very very strong noise source to pull the SCL or SDA line low.

It seems to happens quite a bit.
The AVR TWI / WIre implementation is half h/w half s/w.
There are several situations that can cause to the twi code to lock up due to the code incorrectly transitioning to state and getting stuck in spin loops polling for an event that will never happen.

It could be triggered by various things, induced noise, poor or missing external pullups (the Wire library enables the internal pullups which are not really strong enough to work properly all the time), power supply issues, etc...

There was quite a bit of discussion about it a few years back in the forum and in a few github issues.

One thread back in the 2020 time frame that really kicked off interest in finally resolving this issue, was some goofy person was wanting to use an Arduino to make a ventilator but the Wire library would occasionally lock up.

One of the cases is the twi s/w could get stuck waiting for a STOP than never happens.

Enough people finally seeing these lockups is what drove the addition of the timeouts to the low level twi code.

-- bill

Then it's a hardware issue not a software issue. If your bus is highly susceptible to noise, then you need to rethink your design. You can’t expect software to counteract the effects a poorly designed system. From what I have seen on the forum, the most common causes of I2C problems are due to people using long wires, no pullups, incorrect pullups, incorrect pullup voltage or a misunderstand of how their devices work.

Granted in many cases the trigger for this issue is h/w, like wiring or power.
But the real issue is that a problem in the AVR TWI system is causing the processor to hang.
It is very common in the real world for s/w to work around h/w issues.
The TWI system has the ability to return error status codes when the s/w detects issues. The problem is the TWI system didn't have timeouts for all situations.

The AVR TWI system should not have cases where it can get stuck and hang the processor.
On the AVR TWI system, it is part h/w and part s/w to keep the h/w simple and inexpensive.

In order for a system to be robust, it must never hang, and should be able to recover from as many exceptions and errors as possible.

The AVR TWI system has/had a problem. Under certain scenarios, it causes a lockup in the low level TWI s/w because the s/w never times out waiting on certain h/w events, This can be fixed in s/w- and has, by enabling the missing s/w timeouts - but these additional timeouts are not enabled by default.

Consider this scenario. There actually are multiple masters, the other master starts a transaction but fails to complete it because it just happened to lose power.
If the timing was just right, the AVR master could fall into a spin loop waiting forever for that other master to finish but it will never happen.

I also have a type of I2C LCD device whose chipset misbehaves if you read from it and triggers one of the TWI lockup issues. I put code in my LCD library for this type of device that prevents reads to prevent a lockup on AVR systems .

These are just a couple of the many situations where the AVR can get locked up because the TWI s/w state machine can spin forever waiting for a h/w state change than never happens. If the additional timeouts are enabled, the Wire library won't lock up anymore and the user code can take whatever actions it deems necessary to re-initalize the Wire system and slaves.

Alternatively as I mentioned, the user could simply enable the watchdog timer and then, should one of these TWI issues happen, the AVR is reset and starts over cleanly.