and as 64bit float calculation are VERY CPU expensive you may actually need to do the calculations is background thread as long calculation will prevent you reading from devices on time..

That is because used Math64 lib is not very efficient. It use also Taylor series known converges very slowly, tan is calculated by sin/cos, sqrt with exp function, etc. You have used your own replacements, however that is still too much multiplications and divisions...

The famous HP-35 from 1972 used a custom made chip to process BCD numbers with execution time of just 250us for one clock per instruction, but still was able to process sqrt and trig functions (sin, cos, tan and natural log including its inversions) with 10 digits precision in max ~500ms. That was really impressive for the time, knowing all was accomplished with addition and shifting with in only 768 "words" of code, with the chip had only 5 registers!

Imagine what can be done with 16MHz clock and modern (just) 8-bit architecture CPU having multiplication instruction, at least 32K ROM, 2K RAM and 1K EEPROM...

I'm working for a while on fast double precision (to the last bit) lib for 8-bits based MCUs as I have a need as well for max precision for my own not trivial at all GPS navigator project (including maps etc). Really there is no need for 32-bit MCU platform.