Rapa Nui
Offline
God Member
Karma: 16
Posts: 888
Pukao hats cleaning services
|
 |
« Reply #75 on: October 15, 2012, 05:19:13 am » |
..not sure your measurement does reflect the reality //time elasped = 29721 micros = teensy //time elasped = 47436 micros = Uno A typical 32bit float sin() on an CM3 takes ~1050cycles = ~22usec @48MHz so it seems you have to get something like 9000 micros max... CM3 none FPU : fZ = fX * fY; // 41 cycles fZ = sqrt(fY); // 624 cycles fZ = sin(1.23); // 1017 cycles CM4 with FPU: fZ = fX * fY; // 6 cycles fZ = sqrt(fY); // 20 cycles fZ = sin(1.23); // 124 cycles http://www.micromouseonline.com/2011/10/26/stm32f4-the-first-taste-of-speed/?doing_wp_cron=1350295560.3099548816680908203125#axzz29MQwGQCw
|
|
|
|
« Last Edit: October 15, 2012, 05:32:59 am by pito »
|
Logged
|
|
|
|
|
Greenville, IL
Offline
Edison Member
Karma: 11
Posts: 1289
Warning Novice on board! 0 to 1 chance of errors!
|
 |
« Reply #76 on: October 15, 2012, 06:32:47 am » |
I can only say that my code saves the results to an array which is printed after the math has been timed. Could the saving to the array cause the time difference?
|
|
|
|
|
Logged
|
|
|
|
|
Rapa Nui
Offline
God Member
Karma: 16
Posts: 888
Pukao hats cleaning services
|
 |
« Reply #77 on: October 15, 2012, 07:09:44 am » |
.. a clock2clock comparision says teensy shall be ~3times faster than Uno (@16MHz) and teensy is 32bit CM3, so a 32bit fp sin() cannot be "only" 1.6x faster than Uno.. saving to an array cannot create such overhead, indeed..
|
|
|
|
|
Logged
|
|
|
|
|
Belgium
Offline
Edison Member
Karma: 34
Posts: 1073
Arduino rocks; but with my plugin it can fly rocking the world ;-)
|
 |
« Reply #78 on: October 15, 2012, 08:01:13 am » |
.. a clock2clock comparision says teensy shall be ~3times faster than Uno (@16MHz) and teensy is 32bit CM3, so a 32bit fp sin() cannot be "only" 1.6x faster than Uno.. saving to an array cannot create such overhead, indeed..
My 2 cents: A possible explanation may be that the compile options for teensy are not similar to those of UNO. Best regards Jantje
|
|
|
|
|
Logged
|
|
|
|
|
Rapa Nui
Offline
God Member
Karma: 16
Posts: 888
Pukao hats cleaning services
|
 |
« Reply #79 on: October 15, 2012, 08:22:57 am » |
This is with STM32F100 CM3 @48MHz (Teensy compatible I think): timer = millis; for (i=0;i<400;i++) { sinres[i]=sinf((float)i); } timer = millis - timer; printf("\rElapsed time float sin 400x into array: %u millis\n", timer); timer = millis; for (i=0;i<400;i++) { sinresd[i]=sinl((long double)i); } timer = millis - timer; printf("\rElapsed time double sin 400x into array: %u millis\n", timer);
Elapsed time float sin 400x into array: 8 millis Elapsed time double sin 400x into array: 19 millis
|
|
|
|
« Last Edit: October 15, 2012, 08:24:48 am by pito »
|
Logged
|
|
|
|
|
0
Offline
Sr. Member
Karma: 19
Posts: 420
Always making something...
|
 |
« Reply #80 on: October 15, 2012, 08:41:16 am » |
There are indeed some complex things going on with this test. For example, this takes 138 us: for (int i = 0; i < 3; i++) { sinanswers[i] = sin(i); } time2 = micros();
But this takes takes 229 us.... almost twice as long, just because the input is offset by 400. Clearly sin()'s execution time is not constant. for (int i = 0; i < 3; i++) { sinanswers[i] = sin(i+400); } time2 = micros();
I suspected the slowness was due to computing double precision. But I tried changing sin() to sinf(), and amazingly sinf() takes MUCH longer. Clearly newlib or libgcc is not optimized very well, or some settings aren't quite right. I need to dig into that......
|
|
|
|
« Last Edit: October 15, 2012, 08:44:30 am by Paul Stoffregen »
|
Logged
|
|
|
|
|
Rapa Nui
Offline
God Member
Karma: 16
Posts: 888
Pukao hats cleaning services
|
 |
« Reply #81 on: October 15, 2012, 09:29:22 am » |
And your test with 1000x (actually 500x) sin cos tan (STM32F100 CM3 @48MHz): timer = millis; for (i=0;i<500;i++) { fsi[i]=sinf((float)i); fco[i]=cosf((float)i); fta[i]=tanf((float)i); } timer = millis - timer; printf("\rElapsed time float sin cos tan 500x into array: %u millis\n", timer);
Elapsed time float sin cos tan 500x into array: 31 millis Such big arrays do not fit into my 8kB RAM so double it for 1000x (=62 millis, yours is 278 ms). Double it again for a double precision fp result. p.
|
|
|
|
« Last Edit: October 15, 2012, 09:37:10 am by pito »
|
Logged
|
|
|
|
|
0
Offline
Edison Member
Karma: 28
Posts: 1079
Arduino rocks
|
 |
« Reply #82 on: October 15, 2012, 10:56:27 am » |
First the Teensy Kinetis CPU does not have hardware floating point. Hardware floating point is optional for Cortex M4. Only the high end K20 processors have floating point http://www.freescale.com/files/microcontrollers/doc/fact_sheet/KNTSK20FMLYFS.pdf. newlib math function are really old C functions. newlib execution times depend on the value of the arguments. Here are two examples for 32-bit sine: float sinf(float); I ran this code float sinanswers[401]; float sinarg[401]; for (int i = 0; i < 400; i++) { sinarg[i] = factor*i; } time1 = micros(); for (int i = 0; i < 400; i++) { sinanswers[i] = sinf(sinarg[i]); } time2 = micros();
If factor is 0.01 so the range is from 0.0 - 4.0 time elapsed = 17110 micros
If factor is 1.0 so the range is 0.0 - 400.0 time elapsed = 105353 micros
The algorithms for 64-bit double are totally different than for 32-bit float. Much of this dates back to work in the 1980s on BSD Unix at UC Berkeley. I was at UCB when BSD Unix was developed. Bill Joy was a key developer of BSD and used it at Sun Microsystems as the base for Solaris.
|
|
|
|
« Last Edit: October 15, 2012, 11:13:40 am by fat16lib »
|
Logged
|
|
|
|
|
Dubai, UAE
Offline
Edison Member
Karma: 20
Posts: 1627
|
 |
« Reply #83 on: October 15, 2012, 11:48:18 am » |
Hi, I assume that most of us will use fixed point maths, but for those that have a reason to use float and double, is there an alternative implementation that can be included at compile time or some other work around that provide more recent and faster implementations ? Duane B rcarduino.blogspot.com
|
|
|
|
|
Logged
|
|
|
|
|
SF Bay Area (USA)
Offline
Faraday Member
Karma: 78
Posts: 5453
Strongly opinionated, but not official!
|
 |
« Reply #84 on: October 15, 2012, 12:06:25 pm » |
avr-gcc has floating point algorithms that have been carefully optimized for the AVR architecture. arm-gcc using newlib presumably has generic algorithms...
|
|
|
|
|
Logged
|
|
|
|
|
0
Offline
Edison Member
Karma: 28
Posts: 1079
Arduino rocks
|
 |
« Reply #85 on: October 15, 2012, 05:05:20 pm » |
Thanks to Paul's help, I have first results for SdFat with large reads and writes. Writes are slow since I am not using the full TX fifo yet. Here are results for 4096 byte writes and reads using the SdFat bench.ino example. Type is FAT16 File size 5MB Buffer size 4096 bytes Starting write test. Please wait up to a minute Write 723.07 KB/sec Maximum latency: 21609 usec, Minimum Latency: 4929 usec, Avg Latency: 5625 usec
Starting read test. Please wait up to a minute Read 1255.56 KB/sec Maximum latency: 3874 usec, Minimum Latency: 3204 usec, Avg Latency: 3260 usec
I think I know how to speedup writes by using all four bytes of the SPI fifo. DMA may be required to achieve maximum speed since using the fifo has a bit more overhead than I first assumed.
|
|
|
|
|
Logged
|
|
|
|
|
Belgium
Offline
Edison Member
Karma: 34
Posts: 1073
Arduino rocks; but with my plugin it can fly rocking the world ;-)
|
 |
« Reply #86 on: October 15, 2012, 05:32:16 pm » |
fat16lib I don't know much about sdcards and read and write speeds but those read figures look impressive to me.  Best regards Jantje
|
|
|
|
|
Logged
|
|
|
|
|
Ayer, Massachusetts, USA
Offline
Edison Member
Karma: 27
Posts: 1097
|
 |
« Reply #87 on: October 16, 2012, 02:49:53 pm » |
What does the S in sin.S stand for? I googled it but it is to close to sin to find something relevant quickly.
The GCC compiler passes .S files through the C preprocessor, so that you can use #ifdef and #define within assembly files (because of the #ifdef, you can have one .S file that has code for several different targets, and the defines are set based on the -m<xxx> options on the command line). If the file is .s (lowercase), it is passed directly to the assembler and does not go through the preprocessor.
|
|
|
|
|
Logged
|
|
|
|
|
0
Offline
Edison Member
Karma: 28
Posts: 1079
Arduino rocks
|
 |
« Reply #88 on: October 16, 2012, 10:18:54 pm » |
Teensy 3.0 is great. The hardware is a little gem and the way Paul did the software makes development really easy. I now have optimized native SPI running in programmed I/O mode and performance is really good. Here are my latest results for SdFat. Free RAM: 4219 Type is FAT16 File size 10MB Buffer size 8192 bytes Starting write test. Please wait up to a minute Write 1801.74 KB/sec Maximum latency: 80495 usec, Minimum Latency: 4398 usec, Avg Latency: 4543 usec
Starting read test. Please wait up to a minute Read 2017.82 KB/sec Maximum latency: 4473 usec, Minimum Latency: 4048 usec, Avg Latency: 4057 usec
This is with the CPU overclocked at 96 MHz. I used an industrial SD designed for embedded use.
|
|
|
|
« Last Edit: October 17, 2012, 05:51:08 am by fat16lib »
|
Logged
|
|
|
|
|
Newcastle, UK
Offline
Full Member
Karma: 0
Posts: 227
|
 |
« Reply #89 on: October 17, 2012, 04:37:36 am » |
Great board.
Is there any news when we get buy one? Are there any European distributors stocking it?
|
|
|
|
|
Logged
|
|
|
|
|
|