Go Down

Topic: Profiling / why so slow? (Read 955 times) previous topic - next topic

cptdondo

I have the following body of code.  All the conditionals in this should fail except for once every 5 seconds.  The only processing being done is once every 5 seconds.

According to the manual, the arduino is capable of 10,000 samples/sec on the A/D.  I am running this code, with the conditionals failing, and only achieving 2,000 samples in 5 seconds, or about 1/40th of what the docs say.

Is there some way to profile this code?  Is it unrealistic to expect 10,000 samples unless I just sample?

Code: [Select]


long Iavg=0, Vavg=0;

unsigned int
analogRead100 (byte reg)
{
  unsigned long analog = 0;
  for (byte ai = 0; ai < 100; ai++)
    {
      analog += analogRead (reg);
    }
  analog /= 100l;
  return analog;
}

void
loop ()
{

      Iavg += analogRead100 (currentPin);
      Vavg += analogRead100 (voltagePin);
      ct++;

  if (debug && Serial.available () > 0)
    {
    }

  if (elapsedTime (podHistory.lastUpdated) > 5000)
    {
    }

  if (fram && elapsedTime (timer60Msg) > (60 * 1000))
    {
    }

  if (fram && elapsedTime (timer30Msg) > (60 * 60 * 24 * 1000))
    {
    }

  if (recvMsg (fAvailable, fRead, buf, 32) && buf[0] == podSetup.address)
    {
    }
}

PeterH

That doesn't compile. If you have a problem with your code, you need to post code that actually compiles and runs and demonstrates the problem. If your sketch involves a lot of code which you think is unrelated to the problem then create a test sketch that reproduces the problem in the simplest possible way. Quite often, the mere act of doing this will reveal a false assumption and give you insight that will enable you to figure the problem out for yourself, but even if it doesn't it makes it much easier for us to see what's going on.
I only provide help via the forum - please do not contact me for private consultancy.

Nick Gammon

I agree with PeterH.

Read this before posting a programming question


However your figures are roughly right. Taking an analogRead with default settings takes 104 uS, so you should be able to do 9615 of them in one second. That's assuming you do nothing else. Printing to serial ports takes time, of course.

You can do analog reads asynchronously, which means you can be starting the next read while you do something with the previous one.

http://www.gammon.com.au/interrupts

In particular this part: http://www.gammon.com.au/forum/?id=11488&reply=5#reply5
Please post technical questions on the forum, not by personal message. Thanks!

More info:
http://www.gammon.com.au/electronics

holmes4

It may compile BUT you must learn to format your code ALWAYS USE THE AUTO FORMAT IN THE IDE BEFORE POSTING! The type of the function should always be on the same line as the function name and its parameters. Good code layout is vital when others are expected to read it!

Code: [Select]
  analog /= 100l;


The above will only ever get you a number between 0 and 2. Integer math always rounds down, there is NEVER a decimal place with ints.

2/3 = 0 with ints

3/2 = 1 with ints

any thing less than 1001 divided by 1001 equals 0 with ints!

Mark

AWOL

Quote
The above will only ever get you a number between 0 and 2

"will only ever" is rather strong wording, don't you think?
What if the 100 analogReads you've just executed all return their maximum 1023?
That would be the number 102, wouldn't it?
"Pete, it's a fool looks for logic in the chambers of the human heart." Ulysses Everett McGill.
Do not send technical questions via personal messaging - they will be ignored.

Nick Gammon

#5
Jul 09, 2013, 09:30 am Last Edit: Jul 09, 2013, 10:13 am by Nick Gammon Reason: 1
I modified your sketch so it would compile:

Code: [Select]

const byte LED = 13;

long Iavg=0, Vavg=0;

void setup ()
 {
 pinMode (LED, OUTPUT);
 }
 
unsigned int analogRead100 (byte reg)
{
 unsigned long analog = 0;
 for (byte ai = 0; ai < 100; ai++)
 {
   analog += analogRead (reg);
 }
 analog /= 100l;
 return analog;
}

void loop ()
{
 Iavg += analogRead100 (A0);
 Vavg += analogRead100 (A1);
 digitalWrite (LED, ! digitalRead (LED));
}


Pin 13 toggled every 22.48 mS. Since it took 200 samples in that time then each one took 22.48 / 200 = 112.4 µS.

Working in mS, therefore you would get:

Code: [Select]

5000 / 0.1124 = 44484


44484 samples per 5 seconds. I predicted above that the analog reads alone should give you:

Code: [Select]

9615 * 5 = 48075


So you are 3591 samples short of the target, but in that loop you are doing 200 x "long" division, so that would take time.

So on the whole I would say it is working to spec.
Please post technical questions on the forum, not by personal message. Thanks!

More info:
http://www.gammon.com.au/electronics

TanHadron

If you could somehow arrange to divide by 1024 instead of 1001 it would save you a grundle of time.

MichaelMeissner

#7
Jul 09, 2013, 01:58 pm Last Edit: Jul 09, 2013, 02:26 pm by MichaelMeissner Reason: 1

If you could somehow arrange to divide by 1024 instead of 1001 it would save you a grundle of time.

On most machines, division is slow.  On an 8-bit machine like the AVR doing division on a 64-bit type (long long), I would imagine it will be very, very slow.

Lets see, on the AVR, the 32-bit divide is done as a loop executed 33 times of 15-19 instructions, plus some setup.  Figure around 650 instructions to do a 32-bit unsigned divide (there are some optimizations for 24-bit that might reduce the number of instructions to 500 or so).

To do a 64-bit divide, the library breaks it up into a series of 32-bit divides, similar to the long division we learned in grade school, though there is an optimization if the upper half of the divisor is 0.  As a rough estimate, you are probably doing 1,500 or so instructions every time you do a long long division when the divisor fits in 32-bits.  If the divisor does not fit in 32-bits, you are looking at many more instructions.

So some hints:

  • As TanHadron says, if you can arrange that you are dividing by a constant power of two (like 1024), the compiler can do the division as a series of shifts;

  • If you are doing the division with unsigned types, the compiler can emit simpler code where it doesn't have to worry about sign propigation;

  • If you can use 32-bit types over 64-bit types without losing bits, it will be faster (on the AVR, if you can do it in 16-bit types, it will be even faster, but not on the ARM boards);

  • If your main loop is doing lots of 64-bit or 32-bit calculations, consider getting a 32-bit board that can do 32-bit calculations in much fewer instructions (64-bit cpus are common in laptop/desktop/servers, but not as common in embedded environments);

  • If your main loop is computational bound, consider using a faster processor (such as the ARM based boards);

  • If you are doing lots and lots of divisions that aren't constant powers of two, consider using a board that has a fast division instruction;

  • For example, the teensy 3 is a 32-bit ARM board that is programmed using a modified Arduino IDE, that is fairly low cost ($19 + shipping/handling), and it has a higher clock rate (48 Mhz vs. 16 Mhz, with a boost to 96 Mhz IIRC).  The Arduino Due is another 32-bit Arm chip.  There are also Arm boards that more typically run the Linux operating system that could be used that have even higher clock rates (Rasberry Pi, BeagleBone Black, pcDuino, etc.).

PaulS

Quote
If you are doing the division with unsigned types, the compiler can omit simpler code where it doesn't have to worry about sign propigation.

Can emit?

MichaelMeissner


Quote
If you are doing the division with unsigned types, the compiler can omit simpler code where it doesn't have to worry about sign propigation.

Can emit?

Yeah, yeah, yeah.  I noticed this and fixed it.

cptdondo

Interesting.  Thanks for the discussion.

I am trying to measure current and voltage in an automotive environment.  I have 3 sources of power:

1.  Solar panels, that have a 1500Hz PWM controller
2.  3 stage charger, that has a 3000Hz controller (not really PWM but some sort of modified / filtered PWM)
3.  Vehicle alternator

All of these produce harmonics.  Taking a single reading gives an almost random result.  The final product will incorporate an RC circuit to filter some of the higher harmonics, but still, I need to hammer away at the A/D and get enough samples to get a legitimate average before doing any calculations.

So if I can modify the code to get more readings, that's great.  The more the better.  A bigger CPU is not really an option; I am space and power constrained.  (I turn sensors on and off to save a few milliamps.)

I do a limited amount of float calcs.  I need to do a lot of code cleanup and make sure my variable types are appropriate.  I try to do most of  my calcs in 16-bit unsigned ints, but for some I just need to use floats.

I can definitely convert most of my ints to unsigned if that will speed up calcs.  I could take 128 samples and then right-shift 7 bits if that would help speed things up.  I'll try different things.

Thanks!

MichaelMeissner

#11
Jul 09, 2013, 02:57 pm Last Edit: Jul 09, 2013, 03:02 pm by MichaelMeissner Reason: 1

All of these produce harmonics.  Taking a single reading gives an almost random result.  The final product will incorporate an RC circuit to filter some of the higher harmonics, but still, I need to hammer away at the A/D and get enough samples to get a legitimate average before doing any calculations.

I don't know if going to a dedicated chip to do the A/D reading, complete with smoothing over several reads and delivering the result via I2C might give you a better result than trying to do it in the Arduino.  I don't know what the Arduino does for A/D inputs behind the curtain, but I've seen complaints that it is fairly slow.


I do a limited amount of float calcs.  I need to do a lot of code cleanup and make sure my variable types are appropriate.  I try to do most of  my calcs in 16-bit unsigned ints, but for some I just need to use floats.

Bear in mind, no Arduino has floating point built in, and all floating point is done via software emulation.  If you are doing floating point in your inner loop, it is perhaps better to rework the code to use integer types.  Or alternatively use hardware with floating point instructions, but those would blow your power/space budget.


I can definitely convert most of my ints to unsigned if that will speed up calcs.  I could take 128 samples and then right-shift 7 bits if that would help speed things up.  I'll try different things.

Note, the unsigned vs. signed is a small performance tick, but if you are up against the wall, it can give you that last speed boost.

It would be nice if the Arduino had facilities that exist in the higher end embedded processor development kits like profilers, debuggers, etc.  But it was more designed for the hobbyist market, and the tools are fairly basic.  I would guess the majority of Arduinos don't run into the performance problems, because they only do a few calculations, and then wait for the next interrupt/keypress/etc.

KeithRB

You might use an Arduino for each analog voltage input and let it sum a bunch of readings and pass it on to the main controller that performs the division.

Nick Gammon


So if I can modify the code to get more readings, that's great.  The more the better. 


You are already getting more readings. I showed you can get one every 112 µs, which you haven't commented on.

Your claim that you can only get 2,000 readings in 5 seconds is not supported by my actual measurements. I got 44,484 readings in 5 seconds.

Quote
Is there some way to profile this code?  Is it unrealistic to expect 10,000 samples unless I just sample?


Over 44,000 readings is a lot more than 10,000. So rather than mucking around changing things right now you need to address this discrepancy.
Please post technical questions on the forum, not by personal message. Thanks!

More info:
http://www.gammon.com.au/electronics

Go Up