DUE code compilation is too slow.

Hi, guys

I am using ARDUINO 1.6.13 IDE. i am trying to upload a skatch which is containing 10000 byte array.
after making a skatch i am trying to compile it but it is taking about 10 minutes to compile.

how do i do this fast. although DUE has a sufficient memory to store this array.

Thanks
EJ

You mean like: byte bulkdata[] = {0, 1, 2, 3 ... 10000}; // (presumably with values that actually fit in a byte

It went pretty quick for me. Can you be more specific about the formatting and the system environment? I could see a lack of newlines causing one of the preprocessing steps to get confused, for example. (I used 1000 lines with 10 integer bytes in each line...)

byte bigarray[] = {
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
  :

i am writing like following.

byte pos[10000];

void setup()
{
seting up digitl out for some other stuff
}
void loop()
{
pos[0]=port pin’s data
pos[1]= port pin’s data
pos[3]=port pin’s data
.
.
.
.
.
pos[9999]=port pin’s data

}

its taking too much time to compile.

I am not surprised that it takes time. It must have taken you some time to type in 10000 “similar” statements.

Surely it would be easier to just say:

     for (int i = 0; i < 10000; i++) pos[i] = port_pin_data;

Let the compiler generate the code. You could possibly give a hint e.g. use pointers.
I would guess that the Compiler will be better than you.

If you are worried about execution time, DMA would handle the whole sequence by magic. (if you really are just recording a port pin)

David.

Thank you david,

i know the above solution but it takes time in execution. i can not tolerate that time loss in my measurement.

each ittrition of for loop might take a 1 microsecond minimum. so total 10 microsecond loss.

is there any other way to solve this without losing much time.

Thanks

Hmm. Even with 10000 individual lines:

  :
bigarray[9995] = digitalRead(13);
bigarray[9996] = digitalRead(13);
bigarray[9997] = digitalRead(13);
bigarray[9998] = digitalRead(13);
bigarray[9999] = digitalRead(13);

I was able to compile in far less than one minute. How long is your entire sketch? Can you zip it up and attach?
(Go EMACS. C-x r + C-x r i for the win!)

each ittrition of for loop might take a 1 microsecond minimum

Um. No. A microsecond is not likely. And you’ll probably hit indeterminate timing due to the flash buffer…
(Measured it with a scope. I see about 40ns loop overhead.)

void loop() {
  for (int i = 0; i < 10000; i++) {
    outPort->PIO_SODR = outMask;
    outPort->PIO_CODR = outMask;
  }
}

EJLED: it is taking about 10 minutes to compile.

Any chance you have Trend Micro anti-virus?

Plz find the file of code.

i also want to know that the data i read, is storing in 96KB of SRAM??

kindly go through the code and help.
Thanks

Due_forum.ino (264 KB)

I get some compile errors:

error: 'PMC' was not declared in this scope error: 'PMC_PCER0_PID13' was not declared in this scope error: 'PIOB' was not declared in this scope error: 'PIO_PB25' was not declared in this scope

It is for Due…
The compile worked for me. Took about 1 minute.

The code produced is “interesting.”

 ad [3930 ]=REG_PIOD_PDSR;
   85e3a:       6819            ldr     r1, [r3, #0]
   85e3c:       f882 1f5a       strb.w  r1, [r2, #3930] ; 0xf5a
 ad [3931 ]=REG_PIOD_PDSR;
   85e40:       6819            ldr     r1, [r3, #0]
   85e42:       f882 1f5b       strb.w  r1, [r2, #3931] ; 0xf5b
 ad [3932 ]=REG_PIOD_PDSR;
   85e46:       6819            ldr     r1, [r3, #0]
   85e48:       f882 1f5c       strb.w  r1, [r2, #3932] ; 0xf5c

Ok; that’s what one expects. Load, Store via offset from array beginning.

But every once in a while, THIS happens. It seems to be trying to make sure that the constants used for the array beginning and pio structure stay “within reach” of the PC-relative instructions…

   85e4c:       e004            b.n     85e58 <loop+0x5d10>
   85e4e:       bf00            nop
   85e50:       400e143c        .word   0x400e143c          ;; address if piod_pdsr
   85e54:       200708f8        .word   0x200708f8           ;; address of ad
 ad [3933 ]=REG_PIOD_PDSR;
   85e58:       6819            ldr     r1, [r3, #0]
   85e5a:       f882 1f5d       strb.w  r1, [r2, #3933] ; 0xf5d
 ad [3934 ]=REG_PIOD_PDSR;
   85e5e:       6819            ldr     r1, [r3, #0]
   85e60:       f882 1f5e       strb.w  r1, [r2, #3934] ; 0xf5e
:
 ad [4006 ]=REG_PIOD_PDSR;
   8600e:       6819            ldr     r1, [r3, #0]
   86010:       f882 1fa6       strb.w  r1, [r2, #4006] ; 0xfa6
 ad [4007 ]=REG_PIOD_PDSR;

And then every once in a while, the compiler seems to randomly choose a different register as the temp, and has to reload it

   86014:       681b            ldr     r3, [r3, #0]     ;;  What?  r3?   We've been using R1!!
   86016:       f882 3fa7       strb.w  r3, [r2, #4007] ; 0xfa7
 ad [4008 ]=REG_PIOD_PDSR;
   8601a:       4bc2            ldr     r3, [pc, #776]  ; (86324 <loop+0x61dc>)   ;; reload R3
   8601c:       6819            ldr     r1, [r3, #0]
   8601e:       f882 1fa8       strb.w  r1, [r2, #4008] ; 0xfa8
 ad [4009 ]=REG_PIOD_PDSR;
   86022:       6819            ldr     r1, [r3, #0]
   86024:       f882 1fa9       strb.w  r1, [r2, #4009] ; 0xfa9
:
 ad [4095 ]=REG_PIOD_PDSR;
   86226:       6819            ldr     r1, [r3, #0]
   86228:       f882 1fff       strb.w  r1, [r2, #4095] ; 0xfff

Then at offset 4096, we run out of room in the ARM indexed store instruction, so the compiler switches to a different instruction sequence (double indexed, with a constant loaded into the 2nd index register)

 ad [4096 ]=REG_PIOD_PDSR;
   8622c:       6818            ldr     r0, [r3, #0]
   8622e:       f502 5180       add.w   r1, r2, #4096   ; 0x1000
   86232:       7008            strb    r0, [r1, #0]
 ad [4097 ]=REG_PIOD_PDSR;
   86234:       6818            ldr     r0, [r3, #0]
   86236:       f241 0101       movw    r1, #4097       ; 0x1001
   8623a:       5450            strb    r0, [r2, r1]
 ad [4098 ]=REG_PIOD_PDSR;
   8623c:       6818            ldr     r0, [r3, #0]
   8623e:       f241 0102       movw    r1, #4098       ; 0x1002
   86242:       5450            strb    r0, [r2, r1]

In comparison, a for loop is able to make use of autoincremented addressing, and does:

  for (int i = 0; i < sizeof(bigarray); i++) {
   bigarray[i] = REG_PIOD_PDSR;
   80156:       6802            ldr     r2, [r0, #0]    ;; load
   80158:       f803 2f01       strb.w  r2, [r3, #1]!   ;; store with auto-inc
   8015c:       428b            cmp     r3, r1          ;; compare against pre-loaded limit
   8015e:       d1fa            bne.n   80156 <loop+0xe> ;; loop.
  }
p = &bigarray[0];
*p++ = REG_PIOD_PDSR;
*p++ = REG_PIOD_PDSR;
*p++ = REG_PIOD_PDSR; // repeat 9996 more times...
   :
*p++REG_PIOD_PDSR;

My bad, the IDE had reverted to Uno, despite never even owning one :D

It took 35 seconds to compile. I don't suppose the EJLED has an old PC, slow drive, or perhaps compiling over a network?

It may be helpful for us to know what the aim of this is, we might be able to suggest alternative ideas.

I have noticed there are times when "something" happens in my code, there seems to be a small delay inserted into my loop, I can only assume this is an interrupt I don't know about, this would potentially interfere with your results. (Q: does anyone know where I can find out what interrupts are running? Are there standard ones which always run?)

I am using windows 8, 4GB RAM with 64 bit operating system.

Westfw i do not understand what you suppose to say.

Thanks

I wrote a little sketch to compare different styles of loop:

#define ITERS 100
#if defined(AVR)
#define INPUT_PORT PINC
#else
#define INPUT_PORT PIOC->PIO_PDSR
#endif

uint8_t global_array[1000];

uint32_t ary_ptr(void)
{
    uint32_t start = millis();
    for (int iter = 0; iter < ITERS; iter++) {
        int16_t count = 1000;
        uint8_t *p = &global_array[0];
        while (count--) {
            *p++ = INPUT_PORT;
        }
    }
    return millis() - start;
}

uint32_t ary_unroll(void)
{
    uint32_t start = millis();
    for (int iter = 0; iter < ITERS; iter++) {
        int16_t count = 100;
        uint8_t *p = &global_array[0];
        while (count--) {
            *p++ = INPUT_PORT;
            *p++ = INPUT_PORT;
            *p++ = INPUT_PORT;
            *p++ = INPUT_PORT;
            *p++ = INPUT_PORT;
            *p++ = INPUT_PORT;
            *p++ = INPUT_PORT;
            *p++ = INPUT_PORT;
            *p++ = INPUT_PORT;
            *p++ = INPUT_PORT;
        }
    }
    return millis() - start;
}

uint32_t ary_idx(void)
{
    uint32_t start = millis();
    for (int iter = 0; iter < ITERS; iter++) {
        for (int count = 0; count < 1000; count++) {
            global_array[count] = INPUT_PORT;
        }
    }
    return millis() - start;
}

void report(char *name, uint32_t (*f)(void))
{
    Serial.print(name);
    Serial.print(" takes ");
    Serial.print((1000.0 / ITERS) * (*f)());
    Serial.println("ns");
}

void setup()
{
    Serial.begin(9600);
    while (!Serial) ;
    Serial.println("Sequence timing");
    report("ary_ptr", ary_ptr);
    report("ary_unroll", ary_unroll);
    report("ary_idx", ary_idx);

}

void loop()
{
}

I have not bothered to look at the generated ASM.
It is interesting to note that on an AVR, indexing is faster than pointers.
And the ARM produces the same timing.

Yes, unrolling the loop is worth doing. There is little point in unrolling all 1000 (or 10000) accesses.
Just unrolling 10 or 16 is sufficient to achieve a reasonable saving.

As I said, I have not studied the ASM. But with an AVR, you would have an IN r16,PINC / ST Z+,r16 which is 3 cycles (188 ns @ 16MHz)
With an ARM, you still have to read the peripheral into a register and increment it in software. And have some loop control.

If you are attempting to make a Logic Analyser, be realistic.
A $10 Saleae clone can sample in 42ns and transfer an unlimited sequence to your PC.
Faster Logic Analysers will cost you more.

I am sure that you can use your Due with DMA. This should be faster than the Saleae clone.
A software loop will struggle to compete.

David.

Thanks david. can u fuide me how to use DUE with DMA? how do i do that. my application is to measure the particle movments. that is with some pressure sensors.

May I ask what pressure sensors are you using? (are they really high bandwidth 1 bit digital?)

I doubt if silver bullets need to be measured faster than 24000000 times a second.

Nuclear bombs work pretty fast. Perhaps the Due is being used in the armaments industry.

David.

David, you are thinking too high. i am not using this any such application that u mentioned. its simple condense microphone. its school project to study the phenomena of boiling pressurised water. feed into vessel while heating up they form a buble, while cooling that disapear. i want to measure that buble pressure during heating and cooling. it may be hypothetical. but just want to try it. hope u understand now. Thanks

With any project, you need to sit down wth pencil and paper. Write down a few typical numbers.

For example, your steam pressure probably builds over a 10 minute cycle. i.e. 600 seconds. You might want to log the pressure at 200ms intervals (3000 samples) 5 samples per second is a lot easier than 24000000 samples per second.

You might want to sample more frequently and record delta and timestamp.

Quite honestly, your Due can solve your problem in several different ways.

David.

Westfw i do not understand what you suppose to say.

Yeah, I thought that might be a problem. If you're really trying to eek out the maximum possible sample rate, you SHOULD reach a point where you can understand it. But here's a summary:

  • your current method does not sample the pins at a consistent rate. Samples above 4095 will be taken at a different rate than samples below 4095, and there are somewhat random "other" spots where there an interval will be different than those nearby. I think we're talking about 50 to 100% variation; I'd assume that that isn't a good thing.
  • A loop provides more consistent timing, and would have "similar" overall speed.
  • actual speed is difficult to determine. because ARM. And Flash wait states/buffering. And stuff.
  • While your current method is probably slightly faster than a loop, it does not result in the fastest possible sampling.

high bandwidth 1 bit digital?

They're reading a full byte at a time (actually, reading MORE than a byte, but only saving one byte.)

does anyone know where I can find out what interrupts are running? Are there standard ones which always run?

The Arduino core usually enables a 1ms SysTick timer interrupt, and interrupts on any serial ports that are in use. Since you can expect your 10000 samples to be read in less than 5ms, you can consider simply disabling interrupts for the duration, I guess.

westfw: They're reading a full byte at a time (actually, reading MORE than a byte, but only saving one byte.)

Ah, OK, that wasn't obvious to me. A byte of what though? If there's an ADC connected to those pins, where's the synchronisation? Or is there something else I've missed :D

westfw: The Arduino core usually enables a 1ms SysTick timer interrupt, and interrupts on any serial ports that are in use. Since you can expect your 10000 samples to be read in less than 5ms, you can consider simply disabling interrupts for the duration, I guess.

That's good to know. In my sketch I'm using micros(), from what I've managed to google, this relies on the systick interrupt. Hmmm, food for thought. I'll stop hijacking this thread now :) But it does hi-light that the code will be interfered with by interrupts, completely borking the continuous fast sampling.