What is the fastest way to read/write GPIOs on SAMD21 boards?

I have made a custom SAMD21 board, i am still currently trying to place a bootloader on it so i still wont be able to test thing for a little while.

I have read that arduino's digitalread and digitalwrite are very slow on some samd21 boards, hence i would like to create my own , my samd21 board only has PORTA 0 - 31 on them which makes things a bit easier.

So I would like to know how to code the fastest possible way to gpio controlls such as setting a pin output/input, setting pullup/down internally, setting the ping high/low, and reading the pin

I already have found some code on how to do reading, please do correct me if im wrong

reading a PORTA pin:

bool fastRead(uint8_t port_a_number){
   switch(port_a_number){
      case 0:
          if (REG_PORT_IN0 & PORT_PA00)
             return true;
          else
             return false;
      case 1:
          if (REG_PORT_IN0 & PORT_PA01)
             return true;
          else
             return false;

. . . . 

      case 31:
          if (REG_PORT_IN0 & PORT_PA31)
             return true;
         else
             return false;
   }
}

Where did you read that ?
What is slow for you ? 25 milliseconds or 2 picoseconds ?

Let me write the opposite for you:
The SAMD branch of Arduino boards (MKR Zero and others) have a fast digitalRead() and digitalWrite(). For the AVR branch of Arduino boards (Uno and others) there a benefit to use direct I/O instructions, but for the SAMD branch not so much.

So now you can say that you have read that the digitalread() and digitalWrite() are fast for SAMD boards :wink:

@Koepel weeell. i did say i have im still trying to upload a bootloader on my board so i cannot test things for myself hehe.

But it would be great to include them in my library so that the library will be independent from the arduinos.

The goal is to use Arduino code that can be used on any board. That means that the code is fully dependent on Arduino libraries and the Arduino build environment.
Trying to make things "better" for no reason is just not right. If you write you own code (for no good reason) then we can not help if you have a problem. If you already start to look for trouble before the board is even running, then you will never finish your project.

If you don't want to rely on Arduino, then you should write all your own libraries and your own code.

Suppose your board is running, and your really want that extra microsecond. That is possible of course. I always look at the OneWire library how they did the low level things. Look for the '__SAMD21G18A__'. But you have to have a very good reason to do that, because it is wrong if you do that for no good reason.

I have actually studied this issue and the much greater overhead in the SAMD21 vs the ARM based Arduino boards. See my blog post but if TL;DR the minimum pulse width you can create with the SAMD21 (at 48MHz) is 1.5µs using digitalWrite(pin,1); digitalWrite(pin,0). Writing directly to the port the width is 20ns which is 70 times faster.

It doesn't really fare well compared to the AVR microcontrollers, which can do this sort of thing far more efficiently. Just changing the pin value takes 5 instructions because values must be loaded into registers before actually writing to the port. So it takes 100ns to change a pin value. With the Arduino Nano Every (ATmega4809) it can be done in a single instruction taking 62ns. Minimum pulse width is 1.2µs using digitalWrite and 62ns accessing the port directly.

If you are doing this yourself remember to access using PORT_IOBUS otherwise you end up wit h wait states being added and it takes 4x longer to execute.

1 Like

So what i read was correct then there is some abysmal overhead arduino is doing.

Moving on i found this documentation which pretty much has examples on all possible GPIO related situations.

On the documentation mentioned above, it does not use PORT_IOBUS , is it about the same performance?

There are two available busses that can reference the GPIO registers. these are defined as PORT and PORT_IOBUS. PORT_IOBUS provides a fast path that can be accessed while the next instruction is being fetched over the other bus.

The documentation you reference calls a function (or perhaps a macro) port_pin_set_output_level and you would have to find it's definition to see how it is implemented. I cannot find that function in any of the files distributed from Arduino.

I do know that a function that takes the level as an argument is potentially less efficient that one that strictly sets or clears the bit since a different register is used for setting and clearing. What I have implemented is a set of macros one for each pin that indicates the group and bit position. For instance digital pin 4 is bit 7 of the first group:

#define digital_4 (7)

Pins on the second group, I just add 32 to the bit position.

Then I define two macros to decode the group index and bit mask from the earlier constant value for the pin:

#define PBMask(_x) (1 << ((_x)&0x1f))
#define PGrp(_x) ((_x) >> 5)

Then I define macros to set clear or toggle the bit:

#define OutSet(_x) (PORT_IOBUS->Group[PGrp(_x)].OUTSET.reg = PBMask(_x))
#define OutClr(_x) (PORT_IOBUS->Group[PGrp(_x)].OUTCLR.reg = PBMask(_x))
#define OutTgl(_x) (PORT_IOBUS->Group[PGrp(_x)].OUTTGL.reg = PBMask(_x))

So I can set and then clear digital pin 4 with:

OutSet(digital_4);
OutClr(digital_4);

If the value I want to output is a variable, then I have to use this macro:

#define OutPin(_x, _val) if (_val) OutSet(_x); else OutClr(_x)

There are other macros that work similarly for configuration, direction and input. Not everything is accessible via PORT_IOBUS, then PORT must be used instead. This is all significantly more complicated than the AVR microcontrollers!

hmmm i seem to cant find the port_pin_set_output_level you are talking to. According to the examples which i think access the registers directly, whcih looks pretty straight forward:

to set a pin HIGH

REG_PORT_DIRSET1 = PORT_PB27; // set PB27 to output
REG_PORT_OUTSET1 = PORT_PB27; // set PB27 HIGH

to set a pin LOW

REG_PORT_DIRSET1 = PORT_PB27; // set PB00 to output
REG_PORT_OUTCLR1 = PORT_PB27; // set PB00 LOW

and to toggle

REG_PORT_DIRSET1 = PORT_PB27; // set PB27 to output
REG_PORT_OUTTGL1 = PORT_PB27; // toggle PB27

Am i right or im missing something here.

How i think i will do this in actual is maybe to macro the thing where if somebody calls lets say digitalwrite(x,y). replace it with a set the corresponding two line of code like above, that for now is a problem for another day, since i have no idea how to make advance macros yet.

Unfortunately, that's not true. In addition to some of the "it takes multiple instructions" comments that almytom's blog points out (not entirely correctly, alas), the SAMD digitalWrite() function also chooses to inherit some backward-compatibility issues from Uno that take an unfortunate amount of time (check if PWM was on, and turn it off. Check if the pinMode is INPUT, in which case enable the pullup instead of doing an output. Sigh.) It's also complicated by the fact that that the AVR does "link time optimization", and the SAMD doesn't. Which saves the overhead of a function call, "some of the time."

For a custom SAMD21 board with consecutively number bits on PORTA, you can do the fastest read with something like:

static inline boolean fastRead(int bitnum) {
  return !! (PORT_IOBUS->Group[0].IN.reg & (1<<bitnum));
}

and write with:

static inline void fastWrite(int bitnum, int val) {
  if (val)
    PORT_IOBUS->Group[0].OUTSET.reg = (1<<bitnum);
  else
    PORT_IOBUS->Group[0].OUTCLR.reg = (1<<bitnum);
}

The output function yields pulses of about 40ns... It perhaps should be 20ns, but apparently the compiler is re-ordering some instructions because it things RAM access is slow (but IOBUS isn't. Sigh.)

That's not as much faster than an AVR as we might like, but it's a lot better than the default of about 1.5us. Note that an Uno using the standard digitalWrite() functions produces a pulse of about 3us.

Note that optimization comes seriously into play here. If you manage to defeat LTO on the AVR (say, your pin number is not a constant), it'll be slower. If you don't make the ARM functions inline, it'll be slower. If the two writes are far enough apart that the loaded addresses and constants fall out of registers, it'll get slower. It's all very ... not as deterministic as one might like.

almytom: do you want a more detailed critique of your blog entry? Here, or elsewhere? Most seriously, this part:

it takes that single instruction to store a variable boolean value into the port of the AVR, this cannot be done with the ARM. AVR:

port.digital_13 = value;

Isn't true. The AVR can set or clear single bits, but setting a bit to a particular value is about the same as an ARM (or worse, on the older AVRs.)

1 Like

Heh. This will do 20ns pulses...

static inline void fastWrite(int bitnum, int val) {
  if (val)
    PORT_IOBUS->Group[0].OUTSET.reg = (1 << bitnum);
  else
    PORT_IOBUS->Group[0].OUTCLR.reg = (1 << bitnum);
  asm volatile ("" ::: "memory");
}

(gross)

I find using the structure pointer method quite awkward to use as im more used to the tradition notation

Is the pointer method faster than direct register access? or about the same?

forexample the code you made

vs

OK, I see the REG_PORT_* references and looking them up they aren't using PORT_IOBUS. Other than that it will generate the same code as the macros I use. (that also applies to westfw's code).

You are right, of course. I should have said "constant" instead of "variable". And since I was particularly minimum pulse generation I wasn't concerned with a variable value. The change to having to separate set, clear, and toggle registers in the newer AVRs was really necessitated by the limited number of addressable bits and the need to have atomic bit set and clear operations with the larger number of ports. The ugly case of the ATmega2560 which has some ports whose bits can be set or cleared individually but others that can't shows the limitations of the original design.

I do see a pulse width of 20ns, so for some reason there is a difference in the code generated by my macros vs your inline function (which presumably will optimize out the if statement if val is constant). The short pulse does rely that all the register loading of address and mask is done prior to the stores into OUTSET and OUTCLR. I see that is the case for the macros.

I did also notice that if I generate a string of pulses, occasionally but deterministically and seemingly for no reason there will be a delay of 20ns thrown in, as though there is an instruction fetch not being done in parallel. So yes the timing isn't consistent.

Ah. Good question.
The "pointer method" is slightly better (neglecting, for the moment, that there isn't a REG_ definition for the IOBUS (there could be.))

All of that fancy pointer de-referencing and indexing in PORT_IOBUS->Group[1].OUTSET.reg happens at compile time. So if you have:

void trystruct() {
     PORT_IOBUS->Group[1].OUTSET.reg = (1 << 21);
}
void tryREG() {
     REG_PORT_OUTSET1 = (1 << 21);
}

They will compile to essentially EXACTLY the same code:

void trystruct() {
  PORT->Group[0].OUTSET.reg = (1 << 21);
    20fc:       2280            movs    r2, #128        ;construct constant 1<<21
    20fe:       4b02            ldr     r3, [pc, #8]    ; get address of PORT
    2100:       0392            lsls    r2, r2, #14     ; more constant construction (7+14 = 21)
    2102:       619a            str     r2, [r3, #24]   ; store to PORT.OUTSET.reg offset
    2104:       4770            bx      lr              ;return
    2106:       46c0            nop                     ;  (padding)
    2108:       41004400        .word   0x41004400      ;address of PORT0

void tryREG() {
  REG_PORT_OUTSET0 = (1 << 21);
    210c:       2280            movs    r2, #128        ; 0x80
    210e:       4b02            ldr     r3, [pc, #8]    ; address of  (individual) register
    2110:       0392            lsls    r2, r2, #14
    2112:       601a            str     r2, [r3, #0]
    2114:       4770            bx      lr
    2116:       46c0            nop
    2118:       41004418        .word   0x41004418      ; address of PORT_OUTSET0

However, if you expand the source code to generate a pulse:

void trystruct() {
  PORT->Group[0].OUTSET.reg = (1 << 21);
  PORT->Group[0].OUTCLR.reg = (1 << 21);
}
void tryREG() {
  REG_PORT_OUTSET0 = (1 << 21);
  REG_PORT_OUTCLR0 = (1 << 21);
}

The pointer-based version "knows" that OUTSET and OUTCLR are reachable via the same base PORTGroup pointer, while the register-based code thinks they are independent and needs to load them separately:

void trystruct() {
  PORT->Group[0].OUTSET.reg = (1 << 21);
    20fc:       2280            movs    r2, #128        ; constant
    20fe:       4b02            ldr     r3, [pc, #8]    ;base address of portgroup 0
    2100:       0392            lsls    r2, r2, #14     ; constant
    2102:       619a            str     r2, [r3, #24]   ; store to portgroup.outset
  PORT->Group[0].OUTCLR.reg = (1 << 21);
    2104:       615a            str     r2, [r3, #20]   ; store to portgroup.outclr
    2106:       4770            bx      lr              ; return
                                                        ; Look!  No padding! (lucky)
    2108:       41004400        .word   0x41004400      ; base address of portgroup0

void tryREG() {
  REG_PORT_OUTSET0 = (1 << 21);
    210c:       2380            movs    r3, #128        ; constant
    210e:       4a03            ldr     r2, [pc, #12]   ; get address of OUTSET0
    2110:       039b            lsls    r3, r3, #14     ; constant
    2112:       6013            str     r3, [r2, #0]    ; store to OUTSET0
  REG_PORT_OUTCLR0 = (1 << 21);
    2114:       4a02            ldr     r2, [pc, #8]    ; get address of OUTCLR0
    2116:       6013            str     r3, [r2, #0]    ; store to OUTCLR0
    2118:       4770            bx      lr              ; return
    211a:       46c0            nop                     ; padding
    211c:       41004418        .word   0x41004418  ; OUTSET0
    2120:       41004414        .word   0x41004414 ; OUTCLR0

Somewhat subject to the whims of the compiler, alas. It was doing weird things when I used IOBUS in trystruct, since it was able to "construct" the port address rather than using a literal value. And I guess it doesn't know that IOBUS is faster than normal memory (?), so it likes to break up consecutive store operations. Sigh.

1 Like

Very interesting! When I tried the following, which also tries the inline function:

  // 20ns per, but can be 40 (instruction fetch??)
 PORT_IOBUS->Group[1].OUTSET.reg = 1 << 10;
 PORT_IOBUS->Group[1].OUTCLR.reg = 1 << 10;
 PORT_IOBUS->Group[1].OUTSET.reg = 1 << 10;
 PORT_IOBUS->Group[1].OUTCLR.reg = 1 << 10;

 // 80 ns per
 PORT->Group[1].OUTTGL.reg = 1 << 10;
 PORT->Group[1].OUTTGL.reg = 1 << 10;
 PORT->Group[1].OUTTGL.reg = 1 << 10;
 PORT->Group[1].OUTTGL.reg = 1 << 10;

  fastWrite(10, 1); // fastWrite modified to use group 1 instead of 0
  fastWrite(10, 0);
  fastWrite(10, 1);
  fastWrite(10, 0);

(By the way, it doesn't matter if I use SET and CLR or use TGL in terms of performance) It groups all the address and mask calculations together at the beginning of the function and then just has a sequence of 12 str instructions. I suspect the decision to build a constant value rather than loading it depends on which is quicker or takes less space depending on optimization choices.

    210c:	22c0      	movs	r2, #192	; 0xc0
    210e:	05d2      	lsls	r2, r2, #23
    2110:	0011      	movs	r1, r2
    2112:	2380      	movs	r3, #128	; 0x80
    2114:	4808      	ldr	r0, [pc, #32]	; (2138 <loop+0x2c>)
    2116:	00db      	lsls	r3, r3, #3
    2118:	3198      	adds	r1, #152	; 0x98
    211a:	3294      	adds	r2, #148	; 0x94
    211c:	600b      	str	r3, [r1, #0]
    211e:	6013      	str	r3, [r2, #0]
    2120:	600b      	str	r3, [r1, #0]
    2122:	6013      	str	r3, [r2, #0]
    2124:	6003      	str	r3, [r0, #0]
    2126:	6003      	str	r3, [r0, #0]
    2128:	6003      	str	r3, [r0, #0]
    212a:	6003      	str	r3, [r0, #0]
    212c:	600b      	str	r3, [r1, #0]
    212e:	6013      	str	r3, [r2, #0]
    2130:	600b      	str	r3, [r1, #0]
    2132:	6013      	str	r3, [r2, #0]
    2134:	4770      	bx	lr
    2136:	46c0      	nop			; (mov r8, r8)
    2138:	4100449c 	.word	0x4100449c

@westfw this is really cool to know. This is literally as fast as it cann possibly be.

So is this an IDE specific of a situation>? Did you use the arduino IDE to generate those assemblers?

How about reading from a pin are they still the same?

Just goes to show that even compiler still need optimizations.

There is a utility "arm-none-eabi-objdump" in the same directory that contains the compiler that I used to get the assembler listing. (/Applications/Arduino-1.8.13.app/Contents/Java/portable/packages/arduino/tools/arm-none-eabi-gcc/7-2017q4/bin/arm-none-eabi-objdump -SC /tmp/Arduino1.8.13Build/*.elf | less in my case.)

How about reading from a pin are they still the same?

The same logic applies, except there was also that 32-bit read to boolean conversion that you might not need to do in many cases.
If you're just reading a bit, the REG and struct should give the same results, but if you have a function that toggles a clock bit and reads a data bit, that would likely have the same improvements by using the struct with its base register.

One of the reasons we're seeing some different code is that the compiler "knows" that a function has access to certain registers that it doesn't need to save, and it looks like it will make full use of that fact if it can. But if the code is in-line, it can better optimize the register usage.

For fun, you can consider that the SAMD21 allows byte writes in addition to 32bit writes of the entire port. Theoretically, you can write any bit in the port using a 8bit constant, which would be quicker to load. Something like:

  PORT->Group[0].OUTSET.reg = (1 << 21);
    20fc:   2280       movs    r2, #32           ;construct 8bit constant 1<<(21-(2*8))
    20fe:   4b02       ldr     r3, [pc, #8]      ; get address of PORT
    2102:   619a       strb    r2, [r3, #24+2]   ; byte store to 3rd byte of OUTSET

The samd21 .h files don't have any definitions for such 8bit access, so the C code to make it happen would be pretty gross.

Now there's an idea! Byte access! Slight reduction in code length, and not really that difficult if you are willing to modify CMSIS/device/ATMEL/SAMD21/Include/component/port.h (not really recommended!).

They types for the port registers are already defined as unions, so I just added uint8_t breg[4] to the existing uint32_t reg. It works using breg, for instance your example would be:

PORT->Group[0].OUTSET.breg[21 >> 3] = (1 << (21 & 0x07)); 

It generates the same code you just posted.

oh! That's a nice way to do it!

(Unfortunately, Microchip has gotten rid of the unions in the latest CMSIS (CMSIS-like?) files for the SAM chips. Not that that's likely to creep into Arduino, but...)

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.