Question for Fastest way on rotation (circular) left on 16 bit

I want to get fastest way on this function

uint16_t rotation(uint16_t x)
{
return ( (uint16_t) (x<<4) | (uint16_t) (x>>12) );
}

example : x=0x abcd => rotation(x)=0x bcda

I use this function in other functions :

void function(x)
{
....
rotation(x)
....
}

I saw some inline assembly codes, but i can't apply on my problem.
(I can use C and C++, but assembly language is difficult on me....)

plz help me!!

That is about the fastest way.
Unless you want to get crazy sill with a lookup table (not practical for a 16 bit value).

    return  ( (uint16_t) (x<<4)  | (uint16_t) (x>>12) );
 52c:	58 2f       	mov	r21, r24
 52e:	52 95       	swap	r21
 530:	5f 70       	andi	r21, 0x0F	; 15
 532:	49 2f       	mov	r20, r25
 534:	42 95       	swap	r20
 536:	40 7f       	andi	r20, 0xF0	; 240
 538:	35 2f       	mov	r19, r21
 53a:	34 2b       	or	r19, r20
 53c:	92 95       	swap	r25
 53e:	9f 70       	andi	r25, 0x0F	; 15
 540:	82 95       	swap	r24
 542:	80 7f       	andi	r24, 0xF0	; 240
 544:	29 2f       	mov	r18, r25
 546:	28 2b       	or	r18, r24

It would be faster as a preprocessor macro, instead of a function call.

aarg:
It would be faster as a preprocessor macro, instead of a function call.

Okay, marginally.

One has to ask though, why does the OP think they need this perceived extreme level of performance, and why that particular function call.

...and why such a weird rotation... :slight_smile:

why save the sign bit, shift left and add/OR 1 if sign bit set?

this is called a barrel-shift

What sign bit? This function processes unsigned values.

aarg:
It would be faster as a preprocessor macro, instead of a function call.

That is not true, and it's actually harmful advice in my opinion. Using macros instead of functions is just asking for problems.

Modern compilers will inline functions if the implementation is available, you don't need a macro for that.
If you enable link-time optimizations (using GCC's -flto, which is enabled by default for AVR in the Arduino IDE), it doesn't even matter whether they're in the same source file/translation unit, it'll inline functions globally.

Proof: Compiler Explorer
Both the function and the macro code compile to the exact same assembly.

Also see the C++ Core Guidelines on the matter: C++ Core Guidelines

If the compiler insists on not inlining the function call, you could add the inline keyword to try and persuade it, or even stronger compiler-specific attributes if you're into that.
I'm sure you could find edge cases where a macro turns out to be marginally faster (probably at a high memory penalty), but in general, this is not the case, and "use a macro, it's faster" is terrible general advice.
If all else fails, in very rare cases and if you have measured it, you might consider it. Otherwise, use a function, and save yourself trouble in the long term.

Pieter

Edit: add note about LTO

The following is about 12% faster than your original function (on an Arduino UNO):

inline uint16_t __attribute__((always_inline)) rotate_inline_asm(uint16_t in) {
  asm ( "lsl %A0"              "\n\t"
        "rol %B0"              "\n\t"
        "adc %A0,__zero_reg__" "\n\t"
        "lsl %A0"              "\n\t"
        "rol %B0"              "\n\t"
        "adc %A0,__zero_reg__" "\n\t"
        "lsl %A0"              "\n\t"
        "rol %B0"              "\n\t"
        "adc %A0,__zero_reg__" "\n\t"
        "lsl %A0"              "\n\t"
        "rol %B0"              "\n\t"
        "adc %A0,__zero_reg__" "\n\t"
    : "=r" (in) : "0" (in)
  );
  return in;
}

It's just 4 left bit rotates, each rotate uses a logical shift left (lsl) of the low byte, which sets the carry flag to original bit 7, then a rotate left through carry (rol) of the high byte, which sets new bit 8 to the carry flag, and sets the carry flag to original bit 15. Finally, the carry flag is added (adc) to the low byte to set the new bit 0 if original bit 15 was set.

C is particularly bad at doing rotates and (related) doesn’t know about carry bits.

westfw:
C is particularly bad at doing rotates and (related) doesn’t know about carry bits.

But surely the compiler/optimizer should know about carry bits, right?

At least GCC uses the same optimization when rotating a single bit: Compiler Explorer

uint16_t rotation_1(uint16_t x) {
    return uint16_t(x << 1) | uint16_t(x >> 15);
}
rotation_1(unsigned int):
        lsl r24
        rol r25
        adc r24,__zero_reg__
ret

For some reason, for shifting 4 bits, it resorts to swapping and masking nibbles, as posted by pcbbc, which turns out to be slower.

isn't the carry bit an output of an arithmetic operation in the hardware?

the sign bit refers to the most significant bit of a value. there are op-codes specific to the sign bit

gcjr:
isn't the carry bit an output of an arithmetic operation in the hardware?

Yes, but also the bit shifted out of a shift or rotate operation in many implementations (including AVR).

Edit: Also on AVR the LSL Rd (Logical Shift Left) instruction doesn't really exist. The assembler implements it using the ADD Rd, Rd instruction.

The carry bit/flag is a single bit in the CPU status register. A bit shift instruction places the bit that was shifted out into the carry flag. Similarly for other arithmetic instructions. You can branch on the carry flag, or you can use instructions that shift back in the carry flag, or add the carry flag to a register, etc.

There's also a signed flag, that keeps track of whether the result of a subtraction was negative, for instance.

I don't think AVR has any instructions that operate on the most significant bit explicitly (BST etc. can operate on any bit). When dealing with unsigned numbers "sign bit" is pretty much meaningless.

But surely the compiler/optimizer should know about carry bits, right?

Clearly there's knowledge deep down in the code generation. But no C-level accessibility of the sort that would make doing the end-around carry for an IP checksum "easy" (for example.)

Also on AVR the LSL Rd (Logical Shift Left) instruction doesn't really exist.

Neither does ROL (it's "adc Rn, Rn")
(And it doesn't actually rotate in the way you might expect coming from a different CPU architecture.)
(that also means that your quicker code sequence could become this "prettier" version:)

        "add %A0,%A0"              "\n\t"
        "adc %B0,%B0"              "\n\t"
        "adc %A0,__zero_reg__" "\n\t"
            :

I don't think AVR has any instructions that operate on the most significant bit explicitly

BRMI and BRPL test the "N" bit in the status register, which is set to the MSB of the most-recent nbit-modifying instruction. A "shift left" instruction gives you "access" to three bits via conditional branches - the former MSB (C), the new MSB (N), and the former bit 3 (H)
(brownie points for anyone who manages to make use of that for ... anything.)

pcbbc:
Edit: Also on AVR the LSL Rd (Logical Shift Left) instruction doesn't really exist. The assembler implements it using the ADD Rd, Rd instruction.

westfw:
Neither does ROL (it's "adc Rn, Rn")

What would be the correct term that you'd use in a normal conversation? In our MIPS course they referred to it as "pseudoinstructions", but the AVR Instruction Set Manual lists LSL and ROL under "Bit and Bit-test Instructions". Is it common to just call everything that has its own mnemonic an "instruction"?

westfw:
BRMI and BRPL test the "N" bit in the status register, which is set to the MSB of the most-recent nbit-modifying instruction. A "shift left" instruction gives you "access" to three bits via conditional branches - the former MSB (C), the new MSB (N), and the former bit 3 (H)
(brownie points for anyone who manages to make use of that for ... anything.)

Clearly, someone has spent some time doing serious AVR programming :slight_smile:

What would be the correct term that you'd use in a normal conversation? In our MIPS course they referred to it as "pseudoinstructions"

I like "pseudo-instruction" or "alias." TI calls them "emulated" instructions (on MSP430.)
Whether the manufacturer claims them as separate instructions or not seems to be random.
I think back in the day when RISC was "new", they'd be more inclined to make sure that the "instruction set" contained all the mnemonics from prior architectures that they were competing against (the early AVRs were positioned against 8051 (the AT90S8515 was pin-compatible!)), so it needed to document all the old 8080-like instructions.
Now that RISC is more of a "feature", they're more likely to list the pseudo-instructions separately, to make the instruction set look more "reduced." (like TI. The MSP430 has something like 27 opcodes, and 24 additional "emulated" instructions.
On AVR, there are several other pseudo-instructions (Heh. Homework for the assembly language class, to identify them!)
On ARM, "nop" was "mov r8, r8" for a long time; I think the compiler still uses that. Current thumb/thumb2 has both 16bit and 32bit "nop" instructions.

someone has spent some time doing serious AVR programming

:slight_smile: Actually, that observation comes more from studying ARM code. ARM doesn't have a 16bit "test bits" or "and immediate" instruction (which means that M0 doesn't have them at all), but it does have a single-cycle "shift by any amount", so single-bit tests ("if (reg & (1<<13))") get implemented by shifting the value until the bit-of-interest hits something that sets the carry or sign bit.

The term, "pseudo instruction" has a solid history. But sometimes it refers to symbols that are for assembler control only and don't produce any processor instructions, like ORG, EQU, RMB, END... and many more.

Heh. This is cute. It cuts off a couple of cycles, and is shorter. Uses the multiply instruction, which is not present on all of the AVRs.
It trashes 4 registers, which is probably bad. If you're willing to trash a fifth register, you could get rid of another instruction...
Maybe if you had a long list of values that you wanted to rotate, the overhead that could be shared would factor out...

dorot4:    ;; enter with x in r25:r24
    ldi    r17, 16    
    mul    r25, r17
    mov    r18, r1 
    mov    r25, r0
    mul    r24, r17
    mov    r24, r18
    add    r24, r0    
    add    r25, r1    ;; exit with rotated value in r25:r24

(Comments stripped, in case you want to puzzle it out. It's based on a 16x8-->24 multiply I found, taking into account the special properties of multiplying by 16. I can't quite intuit whether it works for other shift factors (and it's tougher to see if it's working. I think... it might have to be more careful about carry bits.))

Clever! Thanks for sharing.