Problem with accurate timing (micro seconds)

Please bear with me until I get to the coding part. I'm trying to make a two channel biphasic pulse sequence generator with variable voltage amplitude, frequency etc. Here's a biphasic pulse:

The pulse width has to be in this range: 200 us - 500 us. I'm passing serial data from two Arduino Uno pins to a DAC. Each DAC has two outputs. Output R and L are determined through every odd 16 bits sent by the Arduino. For clarification about the DAC see this:

For each channel of the pulse generator, the two outputs of the DAC are used and they are added up to make the biphasic pulse. One output which is inverted through an op-amp, for when it is negative and the other for when it is positive.

Now, let's get to the coding part.

//first line of inputs goes to R.OUT of DAC1,second goes to L.OUT of DAC,. 3rd line goes to R.OUT of DAC2,4th goes to L.OUT of DAC12
void WriteRL (int r15_1,int r14_1,int r13_1,int r12_1,int r11_1,int r10_1,int r9_1,int r8_1,int r7_1,int r6_1,int r5_1,int r4_1,int r3_1,int r2_1,int r1_1,int r0_1,
              int l15_1,int l14_1,int l13_1,int l12_1,int l11_1,int l10_1,int l9_1,int l8_1,int l7_1,int l6_1,int l5_1,int l4_1,int l3_1,int l2_1,int l1_1,int l0_1,
              int r15_2,int r14_2,int r13_2,int r12_2,int r11_2,int r10_2,int r9_2,int r8_2,int r7_2,int r6_2,int r5_2,int r4_2,int r3_2,int r2_2,int r1_2,int r0_2,
              int l15_2,int l14_2,int l13_2,int l12_2,int l11_2,int l10_2,int l9_2,int l8_2,int l7_2,int l6_2,int l5_2,int l4_2,int l3_2,int l2_2,int l1_2,int l0_2 )
{

                          delayMicroseconds(d1);digitalWriteFast(SI,r15_1);digitalWriteFast(SI_2,r15_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r14_1);digitalWriteFast(SI_2,r14_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r13_1);digitalWriteFast(SI_2,r13_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r12_1);digitalWriteFast(SI_2,r12_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r11_1);digitalWriteFast(SI_2,r11_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r10_1);digitalWriteFast(SI_2,r10_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r9_1 );digitalWriteFast(SI_2,r9_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r8_1 );digitalWriteFast(SI_2,r8_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r7_1 );digitalWriteFast(SI_2,r7_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r6_1 );digitalWriteFast(SI_2,r6_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r5_1 );digitalWriteFast(SI_2,r5_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r4_1 );digitalWriteFast(SI_2,r4_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r3_1 );digitalWriteFast(SI_2,r3_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r2_1 );digitalWriteFast(SI_2,r2_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r1_1 );digitalWriteFast(SI_2,r1_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,r0_1 );digitalWriteFast(SI_2,r0_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);

  digitalWriteFast(Clk,0);
  digitalWriteFast(LRCk,1);
  
  // Previous data is written to R.Out. Now that LRCk is 1, the next 16 clocks determine the L.OUT.
  // At this point, we don't care about the L.out. For the next 16 clocks, we are going to assign the lowest value to L.OUT.

                          delayMicroseconds(d1);digitalWriteFast(SI,l15_1);digitalWriteFast(SI_2,l15_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l14_1);digitalWriteFast(SI_2,l14_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l13_1);digitalWriteFast(SI_2,l13_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l12_1);digitalWriteFast(SI_2,l12_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l11_1);digitalWriteFast(SI_2,l11_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l10_1);digitalWriteFast(SI_2,l10_2);delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l9_1 );digitalWriteFast(SI_2,l9_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l8_1 );digitalWriteFast(SI_2,l8_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l7_1 );digitalWriteFast(SI_2,l7_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l6_1 );digitalWriteFast(SI_2,l6_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l5_1 );digitalWriteFast(SI_2,l5_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l4_1 );digitalWriteFast(SI_2,l4_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l3_1 );digitalWriteFast(SI_2,l3_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l2_1 );digitalWriteFast(SI_2,l2_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l1_1 );digitalWriteFast(SI_2,l1_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);
  digitalWriteFast(Clk,0);delayMicroseconds(d1);digitalWriteFast(SI,l0_1 );digitalWriteFast(SI_2,l0_2 );delayMicroseconds(d1);digitalWriteFast(Clk,1);delayMicroseconds(d2);

  
  digitalWriteFast(Clk,0);
  digitalWriteFast(LRCk,0);
}

Each time this function is called, serial input goes to two DACs for 32 clocks. Let's call each 32 clocks an "Interval". Now, I know this much delay along with the two cycles that each digitalwritefast takes already defeats my purpose, but this can be fixed through using NOP instead of the delay. The real problem is that the time my calculations take are also limiting. Basically, first of all after the users enters the parameters for the pulses, I calculate for how many Intervals I need to send data (N). Then a while(counter<N) loop repeats and based on the counter it is determined weather each output is low,high or zero in the current Interval. and based on that, the function above is called. I put

time = micros();

  Serial.println(time);

right before and after the "While". and the difference is more than 300 us. Then, the calculations take up another 600 us and finally the function is called. This is all while I wanted my Interval to take only 100 us. What concerns me the most is the time that while(counter<N) takes because I cannot think of a way to not have to use that. Is there any way to make this faster?

a) Welcome to the Forum.

b) Rarely will you receive advice from a code snippet. Post your entire code.

c) When it comes to tuning for speed, you need to get rid of a few Arduino functions and get into direct port manipulation. You may want to read up a bit about that.

d) Read the post at the top of page pne. 'Using millis() for timing...". The same methodologies apply to micros. Get rid of delays.

e) Read b) again. You make mention of your calculations, first place to look for speed improvements. Show us.

The microseconds counter on an UNO increments in units of 4. If you're happy with +/-4us accuracy, then keep going and accept that the final result will be inaccurate.

Or switch to a faster processor. A Teensy 3.2 main clock works at up to 96MHz and a Teensy 3.6 works up to 180MHz. So you can run a large number of instructions in a single microsecond. For periods of 200us with accuracy required around +/-1us, I would just use the regular blinkWithoutDelay method.

Or switch to using the hardware timers to do what you want. You can get better than 1us accuracy on an UNO using the hardware timers.

I suspect you can use the SPI peripheral to transfer data to the DAC. That would allow the transfer to run in the "background" and eliminate all those busy loops.

Apart from not seeing the complete program, figuring out the code in the snippet would be as easy as finding a particular dust mite in a really scruffy carpet. Put one instruction on each line - like, for example, on lines 25 and 26 of the snippet.

As well as that the code seem to be crying out for arrays to eliminate all the repetition.

Using delayMicroseconds() you will need to take account of the time taken to execute the instructions between the delays. If you use non-blocking timing using micros() then the execution time will be automatically taken care of. The technique in Several Things at a Time can also be applied with micros().

I suspect for time intervals as long as 200µs that digitalWriteFast() will be good enough, and it is a lot easier than learning port manipulation.

I note that the graph in the Original Post has current on the Y axis. Arduinos deal with voltages.

...R

Isn't digitalWriteFast normally "as fast" as direct port manipulation? (My new version is :slight_smile: )
(Hmm. Perhaps not, if the bit value isn't also a constant.)

What concerns me the most is the time that while(counter<N) takes

while() usually compiles to optimal code, and shouldn't take anywhere near 300us. OTOH, Serial.print() can be quite slow. As others have said, we'd need to see more complete code to get an idea what is actually going on.

void WriteRL (int r15_1,int r14_1,int r13_1,int r12_1,int r11_1... **** 64 ARGUMENTS ****

OMG! Don't do THAT!
Passing arguments to a function is relatively slow, especially once you try to pass more arguments than will fit in the registers available on the CPU. Build up your table in an array, and either pass a pointer to the array or leave it as a global variable. (and, it looks like it could be "byte" instead of "int", which will save some time as well.)

An Arduino DUE could output easily the pulses described in your post because this board has 2 builtin 12_bit DACs, and with some extra hardware the output range can be e.g. -1.6v <-> + 1.6V instead of 0.6V <–> 2.7V.

Let 's say you initialized 2 buffers of 30 uint16_t in setup() .
Buffer0 first 10 points = 4095, second 10 points = 2048, third 10 points = 0.
Buffer1 30 points = 2048.

You can output thru a single DAC any of the 2 buffers with a DMA, i.e. with (nearly) no load on the CPU. Before the end of a buffer output, the DMA triggers an interrupt to "ask" what will be the next buffer -> this is when you decide whether it will be buffer0 or buffer1.

Note that DAC output frequency can be dynamically updated if you need to.

That may work better, but I'm to short on time to learn and implement that.
Edit: Added the code.

Robin2:
Apart from not seeing the complete program, figuring out the code in the snippet would be as easy as finding a particular dust mite in a really scruffy carpet. Put one instruction on each line - like, for example, on lines 25 and 26 of the snippet.

As well as that the code seem to be crying out for arrays to eliminate all the repetition.

Using delayMicroseconds() you will need to take account of the time taken to execute the instructions between the delays. If you use non-blocking timing using micros() then the execution time will be automatically taken care of. The technique in Several Things at a Time can also be applied with micros().

I suspect for time intervals as long as 200µs that digitalWriteFast() will be good enough, and it is a lot easier than learning port manipulation.

I note that the graph in the Original Post has current on the Y axis. Arduinos deal with voltages.

...R

I know how to deal with the delaymicroseconds and how to reduce it by replacing it with nop. I was more concerned about the other stuff in my loop. I will upload it when I have added comments as soon as I can.
Edit: Added the code.

You appear to be passing four 16-bit integers as 64 integers, each containing 1 bit. That is going to slow things down a lot.

Why are you using delayMicroseconds() at all?!? The maximum clock rate is 10 MHz so there is no way a 16 MHz Arduino UNO is going to exceed that speed. The time it takes to call the digitalWriteFast() will keep the clock rate well below 10 MHz. You could even do direct port manipulation and not clock too fast. If you use the dedicated SPI hardware you could get the data clock up to 8 MHz.

westfw:
Isn't digitalWriteFast normally "as fast" as direct port manipulation? (My new version is :slight_smile: )
(Hmm. Perhaps not, if the bit value isn't also a constant.)

while() usually compiles to optimal code, and shouldn't take anywhere near 300us. OTOH, Serial.print() can be quite slow. As others have said, we'd need to see more complete code to get an idea what is actually going on.

void WriteRL (int r15_1,int r14_1,int r13_1,int r12_1,int r11_1... **** 64 ARGUMENTS ****

OMG! Don't do THAT!
Passing arguments to a function is relatively slow, especially once you try to pass more arguments than will fit in the registers available on the CPU. Build up your table in an array, and either pass a pointer to the array or leave it as a global variable. (and, it looks like it could be "byte" instead of "int", which will save some time as well.)

What is that new version you're talking about?
Edit: Added the code.

johnwasser:
You appear to be passing four 16-bit integers as 64 integers, each containing 1 bit. That is going to slow things down a lot.

Why are you using delayMicroseconds() at all?!? The maximum clock rate is 10 MHz so there is no way a 16 MHz Arduino UNO is going to exceed that speed. The time it takes to call the digitalWriteFast() will keep the clock rate well below 10 MHz. You could even do direct port manipulation and not clock too fast. If you use the dedicated SPI hardware you could get the data clock up to 8 MHz.

I thought the same, but when I tried that, it somehow didn't work. The output of the DAC didn't make any sense.
Edit: Added the code.

Mylaiza:
Had to break the code in 2 parts.

For a long program just add your .ino file as an attachment. It is too easy to make an error when joining parts of a program.

And it still has the appearance of scruffy carpet :slight_smile:

...R

What is that new version you're talking about?

It's not quite published yet. What version are YOU using? (the first search result doesn't like your code at all...)
Frankly, you should start over. Your functions are based on the misguided impression that you can clock the values out faster if you first break them apart into individual bit values. While this might be true, in theory, in some cases, it's certainly not true of the way you've implemented this case. Fundamentally, testing the status of a bit in a register tends to be faster than checking an int/boolean status in memory (on almost all architectures!)
Start with something like:

#define clkdelay() asm("nop\n");
#define bit2data(pin, data, bit) if (data&(1<<bit)) digitalWriteFast(pin,1) else digitalWriteFast(pin, 0)
#define clk(pin) digitalWriteFast(pin, 1); asm("nop\n"); digitalWriteFast(pin, 0)

void write4x16(uint16_t r1, uint16_t l1, uint16_t r2, uint16_t l2) {

    // send 2 right channels
   
    bit2data(SI, r1, 15);
    bit2data(SI_2, r2, 15);
    clk(Clk);

    bit2data(SI, r1, 14);
    bit2data(SI_2, r2, 14);
    clk(Clk);

    bit2data(SI, r1, 13);
    bit2data(SI_2, r2, 13);
    clk(Clk);

    bit2data(SI, r1, 12);
    bit2data(SI_2, r2, 12);
    clk(Clk);

    :

    // Send 2 Left Channels
    bit2data(SI, l1, 15);
    bit2data(SI_2, l2, 15);
    clk(Clk);

    bit2data(SI, l1, 14);
    bit2data(SI_2, l2, 14);
    clk(Clk);

    bit2data(SI, l1, 13);
    bit2data(SI_2, l2, 13);
    clk(Clk);

    bit2data(SI, l1, 12);
    bit2data(SI_2, l2, 12);
    clk(Clk);
    :
}

You probably shouldn't be trying to optimize to this degree unless you're willing to look at the assembly output. (You don't necessarily have to fully understand the assembly language to be able to look at it and go - "oh, that didn't work right." For example, here's the caller overhead that I mentioned back in #5 (and there's a similar amount of overhead in the callee!):

     WriteRL(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,//before input, everything is zero.
            1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
            1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
            1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
     f64:	1f 92       	push	r1
     f66:	1f 92       	push	r1
     f68:	1f 92       	push	r1
     f6a:	1f 92       	push	r1
     f6c:	1f 92       	push	r1
     f6e:	1f 92       	push	r1
     f70:	1f 92       	push	r1
     f72:	1f 92       	push	r1
     f74:	1f 92       	push	r1
     f76:	1f 92       	push	r1
     f78:	1f 92       	push	r1
     f7a:	1f 92       	push	r1
     f7c:	1f 92       	push	r1
     f7e:	1f 92       	push	r1
     f80:	1f 92       	push	r1
     f82:	1f 92       	push	r1
     f84:	1f 92       	push	r1
     f86:	1f 92       	push	r1
     f88:	1f 92       	push	r1
     f8a:	1f 92       	push	r1
     f8c:	1f 92       	push	r1
     f8e:	1f 92       	push	r1
     f90:	1f 92       	push	r1
     f92:	1f 92       	push	r1
     f94:	1f 92       	push	r1
     f96:	1f 92       	push	r1
     f98:	1f 92       	push	r1
     f9a:	1f 92       	push	r1
     f9c:	1f 92       	push	r1
     f9e:	1f 92       	push	r1
     fa0:	1f 92       	push	r1
     fa2:	ee 24       	eor	r14, r14
     fa4:	e3 94       	inc	r14
     fa6:	ef 92       	push	r14
     fa8:	1f 92       	push	r1
     faa:	1f 92       	push	r1
     fac:	1f 92       	push	r1
     fae:	1f 92       	push	r1
     fb0:	1f 92       	push	r1
     fb2:	1f 92       	push	r1
     fb4:	1f 92       	push	r1
     fb6:	1f 92       	push	r1
     fb8:	1f 92       	push	r1
     fba:	1f 92       	push	r1
     fbc:	1f 92       	push	r1
     fbe:	1f 92       	push	r1
     fc0:	1f 92       	push	r1
     fc2:	1f 92       	push	r1
     fc4:	1f 92       	push	r1
     fc6:	1f 92       	push	r1
     fc8:	1f 92       	push	r1
     fca:	1f 92       	push	r1
     fcc:	1f 92       	push	r1
     fce:	1f 92       	push	r1
     fd0:	1f 92       	push	r1
     fd2:	1f 92       	push	r1
     fd4:	1f 92       	push	r1
     fd6:	1f 92       	push	r1
     fd8:	1f 92       	push	r1
     fda:	1f 92       	push	r1
     fdc:	1f 92       	push	r1
     fde:	1f 92       	push	r1
     fe0:	1f 92       	push	r1
     fe2:	1f 92       	push	r1
     fe4:	1f 92       	push	r1
     fe6:	ef 92       	push	r14
     fe8:	1f 92       	push	r1
     fea:	1f 92       	push	r1
     fec:	1f 92       	push	r1
     fee:	1f 92       	push	r1
     ff0:	1f 92       	push	r1
     ff2:	1f 92       	push	r1
     ff4:	1f 92       	push	r1
     ff6:	1f 92       	push	r1
     ff8:	1f 92       	push	r1
     ffa:	1f 92       	push	r1
     ffc:	1f 92       	push	r1
     ffe:	1f 92       	push	r1
    1000:	1f 92       	push	r1
    1002:	1f 92       	push	r1
    1004:	1f 92       	push	r1
    1006:	1f 92       	push	r1
    1008:	1f 92       	push	r1
    100a:	1f 92       	push	r1
    100c:	1f 92       	push	r1
    100e:	1f 92       	push	r1
    1010:	1f 92       	push	r1
    1012:	1f 92       	push	r1
    1014:	1f 92       	push	r1
    1016:	1f 92       	push	r1
    1018:	1f 92       	push	r1
    101a:	1f 92       	push	r1
    101c:	1f 92       	push	r1
    101e:	1f 92       	push	r1
    1020:	1f 92       	push	r1
    1022:	1f 92       	push	r1
    1024:	1f 92       	push	r1
    1026:	ef 92       	push	r14
    1028:	1f 92       	push	r1
    102a:	1f 92       	push	r1
    102c:	1f 92       	push	r1
    102e:	1f 92       	push	r1
    1030:	1f 92       	push	r1
    1032:	1f 92       	push	r1
    1034:	1f 92       	push	r1
    1036:	1f 92       	push	r1
    1038:	1f 92       	push	r1
    103a:	1f 92       	push	r1
    103c:	1f 92       	push	r1
    103e:	1f 92       	push	r1
    1040:	1f 92       	push	r1
    1042:	1f 92       	push	r1
    1044:	81 2c       	mov	r8, r1
    1046:	91 2c       	mov	r9, r1
    1048:	a1 2c       	mov	r10, r1
    104a:	b1 2c       	mov	r11, r1
    104c:	c1 2c       	mov	r12, r1
    104e:	d1 2c       	mov	r13, r1
    1050:	e1 2c       	mov	r14, r1
    1052:	f1 2c       	mov	r15, r1
    1054:	00 e0       	ldi	r16, 0x00	; 0
    1056:	10 e0       	ldi	r17, 0x00	; 0
    1058:	20 e0       	ldi	r18, 0x00	; 0
    105a:	30 e0       	ldi	r19, 0x00	; 0
    105c:	40 e0       	ldi	r20, 0x00	; 0
    105e:	50 e0       	ldi	r21, 0x00	; 0
    1060:	60 e0       	ldi	r22, 0x00	; 0
    1062:	70 e0       	ldi	r23, 0x00	; 0
    1064:	81 e0       	ldi	r24, 0x01	; 1
    1066:	90 e0       	ldi	r25, 0x00	; 0
    1068:	0e 94 94 00 	call	0x128	; 0x128 <WriteRL(int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)>

westfw:
Frankly, you should start over. Your functions are based on the misguided impression that you can clock the values out faster if you first break them apart into individual bit values.

I actually tried that and it did make it faster.
What if I get rid of the functions all together and just repeat that piece of code several times? I know it won't look pretty, but if it works, that would be a quick solution. Unfortunately I won't be able to go to the lab in 2 days and can't try it out for myself. In the meanwhile I'm gonna prepare a bunch of versions from the code, and I'll definitely try yours too.
Any suggestions on what is going on outside the function?