Tightest finite wait loop possible in AVR8

Question

I need a wait delay function with fine grained control of the delay. I came up with the fairly standard way

      ldi   Rx, N
lp:   dec   Rx
      brne  lp

This code takes precisely N * 3 cycles to execute (the final "brne", leaving the loop, takes one cycle less, compensating the extra cycle used by the "ldi").

One way to achieve finer control would be to add two copies, one with an extra "nop", the other with two extra "nop"s. Next to complications with really short delays (need to take the execution time of the code branching to the correct copy into account), this solution is ugly. Therefore my question:

Is there a way to create a finite wait loop where the execution of one loop takes only two cycles (or even one)?

Do you really want a "finite" delay or a "variable" loop. Your example code is a variable loop, where N is the variable. A finite delay could just use the basic NOP instruction, add as many as needed in sequence. Not very efficient but it would be the finest grained control. — Nedd, Commented Mar 28 at 12:26

Justme · Accepted Answer · 2024-03-28 13:27:09Z

3

No loop can be shorter than one iteration, as it would mean there can be no loop code to begin with.

Many compilers handle this scenario by introducing a macro or inline function to wait requested amount of cycles (or time), and they implement it as a combination of loop and additional NOPs, or any combination of them, to precisely introduce a delay of the requested amount with the granularity of one clock cycle.

The compiler is then free to call or inline any available type of delay function to implement what user requested.

So, the tightest loops in 8-bit AVR assembly take three cycles per iteration and additional setup code for loading an 8-bit register with the immediate value.

The instruction set has no instructions to do tighter loops than decrementing a variable and branching back to decrement it again if it was non-zero.

edited Mar 28 at 13:27

answered Mar 28 at 12:18

Justme

159k5 gold badges120 silver badges316 bronze badges

1

\$\begingroup\$ You begin your answer with a commonplace which might not apply. What immediately comes to my mind is that some architectures (68k derivatives for instance) have a combined decrement-and-branch-if-not-zero instruction, hence a loop consisting of only one instruction (as opposed to two from the code in my question) can be constructed. Also more fancy looping code is feasible (like skipping an absolute jump using the "sbrs" instruction), as I wrote mine is just standard stuff. \$\endgroup\$
– Peter Rottengatter
Commented Mar 28 at 13:12
\$\begingroup\$ I'm just not familiar enough with the AVR8 instruction set to see if such optimisations are possible in AVR8 or not, and I'd appreciate an answer by someone who is. \$\endgroup\$
– Peter Rottengatter
Commented Mar 28 at 13:22
1

\$\begingroup\$ @PeterRottengatter OK, well, I did say shorter loop is not possible. An 8-bit AVR has no such instructions to implement a loop shorter than the 3 cycles to decrement a register and branching back if it is not zero. This is a delay loop, while academically intriguing, it really does not matter how efficiently you waste time in a loop, and under normal circumstances, the additional NOPs to make a finer grade granularity than 3 cycles are usually not important. \$\endgroup\$
– Justme
Commented Mar 28 at 13:25

Add a comment |

emacs drives me nuts · Accepted Answer · 2024-04-09 20:34:57Z

Is there a way to create a finite wait loop where the execution of one loop takes only two cycles (or even one)?

Yes, kind of. Assumption is that the delay is known at assembly time. Solution is to write a macro that adds NOPs as needed, and that computes the number of loopings as needed (may be zero).

Here is the code for the GNU tools, feel free to transpose it for your favourite assembler:

;; We need an upper d-register for the loop.
#define WAIT_REG R16

.macro wait cycles
.if \cycles / 3 > 0
    ldi  WAIT_REG, \cycles / 3
.Loop.\@:
    dec  WAIT_REG
    brne .Loop.\@
.endif
.if \cycles % 3 > 0
    nop
.endif
.if \cycles % 3 > 1
    nop
.endif
.endm

Sample usage:

.text
.global main

main:
    wait 0
    wait 1
    wait 2
    wait 3
    wait 4
    wait 10
    wait 100
    wait 500
    ret

The generated assembly disassembles as (comments added by hand):

;; Delay = 0

;; Delay = 1
   0:   00 00           nop

;; Delay = 2
   2:   00 00           nop
   4:   00 00           nop

;; Delay = 3
   6:   01 e0           ldi r16, 0x01
00000008 <.Loop.3>:
   8:   0a 95           dec r16
   a:   f1 f7           brne    .-4         ; 0x8 <.Loop.3>

;; Delay = 4
   c:   01 e0           ldi r16, 0x01
0000000e <.Loop.4>:
   e:   0a 95           dec r16
  10:   f1 f7           brne    .-4         ; 0xe <.Loop.4>
  12:   00 00           nop

;; Delay = 10
  14:   03 e0           ldi r16, 0x03   ; 3
00000016 <.Loop.5>:
  16:   0a 95           dec r16
  18:   f1 f7           brne    .-4         ; 0x16 <.Loop.5>
  1a:   00 00           nop

;; Delay = 100
  1c:   01 e2           ldi r16, 0x21   ; 33
0000001e <.Loop.6>:
  1e:   0a 95           dec r16
  20:   f1 f7           brne    .-4         ; 0x1e <.Loop.6>
  22:   00 00           nop

;; Delay = 500
  24:   06 ea           ldi r16, 0xA6   ; 166
00000026 <.Loop.7>:
  26:   0a 95           dec r16
  28:   f1 f7           brne    .-4         ; 0x26 <.Loop.7>
  2a:   00 00           nop
  2c:   00 00           nop

;; Epilogue
  2e:   08 95           ret

This can easily be generalized to delays longer than 2+3·255 = 767 cycles by more .ifs and nested loop(s). The required arithmetic is all linear in nature.

In the case when the delay is not known at assembly time, stuff gets more complicated, and there will always be specific delays that cannot be realized, in particular short ones.

the busybee · Accepted Answer · 2024-04-10 06:27:12Z

0

Just an addition to the other answers, which are otherwise perfect.

The avr-libc provides two header files to busy-wait for constant or variable times. I assume the implementers strove to use optimal code, so you can inspect the resulting assembly.

1. `<util/delay.h>`

This header file declares two functions or function-like macros void _delay_ms(double __ms) and void _delay_us(double __us), which use the functions declared in the next header file.

My own experiments show that optimization commonly inlines the resulting code. And even more, nops are inserted to reach the best approximation of the desired delay.

2. `<util/delay_basic.h>`

This header file declares two functions or function-like macros void _delay_loop_1(uint8_t __count) and void _delay_loop_2(uint16_t __count). The first loop executes three CPU cycles per iteration, while the second loop executes four CPU cycles per iteration.

So three cycles per iteration is the optimum. Q.E.D.

answered Apr 10 at 6:27

the busybee

3,37910 silver badges20 bronze badges

\$\begingroup\$ These heders won't work in assembly but only in C/C++. The question is tagged assembly. \$\endgroup\$
– emacs drives me nuts
Commented Apr 12 at 14:49
\$\begingroup\$ @emacsdrivesmenuts I did not write that the OP should use these. I just suggested to look into the resulting assembly, which I presume to be optimal. \$\endgroup\$
– the busybee
Commented Apr 12 at 17:47
\$\begingroup\$ These headers contain a computation of the number of required cycles, which uses floating-point arithmetic and constant-folding to get the number of cycles as compile-time constants. This part cannot be done with an assembler and not in assembly. Then the compiler uses __builtin_avr_delay_cycles to issue code for a specific number of delays. The magic behind that built-in is deep in the GCC sources; their transposition into assembly (at least for loops up to a depth of 1) can be found in my answer. There is no magic trick to transplant the functionality of delay.h into assembly. \$\endgroup\$
– emacs drives me nuts
Commented Apr 12 at 19:24
\$\begingroup\$ @emacsdrivesmenuts Again, my answer just "proves" that 3 cycles are the minimum iteration time. There is no doubt in your answer, see my first statement. All the references to C stuff is there for documentation. -- (BTW, with some clever integer calculations an assembler can produce a similar result from a assembly-time-known constant with an SI time unit. There is no need for floating point calculations, 32 bit arithmetic suffices. Of course, the clock frequency needs to be defined.) \$\endgroup\$
– the busybee
Commented Apr 12 at 19:53

Add a comment |

Stack Exchange Network

Tightest finite wait loop possible in AVR8

3 Answers 3

1. `<util/delay.h>`

2. `<util/delay_basic.h>`

Not the answer you're looking for? Browse other questions tagged
avr
delay
assembler
or ask your own question.

Hot Network Questions

Tightest finite wait loop possible in AVR8

3 Answers 3

1. <util/delay.h>

2. <util/delay_basic.h>

Not the answer you're looking for? Browse other questions tagged avrdelayassembler or ask your own question.

Related

Hot Network Questions

1. `<util/delay.h>`

2. `<util/delay_basic.h>`

Not the answer you're looking for? Browse other questions tagged
avr
delay
assembler
or ask your own question.