What for?
Moving from C to assembler (the need has appeared) I found a bad thing for myself, there is no favorite function _delay_ms (long millisecond) on it (Internet search didn’t give anything, I could be looking bad) to write 8000 empty commands (for 8 MHz to keep 1 ms ) of course it's nonsense, this is where the idea to write your Delay came from.
Step by step
Inspired by his idea, of course, he immediately rushed into battle, without thinking much about it, decided to immediately measure out in milliseconds. The code was successfully written,
.macro DELAY_MS push R16 push R17 push R24 push R28 ldi R28, LOW(@0) ldi R24, HIGH(@0) rjmp cycMKS cycSEK: subi R24,1 ldi R28, 255 cycMKS: cpi R28, 1 brlo decMKS subi R28,1 ldi R16, LOW(@1/1000) ldi R17, HIGH(@1/1000) rjmp _delay_c new_cycle: subi R17, 1 ldi R16, 255 _delay_c: subi R16, 4 cpi R16, 4 brsh _delay_c NOP NOP cpi R17, 0 brne new_cycle rjmp cycMKS decMKS: cpi R24,0 brne cycSEK pop R28 pop R24 pop R17 pop R16 .endm 
For development I use AVR Studio 4 + gcc, respectively, and tested the code there too. Result at the end of debugging:
')

...
.equ F_CPU = 8000000; Frequency in Hz
...
DELAY_MS 4, F_CPU; macro substitution for 4 ms
...
Too much error, growing with an increase in the magnitude of the delay, directly hit the eye, it became no longer possible to optimize the writing. I decided to go in order, first write the loop counter for 1 byte, 2 bytes, and afterwards, gluing it all together, get a delay in milliseconds.
The result was 3 macros:
 ;    ; @0 –    9-255 ( ) .macro DELAY_CL ;push R16 ldi R16, LOW(@0)-5 _delay_cl: subi R16, 4 cpi R16, 4 brsh _delay_cl cpi R16, 1 breq end_cl_1 cpi R16, 0 breq end_cl cpi R16, 2 breq end_cl rjmp end_cl end_cl_1: NOP NOP NOP end_cl: .endm 
 ;    ; @0 –    15-65535 ( ) .macro DELAY_C ldi R16, LOW(@0) cpi R16, 17 brsh fault rjmp init_R17 fault: DELAY_CL LOW(@0-7) init_R17: ldi R17, HIGH(@0) cpi R17, 0 breq end_c new_cycle: subi R17, 1 DELAY_CL 252 cpi R17, 0 brne new_cycle NOP end_c: .endm 
The cycles in the respective ranges counts exactly.
 ;    ; @0 –    1 – 65535 ( ) ; @1 –    ( >= 1,3 MHz) .macro DELAY_MS push R19 push R18 push R17 push R16 ldi R18, LOW(@0) ldi R19, HIGH(@0) cpi R18, 0 breq re_init _cicl_msl: DELAY_C @1/1000 subi R18, 1 cpi R18, 0 breq re_init rjmp _cicl_msl re_init: cpi R19, 0 breq _end_c subi R19, 1 ldi R18, 255 DELAY_C (@1/1000)-255*5 rjmp _cicl_msl _end_c: pop R16 pop R17 pop R18 pop R19 .endm 
The results of this code are more successful:
For 1 ms (Atmega8535, F_CPU = 8001000 Hz)

For 300ms (Atmega8535, F_CPU = 8001000 Hz)

For 32s (Atmega8535, F_CPU = 8001000 Hz)

For 300ms (Atmega6490, F_CPU = 4000000 Hz)

For 300ms (ATtiny43U, F_CPU = 2000000 Hz)

The error lies in the range of ~ 4-150 microseconds. That is quite enough.
I would like to change the DELAY_MS macro to a subroutine (it’s logical to embed so much code with each call), but I’m picking weeks 2 with an assembler, and until I figured out how to put all this into a separate module, and make the function in it appropriate.