I'd like to be specific on the atmega family.
As someone said, you can't, because the details are secret.
For your first effort, it would be extremely helpful if you were looking at an architecture that was fully documented.
It would also be helpful if you were looking at an older style of processor where the architecture was laid out to be easily understood by humans, instead of just easily implementable in silicon. You know - fixed position/size fields for opcode, source, and destination; less aliasing of "special" addresses, etc. Atmel, IMO, does a particularly poor job of describing the functional architecture of their architecture (compared to some other vendors who do a better job with that, but a worse job describing how it should be used.) (For example, the fact that peripherals and RAM are in the same address space is understated.)
So for each clock cycle, you have an instruction in the instruction decode buffer, and one in the instruction prefetch register (or it will be there "soon.")
Some of the bits specify the source, which can be one of the registers, or a RAM location. RAM locations can the instruction decode buffer itself ("immediate" addressing) (or maybe that's more like a register address?), or RAM pointed to by the instruction decode register (IO instructions), or RAM pointed to by the prefetch register ("direct addressing"), or RAM pointed to by the various index registers in the CPU.
Some of the bits specify the destination, which is similar but has somewhat fewer possibilities. The destination is also the source of the 2nd operand, for two-operand instructions. The destination can also be the PC (jmp), or it can get discarded (Compare instruction.)
More of the bits tell the ALU what it's supposed to do with the two 8-bit quantities. See a SN74181 datasheet for an approximation of what's available.
Probably on the next clock edge, the output of the ALU would get written to the destination. (rising edge: fetch operands. High: wait for results to stabalize. Falling edge: write results. Or something like that.) Also store the flag bits from the ALU in the status register.
Obviously, 16-bit operations, stack operations, multiply, and other instructions would be handled a bit differently. Immediate addressing only gets to access half the registers as the destination. And a bunch of other complications.
This is all much easier to grasp on something like a PDP11, where the PC and SP are just ordinary registers, and there haven't been any "optimizations" done to squeeze 32 registers into 4bits of identifier in the instruction. (or make everything happen in "one clock cycle"; it's really useful to think of multiple phases of each instruction taking "as long as needed", and letting the HW designer figure out how to make that happen in half a clock cycle. You can probably find a good description somewhere of what an 8051 does with each of the 12 clocks that make up ONE instruction "cycle" in THAT architecture. Or the 4 clocks of an 8080. etc.)