In fact an unconditional instruction prefetch will delay all memory data by one memory cycle and effectivly slow down overall operation.
Only if the CPU is busy internally an instruction prefetch makes some sense, in detail if instructions can be executed in parallel. This will happen rarely in RISC architecture with almost single clock instructions. I also wonder why such big memory caches are required for efficient instruction prefetch. Does nowadays 64 bit data bus width not provide enough bytes for the next few instructions in one memory cycle?
That's possible only if the EU intructs the BIU of data fetches with all address data (seg:off) of instruction time. The EU has to wait then until the data transfer has finished. In case of jumps the entire instruction queue has to be discarded.
Parallel instruction execution requires synchronization with required results (register, flags...) of logically preceding instructions. This again encourages instruction reordering done by a compiler.
This way the x86 chips become bigger and bigger with every new version. with any number of technological and architectural problems. The segmented memory addresses also turned out to slow down execution at high memory cache costs and is effectively unused nowadays. Nonetheless all old features and instructions have to be implemented in every new chip version in order to keep all programs running.
No such problems exist in RISC architectures, which could be redesigned from scratch with every new chip generation, without a heap of legacy crap.
Only if the CPU runs fast enough to fully utilize the memory bus, which USED to be uncommon. The 6502 famously left the bus unoccupied often and consistently enough for video generation to get in there, and the CDP1802 (at 16 clocks per instruction) had memory cycles long enough (6 clocks?) that a friend of mine significantly reduced power consumption by adding some circuitry to compress the access...
(and both of those even in the case where the instruction access data in memory, as well as the instruction fetch.)
I'm unable to immediately find prefetch/timing information for modern von Neumann microcontrollers like the MSP430 or RS08/6805. I guess some of what was "prefetch" in microcontrollers has become "flash accelerator" in modern microcontrollers.
I guess the architectural differences between Harvard and von Nuemann end up a bit blurred. If you have a separate code and data cache on a single memory bus, does that make your CPU "Harvard"?
There are four terms found in literature:
von Neumann architecture,: Can execute program out of ROM and RAM (8086)
non-von Neumann architecture, : Can't execute program out of RAM (328P)
Hardvard architecture, and : ?
Modified Hardvard architeure : ?
In modern uses, it's all about separate instruction and data buses, right? "Address spaces" are less significant (thus some ARMs being "Harvard") Having separate caches means separate buses going into the CPU...
In former times it was (and still is) near impossible to have chip packages with duplicate address and data buses. No such problem in microcontrollers with integrated memories.
How do you count chips like 8051 with essentially TDM'ed buses (these parts of the 12+ clocks per instruction can access code, these other clocks can access data. (Assuming you count the 8051 XRAM as "data" - arguably there are 3 buses!)
I'll take a shot. You can probably find more complete answers on Wikipedia...
In a von Neumann architecture, there is one memory (address space, bus, etc) that contains both the instructions and the data. You can write self-modifying code, accidentally execute your data, and easily do things like incremental compilation of interpreted code.
Almost all "large" computers are von Neumann architectures.
In a Harvard architecture, the code memory and the data memory are completely separated. There are obvious examples like 8-bit PICs, where the instruction memory is entirely different in size (12 or 14 bits - just add bits and you can address more data!) than the data memory (8bits.) (AVRs too, but it's less obvious when the instruction is exactly two bytes.) This allows (theoretically) faster code execution because data manipulation doesn't interfere with instruction fetching.
"RAM" vs "ROM/Flash/etc" doesn't really have anything to do with it, although the microcontroller space, the fact that your program memory isn't (easily) writeable anyway make a Harvard architecture "obvious." (Not all microcontrollers are Harvard. MSP430 and RS08 are von Neumann, for example.)
Of course, being able to have some data in "program space" is useful, and having multiple address spaces is "inconvenient", so there is a lot of "fuzziness" around these terms in modern implementations. As implied by previous discussion here.
On a modern ARM chip, with the so-called "Modified Harvard" architecture, you have a single address space, but it's split between multiple memory buses, so you can still get some of the performance benefits of a Harvard architecture while also retaining the other benefits of a von Neumann machine. Apparently (according to WP), having separate data and instruction caches makes a processor into a "modified Harvard" machine...
WP says:
The term "von Neumann architecture" has evolved to refer to any stored-program computer in which an instruction fetch and a data operation cannot occur at the same time (since they share a common bus). This is referred to as the von Neumann bottleneck, which often limits the performance of the corresponding system.
It's a micro-controller with fast parallel internal code and data memory, a typical Harvard architecture.
Timing varies across chip versions. Access of both code and data memory during one instruction is common to von Neumann and Harvard architectures. On the bleeding edge ("1T") true parallel access to code and data can become essential.
For going into the detail of how a basic CPU works you may find it useful to look at one of the breadboard 8bit CPUs of the type made famous by Ben Eater. These are basic CPUs made from discrete logic components and some special chips say for implementing the CPU. There is no hidden detail and absolutely everything is exposed.
The one I find most interesting was put together by James Bates because it is well documented and he even wrote an assembler for it so you could test small programs: https://www.youtube.com/user/jimmyb998 . I built an emulator for it in C++ to run on an ESP32 with a variable trace level so you can step through the instructions or even the micro code and see what is happening. I was originally intending to make a hybrid, that is part emulated and part real hardware because I could not face building the whole thing in hardware over multiple breadboards, but I don't now see that happening any time soon. I published the emulator here: Emulator of a 8 bit breadboard logic chip teaching computer with a link to it on Wokwi.com .
Incidentally, the part I had the most difficulty with was the ALU (he used an SN74LS382 chip). I could not originally understand how it could be completely agnostic about the data type being signed or unsigned and still give a usable answer. The trick is that it is up to the programmer to specify the datatype and then interpret the result including the overflow and carry bits accordingly.
8051 MCU can be operated as von Neumann computer to fetch and execute programs from external RAM memory. It is done with the help of JMP @A + DPTR instruction. The PC is replaced with the RAM address interactively entered from Keypad, which then enters into DPTR.
Fig-1 is an 8751 (a variant of 8051 which I have designed many yeras ago) MCU based Trainer. It can execute programs from external RAM (Fig-2).
Well, sure, if you've "OR"ed PSEN and RD, and have thus decoded the external RAM as both "program" and "data" memory. (which I guess is pretty common, especially in these days of easy-to-use large RAM chips. But not really "intended configuration.")