Let's take a step back a moment and take a look at what is actually happening here, and get a few basic concepts straight. It's all very well bandying around magic words like "Harvard Architecture", but what does that actually mean?
Basically there are two (main) kinds of processor architecture in the world - Harvard (and "modified" Harvard) and Princeton (aka Von Neumann) architectures. The main difference between them is in how the core CPU looks at the outside world.
Let's go back in time a moment to the days of the ZX Spectrum. This is very much a simplified view of the "traditional" computer model. You have an address bus, a data bus, and some control signals. Onto those buses are attached different kinds of memory. For the ZX Spectrum 48K that's a 64KB (65536 address) memory space, as there are 16 address lines (216 = 65536). Addresses 0 - 16383 are attached to the system ROM. 16384 - 65535 are connected to RAM, with some of that RAM being used by the display system as a frame buffer. So, reading from address 362 will be reading from ROM, and writing to address 52876 will be writing to RAM. Simple enough, yes?
That's the "Princeton" architecture. All the memory of all different types lumped together into one big address space. Makes for a nice simple system.
Small microcontrollers, however, tend to use the "Harvard" (or more commonly the "Modified Harvard") architecture. In this way of thinking, instead of one address and data bus for all the memory, it has two! One of the bus sets goes to the Flash (or ROM) memory, and the other goes to the RAM. Consequently things get a little more complex. Each address bus (let's, for the sake of argument take it to be like the above example, and have 16-bit addresses) has its own addresses, so the Flash address bus has addresses 0-65535, and the RAM address bus has addresses 0-65535. This gives a massive increase in speed over the Princeton architecture, as it can be (for example) reading from Flash and writing to RAM at the same time! But that comes at a cost - the cost of complexity.
Say you want to read, as above, from the Flash address 362 and write to the RAM address 52876. Just saying "Read from 362" is now ambiguous, as there are two address 362's - the Flash 362 and the RAM 362, so which is it that gets read? Well, this is when it comes down to the specific chip to sort it out. Most will have different instructions to read/write RAM to reading/writing Flash. You might have, for example, (hypothetical) instructions like this:
MOV $r1, (362)
MOV (52876), $r1
MOVF $r1, (362)
MOVR (52876), $r1
Of course, some don't have instructions for accessing the flash at all, and instead use an internal peripheral to access the flash:
LDI $r1, 0x6A
MOV (FLADDL), $r1
LDI $r1, 0x01
MOV (FLADDH), $r1
LDI $r1, 0x40 // hypothetical "read from flash" instruction to peripheral
MOV (FLCON), $r1
MOV $r1, (FLDAT)
As you can see that gets even more complex. Load the two parts of the address into HIGH and LOW registers (we're talking 8-bit here), set a function, and read the data.
So how do you simplify the job?
Well, unless told otherwise, the compiler tries to make things easy for us. Data in the flash is first copied into RAM as part of the startup routine of the program (aka crt0 - C Run-Time phase 0). This makes accessing that data from then on as simple as reading any other data from RAM. But when you have a limited amount of RAM and lots of data that can cause big problems. That's where such things as the PROGMEM flag comes in to play. This basically tells the compiler "This data must NOT be copied into RAM". Which is good, but then how do you get at the data? Your program then has to specifically use the facilities the chip provides to read that data, which as you have seen is not always as straight forward as just reading. In fact, it can get downright complex.
So Arduino have provided a set of functions to do the work for you. You can have direct access to the flash memory by using the pgm_read_byte() and similar functions. These will run the special commands needed to read the data from the flash memory. You pass them the address of a variable in flash and it gets the data from that address. If you pass the address of a variable in RAM to the function it won't know the difference, and will get the data at that address from flash anyway, so you have to be careful to only ever use variables that have been flagged as PROGMEM.
Also there are a number of extra versions of existing functions, specially string manipulation functions, which deal specifically with C strings in flash. These are suffixed with _P to indicate they work on PROGMEM variables. Examples are strcmp_P(), memcpy_P() etc.
So to summarise, let's take the analogy of apartment blocks.
You have two apartment blocks next door to each other (A and B). Each block has 16 apartments in it, numbered 1 to 16. Someone sends a letter to apartment 3, but where does it go? There's 2 apartment 3s. And to top it all off apartment block B doesn't have any mailboxes.
So, you have to address your letter to either apartment A3 or apartment B3, but to get it to apartment B3 you have to first send your letter to Geoff in apartment A6 who is friends with Gill in apartment B9. He will hand the letter to Gill, who can then pass it on to Arthur in apartment B3.
So there you have it - Harvard architecture in a nutshell.