Tuesday, July 24, 2012

XENTRAL: A Simple FPGA CPU


XENTRAL is a simple Harvard Architecture CPU. It targets the Spartan6 FPGA present on the Digilent's Nexys3 board, but it's all written using portable behavioral VHDL1. You should be able to use your favorite FPGA, synthesizer and simulator.

All project files are available on github under the Creative Commons Attribution Non-Commercial (CC-by-NC) license.

Architecture Overview

XENTRAL is a Harvard Architecture processor. This means that code and data are handled separately.
The picture above is a very high-level schematic representation of the XENTRAL architecture. Instructions are fetched from ROM3. Based on the instruction, the control unit manages the execution unit. The execution unit performs the actual execution of the instruction, by e.g. adding the contents of two CPU registers, writing a value in RAM or performing I/O.

XENTRAL is a Harvard Architecture processor. This means that code and data are handled separately. This effectively precludes the use of self-modifying code. It also prevents the processor from loading its own program. The program must be programmed into the instruction memory. This is a common approach for embedded devices and microcontrollers, but it does not play nice with operating systems and boot loaders. As the project grows, we will re-visit this limitation.

Overview of the Execution Unit

Schematic representation of the execution unit


The execution unit is designed around a three-bus principle. In addition to the general purpose registers (R1 ... R8, only R1 and R2 are shown in the picture) we also have the memory address register (MAR) and the memory block register (MBR), the stack pointer (SP), the immediate register (IMM) and two constants (0,1). MAR and MBR govern RAM access. A memory address is loaded into MAR. It's contents can then be read from or written to MBR. SP keeps track of the top-of-stack in RAM. IMM puts numbers embedded in the instruction stream (immediate operands) on the bus. It is driven from the control unit. In order to handle function returns and indirect jumps, we need to be able to pass a new program counter from memory/registers back to the control unit. For this purpose we use the register NPC.

Every execution cycle, bus 1 and bus 2 are driven from the registers and constants. They form the operands of an arithmetic/logic operation. The result of this operation appears on bus 3 and is routed back into one of the registers. Register-to-register transfers are also incorporated into this paradigm. The general purpose registers can drive both bus1 and bus2. Some of the special purpose registers or constants are only available on one of the two buses. All registers are always loaded from bus3.

The control unit determines, by reading the instruction stream, which registers get to drive bus1 and bus2, the ALU operation OP and also where the result that appears on bus3 will be stored.

Note that this architecture does not support in-place operations, such as R1 = R1+R2. Because the entire operation takes place in a single cycle, the result cannot be stored in a register that is also providing one of the operands. To overcome this limitation, temporary registers must be used. 

You can do in-place operations just fine. 

Overview of the Control Unit

Schematic representation of the control unit
The control init controls the execution unit based on the instructions read from the Instruction Memory. The program counter (PC) stores the address of the current instruction. The instruction is fetched from the instruction memory and store in the instruction register (IR). Based on the instruction in IR and the status of the ALU flags (stored in the FLAGS signal), the control unit determines what register will drive the input buses 1 and 2 (signals BUS1SEL and BUS2SEL), which operation will be performed on the inputs (OP) and where this result will be stores (BUS3SEL).

For instructions that include immediate operands, the control unit will put the operand on bus1 through the immediate register (IMM). Instructions that modify the program counter (PC) based on values from the execution unit will read the new value of PC from bus3 using the NPC register.

The control unit is also responsible for coordinating the execution of multi-clock instructions. This includes memory accesses, stack manipulations and function calls.

The Registers

Together with the ALU, the registers form the heart of the design. Registers are defined in VHDL simply by describing a signal that takes on the value of the input at the (rising) edge of the clock:

if rising_edge(clk) then
      if LDEN = '1' then
         VALUE <= IN1;      
      end if;
end if;

Remember how we said that we can't do one-clock in-place operations with this architecture? This can be really annoying, especially for increments and decrements. Not being able to increment/decrement the stack pointer means that every stack operation would need to expend two cycles and a register just to update the stack pointer. To alleviate this issue we design our registers with the ability to increment/decrement in-place. This means that every register with this capability will have, next to the flipflops that make up the register, its own set of adders and its own carry chain.

Adders and carry chains eat up FPGA real estate and introduce logic delays. Therefore we only want to synthesize them where we will be needing them, on SP. To avoid duplication of code, we put the increment/decrement logic in the module shared by all registers, but we add a default value to the associated INC and DEC signals that will ensure the synthesizer optimizes away the adders and the carry chain if we do not need them connect them to any actual inputs.

process(clk)
begin
   if rising_edge(clk) then
      if LDEN = '1' then
         VALUE <= IN1;      
      else
         if INC = '1' then
            VALUE <= unsigned(VALUE) + 1;
         else 
            if DEC = '1' then
               VALUE <= unsigned(VALUE) - 1;
            end if;
         end if;
      end if;
   end if;      
end process; 

The complete source for the registers can be found in reg1in2out.vhd.

The ALU

The ALU is responsible for the actual "processing" done by the CPU. The code is simple: we have two inputs, one output and a signal that select the operation. Based on the value of the output, we also set the flags. At the moment we only implement the Zero, Carry and Sign flag. In order to facilitate signed arithmetic we will also need to add an Overflow flag at a future point in time.


architecture Behavioral of ALU is
signal RR: STD_LOGIC_VECTOR (32 downto 0); -- extra bit for carry
begin      
   WITH OP SELECT
      RR <= '0'&A and '0'&B   WHEN X"0",
         '0'&A or '0'&B       WHEN X"1",
         '0'&A xor '0'&B      WHEN X"2",
         '0'&A nand '0'&B     WHEN X"3",
         '0'&A nor '0'&B      WHEN X"4",
         unsigned('0'&A) + unsigned('0'&B)  WHEN X"5",
         unsigned('0'&A) - unsigned('0'&B)  WHEN X"6",
         not '0'&A       WHEN X"7",
         '0'&A           WHEN X"A",
         '0'&B           WHEN X"B",
         (others => '0') WHEN OTHERS;   
   -- Output
   R <= RR(31 downto 0);
   -- Flags
   ALUFLAGS(1) <= RR(31); -- SIGN
   ALUFLAGS(2) <= RR(32); -- CARRY
   WITH RR(31 downto 0) SELECT ALUFLAGS(0) -- ZERO
      <= '1' WHEN X"00000000",
         '0' WHEN OTHERS;
   
end Behavioral;



We use an intermediate 33 bit signal to store the result. This helps us to handle the carry easily, and allows us to read back the values of the output (to update the flags) without having to declare the R port as "inout".
Note the two pseudo-operations A and B that pass operand A and operand B respectively to the output. Because all cycles pass through the ALU, these operations are needed to support register moves.


Note that we are missing quite a bit of instructions. The obvious ones are shifts (plain and arithmetic), rotates, add-with-carry (to support arbitrary precision math) etc.


The complete source for the ALU can be found in ALU.vhd.

Behavioral Description of RAM

We are going to need some RAM to store data and our stack. For now we are going to use RAM that is part of the FPGA itself, rather than accessing the memory on the Nexys3 board. There's actually two types of memory on the FPGA, block RAM and distributed ram. Block RAM consists or dedicated RAM areas in the FPGA. Distributed RAM refers to FPGA logic that is configured as RAM. Instead of instantiating RAM resources directly, or using Xilinx's Core Generator to do it for us, we are going write a behavioral description of our RAM blocks, and leave it up to the FPGA synthesis tools to choose how to implement it.

First we need a to allocate actual storage:

type ram_type is array (0 to (2**address'length)-1) 
      of std_logic_vector(datain'range);
signal ram : ram_type;

Note how the amount of storage is determined by the length of the address register. By using unconstrained types and the range and length qualifiers we can re-use this module in different situations and instantiate RAMs of different sizes, just by instantiating it with different size parameters.

The heart of the module is a synchronous process that, on the rising clock edge, either stores the value on the bus in the address register or in the RAM data structure we just defined.

process(clock) is
  begin
     if rising_edge(clock) then
         if LDDATAIN = '1' then
              ram(to_integer(unsigned(read_address))) <= datain;
         end if;
         if LDADDRESS = '1' then
            read_address <= address;
         end if;
     end if;
  end process;

We want to leave the option open to implement this module using single-port block RAM. This means that we cannot perform concurrent accesses, such as reading a memory location while it's being read. We therefore add a "drive data out" signal that determines whether we will read the memory location in the address register and put it on the bus or not:

WITH DRDATAOUT SELECT
    dataout <= ram(to_integer(unsigned(read_address))) WHEN '1',
         (others => '0') WHEN OTHERS;   

The complete source for the RAM module can be found in SIMPLERAM.vhd.

Tying the Execution Unit Together

In order to wrap up the execution unit, we need to wire up all our components on our three busses. At the center we have the ALU, wired directly to the three busses and the OP signal. It ouputs the flags into the ALUFLAGS signal, that is used to drive a portion of the CPU flag register.

ALU1 : entity work.ALU port map(bus1,bus2,OP,bus3,ALUFLAGS);

process(clk)
begin
   if (rising_edge(clk)) THEN
      FLAGS(3 downto 0) <= ALUFLAGS;
   end if;
end process;
All clocked components have a "load enable" (LDEN) signal, that determines whether the value on bus3 will be loaded into the component. This signal is derived by feeding the drive signal DR3 through a demultiplexer/decoder and connecting the individual lines of the output to the LDEN signal of the components. A value of DR3 of x"3" will pull the LDEN input of the third component high, and those of all other components low.

DEMUX3 : entity work.DEMUX16 port map(LD3,LDEN);
Then we instantiate all other components. We connect the outputs to special "output signals" (ending with the letter "O") for further processing. The load enable signals for the components are derived from the various bits of the multiplexer output LDEN, as mentioned in the previous paragraph. All components are loaded from bus3.
R1 : entity work.reg1in2out port map (clk,bus3,R1O,LDEN(1));
R2 : entity work.reg1in2out port map (clk,bus3,R2O,LDEN(2));
R3 : entity work.reg1in2out port map (clk,bus3,R3O,LDEN(3));
R4 : entity work.reg1in2out port map (clk,bus3,R4O,LDEN(4));
R5 : entity work.reg1in2out port map (clk,bus3,R5O,LDEN(5));
R6 : entity work.reg1in2out port map (clk,bus3,R6O,LDEN(6));
R7 : entity work.reg1in2out port map (clk,bus3,R7O,LDEN(7));
R8 : entity work.reg1in2out port map (clk,bus3,R8O,LDEN(8));
SP : entity work.reg1in2out port map (clk,bus3,SPO,LDEN(9),SPINC,SPDEC);

RAM : entity work.SIMPLERAM port map(clk,LDEN(12),LDEN(13),RAMREADEN,bus3(7 downto 0),bus3,RAMO);
Note how register SP implements increment and decrement signals, whereas the other registers do not. The address register MAR of the RAM component is loaded using bit 12 of LDEN, whereas the block register MBR is loaded using bit 13. By only tying 8 bits of the input to the address register, we constrain the RAM module to a size of 256 words of 32 bits each.
The bus selectors for bus1 and bus2 determine which component gets to drive the bus. For this task we use two selectors that drive the bus according to the value of DR1 and DR2. 
DRIVESEL1: entity work.DRIVESEL port map(DR1,bus1,
   X"00000000",
   R1O,R2O,R3O,R4O,R5O,R6O,R7O,R8O,
   SPO,
   X"00000000",
   IMMO,
   X"00000000",
   RAMO,
   X"00000000",
   X"00000000"   
);

DRIVESEL2: entity work.DRIVESEL port map(DR2,bus2,
   X"00000000",
   R1O,R2O,R3O,R4O,R5O,R6O,R7O,R8O,
   SPO,
   X"00000000",
   X"00000000",
   X"00000000",
   X"00000000",
   X"00000000",
   X"00000001"   
);

If DR1 is x"3", then bus1 will be driven with the the signal corresponding to the output of R3. The same applies to DR2 and bus2. Some components are only wired to bus1, and can therefore only place their output on that bus.

The execution unit is defined in the top-level source file, XENTRAL.vhd

The Control Unit

The control unit reads the instruction memory and drives the execution unit such that the instructions are executed. Its most basic task is to drive the bus selectors and the ALU operation input. To keep things consistent we will use a fixed instruction size of 32 bits. The four most significant bits of the instruction determine the instruction's type. Instead of using microcode, XENTRAL implements all instruction decoding and execution logic directly in hardware. 

To avoid dealing with multiple signals, we wire the bus1 and bus2 selectors (DR1, DR2), the ALU operation signal (OP) and the bus3 selector (LD3) together into a single control signal CONTR:
LD3 <= CONTR(3 downto 0);
DR2 <= CONTR(7 downto 4);
DR1 <= CONTR(11 downto 8);
OP <= CONTR(15 downto 12);

Setting the CONTR signal determines the activity of the execution unit for this cycle. All instructions that can be executed by the execution unit in a single cycle are delegated directly using this method and executed immediately. These instructions are known as instruction type x"0":
case IR(31 downto 28) is
   when X"0" =>
      -- Regular bus arbitration
      CONTR <= IR; 
      PC <= unsigned(PC) + 1;


On the other hand, we also have instructions that do not use the execution unit at all. An unconditional absolute jump for example (type x"f"), only updates the program counter (PC):
   when X"f" =>
      -- Absolute jump!
      CONTR <= (others => '0'); -- Preserve execution unit state
      PC <= X"0" & IR(27 downto 0);= 
Some instructions take more than one cycle to complete. All memory access fall into this category. To support multiple cycles we add a counter signal called PHASE that keeps track of the cycle within the instruction. Below is the logic for instruction type x"3", indirect register store:
when X"3" =>
   -- Indirect register store
   -- (Store the contents of rN at the address in rM)
   case PHASE is
      when X"0" =>
         -- Tranfer dest register into MAR       
         CONTR <= X"0000B0" & IR(7 downto 4) & X"C";     
         PHASE <= unsigned(phase) + 1;
      when others =>
         -- Transfer source register into MBR
         CONTR <= X"0000A" & IR(11 downto 8) & X"0D";                                     -- end of instruction, load the next instruction
         PHASE <= (others => '0');
         PC <= unsigned(PC) + 1; 
   end case;

The complete source for the control module can be found in CONTROL.vhd

The Instruction ROM

We've just started development on our FPGA-based CPU, so we do not really have a toolchain of assemblers and compilers yet. We need to assemble the code that the processor will execute by hand. Thanks to the sparse and regular instruction format we defined in the control unit, this is not a terribly difficult task.

We write a basic loop that writes an increasing value into memory, to exercise the main functions. We will use the following instruction types:

  • x"0": one-cycle execution unit only operation
  • x"1": store immediate value in register
  • x"2": memory load
  • x"3": memory store
  • x"f": unconditional jump
Again, we use a behavioral description of the ROM module, specifyign what outputs we expect to generate (instructions) for each value of the input (address).

WITH MAR SELECT
  IMCR <= 
    -- Fill memory with counter using multi-clock instructions
    X"1FFFFFF9"    WHEN X"00000000", -- Initialize SP
    X"1FFFFFA1"    WHEN X"00000001", -- R1 <- -5
    X"30000110"    WHEN X"00000002", -- [R1] <- R1
    X"20000103"    WHEN X"00000003", -- R3 <- [R1]
    X"000051F2"    WHEN X"00000004", -- R2 <- R1 + 1
    X"0000A201"    WHEN X"00000005", -- R1 <- R2
    X"f0000002"    WHEN X"00000006", -- JMP 2

Refer to the control unit for the instruction types and their parameters, to the ALU for the values of OP and to the execution unit for the numbers of the components on the busses.

It's interesting to see exactly how much optimization the FPGA synthesis tools can perform on a given program. If you do not use a register or component, it will most likely be optimized away, yielding a version of the processor core that is optimized specifically for the firmware used4.

The complete source for the instruction ROM can be found in CODEROM.vhd

Now what?

In the immediate future I will most likely focus my efforts on the toolchain, implementing new instructions and altering existing ones as needed to implement a basic assembler and compiler framework.

Notes

1) A lot of digital design courses around the world are taught using VHDL. This means that there is a wealth of information out on the Internet that is relatively good and readily accessible. Verilog, on the other hand, is more popular in industry where people are considerably less share-happy. I still utterly despise the VHDL syntax, but I found it a lot easier to get a working knowledge of it by using course material than I did with Verilog.

2) Deleted

3) Since an FPGA is a completely programmable device, calling it ROM is a bit of a misnomer. No provisions exist in the logic encoded in the FPGA for writing to this memory area, so the area is read-only between programming sessions. For one incarnation of the processor, it's effectively ROM.

4) But what about the halting problem? How can the synthesizer figure out what the program is going to do without executing it, potentially to infinity? We have to thank our Harvard architecture and our regular instruction format for that. We are only executing instructions from the instruction memory, and we can't change that memory while running. Also, the bus selectors are always wired up to the same bits of the instruction register. So if there's no instruction that has the value "8" in one of the three least significant digits, we can be sure that R8 will never be used and we can optimize it away without considering any of the sequential logic.

No comments:

Post a Comment