Designing A Skid Buffer
from FPGA Resources by GateForge Consulting Ltd.
Networks-on-Chip (NoC) are very common, and have as a fundamental building block a point-to-point connection with a handshaking mechanism so each end can signal if they have data to send, or if they are able to receive data. When both ends agree, a data transfer occurs. The canonical modern example is the valid/ready handshake in the AXI protocol.
Aside: as specifications go, the AXI4 spec is worth your time. It gets complicated, but that's because it aims to be a flexible, all-purpose interface. However, the basics are quite clear and of broader relevance to NoC design.
However, pipelining handshaking interfaces is more complicated: simply adding a pipeline register to the valid, ready, and data lines will work, but now each burst of transfers take two cycles to start, and two cycles to stop. This isn't bad in terms of bandwidth if you have a block transfers to do, but now each receiving end has to be aware of how many pipeline stages are in the connection, and have sufficient buffering to absorb the data that keeps arriving after it signals it is no longer ready to receive more data.
This is the basis of credit-based connections (which I'm not getting into here), which maximize bandwidth over long pipelines, but are overkill if you simply need to add a single pipeline stage between two ends, without having to modify them, so as to meet timing or allow each end to send off one item of data without having to wait for a response (thus overlapping communication and computation, which is desirable).
Figuring Out The Requirements
To begin designing this single pipeline stage, let's imagine a single unit which can perform a valid/ready handshake and receive an incoming item of data, then performs the same handshake with the other end to send the data. The receiving side is called (in AXI terminology) the slave interface, and the sending side is the master interface. (This is for a write transfer. The handshakes are reversed for a read transfer. See the AXI spec for details.)
Ideally, the slave and master interfaces operate concurrently for maximum band in the same clock cycle, a new data item is received on the slave interface and put into a register, and that same register is simultaneously read out by the master interface. However, if the master interface is not transfering data on a given cycle, the slave interface must not transfer data during that cycle also, else we will overwrite the data register before it was read out. To avoid this problem, the slave interface should declare itself not ready in the same cycle as the master interface declaring itself not ready. But this forms a direct combinational connection between them, not a pipelined one. If we could connect both interfaces directly, and not affect timing or concurrency, we wouldn't need pipelining in the first place!
To resolve this contradiction, we need an extra buffer register to capture the incoming data during a clock cycle where the slave interface is transferring data, but the master interface isn't, and there is already data in the main register. Then, in the next cycle, the slave interface can signal it is no longer ready, and no data gets lost. We can imagine this extra buffer register as allowing the slave interface to "skid" to a stop, rather than stopping immediately, which we'd previously found contradicts our pipelining requirements.
Datapath Implementation
A good way to add this buffer register is to selectively feed the main register with data from either the incoming data stream, or the buffer register. This layout gives a neat registered output, which the CAD tools can then retime as necessary with any downstream logic, and forms the datapath of what will become our skid buffer. The Verilog implementation is straightforward.
`default_nettype none module skid_buffer_datapath #( parameter WORD_WIDTH = 0 ) ( input wire clock, // Data input wire [WORD_WIDTH-1:0] data_in, output reg [WORD_WIDTH-1:0] data_out, // Control input wire data_out_wren, input wire data_buffer_wren, input wire use_buffered_data ); // -------------------------------------------------------------------------- localparam WORD_ZERO = {WORD_WIDTH{1'b0}}; initial begin data_out = WORD_ZERO; end // -------------------------------------------------------------------------- reg [WORD_WIDTH-1:0] data_buffer = WORD_ZERO; reg [WORD_WIDTH-1:0] selected_data = WORD_ZERO; always @(*) begin selected_data = (use_buffered_data == 1'b1) ? data_buffer : data_in; end always @(posedge clock) begin data_buffer <= (data_buffer_wren == 1'b1) ? data_in : data_buffer; data_out <= (data_out_wren == 1'b1) ? selected_data : data_out; end endmodule
Controlling The Datapath
To operate our datapath as a skid buffer, we need to understand which states we want to allow it to be in, and which state transitions we also allow. This skid buffer has three states:
- It is Empty.
- It is Busy, holding one item of data in the main register, either waiting or actively transferring data through that register.
- It is Full, holding data in both registers, and stopped until the main register is emptied and simultaneously refilled from the buffer register, so no data is lost or reordered. (Without an available empty register, the slave interface cannot skid to a stop, so it must signal it is not ready.)
The operations which transition between these states are:
- the slave interface inserting a data item into the datapath (
+
) - the master interface removing a data item from the datapath (
-
) - both interfaces inserting and removing at the same time (
+-
)
We can see from the resulting state diagram that when the datapath is empty, it can only support an insertion, and when it is full, it can only support a removal. If the interfaces try to remove while Empty, or insert while Full, data will be duplicated or lost, respectively.
Controlpath Implementation
This simple FSM description helped us clarify the problem, but it also glossed over the potential complexity of the implementation: 3 states, each connected to 2 signals (valid/ready) per interface, for a total of 16 possible transitions out of each state, or 48 possible state transitions total.
We don't want to have to manually enumerate all the transitions to then coalesce the equivalent ones and rule out all the impossible or illegal ones. Instead, if we express in logic the constraints on removals and insertions we determined from the state diagram, and the possible transformations on the datapath, we then get the state transition logic and datapath control signal logic almost for free.
I'll list the code in chunks here, with explanations in between.
First, the module and port definitions, and the initial values for the outputs, which match those of an Empty datapath:
`default_nettype none module skid_buffer_fsm // No parameters ( input wire clock, // Slave interface input wire s_valid, output reg s_ready, // Master Interface output reg m_valid, input wire m_ready, // Control to Datapath output reg data_out_wren, output reg data_buffer_wren, output reg use_buffered_data ); // -------------------------------------------------------------------------- initial begin s_ready = 1'b1; // empty at start, so accept data m_valid = 1'b0; data_out_wren = 1'b1; // empty at start, so accept data data_buffer_wren = 1'b0; use_buffered_data = 1'b0; end
Then, let's describe the possible states of the datapath, and initialize the state variable. This code describes a binary state encoding, but the CAD tool can re-encode and re-number the state encoding. Usually this is beneficial, but if the states+inputs fit in a single LUT, forcing binary encoding reduces area. See what works best (i.e.: reaches the highest speed) for your given FPGA.
localparam STATE_BITS = 2; localparam [STATE_BITS-1:0] EMPTY = 'd0; // Output and buffer registers empty localparam [STATE_BITS-1:0] BUSY = 'd1; // Output register holds data localparam [STATE_BITS-1:0] FULL = 'd2; // Both output and buffer registers hold data // There is no case where only the buffer register would hold data. // No handling of erroneous and unreachable state 3. // We could check and raise an error flag. reg [STATE_BITS-1:0] state = EMPTY; reg [STATE_BITS-1:0] state_next = EMPTY;
Now, let's express the constraints we figured out from the state diagram:
- The slave interface can only insert when the datapath is not full.
- The master interface can only remove data when the datapath is not empty.
We do this by computing the allowable output read/valid handshake signals based on the datapath state. We use state_next so we can have a nice registered output. This little bit of code prunes away a large number of invalid state transitions. If some other logic seems to be missing, first see if this code has made it unnecessary.
always @(posedge clock) begin s_ready <= (state_next != FULL); m_valid <= (state_next != EMPTY); end
After, let's describe the interface signal conditions which implement our two basic operations on the datapath: insert and remove. This also weeds out a number of possible state transitions.
reg insert = 1'b0; reg remove = 1'b0; always @(*) begin insert = (s_valid == 1'b1) && (s_ready == 1'b1); remove = (m_valid == 1'b1) && (m_ready == 1'b1); end
Now that we have our datapath states and operations, let's use them to describe the possible transformations to the datapath, and in which state they can happen. You'll see that these exactly describe each of the 5 edges in the state diagram, and since we've pruned the space of possible interface conditions, we only need the minimum logic to describe them, and this logic gets re-used a lot later on, simplifying the code.
reg load = 1'b0; // Empty datapath inserts data into output register. reg flow = 1'b0; // New inserted data into output register as the old data is removed. reg fill = 1'b0; // New inserted data into buffer register. Data not removed from output register. reg flush = 1'b0; // Move data from buffer register into output register. Remove old data. No new data inserted. reg unload = 1'b0; // Remove data from output register, leaving the datapath empty. always @(*) begin load = (state == EMPTY) && (insert == 1'b1); flow = (state == BUSY) && (insert == 1'b1) && (remove == 1'b1); fill = (state == BUSY) && (insert == 1'b1) && (remove == 1'b0); flush = (state == FULL) && (insert == 1'b0) && (remove == 1'b1); unload = (state == BUSY) && (insert == 1'b0) && (remove == 1'b1); end
And now we simply need to calculate the next state after each datapath transformations:
always @(*) begin state_next = (load == 1'b1) ? BUSY : state; state_next = (flow == 1'b1) ? BUSY : state_next; state_next = (fill == 1'b1) ? FULL : state_next; state_next = (flush == 1'b1) ? BUSY : state_next; state_next = (unload == 1'b1) ? EMPTY : state_next; end always @(posedge clock) begin state <= state_next; end
Similarly, from the datapath transformations, we can compute the necessary control signals to the datapath. These are not registered here, as they end at registers in the datapath.
always @(*) begin data_out_wren = (load == 1'b1) || (flow == 1'b1) || (flush == 1'b1); data_buffer_wren = (fill == 1'b1); use_buffered_data = (flush == 1'b1); end
And finally, we glue the datapath and FSM together into the skid buffer module proper:
`default_nettype none module skid_buffer #( parameter WORD_WIDTH = 0 ) ( input wire clock, // Slave interface input wire s_valid, output wire s_ready, input wire [WORD_WIDTH-1:0] s_data, // Master interface output wire m_valid, input wire m_ready, output wire [WORD_WIDTH-1:0] m_data ); // -------------------------------------------------------------------------- // The FSM handles the master and slave port handshakes, and provides the // datapath control signals. wire data_out_wren; wire data_buffer_wren; wire use_buffered_data; skid_buffer_fsm // No parameters controlpath ( .clock (clock), .s_valid (s_valid), .s_ready (s_ready), .m_valid (m_valid), .m_ready (m_ready), .data_out_wren (data_out_wren), .data_buffer_wren (data_buffer_wren), .use_buffered_data (use_buffered_data) ); // -------------------------------------------------------------------------- // The datapath stores and steers the data. skid_buffer_datapath #( .WORD_WIDTH (WORD_WIDTH) ) datapath ( .clock (clock), .data_in (s_data), .data_out (m_data), .data_out_wren (data_out_wren), .data_buffer_wren (data_buffer_wren), .use_buffered_data (use_buffered_data) ); endmodule
For a 64-bit connection, the resulting skid buffer uses 128 registers for the buffers, 4 to 9 registers (and associated LUTs) for the FSM and interface outputs, depending on the particular state encoding chosen by the CAD tool, and easily reaches a high operating speed.