Building System-on-Chip
Updated 2018-03-07
- Soft Core
- CPU Interface
- Interfacing with Hardware Accelerator
- Lab 5 - SoC Lab
- Example 1: Implementing a Slave Device
- Example 2: Master and Slave
- Other Way of Hardware Acceleration
Real systems are not built from scratch.
Real systems contain both hardware and software.
Real systems revolve around embedded processor (CPU that coordinates and CPUs are easy to program).
Real systems are designed using system-level design tools.
How do we design these systems?
In SoC, hardware if connected to software through a memory mapped interface or dedicated circuit. The SoC consists ofr a processor and a bunch of embedded cores / accelerators. The cores are connected using some standard bus.
An analogy is a gaming computer that has a graphics card: the CPU is still coordinating the GPU on what to render (vertices, shaders, etc).
page 6
Recall from FPGA architecture that there are soft and hard embedded processors. The soft processors are implemented using LUTs and other FPGA resources. The hard processors are actually a processing embedded in the system.
Off-chip processors can also be coupled to the FPGA
Soft Core
page 14
CPU Interface
CPUs connected to devices using “interconnect”. The simplest connection is “bus”.
There are tristate drivers for each device connected to the bus such that only one device drives it at a time. But as shown before, this is not really possible on FPGA since FPGA doesn’t have tristate drivers.
Modern SoCs uses “interconnect fabric”. But essentially has the same function: access any cores using some address space.
Master & Slaves
Most bus protocols draw a distinction between:
- Masters: initiate transaction, specify address, give instructions, etc.
- Slaves: respond to requests from masters and can generate data
Most peripherals are slaves.
page 19
The CPU provides an address to address the connected peripherals. This means that any connected device will receive the address, but they have to see if the address is for that particular device. If it is, then it will listen.
The address is sent by the master which is received by 1 or more slaves.
page 22
In the interconnect fabric, instead of being a bus, there is just a tree of MUXs. The interconnect fabric in the DE1-SoC uses Altera Avalon. In their design, we can have multiple transactions happen at the same time.
Memory mapped Peripherals
As said before, each peripheral is mapped to a memory address. The peripheral will not respond to any calls to a different address that is not its own.
Interfacing with Hardware Accelerator
1. Create an interface between your hardware and Avalon fabric.
We need to make sure it is assigned to an memory address range that is accessible and not overlapping with other devices.
2. Attach hardware to a previously designed parallel I/O port
We just need to implement our own logic (combinatory logic) as a driver. The driver takes input from a parallel interface (PIO). The output is attached to a attached hardware.
In general, option 2 is much simpler to do. But option 1 is much more flexible.
Lab 5 - SoC Lab
The objective is to observe how SoCs are built in real life and become familiar with the system design tool (QSYS).
Experience:
- Implement and configure processor of FPGA
- program processors
- Program the interconnect fabric
- Interface hardware and software
page 31-34
How do we program it?
There are two ways to write and debug software:
- Altera Monitor Program
- Eclipse
Super simple sample program:
#define Switches (volatile char *) 0x0002000
#define LEDs (char *) 0x0002010
void main() {
while (1) {
*LEDs = *Switches;
}
}
The hardware is mapped to the memory space in the processor. To read something, we make a read request to the switches via address 0x0002000
. To write something, we make a write request to the LED memory address, which is 0x0002010
.
All we’re doing in this sample code is constantly sampling the switches, then assigning on/off to the LEDs.
YAY! We just unlocked the ability to write software! Note that software is much slower and higher power than hardware.
The processor we’re using in this lab is much more customizable and can be tuned to exact needs, but is much slower (only up to 100MHz). Hence we need to hardware accelerator to do certain tasks.
Example 1: Implementing a Slave Device
We want: a circuit to determine if a number is prime
-
Define hardware / software interface
page 43
The software writes the number into location 0, this starts the computation of in the hardware. The computation may take multiple cycles. When the computation is completed,
done
(another address in the memory space) is set to 1. Theprime
flag is also asserted in the memory space.Note that it is not necessarily always writing to memory, it’s just a piece of data that’s passing via the interconnect fabric.
The software has to poll for
done
going high. -
Define hardware that makes up the core
Because the hardware is a slave, the implementation is straight forward.
page 47
Note that we are not writing memory, we are pretending to be the memory.
page 48
-
Write the software to interact with it
#define MY_ACCEL_BASE (volatile int *) 0x0002040 #define num 12973 void main() { // Write the number to location 0 in our memory chip // But we're actually not writing to memory // We're just sending the request via the bus *(MY_ACCEL_BASE + 0) = num; // Keep looping if not done while ((*(MY_ACCEL_BASE + 2)) == 0); // Read if prime prime = 0; if (*(MY_ACCEL_BASE + 1)) prime = 1; }
Variables in C can either end up in registers or memory. Using
volatile
will ensure the variable is mapped to the register. Otherwise, it might be mapped to a register and thus no transaction on the bus.What will happen if the
volatile
on line 1 is removed?The
MY_ACCEL_BASE
might be put into a register, and thus the while loop might become infinite loop because it would just read the register over and over again.
Example 2: Master and Slave
We want: an accelerator that draws a box in the pixel buffer
The accelerator must be a slave because the processor can write and read values to the control registers
The accelerator must also be a master because the pixel buffer is stored in memory and the accelerator must initiate.
Memory map:
page 52
The slave interface would implement the registers and interface to the Avalon fabric.
Other Way of Hardware Acceleration
page 60
One could create a custom instruction that is connected to some custom logic.
A question to ask is: “is it work accelerate?”. If, without loss of generality, 80% of the time the system is executing 10% of the code such as a loop, then consider implementing one.
What are the limitations?
- In most cases, it takes more time to transfer the data to accelerator than using it to execute code
- Amdahl’s Law
Amdahl’s Law
Speedup is limited by the amount of code that we cannot speed up. Suppose \(P\) is the fraction of execution you can speed up, and \(S\) is the amount that we CAN speed up. Then the expression of for how much speedup there is is given as:
\[\frac{1}{(1-P)+\frac{P}{S}}\]For example, if 25% of the code executed can run twice as fast, then \(P=0.25\), \(S=2\) and the overall speed up is \(14\%\).