Introduction

Certain algorithms do not run efficiently or effectively on the processors inside most consumer computers. Consumer processors are optimized to be very multi-purpose. Among their many functions, they can run operating systems, interface with peripheral devices and control the monitor. A large amount of overhead including memory and control logic is required to handle so many functions.

Algorithms that require the use of large amounts of data, include feedback (either in the form of filters or using the same data multiple times), and necessitate fast arithmetic are not well suited to be run on consumer processors. Instead, these types of algorithms must be run on a more optimized, streamlined system. Often such algorithms must run in real-time, which further emphasizes the need for fast, clean computation.

There is a certain type of chip, namely a Digital Signal Processor (DSP), that is specifically designed for these types of algorithms.

^ TOP

History

DSPs were invented in the late 1970s. In 1978, Intel created a chip with an on-chip analog-to-digital converter and digital-to-analog converter (DAC) with an internal signal processor. However, the chip had no hardware multiplier and ultimately failed.

In 1979, American Megatrends created a similar device, but framed it as a microprocessor peripheral. The chip was host-initialized, and also ultimately failed.

In 1980, Bell Laboratories created the first single-chip DSP. In 1981, NEC and AT&T followed suit, creating their own DSPs. These telecom companies were derived from research the telecom companies had been doing for Public Switched Telephone Networks (PSTN).

Today, there are three major suppliers of DSPs: Texas Instruments, Analog Devices and Motorola.

^ TOP

Analog Devices Blackfin

Most DSPs share certain functionalities. The implementation is often different, but the end results are similar. The 'Blackfin' DSP from Analog Devices will be explored as an example. It can be assumed that other DSPs, while not identical to the Blackfin, will look similar.

The Blackfin has a number of architectural components that assist in the execution of the types of algorithms mentioned above.

^ TOP

Fast Mac

TThe multiply-accumulate operation is at the heart of most signal processing operations. It is the basis for convolution, which is the basis for many higher-level functions. The Blackfin has a 16-bit multiplier and a 40-bit accumulator. It is true that multiplying two 16-bit numbers will never give more than a 32-bit output. The extra 8 bits, however, are not superfluous - they exist so when the accumulator adds up the products of the multiplier, it has an extra 8 bits to work with before it overflows.

Fast MACs are very important for DSPs. The faster the DSP can execute a MAC, the more data can be processed. The Blackfin can execute 1.2GMACs at 600MHz, more than twice that of its closest competition.

^ TOP

Parallelism

Parallelism is another important property of DSPs. Often, the same operation must be done many times on multiple sets of data. Having twice the amount of hardware doubles the rate at which data can be processed.

The Blackfin has two MAC units, as described above. It also has four 8-bit Arithmetic Logic Units (ALUs) devoted solely to video processing. Every clock cycle, each video ALU can output a 4-pixel difference frame. That is, the ALU will compare four pixels from one image and the four pixels in the same location in another image. The output is a measure of how similar the two images are. This is very useful for applications such as stereo vision.

^ TOP

Pipelines

Pipelining is a method used to keep all sections of the ALU constantly computing. Pipelining is explained here and here (link to external references). The Blackfin uses a 10-stage pipeline. The first three stages are Instruction Fetch stages, where the ALU reads an instruction from memory. The fourth stage is the Instruction Decode phase, where the ALU determines what the instruction is telling it to do and sets itself up accordingly. The fifth stages is the Address Calculation stage, where the ALU calculates addresses for data access and branch targets.

The Blackfin then has four Execute phases, where it implements the functions that the instruction specified. Finally, the Write Back phase occurs, in which the ALU stores any required output.

^ TOP

Storage

The hardest task facing a DSP designer is not to make the ALU fast, but to send it data fast enough to keep it busy. Since disk access takes a very long time, the Blackfin has a large amount of L1 cache memory right next to the core. The memory is connected to the core via 64-bit high-speed buses

The Blackfin, like many other DSPs these days, uses a modified form of the Harvard architecture. There are separate L1 caches for instructions and data, each with their own separate buses to the core. This ensures that instructions are not tying up valuable bandwidth on the data lines, allowing data to flow as quickly as possible.

The instruction bus is 64-bits wide. Since the Blackfin uses a variable-length instruction set (instructions can be either 16 bits or 32 bits), two to four instructions can be fetched at once. There are two data buses, each 32 bits wide with corresponding 32-bit address lines.

^ TOP

External Interface

It would not be very useful to have a chip that can process data really fast if you couldn't give it data or take results back. So, the Blackfin has a number of external interfaces. There is a 100Mbps serial port, as well as a high-speed parallel port capable of moving digital video on and off the chip. There are memory interfaces as well, for SDRAM, Flash, SRAM, etc.

^ TOP

Dual Core

The Blackfin comes in a single core or a dual core model. There are various advantages to using a dual core model instead of a single core.

For instance, if two software teams develop two separate pieces of code, it's a non-trivial task to merge the two programs in such a way that they can peacefully coexist on a single core. With a dual core DSP, each program can run on a separate core.

Even if only a single program exists, running half of it on one core and half on the other can be advantageous. By using twice the hardware, the clock rate can be halved without affecting data throughput. A slower clock rate translates directly to lower power consumption, as shown in the following chart. The Blackfin contains an on-chip Phase-Lock Loop (PLL) to control the clock frequency and voltage.

Some applications require real-time data processing and also some sort of control structure, such as a GUI. With a dual-core DSP, these functions can be delegated to separate cores. This ensures that the GUI doesn't interfere with the real-time data processing.

^ TOP

Research

Current research into DSPs is focused in a number of areas, including modeling, design and DSP-FPGA porting.

Modeling

DSPs are very complex systems. A lot of work has been put in to modeling DSPs in order to optimize them. A number of numerical and theoretical methods have been developed. Also, DSPs have recently been modeled in Java and used as online learning tools.

Again, DSPs are very complex systems. It's difficult for developers to understand the entire design process, from algorithm design to hardware design to software design. Therefore, these functions are usually divided up and given to separate teams. The traditional design model (shown above) requires the algorithm team to send their algorithm over to the hardware and software teams, who give feedback to the algorithm team and the process repeats. None of these teams use the same development tools, so the algorithm must be recreated by the hardware and software developers in their respective development environments. Also, the algorithm designers don't have tools that allow them to see the hardware ramifications of their algorithms. It may be that one algorithm design requires 100 times as much hardware as another algorithm design, but there's no way to know until it's sent to the hardware team.

The move is toward an Accelerated DSP Design (ADD) model (shown above), using a single development environment across all teams. This will greatly decrease the time between iterations and therefore reduce the time to market. Research has been done as to just how to create this environment.

DSP-FPGA

Recently, DSP functions have been implemented more and more on FPGAs. Research has been done into how to effectively use and FPGA for DSP functions and how to automate the process. Compilers exist that transform DSPs into HDL code suitable for an FPGA. Companies claim a 3-20x reduction in execution cycles and a 1.3-5x reduction in execution time using an FPGA instead of a DSP.

DSPs and FPGAs

Introduction

History

Analog Devices Blackfin

Fast Mac

Parallelism

Pipelines

Storage

External Interface

Dual Core

Research