See Vitis™ AI Development Environment on amd.com |
Version: Vitis 2025.2
- Polyphase Channelizer
The polyphase channelizer [1] simultaneously downconverts frequency-division multiplexed (FDM) channels in a single data stream using efficient digital signal processing. Channelizer use is ubiquitous in many wireless communications systems. As RF-DAC and RF-ADC technology advances, channelizer sampling rates rise, making implementation on high-speed field-programmable gate arrays (FPGAs) challenging. This tutorial implements a high-speed channelizer design using a combination of AI Engine and programmable logic (PL) resources in AMD Versal™ adaptive SoC devices.
The following table shows the system requirements for the polyphase channelizer. The input sampling rate is 10.5 GSPS. The design supports M=16 channels with each one supporting 10.5G / 16 = 656.25 MHz of bandwidth. The channelizer employs a polyphase technique as outlined in [1] to achieve an oversampled output at a rate of P/Q = 8/7 times the channel bandwidth, or 656.25 * 8/7 = 750 MSPS. The prototype filter used by the channelizer uses K=8 taps per phase, leading to a total of 16 x 8 = 128 taps overall.
| Parameter | Value | Units |
|---|---|---|
| Input Sampling Rate (Fs) | 10.5 | GSPS |
| # of Channels (M) | 16 | channels |
| Interpolation Factor (P) | 8 | n/a |
| Decimation Factor (Q) | 7 | n/a |
| Channel Bandwidth | 656.25 | MHz |
| Output Sampling Rate | 750 | MSPS |
| # of taps per phase (K) | 8 | n/a |
The following figure shows a block diagram of the polyphase channelizer. The following five blocks perform the required signal processing functions:
- The Circular Buffer converts the scalar input data stream into an M-vector output format for the downstream blocks, and introduces state to manage the P/Q output oversampling. Its memory depth spans the full extent of M x K samples. Conceptually, the circular buffer operates on a M x K array, employing a "serpentine shift" to introduce S = M x Q / P samples to each new output block. The remaining M - S samples come from the state history.
- The Polyphase Filter implements a parallel bank of M filters across the columns of the M x K circular buffer. Each filter employs K = 8 coefficients taken from an M-phase decomposition of the channelizer prototype filter. The filter produces a single vector of M output samples.
- The Cyclic Shift Buffer removes frequency-dependent phase shifts from the downstream Inverse Discrete Fourier Transform (IDFT) outputs using a memoryless and periodically time-varying circular shift of its inputs. A finite state machine (FSM) manages the sequence of input permutations across each input block. The number of states depends on the specific oversampling ratio factors P and Q and number of channels M.
- The Inverse Fast Fourier Transform (iFFT) performs an IDFT operation on its input vector of M samples to produce a transformed vector of output samples. In the channelizer context, the IDFT performs a parallel bank of M frequency down-conversion operations. Each IDFT output represents a separate down-converted channel of bandwidth Fs / M sampled at a rate of Fs / M * P / Q samples per second.
- The output buffer prepares the output channel samples for consumption by downstream processing. It is not included in this reference design.
The following figure shows a system model of the polyphase channelizer built in MATLAB and encapsulated in a MATLAB app (GUI). This provides a comprehensive golden model of the channelizer algorithms and shows the relationships between the various system parameters. The model was built to support a broader range of parameter settings than the actual Versal adaptive SoC design:
- The model supports two different input sampling rates: Fs = 10.5 GSPS and Fs = 20.5 GSPS.
- The number of channels you can set M to 16, 32, 64, or 128 using a dial.
- You can set the output oversampling ratio P/Q to 1/1, 2/1, 4/3, or 8/7 using the appropriate button.
- The number of active channels can be entered in the bottom left. This value must be less than the chosen value of M.
Press the "Go" button to run the model. When this occurs, the model generates the desired number of active channels and positions them in carrier locations chosen at random. Each signal is modeled as filtered Gaussian noise for simplicity. The model displays the impulse response of the prototype channelizer filter computed for the given system parameters in the top left plot. The bottom-left plot shows the filter in the frequency domain (red) and the signal to be extracted by the channelizer (blue). The top right plot shows the input spectrum to the channelizer along with the active carriers and their index labels. The bottom right plot shows the extracted channels at baseband in the time domain, where the blue signals are the channelizer inputs (delayed by the known group delay of the channelizer), and the red signals are the channelizer outputs.
This section outlines the system partitioning for the polyphase channelizer. This step analyzes the design’s five functional blocks to determine which to implement in AI Engines versus PL. It establishes a data flow with sufficient bandwidth for the required computations.
Channelizers today can operate at sampling rates between 10 and 20 GSPS. At typical clock rates—1 GHz for the AI Engine and 500 MHz for PL—channelizers require Super Sample Rate (SSR) operation. Several I/O samples are produced and consumed on every clock cycle. A feasible clocking strategy is based on the following:
- IFFT processing employs sizes N = 2^m and hardware solutions become overly complex unless SSR = 2^n. Here SSR = 4, 8, or 16 makes sense given M = 16 for this design.
- Hardware design is further simplified when the input sampling rate Fs contains a factor of Q=7 matching its output oversampling factor P/Q = 8/7 because the output sampling rate is then an integral number of clock cycles.
- AI Engine supports clock rates ranging from Fc = 1.0 GHz to 1.3 GHz depending on speed grade. It follows SSR = Fs/Fc ranges from 10/1.3 to 20/1.0.
A suitable clocking strategy can be identified based on these considerations. This tutorial targets a nominal Fs = 10 GSPS with SSR = 8 for an AI Engine nominal clock rate of Fc = 1.25 GHz. This performance may be met with a "-2M" speed grade device, the specific clock rates chosen as appropriate to satisfy the Q=7 divisibility requirement.
The following figure shows a diagram of the M x K Circular Buffer described earlier. Each cell contains one sample "x(n)", where each sample is labelled with its time index "n". Note there are M=16 rows and K=8 columns. The diagram shows the evolution of the buffer contents over three consecutive time epochs of the buffer. The leftmost column represents the current input samples. There are M=16 samples in total. Fourteen of these labelled in red are input to the buffer over two cycles. The two samples labelled in blue represent history samples from the previous epoch.
Notice how the circular or "serpentine" shift operates on the M x K buffer. From the left to the middle column, the buffer shifts down by 14 samples. The bottom of each column wraps to the top of the next column to the right. Samples shifted out of the rightmost column are discarded. The red input samples x13 and x12 in the top-left two rows become blue samples x13 and x12 in the middle bottom two rows. This is how the Circular Buffer introduces state into the filterbank processing.
The filterbank needs to process each row in the M x K array as a normal FIR filter. This is depicted as the green rectangle in the following figure. The state history in the green rectangle does not include the usual time-shifted samples found in an FIR filter. The sample ordering is jumbled and not time-correlated. A normal finite impulse response (FIR) filter in the AI Engine cannot implement this because the state history is not linear. The FIR would require the input sample and the entire state history on every cycle. This is not feasible.
However, the yellow boxes reveal a solution. The sample time indices in the yellow boxes exhibit the desired time-shifted characteristic of a normal FIR filter state. At each time sample, the state in the yellow boxes shifts by one sample, making room for a new sample. But these yellow boxes correspond to different logical filters of the filterbank. A workable solution is to map logical filters (rows in the M x K matrix) to AI Engine tiles that perform those filters. The mapping changes over time on a sample-by-sample basis, as shown in the following figure. It resembles a card-dealing operation: input samples for the desired logical filters are distributed to different AI Engine tiles. Inside those AI Engine tiles, the state history exhibits time-shifted state. The outputs of the physical tiles must then undergo an inverse "card dealing" pattern to assign the output samples to the proper logical filter. This "card dealing" permutation is implemented seamlessly in the PL through routing and multiplexing logic resource.
The AI Engine supports 16 MAC/cycle with "cint16" data and "int16" coefficients. It follows that four samples of a K=8 tap filter requires two cycles of compute. A single I/O stream delivers exactly four samples over four cycles. It follows this design is "I/O bound" rather than "Compute Bound" because the compute is busy only 50% of the time. The system must process M=16 samples every two cycles. It follows eight AI Engine tiles provide sufficient bandwidth with single stream I/O, each tile performing the compute for two filterbank channels. Additional design details are given below.
The cyclic shift performs no computations but simply introduces memoryless permutations in each input M-vector. No buffering occurs between inputs. The block simply performs a "cyclic shift" of each input M-vector. The shift amount varies according to an eight-stage FSM in this design. This block does not fit well in the AI Engine array. Its stream routing is more restrictive than PL for permutations, and it requires no computation to justify AI Engine placement. This function is a natural fit for a "PL Data Mover" and you can implement it using Vitis HLS.
The IDFT or IFFT must perform an M=16 point transform at the input sample rate Fs. Given the design adopts SSR = 8, it follows a complete transform must be performed once every M / SSR = 16/8 = 2 cycles. This is a very high throughput rate given the M=16 transform involves either four stages of Radix-2 butterflies (32 total) or two stages of Radix-4 butterflies (eight total). This is challenging to achieve at a sustained rate of two cycles per transform given the overhead of butterfly addressing required for FFT solutions.
In this case, a direct "matrix multiplication" approach to computing the IDFT directly provides a workable solution. For the "cint16" data types adopted in this design, the AI Engine is capable of performing a single [1x2] x [2x4] vector-matrix product "OP" per cycle. The IDFT for M=16 requires a [1x16] x [16x16] vector-matrix product, equivalent to 32 such OPs. It follows that 16 AI engine tiles are required to implement the IDFT matrix product in two cycles.
To support this 100% efficient compute bound, each tile must use two input streams and compute one OP every cycle without stalling. The final output tiles must deliver four samples every two cycles to meet the desired throughput. More design details are provided in the following sections.
The following figure shows a hardware diagram of the final polyphase channelizer design. It consists of the following elements:
-
The DMA Stream Source block uses a block RAM buffer to store channelizer input samples from DDR memory sampled at Fs. The samples stream over seven AXI streams into the channelizer. The block is implemented in PL with HLS and runs at 312.5 MHz.
-
The Input Permute block introduces the "serpentine shift" required by the Circular Buffer plus any "card dealing" permutations as dictated by the periodic logical-to-physical channel pattern to drive the AI Engine filterbank with proper data to establish fixed state history patterns in the array. This block is implemented in PL using HLS at 312.5 MHz.
-
The Filterbank is implemented as an AI Engine sub-graph using the following design approach. The design uses eight tiles and has eight I/O AXI streams. The AI Engine array is clocked at 1.25 GHz.
-
The Output Permute block removes the "card dealing" permutation applied for the filterbank processing so its output ordering is restored before the addition of the cyclic shift. This block is implemented in PL using HLS at 312.5 MHz.
-
The IDFT is implemented as an AI Engine sub-graph using the followuing design approach. The design uses 16 tiles and has eight I/O AXI streams.
-
The DMA Stream Sink block uses a block RAM buffer to capture the channelizer output samples and return them to DDR memory. The block is implemented in PL using HLS at 312.5 MHz.
The following figure shows the physical layout of the AI Engine array for the polyphase channelizer design. The overall design requires 24 tiles. The IDFT uses 4 x 4 = 16 tiles and the Filterbank uses 4 x 2 = 8 tiles. A total of 22 tiles are used for buffering. The design uses 32 PLIO in total, 16 for input and 16 for output.
The following figure shows the VC1902 die layout for the polyphase channelizer and summarizes the AI Engine and PL resources needed to build the full design.
The following figure shows the software scheduling of the polyphase filterbank design. Each tile implements the filtering for two physical channels, in this case "A" and "B". The stream inputs collect four samples over four cycles, alternately for each channel. Similarly, the compute is performed alternately over two cycles for each channel. The output results are then produced alternately on the output stream over another four cycles. This loop is scheduled with II=8 to achieve the desired throughput.
The compute gaps in the following figure and the two I/O streams per AI Engine tile raise a question. From a compute-bound perspective, why use eight tiles when four can suffice? Although the AI Engine supports two input and two output streams, a VLIW hardware restriction limits their use to either (i) two inputs and one output or (ii) one input and two outputs, or (iii) one input and one output. It was not feasible to schedule an II=8 loop supporting four filters in a single tile.
The following figure shows a diagram of how the "vector x matrix" multiplication form of the IDFT is vectorized and mapped to the AI Engine array of 4 x 4 = 16 tiles. The figure shows two consecutive IDFT transforms, one above the other. Recall each full transform is performed over two cycles. The operation of the design is outlined as follows:
- The design consists of a four x four array of tiles. Each tile performs two [1x2] x [2x4] operations over two cycles. Each row of tiles passes its computed outputs to the following tile in the same column using the cascade stream.
- Four samples are input on each of two input streams for each tile. The same data broadcasts to each tile in the row. For example, the orange input samples broadcasts to all tiles in the orange row, whereas the purple input samples broadcasts to all tiles in the purple row.
- Notice how the four input samples on a given stream span particular consecutive samples of a pair of transform inputs. For example, the four orange inputs on stream "ss0" contain the first two samples in the top (current) and bottom (next) input vector. Similarly, the four left-most purple samples on (unlabelled) stream "ss4" contain the 9th and 10th samples in the top and bottom input vectors.
- The array combines outputs top-to-bottom (in the diagram) using the cascade streams. The four tiles in the bottom row produce the outputs, writing four samples every four cycles on both streams in each tile. Note in the physical array, the cascade streams run horizontally left to right—the physical layout rotates 90 degrees from the diagram in the following figure.
- Each full compute takes two cycles, with throughput sustained at that rate with 100% efficient compute in each AI Engine tile.
Build the polyphase channelizer design from the command line.
IMPORTANT: Before beginning the tutorial, make sure you have installed AMD Vitis™ 2025.2 software. Make sure you have downloaded the Common Images for Embedded Vitis Platforms from this link.
Set the environment variable COMMON_IMAGE_VERSAL to the full path where you have downloaded the Common Images. Then set the environment variable PLATFORM_REPO_PATHS to the value $XILINX_VITIS/base_platforms. Additional information on this process may be found here.
The remaining environment variables are configured in the top level Makefile <path-to-design>/04-Polyphase-Channelizer/Makefile file.
RELEASE=2025.2
TOP_DIR ?= $(shell readlink -f .)
PLATFORM_NAME = xilinx_vck190_base_202520_1
PLATFORM_PATH = ${PLATFORM_REPO_PATHS}
export PLATFORM = ${PLATFORM_PATH}/${PLATFORM_NAME}/${PLATFORM_NAME}.xpfm
export SYSROOT = ${COMMON_IMAGE_VERSAL}/sysroots/cortexa72-cortexa53-amd-linux
export KERNEL_IMAGE = ${COMMON_IMAGE_VERSAL}/Image
export ROOTFS = ${COMMON_IMAGE_VERSAL}/rootfs.ext4
export PREBUILT_LINUX_PATH = ${COMMON_IMAGE_VERSAL}
You can build the channelizer design for hardware emulation using the Makefile as follows:
[shell]% cd <path-to-design>/04-Polyphase-Channelizer
[shell]% make all TARGET=hw_emu
This takes about 90 minutes to run. The build process generates the 04-Polyphase-Channelizer/package folder containing all the files required for hardware emulation. This can be run as shown below. An optional -g can be applied to the launch_hw_emu.sh command to launch the Vivado waveform GUI to observe the top-level AXI signal ports in the design.
[shell]% cd <path-to-design>/04-Polyphase-Channelizer/package
[shell]% ./launch_hw_emu.sh -run-app embedded_exec.sh
The channelizer design can be built for the VCK190 board using the Makefile as follows:
[shell]% cd <path-to-design>/04-Polyphase-Channelizer
[shell]% make all TARGET=hw
The build process generates the SD card image in the 04-Polyphase-Channelizer/package/sd_card folder.
The Power Design Manager (PDM) is the new, next-generation power estimation platform designed to bring accurate and consistent power estimation capabilities to the largest Versal and AMD Kria™ SOM products. It is the preferred power estimation tool for the Versal product family. You can find more information on the Power Design Manager (PDM) product page and in the Power Design Manager User Guide (UG1556).
The PDM has three modes to estimate power:
- Manual Estimation Flow: All device and design parameters including device part, design resources (AI Engine, PL and PS), clocks, toggle rate, etc. are input manually into the GUI.
- Import Compilation Flow: The file generated from XPE or Vivado Report Power is imported into the PDM after compiling the design.
- Import Simulation Flow: The file generated from XPE or Vivado Report Power is imported into the PDM after simulating the design.
This example uses the Import Compilation Flow mode to perform a Vectorless Power Analysis as defined in the Vivado Design Suite User Guide: Power Analysis and Optimization (UG907). This estimate is refined by running a simulation of the AI Engine portion of the design and updating the initial estimate.
[shell]% make all power TARGET=hwThis performs the following tasks:
- Compiles the design targeting vck190.
- Runs the
vivado_xpeMakefile target undervitis/finalwhich opens the compiled design in Vivado and runsreport_power. The output of this step issystem_power.xpewhich is located in thevitis/final/build_hw/_x/link/vivado/vpl/prjfolder. - Runs the
vitis_xpeMakefile target underaie/m16_ssr8which simulates the AI Engine portion of the design and produces a refined power estimate. The output of this step ism16_ssr8_app.xpewhich is located in theaie/m16_ssr8/aiesim_xpe/folder.
-
Launch the PDM.
-
Select New Project from the Start menu. The New Project dialog box opens.
-
In the New Project dialog box, type a name for your project.
-
In Project location, specify a directory where the project files will be stored.
-
Check the Create project subdirectory checkbox.
-
Select the Import XPE file checkbox and provide the path to
system_power.xpe. -
Click Next, then click Finish.
The following screen is displayed.
In the Import XPE wizard, provide the path to the .xpe file you want to import and click OK.
The following screen is displayed.
The following table shows a comparison between power estimates in compilation versus simulation flows in the PDM.
| Component | Static (W) | Dynamic (W) | Total (W) | Static (W) | Dynamic (W) | Total (W) |
| Import Compilation Flow | Import Simulation Flow | |||||
| PL | 7.5 | 1.9 | 9.4 | 7.5 | 1.9 | 9.4 |
| AI Engine | 4.8 | 4.3 | 9.1 | 4.8 | 1.7 | 6.5 |
| PS+PMC | 0.2 | 1.3 | 1.5 | 0.2 | 1.3 | 1.5 |
| Everything else (NoC, DDRMC, GTY, etc) | 1.1 | 8.7 | 9.8 | 1.1 | 8.7 | 9.8 |
| Total (W) | 13.6 | 16.3 | 29.9 | 13.6 | 13.7 | 27.3 |
[1] F.J. Harris et. al., "Digital Receivers and Transmitter Using Polyphase Filter Banks for Wireless Communications", IEEE Transactions on Microwave Theory and Techniques, Vol. 51, No. 4, April 2003.
GitHub issues are used for tracking requests and bugs. For questions, go to support.xilinx.com.
Copyright © 2023-2025 Advanced Micro Devices, Inc.












