**Impleneting a Streaming FFT** **Home Page** On this page we'll quickly go over building a full DMA-accessible FFT pipeline. The entire project in a zip file is located here however, it may also be beneficial to build this project up from the ground up.
Fully Project (Vivado 2018.2) zipped up here for your convenience...or build it yourself.
The system will involve a DMA interface to take in some time-series data, an FFT, some follow-up modules to figure out the magnitude of the FFT output, and then a return to the DMA interface. ![](./resources/high_level.png width="500px") Getting Started ======================= Make a new project for the Pynq board. Make sure to target the Pynq board. Use your standard `.xdc` file. that you have been using. Create a new block diagram, add in a Zynq Processing System, and run the default automation. Modules and Wiring ======================= Before adding additional modules, go into your Zynq settings and under **`PS/PL`** be sure to enable one **`S Slave AXI Interface`**. You'll need this for streaming data down and back up. Our goal is to build up to the overall block diagram shown below: ![](./resources/fft_working.png ) Let's start adding pieces: ## AXI-Streaming-DMA Module Add one of these. This should be set up identically to how we've set up the DMA before /or as was done in this video. In particular make sure to disable scatter-gather. ![](./resources/module_dma.png width="500px") Feel free to auto-wire if it prompts you at this point. ## The FFT Next let's add the FFT. For the FFT, go ahead and add it to your block diagram like shown below. Really the only input we'll use is the AXI Stream In, and the only output we'll be using is the AXI Stream Out. ![](./resources/module_fft.png width="500px") Set up your FFT so that it is for 1024 points, and the output values are in **Natural Order** among other things. Most of the settings should stay the same, but please compare to the images below: ![](./resources/fft_conf_1.png width="500px") . ![](./resources/fft_conf_2.png width="500px") After you've added it, connect its AXI4 Streaming Input to the AXI4 Streaming Output of the AXI-DMA module. To avoid an error, add a **`Constant`** IP with value of 0 to the design and tie it to the **`s_axis_config_tvalid`** input on the FFT just to suppress an inevitable error that will pop up about some undefined input. ## Square-Sum Now we need a module to the take the complex output of the FFT and turn that into a magnitude (since that's all we care about for this lab...in other applications you may very well want to keep the real and imaginary part separate). To do this we're going to create a piece of Custom AXI4 Streaming IP with an input Slave AXI-Streaming Port and a output Slave AXI-streaming port, just like we did previously in The AXI Streaming Lab (Lab 5). ![](./resources/module_square_sum.png width="500px") Also just like we did in that lab we're going to edit the top level Verilog file to implement a pipelined square-and-then-sum operation shown in the image below: ![](./resources/square_sum.png width="500px") The Verilog I chose to write is below. Study it/use it as reference/write your own. It is totally up to you, but this should serve as an example of how to write a simple, multi-step pass-through AXI-streaming module. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ c++ linenumbers module square_and_sum_v1_0 # ( // Users to add parameters here // User parameters ends // Do not modify the parameters beyond this line // Parameters of Axi Slave Bus Interface S00_AXIS parameter integer C_S00_AXIS_TDATA_WIDTH = 32, // Parameters of Axi Master Bus Interface M00_AXIS parameter integer C_M00_AXIS_TDATA_WIDTH = 32, parameter integer C_M00_AXIS_START_COUNT = 32 ) ( // Users to add ports here // User ports ends // Do not modify the ports beyond this line // Ports of Axi Slave Bus Interface S00_AXIS input wire s00_axis_aclk, input wire s00_axis_aresetn, output wire s00_axis_tready, input wire [C_S00_AXIS_TDATA_WIDTH-1 : 0] s00_axis_tdata, input wire [(C_S00_AXIS_TDATA_WIDTH/8)-1 : 0] s00_axis_tstrb, input wire s00_axis_tlast, input wire s00_axis_tvalid, // Ports of Axi Master Bus Interface M00_AXIS input wire m00_axis_aclk, input wire m00_axis_aresetn, output wire m00_axis_tvalid, output wire [C_M00_AXIS_TDATA_WIDTH-1 : 0] m00_axis_tdata, output wire [(C_M00_AXIS_TDATA_WIDTH/8)-1 : 0] m00_axis_tstrb, output wire m00_axis_tlast, input wire m00_axis_tready ); reg m00_axis_tvalid_reg_pre; reg m00_axis_tlast_reg_pre; reg m00_axis_tvalid_reg; reg m00_axis_tlast_reg; reg [C_M00_AXIS_TDATA_WIDTH-1 : 0] m00_axis_tdata_reg; reg s00_axis_tready_reg; reg signed [31:0] real_square; reg signed [31:0] imag_square; wire signed [15:0] real_in; wire signed [15:0] imag_in; assign real_in = s00_axis_tdata[31:16]; assign imag_in = s00_axis_tdata[15:0]; assign m00_axis_tvalid = m00_axis_tvalid_reg; assign m00_axis_tlast = m00_axis_tlast_reg; assign m00_axis_tdata = m00_axis_tdata_reg; assign s00_axis_tready = s00_axis_tready_reg; always @(posedge s00_axis_aclk)begin if (s00_axis_aresetn==0)begin s00_axis_tready_reg <= 0; end else begin s00_axis_tready_reg <= m00_axis_tready; //if what you're feeding data to is ready, then you're ready. end end always @(posedge m00_axis_aclk)begin if (m00_axis_aresetn==0)begin m00_axis_tvalid_reg <= 0; m00_axis_tlast_reg <= 0; m00_axis_tdata_reg <= 0; end else begin m00_axis_tvalid_reg_pre <= s00_axis_tvalid; //when new data is coming, you've got new data to put out m00_axis_tlast_reg_pre <= s00_axis_tlast; // real_square <= real_in*real_in; imag_square <= imag_in*imag_in; m00_axis_tvalid_reg <= m00_axis_tvalid_reg_pre; //when new data is coming, you've got new data to put out m00_axis_tlast_reg <= m00_axis_tlast_reg_pre; // m00_axis_tdata_reg <= real_square + imag_square; end end endmodule ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once your code is done, **DON'T FORGET TO REPACKAGE THE IP** and add a copy to your high level block diagram. Wire the output of the FFT to its input. ## AXI Streaming FIFO Now let's add a AXI Streaming FIFO between the output of the Square-Sum module and the input of our square root module (which we haven't gotten to yet). We actually probably don't need this here, but I put it in when designing since I wasn't sure how quickly the different modules were going to start generating decent values and I got worried that without a bit of in-series storage the system might work weird. ![](./resources/module_axis_fifo.png width="500px") I believe that we'll use most of the default settings as it comes, but double-check below. The Width needs to change, for example. ![](./resources/axis_fifo_conf.png width="500px") Wire the output of your Square-Sum module to the input of the AXI Streaming FIFO. ## Square Root (CORDIC) For calculating the Square Root we'll need a CORDIC. Find this IP, bring it into the design, and then customize it as shown below: ![](./resources/module_cordic.png width="500px") Make sure to specify that the input data width will be 32 bits. (also, of course, make sure to specify that you want to do a square root and not the default of sine or whatever pops up): ![](./resources/cordic_conf_1_better.png width="500px") In order to make this module perfectly compatible with our AXI4-Streaming pipeline, make sure the **`T_LAST`** and **`TREADY`** signals are activated: ![](./resources/module_axis_fifo.png width="500px") Attach the output of the AXI Streaming FIFO to the input of your CORDIC. ## Zero Padder Annoyingly, the CORDIC output will be stuck at 24 bits for a 32 input when running a square root (this makes sense, it is just annoying), and to add insult to injury the DMA IP freaks out if it doesn't get an exactly 32-bit wide AXI-stream as an input (other modules don't care and just auto-pad with zeroes). Anyways we need to do this manually so I created a second piece of AXI-streaming IP with a Slave input of 32 bits and a Master output of 32 bits (both AXI Streaming ports). This IP is less picky and I can readily connect the 24-bit CORDIC output into this with no errors (only critical warnings which aren't as big of a deal). ![](./resources/zero_padder.png width="500px") Just like with the squaring-summing module up above I then rewrote a big chunk of the default as shown below...this is about as simple as you can get. I literally just tied every input to everyt output. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ c++ linenumbers module zero_padder_v1_0 # ( // Users to add parameters here // User parameters ends // Do not modify the parameters beyond this line // Parameters of Axi Slave Bus Interface S00_AXIS parameter integer C_S00_AXIS_TDATA_WIDTH = 32, // Parameters of Axi Master Bus Interface M00_AXIS parameter integer C_M00_AXIS_TDATA_WIDTH = 32, parameter integer C_M00_AXIS_START_COUNT = 32 ) ( // Users to add ports here // User ports ends // Do not modify the ports beyond this line // Ports of Axi Slave Bus Interface S00_AXIS input wire s00_axis_aclk, input wire s00_axis_aresetn, output wire s00_axis_tready, input wire [C_S00_AXIS_TDATA_WIDTH-1 : 0] s00_axis_tdata, input wire [(C_S00_AXIS_TDATA_WIDTH/8)-1 : 0] s00_axis_tstrb, input wire s00_axis_tlast, input wire s00_axis_tvalid, // Ports of Axi Master Bus Interface M00_AXIS input wire m00_axis_aclk, input wire m00_axis_aresetn, output wire m00_axis_tvalid, output wire [C_M00_AXIS_TDATA_WIDTH-1 : 0] m00_axis_tdata, output wire [(C_M00_AXIS_TDATA_WIDTH/8)-1 : 0] m00_axis_tstrb, output wire m00_axis_tlast, input wire m00_axis_tready ); // assign m00_axis_tvalid = s00_axis_tvalid; assign m00_axis_tdata[31:0] = s00_axis_tdata[31:0]; assign m00_axis_tlast = s00_axis_tlast; assign m00_axis_tvalid = s00_axis_tvalid; assign s00_axis_tready = m00_axis_tready; endmodule ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Build this simple module, and add it to your block diagram. Then, connect the output of the CORDIC to the input of this zero-padding module and the output of the zero-padding module to the input to the AXI DMA module's streaming port. Wiring It All Together ========================= At this point you should be all wired up. Feel free to run the auto-wiring automation if it hasn't been done in a while, let Vivado redraw everything for cleanliness, and then compare your design with mine below (or the complete project included at the top of the page): ![](./resources/fft_working.png ) If all looks good, go through your standard build process like you have been doing. Interacting With It In Python ========================= Once you've got your bit file and tcl file up in place, the snippets of code below should work with your module. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ python linenumbers from pynq import Overlay import pynq.lib.dma %matplotlib notebook import matplotlib.pyplot as plt def plot_to_notebook(time_sec,in_signal,n_samples,): plt.figure() plt.subplot(1, 1, 1) plt.xlabel('Time in Microseconds') plt.grid() plt.plot(time_sec[:n_samples],in_signal[:n_samples],'y-',label='Signal') #plt.plot(time_sec[:n_samples]*1e6,in_signal[:n_samples],'y-',label='Signal') plt.legend() def plot_fft(time_sec,in_signal,n_samples,): plt.figure() plt.subplot(1, 1, 1) plt.xlabel('Frequency') plt.grid() plt.plot(time_sec[:n_samples],in_signal[:n_samples],'y-o',label='Signal') #plt.plot(time_sec[:n_samples]*1e6,in_signal[:n_samples],'y-',label='Signal') plt.legend() overlay = Overlay('./fft8.bit') #./dmatest2.bit overlay.ip_dict dma = overlay.axi_dma_0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ python linenumbers import numpy as np # Sampling frequency fs = 44000 # Number of samples n = 1024#int(T * fs) 1024 #Total Time: T = n*1.0/fs # Time vector in seconds t = np.linspace(0, T, n, endpoint=False) ns = np.linspace(0,fs,n,endpoint=False) # Samples of the signal samples = 0*np.cos(0*2*np.pi*t) + 1000*np.cos(1000*2*np.pi*t) + 3200*np.sin(2000*2*np.pi*t) +0 samples = samples.astype(np.int16) print('Number of samples: ',len(samples)) # Plot signal to the notebook plot_to_notebook(ns,samples,1024) from pynq import Xlnk import numpy as np # Trigger the DMA transfer and wait for the result #import time #start_time = time.time() #out = np.fft.fft(samples) #out = np.absolute(out) #stop_time = time.time() #sw_exec_time = stop_time-start_time #print('Software FFT execution time (1024 samples): ',sw_exec_time) #plot_fft(ns,out,512) # Allocate buffers for the input and output signals xlnk = Xlnk() in_buffer = xlnk.cma_array(shape=(n,), dtype=np.int32) out_buffer = xlnk.cma_array(shape=(n,), dtype=np.int32) # Copy the samples to the in_buffer np.copyto(in_buffer,samples) # Trigger the DMA transfer and wait for the result import time start_time = time.time() dma.sendchannel.transfer(in_buffer) dma.recvchannel.transfer(out_buffer) dma.sendchannel.wait() dma.recvchannel.wait() stop_time = time.time() hw_exec_time = stop_time-start_time print('Hardware FFT execution time (1024 samples): ',hw_exec_time) # Plot to the notebook plot_fft(ns,out_buffer,512) # Free the buffers in_buffer.close() out_buffer.close() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When you run all this, you should first get a time-domain plot dependent on what you specify. For example if ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ python samples = 500*np.cos(10000*2*np.pi*t) + 1000*np.cos(4410*2*np.pi*t) + 750*np.sin(500*2*np.pi*t) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ then you should get something like this. ![](./resources/generated_signal.png width="500px") After that you'll get a hopefully correct FFT as an output! ![](./resources/spectrum.png width="500px")
Some bits of this lab used this awesome video/page as a starting point