Recently I made a lightweight HDMI Driver implemented on the Gowin Tang Nano 9k. I’m working on making a Mandelbrot (and a more general fractal) engine. I found this idea on Bruce Land’s Cornell ECE 5760 Lab Site and thought it would be fun to make.
Table of Contents
Open Table of Contents
Limitations
Since I am running everything on the Tang Nano and not the standard DE-10 Nano board, I’m running into some fun challenges.
Specifically, the Mandelbrot math requires calculating . In hardware, this expansion requires 3 multipliers per iteration:
z_re * z_rez_im * z_imz_re * z_im
The fundamental limitations are that
- Gowin’s GW1NR-9 has 20 DSPs (Multipliers) as opposed to the 112 available on the DE-10. This means that the DE-10 can fit 30-40 pipeline stages, while the Tang Nano only fits 6.
- DE10 allows for massive unrolling, while Tang Nano requires efficient reuse, since the number of Logic Cells (LUTs) on the nano9k i s8,640, while the DE-10 has 110,000.
- DE10 can fit a 640x480 frame buffer inside the FGPA fast RAM since it has ~500KB (+1GB ext. DDR) of BRAM while the nano9k cannot (has ~58KB of BRAM).
Baby’s First Steps
Just to illustrate the output, I wrote mandelbrot.v, which is a 1st pass of a piplined (spatially unrolled) mandelbrot solver.
module mandelbrot #(
parameter MAX_ITER = 5, // Limit to 6 to fit in Tang Nano 9K DSPs (20 total)
parameter WIDTH = 10,
parameter HEIGHT = 10
)(
input wire clk,
input wire rst_n,
input wire [WIDTH-1:0] x,
input wire [HEIGHT-1:0] y,
input wire active,
output reg [7:0] color
);
// Q3.14 Format: 1 sign, 3 integer, 14 fraction.
// 1.0 = 16384
reg signed [17:0] c_re_init, c_im_init;
reg active_init;
// X Scale: (3.0 / 640) * 16384 = 76.8 -> Use 77
// Y Scale: (2.4 / 480) * 16384 = 81.92 -> Use 82
// X Offset: -2.0 * 16384 = -32768
// Y Offset: -1.2 * 16384 = -19661
always @(posedge clk) begin
c_re_init <= (x * 18'sd77) - 18'sd32768;
c_im_init <= (y * 18'sd82) - 18'sd19661;
active_init <= active;
end
reg signed [17:0] z_re [0:MAX_ITER];
reg signed [17:0] z_im [0:MAX_ITER];
reg signed [17:0] c_re [0:MAX_ITER];
reg signed [17:0] c_im [0:MAX_ITER];
reg [4:0] iter_count [0:MAX_ITER];
reg valid [0:MAX_ITER];
always @(posedge clk or negedge rst_n) begin
if(!rst_n) begin
z_re[0] <= 0; z_im[0] <= 0;
c_re[0] <= 0; c_im[0] <= 0;
iter_count[0] <= 0; valid[0] <= 0;
end else begin
z_re[0] <= 0;
z_im[0] <= 0;
c_re[0] <= c_re_init;
c_im[0] <= c_im_init;
iter_count[0] <= 0;
valid[0] <= active_init;
end
end
genvar i;
generate
for (i = 0; i < MAX_ITER; i = i + 1) begin : stage
// 18-bit * 18-bit = 36-bit Result (Q6.28 Format)
wire signed [35:0] z_re_sq = z_re[i] * z_re[i];
wire signed [35:0] z_im_sq = z_im[i] * z_im[i];
wire signed [35:0] z_mix = z_re[i] * z_im[i];
// Divergence Check: (Re^2 + Im^2) > 4.0
// 4.0 in Q6.28 format is (4 * 2^28) = 1,073,741,824
wire divergence = (z_re_sq + z_im_sq) >= 36'sd1073741824;
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
z_re[i+1] <= 0; z_im[i+1] <= 0;
c_re[i+1] <= 0; c_im[i+1] <= 0;
iter_count[i+1] <= 0; valid[i+1] <= 0;
end else begin
// Pass C and Valid down the pipe
c_re[i+1] <= c_re[i];
c_im[i+1] <= c_im[i];
valid[i+1] <= valid[i];
if (!divergence) begin
// Z_next = Z^2 + C
// Shift Q6.28 down to Q3.14 (>>> 14)
z_re[i+1] <= ((z_re_sq - z_im_sq) >>> 14) + c_re[i];
z_im[i+1] <= ((z_mix <<< 1) >>> 14) + c_im[i];
iter_count[i+1] <= iter_count[i] + 1;
end else begin
// Lock values if diverged
z_re[i+1] <= z_re[i];
z_im[i+1] <= z_im[i];
iter_count[i+1] <= iter_count[i];
end
end
end
end
endgenerate
always @(posedge clk) begin
// Output result
color <= valid[MAX_ITER] ? {iter_count[MAX_ITER], 3'b000} : 8'h00;
end
endmodule
Coordinate Mapping
c_re_init <= (x * 18'sd77) - 18'sd32768;
This section converts my screen coordinates (from 0 to 640) into the complex number plane (roughly -2.0 to 1.0). It uses fixed point math (Q3.14), where 18’sd77 is the scaling factor (zoom level), and 18’sd32768 is my offset (panning to center of the image).
Hardware Unrolling generate block
This is the part that currently kills my resources.
genvar i;
generate
for (i = 0; i < MAX_ITER; i = i + 1) begin : stage
// ... instantiates hardware ...
end
endgenerate
This tells the synthesizer to copy and paste the logic inside this block MAX_ITER times. If MAX_ITER is 5, it creates Stages 0 to Stage 4. Each stage requires 3 multipliers, so the totla cost is 15 multipliers. Tang Nano 9k has 20 multipliers, so my upper limit is currently MAX_ITER = 6.
Math Engine
Inside each stage, I perform the standard Mandelbrot operation . I also bit shift >>> 14 to normalize the fixed-point numbers after multiplication, since Q3.14 * Q3.14 is Q6.28.