Skip to content

FPGA HDMI Driver From Scratch: 640x480 on the Tang Nano 9k

Published: at 08:48 PM (15 min read)

For my Mandelbrot Engine, I need to build a chain of modules to handle everything from pixel generation to high-speed electrical signaling. Because the Tang Nano 9k lacks a dedicated HDMI encoder chip, the FPGA has to do all the work. I’ll be describing the architectural design I implemented and debugged. The project’s repository is here.

Table of Contents

Open Table of Contents

HDMI Protocol

HDMI is a digital video + audio interface, so it should be relatively easy to drive from a modern FPGA (using a Gowin Tang Nano 9k). A standad HDMI connector has 19 pins. 8 of these pins form 4 TMDS differential pairs to transport the actual high speed video info:

The board I’m using already hsa the HDMI connector built in, so we can just configure 8 FPGA pins as 4 differential TMDS outputs. Because the Gowin GW1NR-9C lacks a dedicated HDMI PHY, we need to implement the TMDS serialization using OSER10 primitives and high-speed LVDS buffers.

Video Signal

I want to create a standard 640x480 RGB 24bpp @ 60 Hz video signals. This is 307200 pixels per frame, and since each pixel has 24 bits (8 per color chanel), at 60 Hz, the HDMI link transports 0.44 Gbps of data. However, video signals also have an “off-screen area” which is used by the HDMI receiver (monitor) for synchronization (described a bit more later), Thus, out frame is actually sent as an 800x525 frame.

For 60 fps, we require a 24.5MHz pixel clock, but since HDMI specifies a 25MHz minimum clock, we’ll get a 61 Hz output.

TMDS Signals

The FPGA has 4 TMDS differential pairs to drive. The TMDS clock is just the pixel clock, which runs at 25MHz. The other 3 pairs are the red, green, and blue 8bit signals, so we something like this…

However, things are a little more complicated, since HDMI requres a scrable of the data and add 2 bits per color lane, so we have 10 bits instead of 8, and the link ends up transporting 30 bits per pixel (24 + 2*3). The scrambling and extra bits are needed by the receiver to properly synchronize and acquire each lane.

video_timing.v

This module tells the monitor exactly where the electron beam (conceptually) is at any given nanosecond.

Anatomy of a Video Frame

Even though we only care about hte 640x480 pixels that show on screen, the timing geneator actually counts up to 800 horizontal pixels and 525 vertical lines. These extra areas are remnants of old CRT (Cathode Ray Tube) technology, where the electron beam needstime to reset from the right side of the screen back to the left. These portions are sectioned as follows:

viedofarme

How the Counters Work

This module uses two nested counters: h_cnt (horizontal) and v_cnt (vertical). The horizontal counter increments on every pulse of the 25.175 MHz pixel clock (counts from 0 to 799), and the vertical counter increments only when the horizontal counter finishes a full line (counts from 0 to 524).

if (h_cnt == H_TOTAL - 1) begin
    h_cnt <= 0;
    if (v_cnt == V_TOTAL - 1) v_cnt <= 0;
    else v_cnt <= v_cnt + 1;
end else h_cnt <= h_cnt + 1;

Signal Generation Logic

The module uses the current values of h_cnt and v_cnt to decide what the output signals should be:

Full Implementation

module video_timing (
    input  wire pclk,
    input  wire rst_n,
    output reg  h_sync,
    output reg  v_sync,
    output reg  active_video,
    output reg  [9:0] x,
    output reg  [9:0] y
);
    // 640x480 @ 60Hz parameters
    parameter H_ACTIVE = 640;
    parameter H_FP     = 16;
    parameter H_SYNC   = 96;
    parameter H_BP     = 48;
    parameter H_TOTAL  = 800;

    parameter V_ACTIVE = 480;
    parameter V_FP     = 10;
    parameter V_SYNC   = 2;
    parameter V_BP     = 33;
    parameter V_TOTAL  = 525;

    reg [9:0] h_cnt, v_cnt;

    always @(posedge pclk or negedge rst_n) begin
        if (!rst_n) begin
            h_cnt <= 0;
            v_cnt <= 0;
        end else begin
            if (h_cnt == H_TOTAL - 1) begin
                h_cnt <= 0;
                if (v_cnt == V_TOTAL - 1) v_cnt <= 0;
                else v_cnt <= v_cnt + 1;
            end else h_cnt <= h_cnt + 1;
        end
    end

    always @(posedge pclk) begin
        h_sync <= ~(h_cnt >= (H_ACTIVE + H_FP) && h_cnt < (H_ACTIVE + H_FP + H_SYNC));
        v_sync <= ~(v_cnt >= (V_ACTIVE + V_FP) && v_cnt < (V_ACTIVE + V_FP + V_SYNC));
        active_video <= (h_cnt < H_ACTIVE) && (v_cnt < V_ACTIVE);
        x <= (h_cnt < H_ACTIVE) ? h_cnt : 0;
        y <= (v_cnt < V_ACTIVE) ? v_cnt : 0;
    end
endmodule

tmds_encoder.v

The goal of the tmds_encoder.v is to take the standard 8-bit color data and turn it into a 10-bit “wire-friendly” format. In DVI/HDMI, we don’t send raw bits because high-speed copper wire hates two things: too many transitions (which causes EMI/interference) and a DC imbalance (when there are more 1s than 0s, causing voltage to drift). My implementation solves this in 3 seperate stages.

1. Minimizing Transitions (8b to 9b step)

The first part of the code looks at the 8-bit color byte and decides whether to use XOR or XNOR logic to encode it:

2. DC Balancing (9b to 10b step)

Even with minimum transitions, the data might stil send more 1s than 0s over time. This creates a “DC offset” on the wire. To fix this, the encoder keeps running a tally called bias (or commonly called disparity as). If the encoder has sent 100 more 1s than 0s, the bias is +100. If the next byte has more ones and the current bias is already positve, the encoder inverts the entire bias to send more zeroes instead. The tenth bit tells the monitor whether the byte is inverted to keep the wire happy (1), or whether it isn’t inverted (0).

3. “Control” Periods

HDMI isn’t just color. During the “Blanking” periods (when active_video is low), the monitor needs to see Sync pulses:

if (!active) begin
    case (ctrl)
        2'b00: tmds <= 10'b1101010100; // HSync/VSync encoding
        ...

When active is low, the encoder ignores color data and sends four specific 10-bit “Control Tokens.” These tokens are designed to be very easy for the receiver to indentify so it can stay “locked” to the pixel clock even when no image is being sent.

Full Implementation

module tmds_encoder (
    input  wire clk,
    input  wire [7:0] data,
    input  wire [1:0] ctrl,
    input  wire active,
    output reg  [9:0] tmds
);
    wire [3:0] n1d = data[0] + data[1] + data[2] + data[3] + data[4] + data[5] + data[6] + data[7];
    wire use_xnor = (n1d > 4) || (n1d == 4 && data[0] == 0);
    wire [8:0] q_m;

    assign q_m[0] = data[0];
    assign q_m[1] = use_xnor ? (q_m[0] ~^ data[1]) : (q_m[0] ^ data[1]);
    assign q_m[2] = use_xnor ? (q_m[1] ~^ data[2]) : (q_m[1] ^ data[2]);
    assign q_m[3] = use_xnor ? (q_m[2] ~^ data[3]) : (q_m[2] ^ data[3]);
    assign q_m[4] = use_xnor ? (q_m[3] ~^ data[4]) : (q_m[3] ^ data[4]);
    assign q_m[5] = use_xnor ? (q_m[4] ~^ data[5]) : (q_m[4] ^ data[5]);
    assign q_m[6] = use_xnor ? (q_m[5] ~^ data[6]) : (q_m[5] ^ data[6]);
    assign q_m[7] = use_xnor ? (q_m[6] ~^ data[7]) : (q_m[6] ^ data[7]);
    assign q_m[8] = use_xnor ? 0 : 1;

    reg signed [4:0] bias = 0;
    wire [3:0] n1q_m = q_m[0] + q_m[1] + q_m[2] + q_m[3] + q_m[4] + q_m[5] + q_m[6] + q_m[7];
    wire [3:0] n0q_m = 8 - n1q_m;

    always @(posedge clk) begin
        if (!active) begin
            bias <= 0;
            case (ctrl)
                2'b00:   tmds <= 10'b1101010100;
                2'b01:   tmds <= 10'b0010101011;
                2'b10:   tmds <= 10'b0101010110;
                default: tmds <= 10'b1010101011;
            endcase
        end else begin
            if (bias == 0 || n1q_m == n0q_m) begin
                tmds <= {~q_m[8], q_m[8], (q_m[8] ? q_m[7:0] : ~q_m[7:0])};
                if (q_m[8] == 0) bias <= bias + (n0q_m - n1q_m);
                else bias <= bias + (n1q_m - n0q_m);
            end else if ((bias > 0 && n1q_m > n0q_m) || (bias < 0 && n0q_m > n1q_m)) begin
                tmds <= {1'b1, q_m[8], ~q_m[7:0]};
                bias <= bias + q_m[8] - (n1q_m - n0q_m);
            end else begin
                tmds <= {1'b0, q_m[8], q_m[7:0]};
                bias <= bias - (~q_m[8]) + (n1q_m - n0q_m);
            end
        end
    end
endmodule

hdmi_top.v (NOT WORKING)

NOTE THAT THIS MODULE DOESN’T WORK. There’s some issues with this code that I had to spend some time debugging. Please reference my debugging section (later in this post) to get the working implementation.

The hdmi_top module is the Physical Layer (PHY) and the integration hub. It takes the internal logic (timing and colors) and prepares them for the physical HDMI cable. It’s primary job is to bridge the gap between the slow Pixel Clock (where logic operates) and ultra-fast Serial Bitstream (cable).

1. Clocking Architecture

HDMI requires two clocks working in harmony. In this design, I use Gowin rPLL to create:

Because we use DDR (Double Data Rate), we move two bits per FCLK cycle. Since 5 cycles x 2 bits/cycle = 10 bits, we send a ful 10-bit TMDS symbol for every 1 pixel clock cycle.

2. OSER10 Primitive

This primitve is a special hardware block inside the Gowin FPGA. I cannot write a standard always block to toggle pins at 250 MHz, since the fabric isn’t fast enough. Instead, we use OSER10 (Output Serializer).

3. Differential Signaling (LVDS)

HDMI doesn’t use standard 3.3V logic levels. It uses differential signaling, where two wires carry opposite signals (positive and negative). This helps cancel out electromagnetic noise. In the code, we use the TLVDS_OBUF primitve:

TLVDS_OBUF u_buf_data (.I(tmds_serialized), .O(tmds_d_p), .OB(tmds_d_n));

This tells the Tang Nano hardware to take a single bit and drive it across two physical pins as a differential pair.

Full Implementation (NOT WORKING)

module hdmi_top (
    input  wire clk_27m,      // Crystal Oscillator
    input  wire rst_n,
    // HDMI physical pins
    output wire tmds_clk_p, tmds_clk_n,
    output wire [2:0] tmds_d_p, tmds_d_n
);
    wire pclk;   // 25.175 MHz
    wire fclk;   // 125.875 MHz (5x PCLK)
    wire pll_lock;

    // --- PLL Configuration ---
    // Use Gowin IP Designer to generate rPLL: 27MHz -> 125.875MHz (CLKOUT), 25.175MHz (CLKOUTD)
    GW_PLL u_pll (
        .clkout(fclk),     .clkoutd(pclk),
        .lock(pll_lock),   .clkin(clk_27m)
    );

    // --- Video Timing ---
    wire [9:0] x, y;
    wire h_sync, v_sync, active;
    video_timing u_timing (
        .pclk(pclk), .rst_n(pll_lock),
        .h_sync(h_sync), .v_sync(v_sync), .active_video(active),
        .x(x), .y(y)
    );

    // --- Placeholder Color Gen (Mandelbrot Hook here) ---
    // Integration Tip: Replace 'red', 'green', 'blue' with your Mandelbrot escape-time color map
    wire [7:0] red   = active ? x[7:0] : 8'd0;
    wire [7:0] green = active ? y[7:0] : 8'd0;
    wire [7:0] blue  = active ? x[8:1] : 8'd0;

    // --- TMDS Encoding ---
    wire [9:0] tmds_red, tmds_green, tmds_blue;
    tmds_encoder enc_r (.clk(pclk), .data(red),   .ctrl(2'b00),          .active(active), .tmds(tmds_red));
    tmds_encoder enc_g (.clk(pclk), .data(green), .ctrl(2'b00),          .active(active), .tmds(tmds_green));
    tmds_encoder enc_b (.clk(pclk), .data(blue),  .ctrl({v_sync,h_sync}), .active(active), .tmds(tmds_blue));

    // --- Serialization & Differential Buffers ---
    wire [2:0] tmds_serialized;
    wire tmds_clk_serialized;

    // Serialize Data Lanes 0, 1, 2
    genvar i;
    generate
        for (i = 0; i < 3; i = i + 1) begin : ser_data
            wire [9:0] tmds_val = (i==0) ? tmds_blue : (i==1) ? tmds_green : tmds_red;
            OSER10 u_oser (
                .Q(tmds_serialized[i]), .D0(tmds_val[0]), .D1(tmds_val[1]), .D2(tmds_val[2]),
                .D3(tmds_val[3]), .D4(tmds_val[4]), .D5(tmds_val[5]), .D6(tmds_val[6]),
                .D7(tmds_val[7]), .D8(tmds_val[8]), .D9(tmds_val[9]), .FCLK(fclk), .PCLK(pclk), .RESET(!pll_lock)
            );
            TLVDS_OBUF u_buf_data (.I(tmds_serialized[i]), .O(tmds_d_p[i]), .OB(tmds_d_n[i]));
        end
    endgenerate

    // Serialize Clock Lane (Sends a constant 1111100000 pattern at PCLK)
    OSER10 u_oser_clk (
        .Q(tmds_clk_serialized), .D0(1'b1), .D1(1'b1), .D2(1'b1), .D3(1'b1), .D4(1'b1),
        .D5(1'b0), .D6(1'b0), .D7(1'b0), .D8(1'b0), .D9(1'b0), .FCLK(fclk), .PCLK(pclk), .RESET(!pll_lock)
    );
    TLVDS_OBUF u_buf_clk (.I(tmds_clk_serialized), .O(tmds_clk_p), .OB(tmds_clk_n));

endmodule

IP Core Generator

Generating the rPLL

The code above requires the usage of a module called GW_PLL. This must be generated using the Gowin IP Core Generator because PLL settings are specific to silicon:

However, I ran into some issues (see below)!

Implementation & Debugging

Pain

Unfortunately, when I started setting up rPLL on Gowin’s IP Core Generator, I encountered a peculiar problem. Setting CLKOUT to be 125.875 MHz seemed to throw an error, like so:

d1

And while setting CLKOUT to 126 doesn’t throw an error, the derived CLKOUTD being 25.2 throws an error.

This is almost certainly due to how PLL hardware limitations. The Gowin PLL is an Integer-N PLL, not a true Fractional-N PLL that can hit arbitrary decimals. In addition, there are VCO (Voltage Controlled Oscillator) range constraints, and PFD (Phase Frequency Detector) Limits.

Thus, changing the fast clock (CLKOUT - serial) to 504 MHz and the pixel clock (CLKOUTD) to 25.2 MHz seemed to make the generator happy. However, now my serialization ratio is 20:1 instead of the original 5:1. Thus, I used the CLKDIV Primitive to ‘downsample’ nad

More Pain

After a decent bit of debugging, I realized I’m an idiot. My implementation of hdmi_top.v was failing due to clock phase misalignment (I skewed it up). I generated the Pixel Clock (25.2 MHz) and the Serial Clock (126 MHz) as two seperate outputs from the PLL:

So even though the logic was all sound, Path A and Path B have slightly differnt lengths and physical delays. This means the rising edge of Pixel Clock arrived slightly early or late relative to the Serial Clock. THE OSER10 tried to grab the data, but actually missed the window and sent garbage out.

The Gain

The way I fixed this issue was through “daisy-chaining” my clocks together:

In future projects, I do plan to try other fixes that are more ‘jank’ but functionally perform similar. For example, I could add a specific delay to either one of the clocks to introduce an offset to match phases. I also haven’t really used Gowin’s Timing Analyzer, which is something I want to learn to do.

Working hdmi_top.v!

module hdmi_top (
    input  wire clk_27m,
    input  wire rst_n,
    output wire tmds_clk_p, tmds_clk_n,
    output wire [2:0] tmds_d_p, tmds_d_n
);
    wire vco_fclk;   // 504 MHz
    wire ser_fclk;   // 126 MHz
    wire pclk;       // 25.2 MHz
    wire pll_lock;

    // 1. PLL: Generates only the high-speed VCO (504 MHz)
    Gowin_rPLL your_pll_inst (
        .clkin(clk_27m),
        .clkout(vco_fclk), // 504 MHz
        .lock(pll_lock),
        .reset(!rst_n)     // Check your IP: usually active high reset
    );

    // 2. CLKDIV 1: 504 MHz -> 126 MHz (Serial Clock)
    // Divides by 4
    CLKDIV #(.DIV_MODE("4")) u_div_serial (
        .CLKOUT(ser_fclk),
        .HCLKIN(vco_fclk),
        .RESETN(pll_lock)
    );

    // 3. CLKDIV 2: 126 MHz -> 25.2 MHz (Pixel Clock)
    // Divides by 5. THIS IS THE CRITICAL FIX.
    // It takes the *already divided* ser_fclk as input.
    CLKDIV #(.DIV_MODE("5")) u_div_pixel (
        .CLKOUT(pclk),
        .HCLKIN(ser_fclk), // Chain from the serial clock
        .RESETN(pll_lock)
    );

    // Video Timing
    wire [9:0] x, y;
    wire h_sync, v_sync, active;
    video_timing u_timing (
        .pclk(pclk),
        .rst_n(pll_lock),
        .h_sync(h_sync), .v_sync(v_sync), .active_video(active),
        .x(x), .y(y)
    );

    // Fixed Color Test (Half Red and Half Blue)
    wire [7:0] red   = active ? (x < 10'd320 ? 8'hFF : 8'h00) : 8'd0;
    wire [7:0] green = 8'd0;
    wire [7:0] blue  = active ? (x >= 10'd320 ? 8'hFF : 8'h00) : 8'd0;
    // TMDS Encoders
    wire [9:0] tmds_red, tmds_green, tmds_blue;
    tmds_encoder enc_r (.clk(pclk), .data(red),   .ctrl(2'b00),           .active(active), .tmds(tmds_red));
    tmds_encoder enc_g (.clk(pclk), .data(green), .ctrl(2'b00),           .active(active), .tmds(tmds_green));
    tmds_encoder enc_b (.clk(pclk), .data(blue),  .ctrl({v_sync,h_sync}), .active(active), .tmds(tmds_blue));

    // Serialization
    wire [2:0] tmds_serialized;
    wire tmds_clk_serialized;

    genvar i;
    generate
        for (i = 0; i < 3; i = i + 1) begin : ser_data
            wire [9:0] tmds_val = (i==0) ? tmds_blue : (i==1) ? tmds_green : tmds_red;
            OSER10 u_oser (
                .Q(tmds_serialized[i]),
                .D0(tmds_val[0]), .D1(tmds_val[1]), .D2(tmds_val[2]), .D3(tmds_val[3]), .D4(tmds_val[4]),
                .D5(tmds_val[5]), .D6(tmds_val[6]), .D7(tmds_val[7]), .D8(tmds_val[8]), .D9(tmds_val[9]),
                .FCLK(ser_fclk), // 126 MHz
                .PCLK(pclk),     // 25.2 MHz
                .RESET(!pll_lock)
            );
            ELVDS_OBUF u_buf_data (.I(tmds_serialized[i]), .O(tmds_d_p[i]), .OB(tmds_d_n[i]));
        end
    endgenerate

    // Clock Lane
    OSER10 u_oser_clk (
        .Q(tmds_clk_serialized),
        .D0(1'b1), .D1(1'b1), .D2(1'b1), .D3(1'b1), .D4(1'b1),
        .D5(1'b0), .D6(1'b0), .D7(1'b0), .D8(1'b0), .D9(1'b0),
        .FCLK(ser_fclk),
        .PCLK(pclk),
        .RESET(!pll_lock)
    );
    ELVDS_OBUF u_buf_clk (.I(tmds_clk_serialized), .O(tmds_clk_p), .OB(tmds_clk_n));

endmodule

Here’s the full digital schematic:

Example Output

Yay! This worked out! Here is a test image of a half red / half blue screen. Stay tuned for my Mandelbrot Engine! I also have a fun project I’m working on (a DLP Maskless Lithography system) that will need to make use of this driver (although implemented on an FPGA with more LUTs). alt text

Comments