Elf Sternberg: Code snippet: Using Awk to Find Specific Clauses in HTML

If you've ever needed to hunt through a massive HTML project for a specific, repetitive clause, there aren't a lot of tool that are really that helpful. I'm sure there are some tools better suited to it, but I found a recipe for gawk that I like.

This recipe starts by setting the Gawk Record Separator RS to the close tag for the component I'm looking for. The gsub command replaces all of the newlines with an obscure Unicode character, "∎", the "tombstone," so you might have to change this if you're parsing math proofs, but in general it's rare enough that I can get away with it.

The search starts with trying to find the open tag of the HTML component, and the tag inside it that I was looking for. This is a combination layout & display we use a lot at authentik. I then convert the result back to having the proper newlines, and print it with a separator.

Since the results are legitimate Lit-Element references, I can put them all into a .ts file and prettier will format them into neat, regular columns, regardless of how heavily indented they were in the original.

BEGIN {RS="</ak-form-element-horizontal>"}
{
    gsub(/\n/, "∎")
    if (match($0, /\s*<ak-form-element-horizontal[^>]*>.*<input[^>]*type="[^"][^"]*"[^>]*>.*/, m)) {
        result = m[0] "</ak-form-element-horizontal>"
        gsub(/∎/, "\n", result)
        print "<!-- ----------------------------------------------------------------------- -->"
        print result "\n"
    }
}

As a simple addition, here's how I find all of my horizontal form clauses that do not have an <input> tag, but something else (like <select>, <textarea>, or just some display elements):

BEGIN {RS="</ak-form-element-horizontal>"}
{
    gsub(/\n/, "∎")
    if (match($0, /\s*<ak-form-element-horizontal[^>]*>.*/, m)) {
        result = m[0] "</ak-form-element-horizontal>"
        if (result !~ /<input/) {
            gsub(/∎/, "\n", result)  # Restore newlines
            print "<!-- ----------------------------------------------------------------------- -->"
            print result "\n"
        }
    }
}

I know you're not supposed to be able to use Regular Expressions to scan HTML, but using Prettier to format individual tags onto their own lines makes it a heck of a lot easier.

Elf M. Sternberg

Full Stack Web Developer

Where one teaches, two learn.

Blog

CODE SNIPPET: USING AWK TO FIND SPECIFIC CLAUSES IN HTML