08Oct

The most asinine technical requirement I’ve encountered

Posted by Elf Sternberg as chat

David J Prokopetz asks, “What’s the most asinine technical requirement you’ve ever had to deal with?” I don’t know if it qualifies as “asinine,” but one of the most idiotic requirements I ever ran into during an assignment was simply one about contracts, money, and paranoia about the GPL.

It came down to this: in 1997, the Seattle-based company I was working at had been “acquired” by CompuServe (CIS) to be CIS’s “Internet division,” and as part of the move we were required to move to RADIUS, the Remote Access Dial-In User Service, an authentication protocol for people who dialed into an ISP using their landlines, so that our customers could dial in through CompuServe’s banks of modem closets. That was fine.

What wasn’t fine was that, at the time, CompuServe’s central Network Operations Center (NOC) in Columbus, Ohio, was 100% MicroSoft NT, and we were a Sun house. The acquisition required a waiver from Microsoft because CIS was getting huge discounts from MicroSoft for being a pure MS play. We were told that, if we had to run on Solaris, then we also had to run a pair of RADIUS servers written for NT and ported to Solaris, and we also had to run a pair of Oracle servers (CIS had a lot of contractual obligations about who they purchased software from as a result of their NT centricity), and in order to make them line up we also had to buy these ODBC-on-Solaris shims that would let our ODBC-based RADIUS servers talk to Oracle, despite all of this running on Solaris.

So we had four machines in the rack, two running this RADIUS hack and the ODBC drivers, and two running Oracle. Four machines and the software alone was $300,000.

And it crashed every night.

“Yeah, it’s a memory leak,” the RADIUS server vendor told us. “We’re aware of it. It happens to NT too. We’ll get around to fixing it, but in the meantime, just reboot it every night. That’s what the NT people do.”

Now, at the time, there was a bit of pride about Unix programmers: we don’t reboot machines unless lightning strikes them. We could refresh and upgrade our computers without trauma without having to engage in therapeutic reboots. We had uptimes measured in years.

The counterpoint is that there was a GPL-licensed RADIUS server. We were allowed to use GPL-licensed code, but only under extremely strict circumstances, and in no case could we link the GPL-licensed RADIUS server to Oracle. That was a definitive ‘no.’ We had to use the ones CompuServe ordered for us.

So Brad, my programming buddy, and I came in one weekend and wrote a shim for the RADIUS server that used a pair of shared memory queues as a full-duplex communications channel: it would drop authentication requests into one, and pick up authentication responses in the other. We then wrote another server that found the same queues, and forwarded the details to Oracle over a Solaris-friendly channel using Oracle Pro*C, which was more performant and could be monitored more closely.

We published the full-duplex-queue for the RADIUS server, which was completely legit, and legal let it go without wondering why we had written it.

A couple of months later my boss calls us in. In his fine Scottish brogue he says, “I haven’t seen any case reports coming out of the RADIUS server in a while. I used to get one a week. What did you do?”

Brad and I hemmed and hawed, but finally we explained that we’d taking the GPL radius servers and put them on yet another pair of Solaris boxes, in front of the corporate ones. We showed him the pass from legal, and how we’d kept our own protocol handler in-house and CIS IP separate (he was quite technically savvy), and how it was ticking over without a problem and had been for all this time.

“But we’re using the corporate servers, right?” he asked.

“Oh, sure,” I said. “If ours ever stops serving up messages, the failover will trigger and the corporate RADIUS boxes will pick up the slack.”

He nodded and said, “Well done. Just don’t tell Columbus.”

Ultimately, we had to tell Columbus. A few months later CIS experiened a massive systemic failure in their RADIUS “wall,” a collection of sixty NT machines that served ten times as many customers as our foursome. Brad and I were flown to Columbus to give our presentation on why our system was so error-free.

After we gave our presentation, the response was, “Thanks. We’d love to be able to implement that here, but our contract with MicroSoft means we can’t.”

There are many reasons CompuServe isn’t around anymore. This was just one of them.

Comment Form

Subscribe to Feed

Categories

Calendar

October 2019
M T W T F S S
« Sep   Nov »
 123456
78910111213
14151617181920
21222324252627
28293031