My First Bug
It’s only appropriate that one of my first blog posts be about my first bug. I’m sure I had written a few bugs before this, but not many, since my SW development career had just started, and I’m sure I have written more than a few bugs since this one, given my career has spanned more than a few years, but this bug was my first big one. The first one with actual consequences.
The advent of CMOS microcontrollers with very low power dissipation, (I will leave it to you to figure out when those started to appear.) made it possible for the environmental controls company I worked for, to contemplate putting some smarts under one of their standard wall mounted thermostat covers.
Obviously, one task that micro would need to do, is measure some analog values, in particular, temperature. Now I’m sure you can buy a micro today that has a very fine analog to digital converter built in, but not in those days. Being a young college grad, I immediately
Googled AtoD converters started searching integrated circuit catalogs for a suitable AtoD chip. I need not have bothered. The wiser EEs on the team told me, “We don’t need no stink’n AtoD converter chip! Besides, we can’t afford one anyway. We gonna roll our own!” Or words to that effect.
It turned out to be a lot simpler than my college education would have led me to believe. There was no need for a high bandwidth AtoD converter. The values we were measuring varied quite slowly, so we could trade hardware complexity (and cost) for time. Basically, all you need to do is measure the time it takes for a linear increasing voltage ramp to go from zero until it equals the unknown voltage you are trying to measure (Let’s call that X.) If the voltage ramp is linear (increasing at a constant rate), the elapsed time will be in direct proportion to X. (In the case of measuring temperature, we use a sensor that converts physical temperature to a variable voltage.) Don’t think too hard about how it is done, this is not blog about Electrical Engineering, the point is we need to measure elapsed time.
Now the particular micro that was chosen (I believe it was one of Motorola 6805 family.), already had an 8 bit counter built in. So if we zero the counter when we start the ramp and stop the counter when the voltage ramp equals X, then we have a number proportional to the value of X. However, 8 bits only gives us a maximum of 256 discrete values with which to measure X, not enough resolution for us.
So I created a 16 bit counter by designating another 8 bit register value to store the 8 most significant bits of our counter, or the high byte. That gives us a maximum of 65,536 discrete values with which to measure X. More than enough for our purposes. When we start the ramp, we zero out the 8 bit counter and the high byte. When the counter (the low byte) is running, it will count up to 255 and roll-over to zero again. At the point of the rollover, the counter was configured to generate an interrupt. In the counter interrupt subroutine (code) I incremented the high byte by one. When the ramp voltage equaled X, that also generated another interrupt. In the ramp interrupt subroutine I read the value of the counter (low byte) and the high byte to form our 16 bit digital value, which again was proportional to the value of X. Of course, the raw value still needed to be massaged, to get an actual temperature or voltage value, but that’s not important to my story.
Some of you already may have seen the problem, but let me finish my story. This micro did more than just AtoD conversions, BTW. It could be configured to perform analog output, digital input, and digital output functions, plus it performed serial communication with a central controller too. So I coded everything up in assembly language, of course. Tested and released it to pilot testing (or production, I’m a little bit fuzzy on the exact stage of the project, when the bug emerged.), and then reports start coming back from the field. Customers were complaining that their temperature readings sometimes randomly jump to another value. This is a big deal, because if a temperature value is used as an input to a control loop, and it just randomly changes, that could cause a valve to slam shut, a damper to open, a pump to start, or whatever, as the controller tries to compensate for the changing input value.
Well the problem couldn’t be my AtoD conversion firmware, I thought to myself, and I set out to prove it, by performing a test, that in retrospect, should have been done before I went to production. I set up a system consisting of one analog input module, a controller, and an analog output module. The analog input module, of course, contained the AtoD circuitry, microcontroller, and firmware in question. First, I fed a very, very slowly increasing voltage into the analog input. Then I setup the controller to read the analog input, and perform a simple proportional control algorithm, that just set the output equal to the input. Finally, I sent the output of the algorithm to the analog output module, to convert the digital value to a physical voltage. To record the test, I connected the analog output voltage to the Y axis of an XY plotter. The Y input of the plotter would record the voltage value proportional to the input voltage of the the system, while the X axis was set to increase (left to right) with time.
I positioned the plotter pen at the lower left corner of the chart and set the whole thing in motion. To my horror, I did not see a smooth monotonically increasing line drawn on the plotter, but periodically the pen would make a sickening jump down for a bit and then return to normal, and it didn’t jump by a random amount, either. The jump was about the same each time. In retrospect (There’s that word again.) it was obvious what the problem was. What if, after the counter rolled over and jumped to the counter interrupt routine, but before the code to increment the high byte was executed, the other interrupt signalling the end of the AtoD conversion caused execution to jump to the routine that recorded the final conversion value? Well we would have a final count that is low by 256, that’s what!
The fix was quite simple. Disable interrupts inside the interrupt routines. I’m fairly certain I never made that mistake again. However, getting the fix to the customers, was problematic. This was, of course, long before FLASH memory, and download-the-latest-version-from-our-website, and automatic updates, etc. The microcontroller we used was a mask programmed part. Meaning the code was written to the memory of the chip during one of the manufacturing steps. It was very expensive, and took a long time to get new chips (Don’t quote me on this but I think a new mask cost $4,000 and lead time was a couple months). It was a sad day for me when I had to admit to my boss my screw-up. Fortunately, he didn’t fire me, and we were at least able to user the existing chips for the analog out, and digital I/O versions.