Saturday, November 19, 2011

Coding Principles !


Whenever I review a piece of code, I stress a lot to follow even the simplest of the coding principles. When a developer questions the significance of those principles, I give mostly theoretical examples of why the principles are important in scenarios that don’t happen often. Seriously, how often do we see those coding principles making an actual difference in a normal project lifecycle?

The simple coding principles may not always make a lot of difference. But when they do, it’s an embarrassment to the developer and not many developers get embarrassed on a regular basis.

They are rare but very important. I haven’t seen the principles making a huge difference in practice, well, until recently. So when I have a live example, I better document it, for re-usability of course.  Here it goes.

Consider the piece of code below:

#/bin/bash
bash getpendinglist.sh  # populates /tmp/list.txt by querying against a database
while read line
do
            name=`echo $line | cut –d’|’ -f1`
location=`echo $line | cut –d’|’ -f2`
status=`echo $line | cut –d’|’ -f3`
if [ “$status” == “PENDING” ]; then
               echo “Pending entry identified: $location/$name”
               bash process.sh $name $location
fi
done < /tmp/list.txt

At the first look, everything looks OK. The getpendinglist.sh retrieves the list of pending entries (from a database using sqlplus). The file with the retrieved entries is read line by line in a while loop. The line contains 3 values delimited by a pipe (|) character. If the 3rd value is “PENDING”, the process.sh is called with the name and location variables. Otherwise the line is skipped. The process.sh is the actual business process, in our case an ETL process. Straight-forward, right?

But it is not following the simple rule mentioned above. The initialization of name and location is required only for lines with status value as “PENDING”.  So the 2 initialization statements should have been placed under if statement. But come on, how much could 2 simple initialization statements hurt?

The script worked fine for almost 18 months. The maximum CPU usage was around 70%. The CPU utilization itself didn’t bother us because after all the ETL process.sh is supposed to be CPU intensive and there was no operational impact.

All of a sudden, unexpected pauses started to appear in the processing. Even when there was only one PENDING entry in the database, it took a few hours even to start processing the PENDING entry and the CPU usage shot up to a consistent 90%. Even when there are no entries with PENDING as status in the list.txt, the CPU usage never came down below 75%. That triggered the analysis because if there are no PENDING entries, the process.sh that is supposed to be CPU-intensive never runs. So there is no way the CPU usage can be 90%. There is an obvious problem to analyse now. 

The unix admin for our servers, a very good one at that, did some deep-diving and finally figured out a crucial cause. Of 90% the CPU usage, 75% is in the system kernel level and too many processes are being created (~200 per second). The list.txt contained 1000s of entries without having status as PENDING. 

The getpendinglist.sh that populates the list.txt didn’t use a filter clause in its database query, it retrieved all the entries from the database rather than getting only the pending entries. Given that every cut statement is a new process, every 1000 entries in the file created 3000 processes (3 per entry). Obviously the cut is going to be a very short-lived process, only in milliseconds at best. As every new process creation has a kernel overhead, the CPU usage at the kernel level was consistently high.

The fix was very simple indeed,
  •  Add a filter in the database query to remove entries that don't have status as PENDING
  • Move the cut statements inside the if statement

A true silver bullet fix, brought down the CPU usage to 10% in the system kernel level. This is a clear demonstration of the importance of the coding principles that are overlooked many times.
  • Write a line of code only when it is necessary; initialize a variable only when it is going to be used
  • Don’t just go for a simple query, aim for the perfect query that will give you just what you wanted



No comments:

Post a Comment