Saturday, November 19, 2011

Coding Principles !


Whenever I review a piece of code, I stress a lot to follow even the simplest of the coding principles. When a developer questions the significance of those principles, I give mostly theoretical examples of why the principles are important in scenarios that don’t happen often. Seriously, how often do we see those coding principles making an actual difference in a normal project lifecycle?

The simple coding principles may not always make a lot of difference. But when they do, it’s an embarrassment to the developer and not many developers get embarrassed on a regular basis.

They are rare but very important. I haven’t seen the principles making a huge difference in practice, well, until recently. So when I have a live example, I better document it, for re-usability of course.  Here it goes.

Consider the piece of code below:

#/bin/bash
bash getpendinglist.sh  # populates /tmp/list.txt by querying against a database
while read line
do
            name=`echo $line | cut –d’|’ -f1`
location=`echo $line | cut –d’|’ -f2`
status=`echo $line | cut –d’|’ -f3`
if [ “$status” == “PENDING” ]; then
               echo “Pending entry identified: $location/$name”
               bash process.sh $name $location
fi
done < /tmp/list.txt

At the first look, everything looks OK. The getpendinglist.sh retrieves the list of pending entries (from a database using sqlplus). The file with the retrieved entries is read line by line in a while loop. The line contains 3 values delimited by a pipe (|) character. If the 3rd value is “PENDING”, the process.sh is called with the name and location variables. Otherwise the line is skipped. The process.sh is the actual business process, in our case an ETL process. Straight-forward, right?

But it is not following the simple rule mentioned above. The initialization of name and location is required only for lines with status value as “PENDING”.  So the 2 initialization statements should have been placed under if statement. But come on, how much could 2 simple initialization statements hurt?

The script worked fine for almost 18 months. The maximum CPU usage was around 70%. The CPU utilization itself didn’t bother us because after all the ETL process.sh is supposed to be CPU intensive and there was no operational impact.

All of a sudden, unexpected pauses started to appear in the processing. Even when there was only one PENDING entry in the database, it took a few hours even to start processing the PENDING entry and the CPU usage shot up to a consistent 90%. Even when there are no entries with PENDING as status in the list.txt, the CPU usage never came down below 75%. That triggered the analysis because if there are no PENDING entries, the process.sh that is supposed to be CPU-intensive never runs. So there is no way the CPU usage can be 90%. There is an obvious problem to analyse now. 

The unix admin for our servers, a very good one at that, did some deep-diving and finally figured out a crucial cause. Of 90% the CPU usage, 75% is in the system kernel level and too many processes are being created (~200 per second). The list.txt contained 1000s of entries without having status as PENDING. 

The getpendinglist.sh that populates the list.txt didn’t use a filter clause in its database query, it retrieved all the entries from the database rather than getting only the pending entries. Given that every cut statement is a new process, every 1000 entries in the file created 3000 processes (3 per entry). Obviously the cut is going to be a very short-lived process, only in milliseconds at best. As every new process creation has a kernel overhead, the CPU usage at the kernel level was consistently high.

The fix was very simple indeed,
  •  Add a filter in the database query to remove entries that don't have status as PENDING
  • Move the cut statements inside the if statement

A true silver bullet fix, brought down the CPU usage to 10% in the system kernel level. This is a clear demonstration of the importance of the coding principles that are overlooked many times.
  • Write a line of code only when it is necessary; initialize a variable only when it is going to be used
  • Don’t just go for a simple query, aim for the perfect query that will give you just what you wanted



Monday, November 30, 2009

Anything perfect is never simple, even a Singleton !

Having used the Singleton, a creational design pattern so often, a lot about it has not been very obvious. So here is my attempt to put together everything about singleton.

Here's what the GoF declares as the intent of singleton pattern:

The Singleton pattern ensures only one class instance and provides a global point of access to it.

Sample Java code for the simplest Singleton:

package demo;

public class Singleton {

private static Singleton instance = null;
public static Singleton getInstance() {
if (instance == null) {
instance = new Singleton();
}
return instance;
}
private Singleton() {
}
}

As the simplest implementation, the above code cares nothing about thread-safety, serialization etc.,

But then when we need to put together everything about singleton,
we are indeed going to care about those. ;-)

Thread-safety:

The getInstance() method is not thread-safe, which means it is not possible that the class can be called by multiple threads in parallel and assumed only one object will be created.

The simplest fix would be to create the instance as part of the declaration itself.

private static Singleton instance = new Singleton();
public static Singleton getInstance() {
return instance;
}

Here we face the biggest problem of early initialization. Even if the getInstance() is never going to be called (wonder why in the world, one would decide to keep such an object as singleton? ;-) ), just loading the class would result in the object creation, a memory overhead in case the object is huge (as in most of the times).

An alternative could be to declare the getInstance() method as synchronized. But then, it's going to create a performance overhead due to lock contention. With the API exposed to the whole world and expectedly hundreds of calls to retrieve the singleton instance, synchronized wouldn't be the best option.

By then, we come across a clever idea, the "Double Checked Locking" to overcome the disadvantages of these techniques. Using DCL, the instance is not created as part of the declaration and the method is not declared as synchronized either.

public static Singleton getInstance() {
if (instance == null) {
synchronized(Singleton.class) {
if (instance == null) {
instance = new Singleton();
}
}
}
return instance;
}

Looks interesting, isn't it? Quite unfortunately, the DCL just doesn't help (at least in theory, yeah I have failed miserably trying to reproduce the problem).

The link below explains the theory:


We could now think of a combination of early object creation and DCL based resource initialization as an alternative (see below). But since I haven't reproduced the DCL failure yet, I can't be really sure if this would help.

package demo;

public class Singleton {

private static boolean initialized = false;
private static Singleton instance = new Singleton();
public static Singleton getInstance() {
if (!initialized) {
synchronized(Singleton.class) {
if (!initialized) {
instance.runHeavyInit();
}
}
}
return instance;
}
private Singleton() {
}
private synchronized void runHeavyInit() {
if (!initialized) {
doHeavyInit();
initialized = true;
}
}
private void doHeavyInit() {
// Do things here
}
}

And java.util.concurrent.locks.ReadWriteLock has been hogging my mind, looking to help as another alternative.

Serialization:

Fortunately, there is no problem at all while serializing a Singleton instance. But de-serialization is a nightmare. The de-serialization creates another object, a colleague to the Singleton instance (given getInstance() is called already or early initialization has occurred) defeating the purpose of the whole implementation, but wait, don't we know that "failure is never final" ? ;-)

The Serializable API helps us in defining a "serializable" Singleton. Yeah, finally we found one way to use the rarely used method.
<ANY-ACCESS-MODIFIER> Object readResolve() throws ObjectStreamException {
return instance;
}

We add the above method to our Singleton class, and we are done. The existing instance is used for de-serialization to replace the existing contents with the de-serialized contents.

Finally, there are 2 problems design gurus (of course Developers, during their UT) hate about the Singleton design pattern.

1) A pure Singleton class cannot be sub-classed (due to the private constructor), against a famous design principle, a class should be open for extension.

2) Dependency on a class is injected. Another violation of a design principle, instead of programming to an interface, we must have to program to a concrete implementation.

P.S: I guess this is the first ever explanation on a design pattern without an UML diagram depicting it. :-)