Sunday, January 29, 2012

Efficient XML processing

Nowadays, many developers deal with a lot of XML files everyday. These files can be anything ranging from uses in configuration, documentation, databases where they are used for data sharing, data transport or simplifying platform changes. These files can grow to a very large size and need to be processed in an optimized way.
For example, while reading a configuration file, the module that reads the XML, iteratively reads the XML tag for the current XML Element and decides what processing it then has to do. A relatively large XML file would then contain a lot of different XML Elements(differed by their tags) that need to be checked each time you encounter an XML Element.
A brute force algorithm ro achieve this would be to check for each XML tag by doing a string comparison using an if-else ladder. For now, your XML file contains just three tags - Config1, Config2 and Config3. Your code would look something like this:
  
    class Caller 
    { 
        public void Call(string inputValue) 
        { 
            // Using the if-else ladder 
            if(inputValue.Equals("Config1")) 
            { 
                Method1(); 
            } 
            else if(inputValue.Equals("Config2")) 
            { 
                Method3(); 
            } 
            else if(inputValue.Equals("Config3")) 
            { 
                Method2(); 
            } 
        } 
    } 
 
All works well. But, what if the number of XML tags that need to be handled grow each day. You will be handling "Config1" to "ConfigN" in the same way as you have did before - using the if-else ladder. And what if you have no control over the value of 'N'. That is when the processing time for each file increases and a need arises to check the efficiency of the code. String comparisons do take a lot of time and having so many string comparisons can ruin your code in terms of efficiency, maintablility and scalability.

If you try to visualize the above code in terms of a map, you would see that:

"Config1" maps to Method1()
"Config2" maps to Method2()
and so on...

Here's when you know that it would be useful to modify your code to utilize the Hashtable class. Initialize the Hashtable object to store the <key, value> pair as <string, MethodHandlerDelegate> where the string object represents the XML tag input such as "Config1", "Config2" and so on... and the MethodHandlerDelegate is a delegate type that references the method that needs to be called. This can be done using an initializer method such as:

        private delegate void MethodHandler(); 
        private SortedDictionary<string, MethodHandler> stringToDelegateDict; 
  
        public Caller() 
        { 
            stringToDelegateDict = new SortedDictionary<string, MethodHandler>(); 
        } 
  
        public void Initialize() 
        { 
            MethodHandler handler1 = new MethodHandler(Method1); 
            MethodHandler handler2 = new MethodHandler(Method2); 
            MethodHandler handler3 = new MethodHandler(Method3); 
            AddHandler ("Config1", handler1); 
            AddHandler ("Config2", handler2); 
            AddHandler ("Config3", handler3); 
            // All the handler methods are initialized here in the SortedDictionary. 
        } 

Note that we use a SortedDictionary object here since we are aware of the types of the key and value that might be inserted into the Dictionary. This saves us from doing any unnecessary downcasting which we would need to do if we used a Hashtable.

Using this approach, it can also be decided at runtime that which of the handlers need be present in the dictionary by using the following methods that add or remove a handler from the dictionary.

Now, you can subscribe the handlers only when they will be needed.

       // Adding a new handler 
       public void AddHandler(string inputValue, MethodHandler handler) 
       { 
           stringToDelegateDict.Add(inputValue, handler); 
       } 

       // Removing an existing handler 
       public void RemoveHandler(string inputValue) 
       { 
           stringToDelegateDict.Remove(inputValue); 
       } 
 
When you encounter an XML tag now, use just make the same call that you did earlier like this:

       Caller c = new Caller(); 
       c.Initialize(); 
       c.Call("Input1"); 
       c.Call("Input2"); 
  
However, you change your Call method to have the following implementation.
 
       public void Call(string inputValue) 
       { 
           // Get the delegate to the respective method. 
           MethodHandler mh = stringToDelegateDict[inputValue]; 
           // Make the call. 
           mh(); 
       } 

This definitely makes life simple while dealing with changes in the XML tags, adding handlers for new tags, removing handlers for existing tags and providing the runtime support for the same. Now, any changes occuring to the XML format would need changes in the Caller.Initialize method. Here, the method handlers act as subscribers that can dynamically subscribe/unsubscribe for a particular event.

Now, what if your design says that it needs to call both Method1 and Method2 to handle the tag “Config1”? In case you were using the OrdinaryCaller design, you would handle it something like this:

public void CallModified(string inputValue) 
       { 
           // Using the if-else ladder 
           if (inputValue.Equals("Config1")) 
           { 
               Handler.Method1(); 
               Handler.Method2(); 
           } 
           else if (inputValue.Equals("Config2")) 
           { 
               Handler.Method2(); 
           } 
           else if (inputValue.Equals("Config3")) 
           { 
               // The OrdinaryCaller.Call method needs to be changed like this. 
               Handler.Method3(); 
           } 
       } 

The problem with doing this is that you need to keep changing the code of the Call() method and also the main disadvantage is that you cannot change the behaviour of the Call() method at runtime.

Here is where the concept of multicast delegates comes into picture. You slightly modify the AddHandler() method to support multicast delegates. This let you have any number of delegate methods as handlers of each tag.

// Adding a new handler 
       public void AddHandler(string inputValue, MethodHandler handler) 
       { 
           // Check if the string key is already available... 
           if (stringToDelegateDict.ContainsKey(inputValue)) 
           { 
               // If yes, create a multicast delegate 
               // and place it back there with the same key. 
               MethodHandler temp = stringToDelegateDict[inputValue]; 
               temp += handler; 
               stringToDelegateDict.Remove(inputValue); 
               stringToDelegateDict[inputValue] = temp; 
           } 
           else 
           { 
               // else, simply add the handler. 
               stringToDelegateDict.Add(inputValue, handler); 
           } 
       } 

Just call the AddHandler() method to subscribe a new handler and voila! You’ve got another handler working right where you wanted it.

I felt that the above mentioned technique is just a better way of handling large XML files. I’m sure more cleaner options exist to do the same which I’m not aware of. What do you think? What can be a better and more efficient technique to do this?

No comments:

Post a Comment