Monday, September 30, 2019

Parsing Data With Metadata

Recently, I had worked on some parsing routines for messages received from another vendor. The messages come via TCP, but that's not relevant to this discussion. What I wanted to talk about is how a "generic" type of parsing mechanism ended up being useful for parsing data that is received in more than one format. The vendor initially only sent messages in a pipe-delimited fashion, which is how the parsing routine was originally designed. Then later, they added additional messages that were XML (which we could use to Fill a DataSet). We needed to be sure that the parsing routine would work with both types of messages, because we would still be receiving both types.

Let's pretend that this example comes from a ride-sharing system (it doesn't). So we have cars and drivers and we are receiving messages about the status of each car and its location.

A pipe-delimited message might come in looking like this:

|CARCOMPANY=GeeksRide|CARNUMBER=C121|STATUS=Active|OPERATORID=2401|OPERATORNAME=Bonnie|STATUSDATE=5/24/2018 12:54:40 PM|LASTCOORDTIME=5/24/2018 12:54:55|PHONENBR=555-555-0430|X=-122.650345|Y=45.535772|

As you can see, the data is coming into my application from a hypothetical company called GeeksRide.

An XML message might come in looking like this:

<RideShareDataSet>
  <RideShareCar>
    <CarCompany>GeeksRide</CarCompany>
    <CarNumber>C121</CarNumber>
    <Status>Active</Status>
    <X>-122.650345</X>
    <Y>45.535772</Y>
    <LastCoordTime>5/24/2018 12:54:55</LastCoordTime>
    <StatusDate>5/24/2018 12:54:40 PM</StatusDate>
    <OperatorID>2401</OperatorID>
    <OperatorName>Bonnie</OperatorName>
    <PhoneNBR>555-555-0430</PhoneNBR>
  </RideShareCar>
</RideShareDataSet>

For ease of comparison between the two methodologies, the XML in my example contains the same data as the pipe-delimited, although,  in reality, it could be different data and different schema (column names). In the case of this hypothetical ride-sharing system, we could be receiving messages from many different companies with different data and different schema. Each type of message in your application can have different metadata, if necessary, to correspond to the various companies schemas.

What does this metadata look like? Very, very similar to the pipe-delimited data! Whether the message is coming in as pipe-delimited or XML, this pipe-delimited metadata can be used for either! Here's an example of metadata that can be used to process the data coming in with the above schema:

|CARCOMPANY=Company|CARNUMBER=CarID|EVENTNO=TripID|STATUS=Status|OPERATORID=DriverID|OPERATORNAME=DriverName|RADIOID=|EVENTTYPE=|STATUSDATE=LastStatusDateTime|LOCATION=LocationName|X=Longitude|Y=Latitude|ADDRESS=Address|DIRECTION=heading|SPEED=speed|LASTCOORDTIME=LastGPSDateTime|PHONENBR=PhoneNumber|

Let me explain what this means: each pair of values between the pipes,  |XX=YY|, correspond to the name of the data coming in (XX) and the name of the column in our ride-sharing system's DataSet/DataTable (YY).  Pretty simple, right?

Note that the leading and trailing pipes are optional in both the metadata and pipe-delimited input data.

OK, so now some code.

First, to simulate an incoming message, I'll just create pipeInput and xmlInput string variables. Like so:

private string pipeInput = "|CARCOMPANY=GeeksRide|CARNUMBER=C121|STATUS=Active|OPERATORID=2401|OPERATORNAME=Bonnie|STATUSDATE=5/24/2018 12:54:40 PM|LASTCOORDTIME=5/24/2018 12:54:55|PHONENBR=555-555-0430|X=-122.650345|Y=45.535772|";

private string xmlInput =
    "<RideShareDataSet>" +
    "  <RideShareCar>" +
    "    <CarCompany>GeeksRide</CarCompany>" +
    "    <CarNumber>C121</CarNumber>" +
    "    <Status>Active</Status>" +
    "    <X>-122.650345</X>" +
    "    <Y>45.535772</Y>" +
    "    <LastCoordTime>5/24/2018 12:54:55</LastCoordTime>" +
    "    <StatusDate>5/24/2018 12:54:40 PM</StatusDate>" +
    "    <OperatorID>2401</OperatorID>" +
    "    <OperatorName>Bonnie</OperatorName>" +
    "    <PhoneNBR>555-555-0430</PhoneNBR>" +
    "  </RideShareCar>" +
    "</RideShareDataSet>";

And I'll also simulate the metadata with a string variable as well. Normally you'd get the metadata from a config file or from a database:

private string pipeMetadata = "|CARCOMPANY=Company|CARNUMBER=CarID|EVENTNO=TripID|STATUS=Status|OPERATORID=DriverID|OPERATORNAME=DriverName|RADIOID=|EVENTTYPE=|STATUSDATE=LastStatusDateTime|LOCATION=LocationName|X=Longitude|Y=Latitude|ADDRESS=Address|DIRECTION=heading|SPEED=speed|LASTCOORDTIME=LastGPSDateTime|PHONENBR=PhoneNumber|";

One more variable we'll need is a Dictionary to store the metadata in. I call it a SchemaMap:

/// <summary>
/// The key is the data name of the incoming data, as defined in the Metadata (pipeMetadata)
/// The value is the column name of the DataTable where we store the Parsed data, as defined in the Metadata
/// </summary>
Dictionary<string, string> SchemaMap = new Dictionary<string, string>();

To test all this, just call this Test method (you can set a breakpoint at the end and look at the two DataTables that now contain the data parsed from the two different input messages):

public void TestParsingMethods()
{
    this.CreateSchemaMap(pipeMetadata);

    DataTable dtWithPipeInput = new DataTable();
    this.ParsePipeDelimited(dtWithPipeInput, pipeInput);
    
    DataTable dtWithXmlInput = new DataTable();
    this.ParseXml(dtWithXmlInput, xmlInput);
}

Let's start with the CreateSchemaMap, which is very simple. It fills the dictionary with the metadata:

public void CreateSchemaMap(string schema)
{
    string[] names = schema.Split(new char[] { '|' }, StringSplitOptions.RemoveEmptyEntries);
    this.SchemaMap = new Dictionary<string, string>();
    
    for (int i = 0; i < names.Length; i++)
    {
        string[] s = names[i].Split(new char[] { '=' }, StringSplitOptions.RemoveEmptyEntries);
        if (s.Length == 2)
            SchemaMap.Add(s[0].ToUpper(), s[1]);
    }
}

The two Parsing methods, which I'll show first, both call a common method that puts the data into the DataRow with the proper column names, based on the metadata in the SchemaMap dictionary:

public void ParsePipeDelimited(DataTable dt, string pipeMessage)
{
    // If we're passed an undefined DataTable (one with no columns), we need to add string columns
    // as we have no way of accurately determining the type of data.
    // Typically, we'd already have defined a DataSet/DataTable with column datatypes.
    if (dt.Columns.Count == 0)
        foreach (KeyValuePair<string, string> kvp in this.SchemaMap)
        {
            if (dt.Columns.Contains(kvp.Value) == false)
                dt.Columns.Add(kvp.Value);  // default data type is string
        }
    
    // We'll be parsing data from the Message into this DataRow
    DataRow row = dt.NewRow();
    dt.Rows.Add(row);
    
    string[] nameValue = pipeMessage.Split(new char[] { '|' }, StringSplitOptions.RemoveEmptyEntries);
    string schemaKey, columnName;
    for (int i = 0; i < nameValue.Length; i++)
    {
        string[] s = nameValue[i].Split(new char[] { '=' }, StringSplitOptions.RemoveEmptyEntries);
        schemaKey = s[0].ToUpper();
        if (s.Length == 2 && SchemaMap.ContainsKey(schemaKey))
        {
            columnName = this.SchemaMap[schemaKey]; // get column name from dictionary
            this.ParseToDataRow(row, columnName, dt, s[1]);
        }
    }
}
public void ParseXml(DataTable dtParsed, string xmlMessage)
{
    // If we're passed an undefined DataTable (one with no columns), we need to add string columns
    // as we have no way of accurately determining the type of data.
    // Typically, we'd already have defined a DataSet/DataTable with column datatypes.
    if (dtParsed.Columns.Count == 0)
        foreach (KeyValuePair<string, string> kvp in this.SchemaMap)
        {
            if (dtParsed.Columns.Contains(kvp.Value) == false)
                dtParsed.Columns.Add(kvp.Value);  // default data type is string
        }
    
    // We'll be parsing data from the Message into this DataRow
    DataRow row = dtParsed.NewRow();
    dtParsed.Rows.Add(row);
    
    // Now, let's read the XML into a DataSet:
    // I suggest making this into an Extension method, it's pretty handy to have.
    DataSet dsFromXml = new DataSet();
    StringReader sr = new StringReader(xmlMessage);
    dsFromXml.ReadXml(sr, XmlReadMode.InferSchema);
    dsFromXml.AcceptChanges();
    
    // Most likely only one Table, but just in case
    foreach (DataTable dt in dsFromXml.Tables)
    {
        if (dt.Rows.Count == 0)
            continue;
        
        DataRow rowFromXml =  dt.Rows[0];
        string columnName;
        foreach (string schemaKey in this.SchemaMap.Keys)
        {
            columnName = this.SchemaMap[schemaKey]; // get column name from dictionary
            foreach (DataColumn dcFromXml in dt.Columns)
            {
                if (dcFromXml.ColumnName.ToUpper() == schemaKey)
                {
                    this.ParseToDataRow(row, columnName, dtParsed, rowFromXml[schemaKey].ToString());
                    break;
                }
            }
        }
    }
}

Note the comments at the beginning of each of those methods. I used Typed DataSets extensively, so I don't need to be adding any columns to a "generic" DataTable. But, you don't have to have a Typed DataSet to have a DataTable with pre-defined columns and datatypes.

And lastly, the common method that they both call:

private void ParseToDataRow(DataRow row, string columnName, DataTable dt, string data)
{
    foreach (DataColumn dc in dt.Columns)
    {
        if (dc.ColumnName.ToUpper() == columnName.ToUpper())
        {
            if (dc.DataType == Type.GetType("System.Double"))
            {
                double lat;
                if (double.TryParse(data, out lat))
                {
                    if (data.Contains('.'))
                        row[dc] = lat;
                    else
                        row[dc] = lat / 1000000;
                }
            }
            else if (dc.DataType == Type.GetType("System.Int32"))
            {
                int val;
                if (int.TryParse(data, out val))
                {
                    row[dc] = val;
                }
            }
            else if (dc.DataType == Type.GetType("System.DateTime"))
            {
                DateTime date;
                if (DateTime.TryParse(data.Replace('/', ' '), out date))
                {
                    row[dc] = date;
                }
            }
            else
            {
                if (string.IsNullOrWhiteSpace(data) == false)
                    row[dc] = data;
            }

            break;
        }
    }
}

So, now you've seen how the same metadata and (at least part of) the same code can be used to parse data out of two very different types of data. I hope that you, dear Reader, can find some use for this.

Happy Coding! =0)