Saturday, January 31, 2015

Compress / Decompress

Our software utilizes a message bus type of architecture to send messages between different  integrated systems. The messages are sent compressed and encrypted, and on the receiving end, they are decompressed, decrypted and then dealt with. We developed this architecture about 6 years ago and it works pretty darn good.

So, today I’m going to talk about the compress/decompress methods. I’ll save the Encrypt/Decrypt for another time, since that is a bit more complicated. What prompted me to write about this was an MSDN forum thread I read, where people were suggesting using the GZipStream class for this functionality. When we first developed our Compress/Decompress methods, we did NOT use GZipStream, so I was curious as to which methodology was better … both in the amount of compression and in the amount of time it took. Here’s the MSDN forum thread: https://social.msdn.microsoft.com/Forums/vstudio/en-US/22df4c05-9959-4188-9d8b-a755ccd9a8cc/maximal-compressing-of-xml?forum=csharpgeneral

Here’s our set of classes:

// be sure you have
//using System.Collections;
//using System.IO.Compression;
protected byte[] CompressData(string xml)
{
if (xml == null)
xml = "";

byte[] temp = Encoding.UTF8.GetBytes(xml);
MemoryStream ms = new MemoryStream();
DeflateStream ds = new DeflateStream(ms, CompressionMode.Compress);
ds.Write(temp, 0, temp.Length);
ds.Flush();
ds.Close();
return ms.ToArray();
}
protected string DecompressData(byte[] data)
{
const int BUFFER_SIZE = 10;
byte[] tempArray = new byte[BUFFER_SIZE];
ArrayList tempList = new ArrayList();
int count = 0, length = 0;

MemoryStream ms = new MemoryStream(data);
DeflateStream ds = new DeflateStream(ms, CompressionMode.Decompress);

while ((count = ds.Read(tempArray, 0, BUFFER_SIZE)) > 0)
{
if (count == BUFFER_SIZE)
{
tempList.Add(tempArray);
tempArray = new byte[BUFFER_SIZE];
}
else
{
byte[] temp = new byte[count];
Array.Copy(tempArray, 0, temp, 0, count);
tempList.Add(temp);
}
length += count;
}

byte[] retVal = new byte[length];

count = 0;
foreach (byte[] temp in tempList)
{
Array.Copy(temp, 0, retVal, count, temp.Length);
count += temp.Length;
}

return Encoding.UTF8.GetString(retVal);
}

And here is a set of classes using GZipStream:

protected byte[] CompressGZip(string xml)
{
byte[] raw = Encoding.UTF8.GetBytes(xml);
using (MemoryStream memory = new MemoryStream())
{
using (GZipStream gzip = new GZipStream(memory, CompressionMode.Compress, true))
{
gzip.Write(raw, 0, raw.Length);
}
return memory.ToArray();
}
}
protected string DecompressGZip(byte[] data)
{
using (MemoryStream memory = new MemoryStream())
{
using (GZipStream stream = new GZipStream(new MemoryStream(data), CompressionMode.Decompress))
{
stream.CopyTo(memory);
// use the commented code instead of CopyTo() for .NET prior to 4.0:
//const int size = 4096;
//byte[] buffer = new byte[size];
//int count = 0;
//while ((count = stream.Read(buffer, 0, size)) > 0)
// memory.Write(buffer, 0, count);
}
return Encoding.UTF8.GetString(memory.ToArray());
}
}

I started with a small set of data, but increased it several times over the course of testing this. The largest string length tested was 1,654,202 bytes, which compressed down to 113,180 bytes. Both sets of classes compressed the data to about the same size (with ours always 24 bytes smaller than the GZipStream, no matter how large the original data was).

Compression times were about the same for either method. But decompression times were about twice as fast with the GZipStream method … but still, we’re talking 40 milliseconds vs 20 milliseconds. Here’s the code I used to benchmark this:

private void CompareCompressionAlgorithms()
{
// get some data
string data = this.GetSomeData(); // write this method to get data however you want to
System.Diagnostics.Stopwatch stopWatch = new System.Diagnostics.Stopwatch();

stopWatch.Start();
byte[] dataBytes = this.CompressData(data);
string dataCompressed = Convert.ToBase64String(dataBytes);
stopWatch.Stop();

Console.WriteLine("DeflateStream Class:");
Console.WriteLine("Data Length: {0}", data.Length);
Console.WriteLine("Compressed Bytes Length: {0}", dataBytes.Length);
Console.WriteLine("Compressed String Length: {0}", dataCompressed.Length);
Console.WriteLine("Total Milliseconds: {0}", stopWatch.ElapsedMilliseconds);

// let's check the Decompress now
stopWatch.Restart();
string dataDecompressed = this.DecompressData(dataBytes);
stopWatch.Stop();

Console.WriteLine("Decompressed Length: {0}", dataDecompressed.Length);
Console.WriteLine("Total Milliseconds: {0}", stopWatch.ElapsedMilliseconds);

// Compare to using GZipStream
stopWatch.Restart();
dataBytes = this.CompressGZip(data);
dataCompressed = Convert.ToBase64String(dataBytes);
stopWatch.Stop();

Console.WriteLine("GZipStream Class:");
Console.WriteLine("Compressed Bytes Length: {0}", dataBytes.Length);
Console.WriteLine("Compressed String Length: {0}", dataCompressed.Length);
Console.WriteLine("Total Milliseconds: {0}", stopWatch.ElapsedMilliseconds);

// now check the Decompress
stopWatch.Restart();
dataDecompressed = this.DecompressGZip(dataBytes);
stopWatch.Stop();

Console.WriteLine("Decompressed Length: {0}", dataDecompressed.Length);
Console.WriteLine("Total Milliseconds: {0}", stopWatch.ElapsedMilliseconds);

}

Bottom line is that both methods work about the same, so choose whichever one you like.

Happy coding!