Programmatic Manipulation of Binary Excel (.xls) Files

A few months ago I worked on a small spare time project which included some manipulation of binary Excel (.xls) files. This seemingly simple task soon turned out to be quite a challenge if you want to handle it right. The post you are reading is a short summary of my experiences. They should make your choices easier if you are about to tackle a similar problem.

The most obvious choice for handling .xls files is Excel automation using the Excel object model. As long as your application is always going to be used interactively, you should be fine. It's probably the best method to use in spite of a few downsides:

  • The object model is COM based which means you'll have to do interop if you are developing a .NET application. Fortunately there are some nice improvements in .NET Framework which make coping with COM in C# much easier.
  • Your application will have to be compiled in 32-bit in order for it to work on 64-bit Windows. This usually isn't a real limitation just don't forget to switch the target platform from Any CPU to x86 if you're doing development on a 32-bit OS to avoid the problem. (You're going to notice it yourself when developing on a 64-bit OS.)
  • Don't try unit testing the Excel automation code outside your IDE (e.g. on your continuous integration server). As described below this scenario is not supported.

As soon as you want your code to be run non-interactively, you're out of luck with Excel automation. Since all Office applications assume to be running on the interactive desktop, there are several reasons why such usage is not supported and is even strongly discouraged by Microsoft. If you want to run your code on a web server, as a Windows service, a scheduled task or just automatically test it on your build server, you'll have to find a different approach. And in this case there are no obvious choices.

Your best bet is to use Open XML file format (.xlsx, .xlsm) instead of the binary one (.xls) if this is an option for you. Since this is a well documented XML based format you can manipulate it directly without running Excel at all. You can even use Open XML SDK for Microsoft Office which includes strongly typed classes that simplify many common tasks.

This is not the case with the binary format. Microsoft doesn't provide or support any SDK or API for manipulating the files directly. You are only provided with detailed documentation of the format. Based on it some third party solutions have been developed. Aspose and SoftArtisans have their own commercial offerings which I haven't evaluated because as such they weren't suitable for me.

On the other hand the only free library that I have found is MyXLS and this one leaves much to be desired. It also seems to be pretty much abandoned with the latest release almost a year ago and only a single commit to the repository in this year so far. That being said, it still might prove useful if you only want to create the files, mostly focusing on the appearance with only minimal requirements regarding formulas. According to the samples this seems to work fine. Reading existing files is another story. You are more or less on your own as soon as you need to read cells with formulas. This made the library useless for me therefore I used another approach based on the fact that the OLE DB Provider for Jet can be used to read and write data in Excel worksheets.

At first I haven't even considered this possibility because I was convinced that this method can only be used on worksheets designed as database tables (having columns of data with or without header columns). This article proved me wrong and in spite of many issues in the demo project it showed off techniques which turned out really useful for me. The most important one was the possibility to address regions, not only complete worksheets – this can be used to retrieve and set individual cell values as well as for accessing database table-like blocks of data in a part of the worksheet. Let's take a look at a sample:

SELECT F1 FROM [Sheet1$C2:C2];

This query retrieves a value of a single cell. The same syntax can be used to access any region: after the worksheet name with a trailing $ character the region is defined just like in Excel formulas. If the region doesn't include column headers (specified by including HDR=No in the Extended Properties of the connection string) the columns are named F# where # is the sequential number of the column in the defined region. Similarly the following query can be used to set a value of an individual cell:

UPDATE [Sheet1$C2:C2] SET F1 = 'Value';

A few more things worth mentioning:

  • Be aware of the IMEX option in the connection string. It doesn't seem to be documented very well, probably its best description is here.
  • Make sure you keep the connection to the file open if you plan to do more data access later. Opening the connection takes quite some time therefore the performance will be terrible if you keep closing and reopening it.
  • The technique doesn't seem to work if the worksheet name begins with a space. At least I couldn't make it work, no matter how I set the quotes.
  • I encountered a file which couldn't be opened in this way but the problem was resolved after I opened the file in Excel and saved it again.
Copyright
Creative Commons License