How to get rid of BOM when downloading text from azure blob

When you download a text content using CloudBlockBlob.DownloadText() if the blob contains the BOM or byte order mark, then the returned text will contain some additional characters.

For example in case of having UTF8 with BOM, you will receive ï»¿ sequence at beginning of the file content. Then if you use the text content for example to deserialize a json object, the deserialization will fail with this message:

Message: Newtonsoft.Json.JsonReaderException : Unexpected character encountered while parsing value

How can you get rid of this BOM?

When downloading file using System.Net.WebClient.DownloadString() the BOM will be eliminated automatically by the DownloadString method and you only receive the content which you expect. But to use this method to download a text blob you need to create a sas token and download the file using url containing sas token.

If you take a look at source code for WebClient.DownloadString() and source code of CloudBlockBlob.DownloadText() you will see they are acting differently.

I personally prefer the way that is used in WebClient in most cases. So as a solution, I mixed some code from those classes to create a DownloadString extension method for CloudBlockBlob, then to download text without BOM, you can simply use it this way:

var connectionString = "connection string"
var storageAccount = CloudStorageAccount.Parse(connectionString);
var blobClient = storageAccount.CreateCloudBlobClient();
var container = blobClient.GetContainerReference("container name");
var blob = container.GetBlockBlobReference("blob name");
var text = blob.DownloadString();

DownloadString extension method

Here is the code for the extension method:

using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;
using Microsoft.WindowsAzure.Storage.Core;
using System.Text;

namespace AzureExtensions
{
    public static class CloudBlockBlobExtensions
    {
        /// <summary>
        /// Downloads string content of the given blob. If encoding is not specified, it removes the BOM mark from beginning of the string if exists.
        /// </summary>
        /// <param name="blob">CloudBlockBlob to download contents.</param>
        /// <param name="encoding">Encoding which should be used to convert the content of blob to string. If encoding is not specified, it removes the BOM mark from beginning of the string if exists.</param>
        /// <returns>string content of the given blob.</returns>
        public static string DownloadString(this CloudBlockBlob blob, System.Text.Encoding encoding = null, AccessCondition accessCondition = null, BlobRequestOptions options = null, OperationContext operationContext = null)
        {
            using (SyncMemoryStream stream = new SyncMemoryStream())
            {
                blob.DownloadToStream(stream, accessCondition, options, operationContext);
                byte[] streamAsBytes = stream.GetBuffer();
                return GetStringUsingEncoding(streamAsBytes, encoding);
            }
        }
        private static string GetStringUsingEncoding(byte[] data, Encoding encoding = null)
        {
            int bomLengthInData = -1;

            // If no content encoding listed in the ContentType HTTP header, or no Content-Type header present, then
            // check for a byte-order-mark (BOM) in the data to figure out encoding.
            if (encoding == null)
            {
                byte[] preamble;
                // UTF32 must be tested before Unicode because it's BOM is the same but longer.
                Encoding[] encodings = { Encoding.UTF8, Encoding.UTF32, Encoding.Unicode, Encoding.BigEndianUnicode };
                for (int i = 0; i < encodings.Length; i++)
                {
                    preamble = encodings[i].GetPreamble();
                    if (ByteArrayHasPrefix(preamble, data))
                    {
                        encoding = encodings[i];
                        bomLengthInData = preamble.Length;
                        break;
                    }
                }
            }

            // Do we have an encoding guess?  If not, use default.
            if (encoding == null)
                encoding = Encoding.UTF8;

            // Calculate BOM length based on encoding guess.  Then check for it in the data.
            if (bomLengthInData == -1)
            {
                byte[] preamble = encoding.GetPreamble();
                if (ByteArrayHasPrefix(preamble, data))
                    bomLengthInData = preamble.Length;
                else
                    bomLengthInData = 0;
            }

            // Convert byte array to string stripping off any BOM before calling GetString().
            // This is required since GetString() doesn't handle stripping off BOM.
            return encoding.GetString(data, bomLengthInData, data.Length - bomLengthInData);
        }
        private static bool ByteArrayHasPrefix(byte[] prefix, byte[] byteArray)
        {
            if (prefix == null || byteArray == null || prefix.Length > byteArray.Length)
                return false;
            for (int i = 0; i < prefix.Length; i++)
            {
                if (prefix[i] != byteArray[i])
                    return false;
            }
            return true;
        }
    }
}

Reza Aghaei

My notes about C#, .NET, ASP.NET, Windows Forms, Azure, PowerShell, …

How to get rid of BOM when downloading text from azure blob

About the Author: Reza Aghaei

1 Comment

Leave a Reply Cancel reply

You May Also Like

Put application in startup

Panel titlebar – Customize nonclient area

About the Author: Reza Aghaei

1 Comment

Leave a Reply Cancel reply