Sitecore Media Library -> Cloud

By default all of Sitecore’s images are stored in the database, and retrieved on the fly when an image is requested via the media library. There are definitely reasons why this is a bad idea, and can be improved upon, and much has been written on the web about how to move Sitecore’s images from the database to the cloud. There are specific reasons to do this – reduce the size of the content database, reduce page load times and to reduce database hits. I’ve taken a look through a few options for porting images to Windows Azure Blob Storage (WABS)*, and wanted to outline what I saw as the approaches that people have taken, and how we ended up solving this problem.

*Other providers like AS3 are also available

The options for achieving this I have read about so far are broadly:

  1. Swap out the SqlserverDataProvider for your own subclassed provider that pumps the blob data out to azure with a GUID for its name, reference.
  2. Reroute the media library to an Azure storage blob using ARR or something similar. Example using Sitecore’s own configuration to reroute.

When I came to investigate these approaches, I found slight problems with both of them, and hence developed a slightly different approach, which is outlined below.

Option 1 – Swapping out the data provider

The first approach is broadly elegant, and offers some significant benefits, but unfortunately also has some drawbacks. The benefits are that, being low-level, it should preserve Sitecore’s image resizing capabilities via the pipeline steps, and it allows us to remove the images entirely from the database, thereby minimizing the size of the database, making for easier portability. It also would allow logic to be added that only pulls images from Azure if they exist, and fallback to the database in the event that no image has been uploaded, making it more robust.

The main drawback with this approach is that because we are hooking into quite a low-level part of Sitecore, in the SqlserverDataProvider, the functions to read / write the blob only get limited information about it – a GUID which would identify that blob in the SQL table, and the data itself. This leaves a problem whereby once all your images are published to the cloud, they don’t have the same item hierarchy (folder path) that would have been present in Sitecore, and worse, they don’t necessarily have the correct extension. So whilst this solution works acceptably, it’s not easy to see what has happened if an image is missing, and eventually you will have one container with potentially thousands of
unstructured images in it, all named by
GUIDs.

I spent some time trying to amend this solution to save / retrieve the images via a path rather than a GUID, and there are some options here. Remember the GUID you have is not the ID of the Media item, it’s actually the ID of the blob in the database, so it’s not so easy to get to the item from it to lookup the path.

The first option is to lookup the path from the GUID via a database hit. There’s some SQL below which should do it, but I don’t like this solution. You’re getting rid of one database hit, and introducing another. Also it feels “dirty”. There are further options – you could create and maintain some sort of lookup and cache these hits, but the whole thing starts to feel pretty messy at this point.

SELECT top 100 
 *         
FROM 
 Items I
 Join SharedFields S on S.ItemId = i.ID and s.fieldid  = '{40E50ED9-BA07-4702-992E-A912738D32DC}' 
 left Join Blobs B on S.Value = B.BlobId 
Where 
 s.value = '{B018B71D-681E-4771-88E6-EFF99994F979}'        
order by 
 i.created desc 

The second option I looked into was to try and hook into the process at a higher level where the Media item GUID is still available to use. Looking in the call stack for the ‘SetBlobStream’ function, we see the below.

Sitecore Callstack

To get the full path for an item when calling SetStream, we would need to get into this call stack a bit higher – ideally at the MediaData / Media class. There is some config that looks like it might wire this up in Sitecore:

    <mediaLibrary>
      <!-- MEDIA PROVIDER
         The media provider used to generate URLs, create media items, control media caching, parse media requests, and other
         media related functionality.      
      -->
      <mediaProvider type="Sitecore.Resources.Media.MediaProvider, Sitecore.Kernel" />
      <!-- MEDIA REQUEST PREFIXES 
           Allows you to configure additional media prefixes (in addition to the prefix defined by the Media.MediaLinkPrefix setting)
           The prefixes are used by Sitecore to recognize media URLs. 
           Notice: For each custom media prefix, you must also add a corresponding entry to the <customHandlers> section 
      -->
      <mediaPrefixes>
        <!-- Example
        <prefix value="-/media"/>
        -->
      </mediaPrefixes>
      <requestParser type="Sitecore.Resources.Media.MediaRequest, Sitecore.Kernel" />
      <mediaTypes>
        <mediaType name="Any" extensions="*">
          <mimeType>application/octet-stream</mimeType>
          <forceDownload>true</forceDownload>
          <sharedTemplate>system/media/unversioned/file</sharedTemplate>
          <versionedTemplate>system/media/versioned/file</versionedTemplate>
          <metaDataFormatter type="Sitecore.Resources.Media.MediaMetaDataFormatter" />
          <mediaValidator type="Sitecore.Resources.Media.MediaValidator" />
          <thumbnails>
            <generator type="Sitecore.Resources.Media.MediaThumbnailGenerator, Sitecore.Kernel">
              <extension>png</extension>
              <filePath>/sitecore/shell/themes/Standard/Applications/32x32/Document.png</filePath>
            </generator>
            <width>150</width>
            <height>150</height>
            <backgroundColor>#FFFFFF</backgroundColor>
          </thumbnails>
          <prototypes>
            <media type="Sitecore.Resources.Media.Media, Sitecore.Kernel" />
            <mediaData type="Sitecore.Resources.Media.MediaData, Sitecore.Kernel" />
          </prototypes>
        </mediaType>

However, I found that when I changed the type that mediaData should link to, the changes had no impact. I could see my class being instantiated at points during the rendering of an image, but unfortunately it wasn’t instantiated from the Media class, which is what I needed. Looking at the class, it has an injected reference to the MediaData class, but I can’t see where I can influence this in config, and I suspect it can’t be easily done. At this point I decided that this was probably a dead end for me, and there were easier ways to get images working in cloud storage. So I moved on to looking at other options.

Option 2 – Rerouting using Active Rewrite Rules

An alternative to the above is to use some mechanism to push images to the cloud, and then reroute from the browser requests for media library URLs to the cloud, therefore bypassing Sitecore’s own media handler.

In order to achieve the first part of this solution and push images to the cloud, it made the most sense to follow the method outlined here – a publishitem pipeline step. This pipeline step is quite simple, all it does is check whether a published item is a media item, and if so push it up to the cloud if the item has been updated / added. A code sample from our solution is below – the IImageStore interface / implementation are not provided, but hopefully it’s still clear what this is trying to do.

public class PublishItemProcessor: Sitecore.Publishing.Pipelines.PublishItem.PublishItemProcessor
{
	private readonly IImageStore _imageStore;

    public PublishItemProcessor(): this (IoC.Unity.Resolve&lt;IImageStore&gt;())
	{

    }

    public PublishItemProcessor(IImageStore imageStore)
	{	
		if (imageStore == null) throw new ArgumentNullException(“imageStore”);
		_imageStore = imageStore;
	}

    public override void Process(PublishItemContext context)
	{
		var target = context.PublishOptions.TargetDatabase.GetItem(context.ItemId,context.PublishOptions.Language);
		if (target == null || !target.Paths.IsMediaItem) return;

        var mediaItem = new MediaItem(target);
		switch (context.Action)
		{
			case PublishAction.PublishVersion:
			case PublishAction.PublishSharedFields:
				_imageStore.Add(mediaItem);
				break;
		
            case PublishAction.DeleteTargetItem:
				_imageStore.Remove(mediaItem);
				break;

		}
	}
}

The imagestore implementation here only knows that it takes a media item and publishes it to the cloud. Therefore – update or add – the media in the cloud will be overridden. Unfortunately we found a slight idiosyncrasy here, in that it looks like the DeleteTargetItem PublishAction never fires. This didn’t turn out to be a significant problem, it may be necessary to add a clean-up step at a later point that goes through the Azure Storage container and removes any orphaned items, but for now the orphaned items don’t do any harm. This publish pipeline step is configured as per the article referenced above, so I won’t repeat that configuration here.

The second part of the solution was to rewrite requests for http://<servername>/~/media/ to https://<azure_storage_name>/media/, thereby ensuring images are now served from the cloud rather than pulled from the database. We found we had one additional requirement – to still allow Sitecore to serve the images where those images are being re-sized by the server. This is largely a backwards compatibility concern, but again could be achieved using ARR. The rule that was applied is broadly as below:

    <rule name=CloudImages stopProcessing=true>
      <match url=~/media(?:/(.+.(?:jpg|jpeg|png|gif|bmp))) />
      
      <!– we still want sitecore’s image resizing functionality – so don’t root for requests where this is being invoked.   –>
      <conditions logicalGrouping=MatchAll trackAllCaptures=true>        
        <add input={QUERY_STRING} negate=true pattern=(?:.(h=|w=|bc=|width=|height=)) />
      </conditions
   
      <action type=Redirect redirectType=Permanent url=https://<cloud.server>/media/{R:1} />
    </rule>    
    

This rule matches all images served from the media library, with the listed extensions, and serves them from a cloud server rather than the Sitecore instance. Having configured this, voila! Images are now stored in the cloud as well as the database. This allows us to take some load off the Sitecore database, with minimal interruption and fuss, and should improve page load time when Sitecore is heavily contended as well.

Further work:

In an ideal world, the following requirements would additionally be satisfied by this solution:

  1. Backwards compatibility, Sitecore can fall back to the database where an image has failed to upload to the cloud.
  2. Image resizing / other pipeline steps can still be integrated where necessary, without fetching these images from the database.
  3. Image remove / publish deletes redundant images from the cloud.

These requirements may be looked at as part of a refinement to this solution at some point in the  future, but for now they are not considered so important, so we will press on with this solution. Feel free to sound out other articles / approaches you consider effective here in the comments section!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s