S3 with Gallery2 – Details

S3 with Gallery2 – Details

Amazon S3 is a low-cost way to store data with high reliability. You only pay for what you use in terms of disk space and bandwidth. Gallery2 is a fully-featured web gallery built using PHP and various choices of databases and web servers. Understanding how I forced them to work together will be helpful if you wish to use S3 with your photo gallery.

Briefly, you use s3fs to mount a S3 ‘bucket’ as a mount point on your server. You then put the g2data/albums and g2data/cache/derivative folders on your s3fs mount point, and point to those folders using symlinks in your g2data folder. You then slightly modify the Gallery2 rewrite rules such that image requests go to a cgi, which determines if the S3 version should be used or not. This setup will work without S3-directed rewrites, but they will make things much faster.

Take a look at the diagram below (yeah, yeah, I’m not an artist):

This shows roughly the topology of the setup. Newbies wonder why one doesn’t host all their pages on S3. The answer is that S3 isn’t even Web 1.0; you can’t run cgi scripts, PHP, SQL databases, etc… All it does is hold data reliably and serve it quickly. The power of S3 is you can stop shopping for hosts based on disk space. You can shop based on CPU cycles and customer support, which will make you much happier than with one of the large oversold shared hosts.

Here is an overview of what goes on:

  1. A Gallery2 image request comes from the client web browser. The webserver rewrites that request to a cgi, called rewrite.py.
  2. rewrite.py uses the ID of the image to query the Gallery2 database directly to build a string with the location of that image on S3.
  3. It then gets the HEAD of that file from S3 using http (not through s3fs). If S3 returns an error, rewrite.py knows that the image isn’t on S3 and returns to the client browser a URL pointing to the standard core.Download link. This should only happen for resized images that haven’t been built yet (like thumbnails).
  4. If the HEAD is successful, it then calculates the age of the file on S3. If it’s too old (specified by refreshTime), rewrite.py returns to the browser a URL pointing to the standard core.Download link. This way, if the resized image needs to be re-built by Gallery2, it will happen eventually. It also ‘touches’ the file (through s3fs) so the next request can come from S3 if the file isn’t regenerated. (I’m not sure if touching breaks Gallery2, it does if Gallery2 looks at file update times on disk rather than the database.)
  5. If the HEAD is successful, and the file is new enough, rewrite.py returns a S3 URL to the image, and the client browser gets the image directly from S3.

There are some caveats:

  1. Having the data on S3 breaks nothing about Gallery2. Gallery2 will function just the same. You can move images between galleries, delete them, resize them, and so on. The difference is that it will be much, much slower. s3fs does feature local caching, so once a file is fetched, future reads are quick. File writes go as fast as your network each and every time, which is much slower than a local disk. In my experience this mainly affects bulk item management. Since file operations are so much slower, processes may time out.

    If you’re concerned about image freshness, you can turn off the S3 rewrites so that all image requests go through the server. It will be slower, but you can be sure that all images are as they should be.

  2. This isn’t kosher when it comes to Gallery2 development. If I wanted to do it ‘right,’ I would have written a plugin that anyone could install on their Gallery2. I thought seriously about this, but I decided that a kosher S3 plugin would have to change so many fundamental parts of Gallery2 that this hack would work with much less effort. This hack changes almost none of the Gallery2 files (except for the Htaccess.tpl file), which saved me time.

    I’ve written my cgi in python for mysql. I like and know python better than PHP, and mysql is what I use. I started to write the cgi functionality using embedded Gallery2, a more ‘proper’ way, (the point of this thread), but I found it really, really slow. If you’re kooky enough to use a different SQL server, nerts to you. Again, if I wanted to do this kosher, it would have taken much longer.

    Update: Above I say the embedded method is slow. That may not be true. I started development for this hack on my old shared host server, which is really, really slow. Now that I’m trying it on my new server, it’s much faster. But I’ve already written all my stuff in python, and I’m not going to rewrite it just yet. If you want to do this more kosher than me, here’s the beginning of the code you’ll need:

    <?php
    require_once('./embed.php');
     
    // initiate G2
    $ret = GalleryEmbed::init(array( 'embedUri' => 'http://mysite.com/gallery2/thisscript.php', 
    'g2Uri' => 'http://mysite.com/gallery2/', 
    'loginRedirect' => 'http://mysite.com/gallery2/main.php', 
    'activeUserId' => ''));
     
    $g2moddata = GalleryEmbed::handleRequest();
     
    list($title, $css, $javascript) = GalleryEmbed::parseHead($g2moddata['headHtml']);
     
    $id = $_GET['id'];
     
    list ($ret, $myItemArray) = GalleryCoreApi::loadEntitiesById($id,'GalleryDerivativeImage');
    if ($ret) {
        print $ret->getAsHtml();
        exit;
    }
    list ($ret, $myItemPath) = $myItemArray->fetchPath();
    if ($ret) {
        print $ret->getAsHtml();
        exit;
    }
     
    echo $myItemPath;
    ?>
  3. When you transfer files over to S3 the first time, you may be moving lots of data. If you have caching turned on, you will be doubling the amount of data on your disk. You should think about how you should do this so you don’t fill up your disk or quota. Perhaps turn off caching while uploading, or upload in stages, clearing out your cache directory between runs.
  4. Copying the entire derivative folder is a necessary evil. It’s necessary such that new image resizes get uploaded to S3 automatically. It’s evil because each resized image has a *-meta.inc file associated with it which is small, and it’s a bad idea to put small things on S3 when using s3fs, because of the large per-file time cost.

There are things I think can be improved. Comments and suggestions are welcome!

  1. Instead of getting the HEAD from S3, using the Gallery2 database to decided if an image needs to be refreshed. Gallery2 embedded mode might make this easier.
  2. Caching of rewritten URLs to save SQL queries? I don’t know if there’s a real advantage to this if I’m already making SQL queries.
  3. Convert my python script to PHP/embedded Gallery2 mode. This would make this a bit more ‘kosher’ and future proof. Any volunteers?
  4. I need to learn more about how Gallery2 interfaces with images on disk. Ideally I’d like to minimize that interaction because read/writes over s3fs are time costly. Are there any suggestions?
  5. If this becomes popular enough, I’ll put it in a version control system.

9 thoughts on “S3 with Gallery2 – Details

  1. Nice idea. This will work well for those who host G2 from their home cable or DSL. You can serve the processing heavy, bandwidth friendly PHP/HTML from home and offload most of the big transfers to S3.

    I’d be interested in a “kosher” way to do this. Like you said, a G2 module may be more appropriate. Unfortunately it looks like only myself and Stopb from the gallery forum are interested.

    P.S. Please download Dia or buy Visio. Your diagram makes me sad…

  2. I too would prefer a G2 module for S3. But the lack of interest is probably not going away soon. Most people are happy with the (promised but practically unusable) terabytes they can get on shared webhosts. On my old shared host, I wasn’t running out of room, but I was completely dissatisfied with the speed and reliability of the server.

    When I wrote up this page, I intended to come back and make a prettier diagram. Obviously I never got around to that. Oh well!

  3. hi Stephen,
    thank you for these descriptions. you gallery seems much faster than mine, which takes about 25 seconds to go from an album to “add an item” page, using Dreamhost. I wonder if this will help so I’m ready to try… I’ll have to hire a programmer to do this I believe.

    I’m just curious about one thing: I have photos worth 300gigs all organized. When I try to add a folder with 30 sub-folders I get memory errors. Do I overcome this php memory problem with Amazon s3 or is it still related to shared hosting?

    I meantion this here: http://gallery.menalto.com/node/81666

  4. Looking at your error messages on the forums page, it certainly does appear that you’re hitting a memory maximum on Dreamhost. This is most certainly related to the low resource limits of shared hosting. I should point out that this S3+Gallery solution of mine will not work on Dreamhost, nor likely any other shared host, so you’d have to switch to a different kind of host to use this.

    I’m using a virtual host from Linode.com which I very much like. I’m using their lowest option which has more than enough power for me.

  5. I think I might go with Linode or another service called Vpslink +Gallery+S3 as well. do you have any experience with video’s on Amazon s3? do you think uploading or streaming might be an issue?

  6. Hi stephen,

    I think I’ll go with Linode, they provide much higher ram at better prices and I’ll be needing those for Dcraw raw conversions and FFmpeg. would you like to give me your referral code?

Leave a Reply

Your email address will not be published. Required fields are marked *