Production Data to Development
December 1st, 2008
It's often helpful to be able to bring MySQL production data down to a development machine. I wanted to do this for my Mephisto blog engine (see gitting Started with Mephisto and Deploying Mephisto with Capistrano to DreamHost) so that I could see how changes would look with existing data before deploying an update. Production data is also useful for tracking down bugs with production data and for backups. This post describes the Capistrano namespace I created for doing this.
Before writing my own code I googled and found that Peter Harkins had posted Capistrano Task to Load Production Data with an implementation that did basically what I wanted. I also found a Joyent CodeSnippets post from Jamie Wilkinson that was based on Peter's code but added support for Capistrano 2.0 and gzip before doing the download. I didn't need the gzip support so I started with Peter's code. I'm using Capistrano 2.5.2 and I added additional code to include downloading files in the public/assets directory, pull the production database connection information from the server (instead of the development database.yml), get the password off of the mysqldump command line, split the download from the load to provide cached loading and backup functionality, and use rsync delta transfers to speed up the transfers.
With those changes I ended up with a larger (but still not big) implementation with three tasks in a namespace called data. You could load this from a separate recipe file but I just added it to the end of my deploy.rb file. Here's the namespace:
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
# Derived from http://push.cx/2007/capistrano-task-to-load-production-data namespace :data do require 'yaml' set :sql_dump_file, "dump.#{application}.sql" desc <<-DESC Get production data and load it into development. A new copy of the production database and production public/assets are cached and loaded into the development environment. DESC task :load_from_prod do cache_prod_data load_from_cache end desc <<-DESC Get production and cache it locally. The production database and production public/assets directory are downloaded and cached. DESC task :cache_prod_data, :roles => :db, :only => { :primary => true } do on_rollback { run "rm #{current_path}/tmp/#{sql_dump_file}" } require 'yaml' database_yml = "" run "cat #{current_path}/config/database.yml" do |_, _, database_yml| end config = YAML::load(database_yml)['production'] run "mysqldump #{mysql_options(config, true)}"+ " > #{current_path}/tmp/#{sql_dump_file}" do |ch, _, out| if out =~ /^Enter password: / ch.send_data "#{config['password']}\n" else puts out end end FileUtils.mkdir_p("tmp/data/public/assets") logger.debug "rsyncing #{sql_dump_file} from #{application}" system "rsync -lrp "+ "#{user}@#{application}:#{current_path}/tmp/#{sql_dump_file} tmp/data" run "rm #{current_path}/tmp/#{sql_dump_file}" logger.debug "rsyncing assets from #{application}" system "rsync -lrp --delete "+ "#{user}@#{application}:#{shared_path}/assets tmp/data/public" end desc <<-DESC (Re)load cached production data into development. Use data:cache_prod_data or data:load_from_prod to create the cache. DESC task :load_from_cache do config = YAML::load_file('config/database.yml')['development'] if File.exist?("tmp/data/#{sql_dump_file}") mysql_load = "mysql #{mysql_options(config)} < tmp/data/#{sql_dump_file}" logger.debug %(executing "#{mysql_load.sub(/-p\S+/, '-px')}") system mysql_load FileUtils.rm_rf("public/assets/*") FileUtils.cp_r("tmp/data/public/assets", "public") else abort "The data cache is empty, try 'cap data:load_from_prod'" end end # Return MySQL options for a specific configuration. def mysql_options(config, prompt_for_password=false) if config['password'] password_opt = prompt_for_password ? " -p" : " -p#{config['password']}" else password_opt = "" end "-u #{config['username']} -h #{config['host']} #{config['database']}"+ password_opt end end |
And here's a breakdown of the tasks and their features:
cap data:cache_prod_datadumps and downloads the latest copy of the production database and the public/assets directory. The database configuration is pulled from thedatabase.ymlon the server so no account information is needed on the local machine. The MySQL password is kept out of the command line so it's left out of the Capistrano log and out of the process list (see Hiding MySQL Passwords with Capistrano). The download uses rsync to a cache in the localtmp/datadirectory. This is intended to make the transfer fast and eliminate the need for a creating new dump when reloading a previous dump is good enough.data:cache_prod_datais used as a step indata:load_from_prodbut is also useful as standalone task for getting a backup copy of the database. It could be used in a crontab as is with backup software like Time Machine taking care of versioning. Alternatively, with some additional software you could create time stamped compressed versions of thetmp/datadirectory after download.cap data:load_from_cachewill reload a the last version of the production data, printing an error if the data doesn't exist in the local cache. This eliminates the dump and transfer when you're fine with just restarting with a fresh copy of the last downloaded version of production data. One scenario is that you've mucking with production data in the development environment and want to reload a fresh copy. Another is that you're doing regular back-ups usingdata:cache_prod_dataduring off hours and the cache contains current enough information for your development testing.cap data:load_from_proddownloads the current production data into the cache and development environment by combiningdata:cache_prod_dataanddata:load_from_cache. If you always want the latest copy of production data you can use this task and forget about the other two.
Limitations
I'm sure there are lot more than this but the obvious ones are:
- The code specific to MySQL and specific to moving from production to development.
- It does a complete database dump. I think the rsync delta transfer and caching are sweet but if your database is quite big then some form of delta dumping and loading of the database will also be needed. Perhaps tracking changes since the last backup using some type of journaling or a similar technology along with smarter dumping.
- The entire production database is made available on a development machine. That's fine for many applications but if you have privacy concerns about the contents of the production database then copying the production database to a development machine isn't a good idea and in some cases, such as the need to comply with HIPPA regulations, it may even be a legal violation. In those situations you'll need to be smarter on the server side and either obfuscate or limit the what data is transfered.
- MySQL vulnerabilities to process list visibility are handled on the production server but not on the development client. Hiding MySQL Passwords with Capistrano describes the scope of the issue and suggestions for possible solutions if this is a problem in your environment.
Sorry, comments are closed for this article.