Production Data to Development

December 1st, 2008

It's often helpful to be able to bring MySQL production data down to a development machine. I wanted to do this for my Mephisto blog engine (see gitting Started with Mephisto and Deploying Mephisto with Capistrano to DreamHost) so that I could see how changes would look with existing data before deploying an update. Production data is also useful for tracking down bugs with production data and for backups. This post describes the Capistrano namespace I created for doing this.

Before writing my own code I googled and found that Peter Harkins had posted Capistrano Task to Load Production Data with an implementation that did basically what I wanted. I also found a Joyent CodeSnippets post from Jamie Wilkinson that was based on Peter's code but added support for Capistrano 2.0 and gzip before doing the download. I didn't need the gzip support so I started with Peter's code. I'm using Capistrano 2.5.2 and I added additional code to include downloading files in the public/assets directory, pull the production database connection information from the server (instead of the development database.yml), get the password off of the mysqldump command line, split the download from the load to provide cached loading and backup functionality, and use rsync delta transfers to speed up the transfers.

With those changes I ended up with a larger (but still not big) implementation with three tasks in a namespace called data. You could load this from a separate recipe file but I just added it to the end of my deploy.rb file. Here's the namespace:

62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# Derived from http://push.cx/2007/capistrano-task-to-load-production-data
namespace :data do
  require 'yaml'
  set :sql_dump_file, "dump.#{application}.sql"
  
  desc <<-DESC
    Get production data and load it into development. A new copy of the
    production database and production public/assets are cached and
    loaded into the development environment.
  DESC
  task :load_from_prod do
    cache_prod_data
    load_from_cache
  end
  
  desc <<-DESC
    Get production and cache it locally. The production database and 
    production public/assets directory are downloaded and cached.
  DESC
  task :cache_prod_data, :roles => :db, :only => { :primary => true } do
    on_rollback { run "rm #{current_path}/tmp/#{sql_dump_file}" }

    require 'yaml'

    database_yml = ""
    run "cat #{current_path}/config/database.yml" do |_, _, database_yml| end
    config = YAML::load(database_yml)['production']
    
    run "mysqldump #{mysql_options(config, true)}"+
        " > #{current_path}/tmp/#{sql_dump_file}" do |ch, _, out| 
      if out =~ /^Enter password: /
        ch.send_data "#{config['password']}\n"
      else
        puts out 
      end
    end
    
    FileUtils.mkdir_p("tmp/data/public/assets")

    logger.debug "rsyncing #{sql_dump_file} from #{application}"
    system "rsync -lrp "+
      "#{user}@#{application}:#{current_path}/tmp/#{sql_dump_file} tmp/data"
    run "rm #{current_path}/tmp/#{sql_dump_file}"
    
    logger.debug "rsyncing assets from #{application}"
    system "rsync -lrp --delete "+
             "#{user}@#{application}:#{shared_path}/assets tmp/data/public"
  end

  desc <<-DESC
    (Re)load cached production data into development. Use
    data:cache_prod_data or data:load_from_prod to create the cache.
  DESC
  task :load_from_cache do
    config = YAML::load_file('config/database.yml')['development']
    if File.exist?("tmp/data/#{sql_dump_file}") 
      mysql_load = "mysql #{mysql_options(config)} < tmp/data/#{sql_dump_file}"
      logger.debug %(executing "#{mysql_load.sub(/-p\S+/, '-px')}")
      system mysql_load 
      FileUtils.rm_rf("public/assets/*")
      FileUtils.cp_r("tmp/data/public/assets", "public")
    else
      abort "The data cache is empty, try 'cap data:load_from_prod'"
    end
  end
  
  # Return MySQL options for a specific configuration.
  def mysql_options(config, prompt_for_password=false)
    if config['password'] 
      password_opt = prompt_for_password ? " -p" : " -p#{config['password']}" 
    else
      password_opt = ""
    end
    "-u #{config['username']} -h #{config['host']} #{config['database']}"+ 
    password_opt
  end
  
end

And here's a breakdown of the tasks and their features:

  • cap data:cache_prod_data dumps and downloads the latest copy of the production database and the public/assets directory. The database configuration is pulled from the database.yml on the server so no account information is needed on the local machine. The MySQL password is kept out of the command line so it's left out of the Capistrano log and out of the process list (see Hiding MySQL Passwords with Capistrano). The download uses rsync to a cache in the local tmp/data directory. This is intended to make the transfer fast and eliminate the need for a creating new dump when reloading a previous dump is good enough. data:cache_prod_data is used as a step in data:load_from_prod but is also useful as standalone task for getting a backup copy of the database. It could be used in a crontab as is with backup software like Time Machine taking care of versioning. Alternatively, with some additional software you could create time stamped compressed versions of the tmp/data directory after download.
  • cap data:load_from_cache will reload a the last version of the production data, printing an error if the data doesn't exist in the local cache. This eliminates the dump and transfer when you're fine with just restarting with a fresh copy of the last downloaded version of production data. One scenario is that you've mucking with production data in the development environment and want to reload a fresh copy. Another is that you're doing regular back-ups using data:cache_prod_data during off hours and the cache contains current enough information for your development testing.
  • cap data:load_from_prod downloads the current production data into the cache and development environment by combining data:cache_prod_data and data:load_from_cache. If you always want the latest copy of production data you can use this task and forget about the other two.

Limitations

I'm sure there are lot more than this but the obvious ones are:

  1. The code specific to MySQL and specific to moving from production to development.
  2. It does a complete database dump. I think the rsync delta transfer and caching are sweet but if your database is quite big then some form of delta dumping and loading of the database will also be needed. Perhaps tracking changes since the last backup using some type of journaling or a similar technology along with smarter dumping.
  3. The entire production database is made available on a development machine. That's fine for many applications but if you have privacy concerns about the contents of the production database then copying the production database to a development machine isn't a good idea and in some cases, such as the need to comply with HIPPA regulations, it may even be a legal violation. In those situations you'll need to be smarter on the server side and either obfuscate or limit the what data is transfered.
  4. MySQL vulnerabilities to process list visibility are handled on the production server but not on the development client. Hiding MySQL Passwords with Capistrano describes the scope of the issue and suggestions for possible solutions if this is a problem in your environment.

Sorry, comments are closed for this article.