Using bulkloader to backup, restore and migrate data

Bulkloader is an interface in Google App Engine for loading data from / to storage on Google servers. Bulkloader is useful for backing up / restoring / migrating application data, but the documentation and usage examples are too low, and you will have to run into various problems and bugs in a complex application. I myself have been digging up various sources of information for quite a long time, digging through the source code of the SDK, reading the bugs, writing my own desktop workflows; and now I am ready to present some of the fruits in the form of a detailed article.

The article is very large, keep in mind.

I will not particularly go into the details of creating App Engine applications, this topic has been repeatedly raised and examples can be found in the sea, including in Russian. However, synthetic examples are usually poorly perceived, so we will consider the “real” application - the personal blog engine, the topic is well-known and understandable. And as the backup file format, select plain XML.

Retelling the documentation here also will not.
')

Terminology

There is no Russian-language established terminology, so I allowed myself some liberty, calling “Kind” either “class” or “type”. “Entity” has remained “entity”.

Import = deserialization = restore. Export = serialization = backup.

Training

Hereinafter we use GAE SDK version 1.4.1 or higher on a local unix / linux machine, on windows everything is almost the same. When working with the application on google servers, there are certain nuances, but you can read about them in the official documentation, but here we are only working with a local server.

The main programs of the GAE SDK (appcfg.py, dev_appserver.py) should be available for execution in the console (the paths to the SDK are specified in the appropriate environment variables, for example).

What is bulkloader

Bulkloader is a Python framework, and to use it you will have to not only write configuration files, but also code. But, having mastered the framework, you will get a very powerful mechanism for saving and restoring data from your server inside App Engine.

You choose which format to store data on the local machine, and bulkloader converts it according to certain rules (import and export). At the end of the article there is a link by which you can learn more about bulkloader.

In order for the bulkloader to work, the application first needs to enable an access point to the API. So, turn on remote_api, for this we add a section to the application configuration (app.yaml) (if it is not there)

  builtins:
 - remote_api: on

This section includes an access point to the API at http: // servername / _ah / remote_api, for a local server with default settings it will be http: // localhost: 8080 / _ah / remote_api.

Data schema

Let's start with the data scheme of the application. Everything is clear for a blog: Articles (Article), Comments (ArticleComment), RenderedArticle. Comments are presented in the form of a tree. The rendered html article is stored in a separate repository entity.

Entity classes refer to each other as follows:

Article → RenderedArticle (link to a rendered article in an article)
ArticleComment → Article (link to article from comment)
ArticleComment → ArticleComment (link to parent comment)

class RenderedArticle(db.Model): html_body = db.TextProperty(required=True) class Article(db.Model): shortcut = db.StringProperty(required=True) title = db.StringProperty(required=True) body = db.TextProperty(required=True) html_preview = db.TextProperty() rendered_html = db.ReferenceProperty(RenderedArticle) published = db.DateTimeProperty(auto_now_add=True) updated = db.DateTimeProperty(auto_now_add=True) tags = db.StringListProperty() is_commentable = db.BooleanProperty() is_draft = db.BooleanProperty() class ArticleComment(db.Model): parent_comment = db.SelfReferenceProperty() name = db.StringProperty() email = db.StringProperty() homepage = db.StringProperty() body = db.TextProperty(required=True) html_body = db.TextProperty(required=True) published = db.DateTimeProperty(auto_now_add=True) article = db.ReferenceProperty(Article) ip_address = db.StringProperty() is_approved = db.BooleanProperty(default=False) is_subscribed = db.BooleanProperty(default=False)

It is clear from the models that many different types of data are used: two kinds of references, dates, strings, booleans, lists. Looking ahead, I note that the lists and links have the biggest problems.

Checking the work of bulkloader

We fill the database with data and check the work of the bulkloader via the API:

  appcfg.py download_data --email = doesntmatter -A wereword --url = http: // localhost: 8080 / _ah / remote_api --kind = Article --filename = out.dat

In the -A parameter, we specify the name of the application, in the --email parameter any string (for the local server it is not important), in the --kind parameter - the entity class that we want to download (look at the download_data argument). After executing the command (just press Enter when prompted for a password) a file with a backup of the specified entity class (out.dat) and a bunch of different logs will appear in the current directory (files like bulkloader- *). The default backup format is SQLITE3, the resulting file (out.dat) can be opened in any SQLITE3 viewer and studied. Its structure is of little use for practical use (for example, migration), so we will continue to write the config (and other related files) for the bulkloader so that the data export is made in a more convenient format for us.

Writing a configuration file for bulkloader

The current SDK version supports two data export / import formats: CSV and XML, we will use the second one. The configuration file is already familiar to you YAML-file, it describes exactly how the data is transformed when exporting / importing data from the repository. The official documentation says how to generate the basic config from the application, but we will write it from scratch. This file is called config.yaml, I usually create a separate backup directory in the application tree and place everything I need in it, it practically doesn’t intersect with the main application.

At the beginning - in the python_preamble section - those python modules are defined that will be needed during the export / import process. Here is the “gentleman's set” of modules, base64 and re are standard python modules, google. * Are modules from the SDK, but helpers is our own module, the helpers.py file located in the current directory. In helpers.py, we will have various workflows and other useful functions for importing / exporting data, but in the beginning just create an empty file with that name, add the code later.

 python_preamble: - import: base64 - import: re - import: google.appengine.ext.bulkload.transform - import: google.appengine.ext.bulkload.bulkloader_wizard - import: google.appengine.ext.db - import: google.appengine.api.datastore - import: google.appengine.api.users - import: helpers

The next section of the config is transformers, it describes the “converters” of entities into the local backup format and vice versa. Here you should describe all the fields of the entity class that you need. Each entity class is described in a separate section named kind, here is the simplest example of such a section, in which we describe the converter for the Article class:

 transformers: - kind: Article connector: simplexml #   connector_options: #   xpath_to_nodes: "/blog/Articles/Article" # XPath,          style: element_centric #    XML,    — - property_map: - property: __key__ external_name: key export_transform: transform.key_id_or_name_as_string

A small note, XPath support is very weak, you really can only use the expression "/ AAA / BBB / CCC".

Now download the data from the server using the newly created config (option --config):

  appcfg.py download_data --email = doesntmatter -A wereword --url = http: // localhost: 8080 / _ah / remote_api --kind = Article --config = test.yaml --filename = Article.xml

And we get this final XML containing data about two objects:

 <?xml version="1.0"?> <blog> <Articles> <Article> <key>6</key> </Article> <Article> <key>8</key> </Article> </Articles> </blog>

Please note that only the fields that we described in the configuration file in the transformers section are included in the XML; in our case, this is only a recording key. In the export_transform parameter, we have specified a specific converter for this field - transform.key_id_or_name_as_string. This is a feature from the google.appengine.ext.bulkload.transform module. For fields of a different type, other converter functions are used, and the usual lambda expression on python can act as such a converter.

And now the whole piece of the config describing the entity class Article:

 - kind: Article connector: simplexml connector_options: xpath_to_nodes: "/blog/Articles/Article" style: element_centric property_map: - property: __key__ external_name: key export_transform: transform.key_id_or_name_as_string - property: rendered_html external_name: rendered-html export_transform: transform.key_id_or_name_as_string # deep key! It's required here! import_transform: transform.create_deep_key(('Article', 'key'), ('RenderedArticle', transform.CURRENT_PROPERTY)) - property: shortcut external_name: shortcut - property: body external_name: body - property: title external_name: title - property: html_preview external_name: html-preview - property: published external_name: published export_transform: transform.export_date_time('%Y-%m-%dT%H:%M') import_transform: transform.import_date_time('%Y-%m-%dT%H:%M') - property: updated external_name: updated export_transform: transform.export_date_time('%Y-%m-%dT%H:%M') import_transform: transform.import_date_time('%Y-%m-%dT%H:%M') - property: tags external_name: tags import_transform: "lambda x: x is not None and len(x) > 0 and eval(x) or []" - property: is_commentable external_name: is-commentable import_transform: transform.regexp_bool('^True$') - property: is_draft external_name: is-draft import_transform: transform.regexp_bool('^True$')

Let's analyze it in detail. For each object field, a property parameter is set, which describes the data conversion rules for this field.

The external_name parameter specifies the name of the corresponding element in the XML file.

In the parameter import_transform is a function for importing data, it converts the data from the backup into the required field data type. We can assume that this is deserialization.

In the parameter export_transform - the function of converting a field into text, which will be recorded in backup, serialization of data.

For simple types (String, for example), an explicit description of the import and export functions is not needed, the standard one is used, which is quite sufficient. On the other types of talk separately.

Let's start with the rendered_html field, it is, firstly, a reference (reference) to an object of another class (in our case, RenderedArticle), and secondly, this object of the RenderedArticle class is a child of the corresponding Article object. Therefore, during deserialization, it is necessary to “construct” a valid link to an object; this is done from the values of two fields using the standard transform.create_deep_key method:

  - property: rendered_html external_name: rendered-html export_transform: transform.key_id_or_name_as_string # deep key! It's required here! import_transform: transform.create_deep_key(('Article', 'key'), ('RenderedArticle', transform.CURRENT_PROPERTY))

Note that in the import / export_transform parameters there should be expressions that eventually result in a function that takes one argument and returns one value. And in the example above, we see a function call with specific arguments, this function is a kind of decorator and returns the already prepared function for data conversion. transform.create_deep_key accepts several two-element tuples as input, each of which reflects one level in the chain of object relations, and the tuple itself contains the name of the entity class and the element name (from the XML file); key fields are generated from the specified fields.

In our case, the chain consists of two objects, and we use the value transform.CURRENT_PROPERTY to get rid of specifying the name of the field of the current object from the chain of relations. In principle, instead of transform.CURRENT_PROPERTY it is quite possible to write rendered_html.

Fields with dates also require a special approach, but everything is simple here - we use function generators from the SDK, in the argument we specify a date / time formatting pattern:

  - property: published external_name: published export_transform: transform.export_date_time('%Y-%m-%dT%H:%M') import_transform: transform.import_date_time('%Y-%m-%dT%H:%M')

Fields with a list of strings, here the standard method is used for serialization, so you do not need to write anything, but for import you need a special approach:

  - property: tags external_name: tags import_transform: "lambda x: x is not None and len(x) > 0 and eval(x) or []"

When exporting (serializing), the list of strings is converted to an element of this type:

 <tags>[u'x2', u'another string']</tags>

However, an empty list of strings is converted to an empty string:

 <tags></tags>

And when importing using a standard converter, an empty field will be converted to None, which is obviously not a valid list and will cause problems when trying to read this field in an application. Therefore, we use a lambda expression that performs the correct (relatively) transformation. However, because of the bug in the SDK, it will still help you a little, since an error in the field type validator.

When working with Boolean fields, we also use a simple converter for deserialization:

  - property: is_commentable external_name: is-commentable import_transform: transform.regexp_bool('^True$')

With standard export, boolean values are converted to strings “True” and “False”, while we use an even more general method during import — only the string “True” is converted to True, and all others are converted to False.

The resulting XML file with the imported objects of the Article class looks like this:

 <?xml version="1.0"?> <blog> <Articles> <Article> <body>aaa bbb ccc</body> <updated>2011-01-20T08:19</updated> <key>6</key> <is-draft>False</is-draft> <title>this is new article</title> <html-preview><p>aaa bbb ccc</p></html-preview> <tags></tags> <shortcut>short-cut-1295418565</shortcut> <rendered-html>7</rendered-html> <published>2011-01-19T06:29</published> <is-commentable>True</is-commentable> </Article> <Article> <body>ff gg hh</body> <updated>2011-01-19T06:30</updated> <key>8</key> <is-draft>False</is-draft> <title>another article</title> <html-preview><p>ff gg hh</p></html-preview> <tags>[u'x2']</tags> <shortcut>short-cut-1295418590</shortcut> <rendered-html>9</rendered-html> <published>2011-01-19T06:29</published> <is-commentable>True</is-commentable> </Article> </Articles> </blog>

Work with object relationship chains

Relationships or dependencies between objects are lined up using the parent argument when creating an object of some class of entities. A new object then falls into the same group of entities as the one specified in parent. This approach allows you to use, for example, transactions to preserve the integrity of the data. Chains of relations during import and export must be processed in a special way. And here there are several nuances that we consider below.

So, we have a Article entity class, objects of this type are articles, it contains the source code of the article in the markup language, a small preview and other service information. And the article text rendered in html code is stored in a separate object of the RenderedArticle class. The separation of the rendered text into a separate entity class was made in order to circumvent the limitation on the overall size of an object adopted in App Engine, and in fact the Article and RenderedArticle objects act in relation to the one-to-one relationship. The RenderedArticle object is created in the same entity group as the Article object.

Here is what the config.yaml config part of the RenderedArticle entity class looks like

 - kind: RenderedArticle connector: simplexml connector_options: xpath_to_nodes: "/blog/RenderedArticles/Article" style: element_centric property_map: - property: __key__ external_name: key export: - external_name: ParentArticle export_transform: transform.key_id_or_name_as_string_n(0) - external_name: key export_transform: transform.key_id_or_name_as_string_n(1) import_transform: transform.create_deep_key(('Article', 'ParentArticle'), ('RenderedArticle', transform.CURRENT_PROPERTY)) - property: html_body external_name: html-body

Notice how the data export is described in the example above. First, one key field of the object is converted into two elements in the backup. Secondly, during import, the key field is “assembled” from the values of two elements - the ParentArticle and key. The transform.key_id_or_name_as_string_n (0) code returns a function that, as a result of execution on the key field, returns the specified component of the composite key.

The generated XML based on this config looks like this:

 <?xml version="1.0"?> <blog> <RenderedArticles> <Article> <ParentArticle>6</ParentArticle> <html-body><p>aaa bbb ccc</p></html-body> <key>7</key> </Article> <Article> <ParentArticle>8</ParentArticle> <html-body><p>ff gg hh</p></html-body> <key>9</key> </Article> </RenderedArticles> </blog>

Now consider the export-import of the object of the ArticleComment class, I remind you that comments are a tree, that is, a comment can have a “parent” comment, in addition, each comment has a link to the parent post.

 - kind: ArticleComment connector: simplexml connector_options: xpath_to_nodes: "/blog/Comments/Comment" style: element_centric property_map: - property: __key__ external_name: key export_transform: transform.key_id_or_name_as_string import_transform: transform.create_deep_key(('Article', 'article'), ('ArticleComment', transform.CURRENT_PROPERTY)) - property: parent_comment external_name: parent-comment export_transform: transform.key_id_or_name_as_string import_transform: helpers.create_deep_key(('Article', 'article'), ('ArticleComment', transform.CURRENT_PROPERTY)) - property: article external_name: article export_transform: transform.key_id_or_name_as_string import_transform: transform.create_foreign_key('Article') - property: name external_name: name - property: body external_name: body

At first glance, everything looks simple, but at one point the “silent” behavior of the converters breaks down. Note that the parent_comment field can be None, which means a top-level comment. If we use the transform.create_deep_key method in the import process, we get an error on the value None:

  BadArgumentError: Expected an integer id or string name as argument 4;  received None (a NoneType).

I also started a bug about this error, but so far I haven’t received any reaction from the developers to it. To bypass this bug, use the helpers.py file, where we place the replacement of the transform.create_deep_key method. Workeround is very simple, we only generate the key if the value is not None:

 def create_deep_key(*path_info): f = transform.create_deep_key(*path_info) def create_deep_key_lambda(value, bulkload_state): if value is None: return None return f(value, bulkload_state) return create_deep_key_lambda

In the comments I can tell in more detail about what is happening in this function, if anyone is interested.

Thus, when an optional object reference is restored correctly.

Now we are working with the article field, which contains a link to the article to which the comments belong. To restore an object reference, we use the transform.create_foreign_key method, it works similarly to the transform.create_deep_key method, only without regard to chains of relationships. Here I want to draw attention to a potential bug, if the link to the object is empty, during restoration you will encounter exactly the same error that a couple of paragraphs are higher.

Conclusion

It is quite possible to work with bulkloader, but very carefully. You need to constantly monitor the announcements and read the documentation carefully after each release of the SDK, since not all changes are included in the changelog. Also overboard is an overview of working with binary data, but everything is simple:

  - property: data external_name: data export_transform: base64.b64encode import_transform: transform.blobproperty_from_base64

Next time we will talk about the features of localization in GAE-python-django applications.