Migrating DasBlog Content to Markdown Based DocPad Site

Source Data

One of the most daunting parts of replacing my current blogging platform DasBlog by a site created with DocPad, is the migration of existing content. Being a software developer, I wanted to automate as much of the process as possible. Even if the total time required wouldn't be all that much shorter, I'd rather spend it writing scripts and learning new tools and technologies, than doing mundane tasks.

The most critical part of the conversion was the switch from HTML content in DasBlog to Markdown content in my DocPad site. Although DocPad would have allowed me to use a different source format for old blog posts, this would make it much more difficult to use consistent styles across all posts. I soon realized, there are not all that many options available for converting HTML to Markdown. In the end I chose to-markdown for JavaScript by Dom Christie. Surprisingly enough, installing it turned out to be the most challenging part.

The next obstacle was getting the content out of DasBlog. At first I wanted to use its internal dayentry.xml files directly, but they don't seem to contain the permalinks for blog posts. I tried looking at DasBlog sources, but quickly decided to search for an alternative solution. I stumbled across an interesting tool for exporting DasBlog content to BlogML. Unfortunately the original link to its sources didn't work anymore, so I had to settle with a binary only repository I happened to find.

Blog Post Conversion

Selecting to-markdown as my conversion library automatically meant, I was going to write my conversion script in Node.js. Although I probably could have written a command line script directly in Node.js, I have decided to rather take advantage of Grunt, since I had more previous experience with it. Its direct support for CoffeeScript scripts was just an extra bonus. I even knew already how to debug the script in WebStorm - my favorite IDE for JavaScript stack.

Writing a custom Grunt multi task is simple enough:

module.exports = (grunt) ->
  grunt.initConfig(
    {
      blogml2docpad:
        convert:
          src: './DasBlog.xml'
          dest: './DocPad/'
    }
  )

  grunt.registerMultiTask('blogml2docpad', \
  'Convert BlogML file to Markdown files for DocPad', ->
    grunt.log.writeln('src : ' + this.data.src);
    grunt.log.writeln('dest : ' + this.data.dest);
  )

  grunt.registerTask('default', ['blogml2docpad'])

I was ready to read the contents of DasBlog.xml file - my BlogML export of DasBlog content. I chose xml2js as XML parser and dumped all blog posts to console, unmodified:

exportPost = (post) ->
  grunt.log.writeln(post.content[0]._);

grunt.registerMultiTask('blogml2docpad', \
'Convert BlogML file to Markdown files for DocPad', ->
  fs = require 'fs'
  xml2js = require 'xml2js'

  parser = new xml2js.Parser

  data = fs.readFileSync this.data.src
  parser.parseString data, (err, result) =>
    exportPost post for post in result.blog.posts[0].post
)

You might have noticed, how I accessed the /blog/posts element in the XML document: result.blog.posts[0].post. I found the syntax not quite intuitive and had to inspect the result object in the debugger for some time, before getting used to it.

Converting that HTML to Markdown couldn't have been easier:

exportPost = (post, params, linkMappings, imgMappings) ->
  toMarkdown = require 'to-markdown'
  grunt.log.writeln(toMarkdown(post.content[0]._));

Generating DocPad Source Files with Metadata Header

Once I got the basic conversion working, it was time to write the posts to files which could be used by DocPad, instead of just dumping them to console. Since I already settled on the filename structure for blog posts (<date>-<title>.html.md, e.g. 20150719-MigratingDasBlogContentToMarkdownBasedDocPadSite.html.md for the post you're currently reading), I had to generate such filenames based on post metadata in BlogML:

createFilename = (post) ->
  moment = require 'moment'
  slug = require 'slug'
  titleCase = require 'title-case'

  datePrefix = moment(post.$['date-created']).format('YYYYMMDD')
  titleSlug = slug(titleCase(post.title[0]._), '')
  datePrefix + '-' + titleSlug + '.html.md'

exportPost = (post) ->
  fs = require 'fs'
  toMarkdown = require 'to-markdown'

  filename = createFilename post
  contents = toMarkdown post.content[0]._
  fs.writeFileSync params.dest + filename, contents

Since my site needs some metadata about the posts to display them correctly, it's not enough to have just the blog post in the file, a YAML header with metadata is required, as well. This is how it looks for this blog post:

---
title: "Migrating DasBlog Content to Markdown Based DocPad Site"
date: 2015-07-19
description: "One of the most daunting parts of replacing my current blogging platform DasBlog by a site created with DocPad, is the migration of existing content. Being a software developer, I wanted to automate as much of the process as possible. Even if the total time required wouldn't be all that much shorter, I'd rather spend it writing scripts and learning new tools and technologies, than doing mundane tasks."
tags:
 - DasBlog
 - CoffeeScript
 - JavaScript
 - Grunt
 - Node.js
---

I installed yamljs and got to work:

createHeader = (post) ->
  moment = require 'moment'
  header =
    title: post.title[0]._
    date: moment(post.$['date-created']).format('YYYY-MM-DD')
    description: ''
    tags: category.$['ref'] for category in post.categories[0].category \
    if post.categories[0].category # no tags for posts without categories

exportPost = (post) ->
  os = require 'os'
  yaml = require 'yamljs'
  fs = require 'fs'
  toMarkdown = require 'to-markdown'

  filename = createFilename post
  header = createHeader post
  contents = '---' + os.EOL + yaml.stringify(header) + '---' + os.EOL + 
    toMarkdown(post.content[0]._)
  fs.writeFileSync params.dest + filename, contents

That's it! My existing content was exported from DasBlog and converted to a new format, expected by my DocPad site. To be honest, I did some additional post-processing on post contents to transform internal links to other posts, images and downloads, but including that code here would just obscure the core of the solution. In case you're curious, I just prepared the mappings of old URLs to new ones and used regular expressions to apply them to Markdown text.

Copyright
Creative Commons License