*/ ?>

Almost done with Wordpress Import

I've been working the last couple of days on importing a Wordpress eXtended RSS export file into a concrete 5 website.  Read this post to find out the status of the package and find out how to download.

First off let me say that I didn't do all of the work on this - a lot of the hard work was already done by Scott C, and then by Greg Joyce from the core team. I just took what they had and polished it up a lot because I need it for a super secret project that I'm working on for a local cycling related non-profit.  It's cutting into development time on bikedate.mn but I think that's OK.  I'm honestly wondering how hard I should work on bikedate a little, actually, after the turnout for the last group ride.  Trying to tell myself that the idea is still solid and that it was just a bad weekend and competing with Free Ride is always a pretty tough proposition.  I'm still planning on finishing the site, but moving on to do a little work on this other project seemed like a good idea.  They currently only have a hosted wordpress.com blog, so I first needed to know if I could convert the last three years worth of posts into something I could actually work with in concrete 5.  Work is still pretty slow while we're waiting on designers to get us files for our next couple of projects, so I've had a pretty solid couple of days to work on it, which is nice.

Anyway, from the original post on the concrete5.org I picked up a copy of the existing code over on github.  This is what they said about the current state of things:

Wordpress Importer does a basic import of a wordpress blog.
The main idea is that someone can take their old content and have it 
represented on a concrete5 site. What it is doing: - Uses a wizard-like interface to import - Import is incremental -- keeps user in the loop and continues where it
left off - Includes basic image import - Uses WP's own text formatting functions so everything comes through
looking like a paragraph, etc Not so great: - pages and posts import under the same page - No "Start Over" on the first step. Choosing a new database when you
have existing records doesn't do anything so the dialog shouldn't be
there. - Not selecting a page to import under is very bad. Hundreds of pages
show up under "home" and the link does not work at the WORDPRESSED
step. - Should not proceed if "Posts" are going under Home. This
basically ruins a site. - icon.png does not look good Would be nice: - Import comments - Import tags - Import categories (or be able to choose a different top-level C5
page to import all posts of just one category under) - Have an option to import "posts" under one page and "pages" under another. - Option to import just posts or just pages. - Include a new page type called "WordPress Post" that is basically
the "Blog" page type but without the lipsum text. - some kind of rudimentary support for image captions, like caption

Seems like it's ok, but there's a lot there that didn't work for me. Once I had the code, it was time to get working.  I was able to get a lot done so far, but there are a few things that don't really work that I'm not sure how to get around.  This is the current state of the README over on my fork of the code on github looks a little more like this:

Wordpress Importer does a basic import of a wordpress blog.
The main idea is that someone can take their old content and have it 
represented on a concrete5 site. What it is doing: - Uses a wizard-like interface to import - Import is incremental -- keeps user in the loop and continues
where it left off - Includes basic image import - Uses WP's own text formatting functions so everything comes through
looking like a paragraph, etc - Pages and post import to different pages, and use different page
types if selected - Page parent/child relations kept - Imports Wordpress Categories to the attribute 'wordpress_categories'
- no nested categories, but the category name is kept - Imports Wordpress Tags to the 'tags' attribute - Removes all blocks from "Main" and "Blog Post More" areas on newly
created pages to keep lipsum text from showing up - Imports comments and comment dates - If users exist with the same username as a post author's username,
uses that user for the author of the newly created page Not so great: - No "Start Over" on the first step. Choosing a new database when
you have existing records doesn't do anything so the dialog shouldn't
be there. - Not selecting a page to import under is very bad. Hundreds of pages
show up under "home" and the link does not work at the WORDPRESSED step. - Should not proceed if "Posts" are going under Home. This basically
ruins a site. - icon.png does not look good Would be nice: - Option to import just posts or just pages. - some kind of rudimentary support for image captions, like caption - Somehow keep post url structure, or update internal links
in post content - Better import of images, script as is tries to import flickr
images that probably shouldn't be imported - Keeping pingback comments would be nice if there was a comment
block that supported pingbacks

So I'm getting most of the metadata stuff attached to the posts, tags, categories, posts, and keeping the page structure of the site, but there are still some limitations.  The biggest one that I can think of right now is the internal links on the blog.  I think I can set it up to add a collection alias with the taxonomy of the blog post included (/year/month/day/blog-title-with-dashes) but that might not always be the structure of the internal link.  Also some of the links start with ../../ but others include the full web address of the existing site.  If the full address is there I don't think there's anything that I could do, there's no way the link would work.  Honestly for the purposes of this site I will probably end up doing a find and replace on href=" and just checking every link individually.  Should only take an hour or two to do.

This wouldn't be an issue if wordpress did something like concrete 5 does with internal links in content.  Concrete 5 will use a coded link for links to internal pages like this : href="/about/musical-tastes/".  The 108 is the cID or collection ID of the page, which is unique to the page and independent of any url to the page.  At run time this is converted to the full path to the page, whatever that page is.  This allows you to move pages and posts all around your site and not worry too much about the link structure.  You should of course worry about SEO and permalinks and probably be putting in collection alias urls to the old urls, or 301 redirects in your .htaccess file.  But the point is, the links themselves will work.  Your site doesn't break just because you rename a page and rename the corresponding URL. 

The image import I'm a little more unclear on.  The way the script works, it's looking for a regular expression that matches a link surrounding an image. I'd post the regex but tinymce eats it.

Anyway, this matched a lot of files on the site, but not in the way I think it was supposed to.  There were several images on their site that were linked back to other pages, and the script was looking for the href match to be the full resolution version of the file.  I have the feeling this is how it would work if the blog had any hosted images, but since it's from wordpress.com they don't get to add attachments or something?  I'm not sure, but at any rate the import would fail because it thought that all it was matching was the thumbnail image, or the src attribute of the image.  This should be handled better I think, it imported images even though it didn't need to.  Because none of the links were to a full resolution image, the second part of the script that replaced the image src with another formatted link (this time for image/file links) never ran, and the image link just stayed as it was.  But I still ended up importing the image to my file system on concrete 5.

Also at one point one of the websites that I needed to download an image from was timing out and causing everything to crash.  This was because the zend http client just throws an exception and the entire concrete 5 execution stops when a url doesn't respond in a timely manner.  I think I added in a check with curl to make sure that the website actually responds before doing the import, but like I said, I don't think that the import should happen in the cases I was matching at all and I wasn't sure how to differentiate between that and a good image from wordpress.  I have the feeling that image captions are probably also formatted a certain way but I never saw anything in any of the xml records I looked at to show me what it was supposed to look like.

I kind of want to expand this out and actually include a composer blog in the package and make it something where you can import your wordpress site and have a fully functioning blog with category pages that ajax filter your content based on tags or date navigation.  I think this would be a really popular marketplace add-on, maybe the first one to make real money. 

I will of course be making sure that the stuff that I've done for just basic import is available for everyone for free, in fact you can download the package here and use it if you'd like.

blog comments powered by Disqus