The data science about moving a blog site

Posted from Discovering Biology in a Digital World by Todd Smith on Tue Oct 31, 2017
getting the images was the hardest part

A lot of our colleagues are interested in bioinformatics and data science. While it is clear that bioinformatics is important, even essential, in modern biology research, there is the question about the levels of programming and computer skills needed for different jobs. I'm not going to answer that question here. Instead I'm going to focus on one end of the problem, that is, how is a bioinformatician similar to a data scientist? I'll illustrate the kinds of skills that are needed for this role by sharing my recent experience of how we moved the Discovering Biology in a Digital World (DWB) blogs from Scienceblogs/digitalbio to this web site. This process involved many steps, including:

  • Reading source file(s)
  • Adding data to a database
  • Parsing data
  • Getting data from the web
  • Automating tasks, performing bulk actions
  • Modifying, cleaning data
  • Mapping pointers to data
  • Managing exceptions
  • Testing your work

The skills, knowledge, and abilities needed to accomplish these tasks are: have an end goal, understand HTML/XML and other file formats, use UNIX command tools, have knowledge of Perl (or shell scripting, or other scripting language), use Excel, install software, and be able to formulate questions for web searches to find solutions. 

By sharing the experience, I'm also documenting my work so others might benefit. 

Getting the data

The first step was to get the scienceblogs blogs. Scienceblogs (SB) was a WordPress site and the blog data can be exported in bulk into a "WordPress Extended RSS format (WXR)" XMLfile. That's easy enough. I should be able important that into my site and be done right? No, if it was that easy there would be nothing to write about.

The first issue is that SB was a Word Press site, and we're a Drupal site.  Luckily, the Drupal migrate module is able to read WXR files (via Import from WordPress) so that is not a big deal. The bigger deal as we will see, is that you do not export all of your content to the WXR file. You only export text.  Media, like images, are referenced through their URL pointers. 

Loading the WRX file and clicking submit worked great, 684 blogs were loaded into the Drupal site. I made a content type, "WP archive blog" to include fields for WordPress tags and categories, and also to keep this content separate from our other Blog content. A couple of tries were needed to get it right, but Drupal has the needed power tools to test, clean, and do over. 

Next, I checked the blogs. The text was there but seeing the images required that the image paths had to be updated from the scienceblogs URL to the DWB URL. On the server I could see new directories (folders) that matched the SB paths so it should be a simple matter of changing "scienceblogs.com/digitalbio" to "digitalworldbiology/sites/all/default/files/digitalbio." As is often the case, there is a Drupal module for that: Search are Replace Scanner (SRS). SRS is powerful. You can set the content types and also use regular expressions in the search and variables in the replace. Replacing SB URL with the DWB URL worked and I could see some of the images.

The data science: retrieving the missing images

I quickly learned that most images were missing. In 2012, SB moved the bloggers from a Drupal site to a WordPress site. This migration process put all the images in a single directory with a long pathname. Only a few of the images were transferred currently to this directory. A second issue was encountered with the newer data. While the image paths logically matched the blog paths, the images themselves had their file names appended in the WXR file URL with display dimensions like image-640x480.png. This was a problem because the images downloaded from SB did not have those appendages. Thus, the names mismatched and those images would not display.  

Dealing with the second issue meant that the dimension endings in the URL associated with the "src" field had to be removed. After some trial and error I got a regexp that would work. The command line

perl -pe 's/(img.*src.*)-\d+x\d+/$1/' wxr.xml 

would delete the dimension data from the URL. Briefly, we are matching "img" (the html image tag) to "src" (the URL reference tag) then matching the -640x480 as any number of digits on either side of an "x." The "()" encapsulating in the "img.*src.*" match captures the data into a variable, so when we output $1 we get the whole line between img and src to the first "-". The "-digitsxdigits" will be removed. I tested this with Perl on the command line. Now to see if it works in SRS. 

Of course not.

Drupal uses PHP as its core language, so the default in SRS, which is to use basic php regular expressions and \d+, was not recognized.  But the php [0-9]+ is equivalent, so I tested (img.*src.*)-[0-9]+x[0-9]+ and got the lines that needed to be updated. When I used $1, like the above perl command line, in the replace box, 47 blogs with between 60 and 90 image names were updated and fixed; except for one, which was found by Link Checker (below). On to the missing "old" images.

The first thing to do for getting the old images from SB was to email SB and ask for them. While, we can't get them from the web server, they should be on a disk somewhere, perhaps in a directory at the end of the long path, right. No. SB emailed back saying they were lost when the site was migrated in 2012. This didn't seem right because I can paste the URL into my browser and see them on the web.

If I can see them, I can get them. But there are almost 700 (grep img from the WXR file and count the lines with wc), so this has to be automated. A web command line tool like wget is needed. I'm on a Mac, which has curl (wget does not ship with Macs because of the gpl license, it's a different religion and legal thing). However, for this task, wget is the better tool so I should install that. A quick web search led me to brew (I could use one now). Brew (from Homebrew) was not on my Mac either, so I take a deep breath and run:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" 
brew install wget --with-libressl
Now I have wget and we're ready to rock and roll. I just need to get the image names and paths from the WXR file. These images were encoded, like i-bd78a79396aa4bd9ffb083cb90ef24fd, (abbreviated wtf) and had a different path than the 2012 images - this helped. To get the names out of the WXR file, I used three regular expressions followed by an egrep to further filter lines that had matches that were not the specific match I wanted. They were as follows:
 
perl -pe 's/^.*(i-.*gif).*$/$1/'  wxr.xml | egrep "i-.*gif" > gifs
perl -pe 's/^.*(i-.*jpg).*$/$1/'  wxr.xml | egrep "i-.*jpg" > jpgs 
perl -pe 's/^.*(i-.*png).*$/$1/'  wxr.xml | egrep "i-.*png" > pngs 

The names, now written to three files were ready to be added to a URL base and downloaded with wget. 

perl -pe '`wget http://scienceblogs.com/digitalbio/wp-content/blogs.dir/460/files/2012/04/$_`' ../gifs
perl -pe '`wget http://scienceblogs.com/digitalbio/wp-content/blogs.dir/460/files/2012/04/$_`' ../jpgs 
perl -pe '`wget http://scienceblogs.com/digitalbio/wp-content/blogs.dir/460/files/2012/04/$_`' ../pngs 

some of the retrieved images
Examples of the retrieved images

This worked flawlessly until we came to the pngs. It turns out there were spaces in some of the names. This created a bunch of funny "*png%" kinds of names that were derived from the space and the new line characters.  These can be chomped and $_ can be put into quotes, "$_" to get the data.  

perl -pe 'chomp $_; `wget http://scienceblogs.com/digitalbio/wp-content/blogs.dir/460/files/2012/04/"$_"`' ../pngs 

Viola - ~650 files transferred! The files were zipped up and uploaded to our server.  Testing a random page showed a missing image on first try! I’ve always been a good tester. One offs are exceptions so exceptional methods are needed. I manually retrieved the file and uploaded it, checked many more pages, and the images were there. 

The last step was to make redirects from SB to DWB. SB (now closed) will run the server to another month with redirects so any direct links to a blog will be redirected to the same blog on DWB. The challenge is that DWB URLs are different than SB's URLs so a translation is needed as follows:

ScienceBlogs URL -> DWB URL with ScienceBlogs URL structure -> Digital World Biology URL
           | ScienceBlogs Server |                                 | Digital World Biology Server | 

The steps for this are: 1. get the relevant URLs from the WXR file, 2. clean up the data and get the unique lines with the needed URL, 3. make a map of URLs so the redirects go to the correct page. 

1. Getting links and relevant text required:

grep "<link>" discoveringbiologyinadigitalworld.wordpress.2017-10-12.xml > scienblogs_urls.txt, and
egrep "http://scienceblogs.com/digitalbio/[0-9]{4}/[0-9]{2}/[0-9]{2}/.*/</link>" scienblogs_urls.txt > scienceblogs-base-urls.txt

2, Cleaning the URLs required:

remove <link></link> and extra path data to image views
​perl -pe 's/<link>(http:\/\/scienceblogs.com\/digitalbio\/\d{4}\/\d{2}\/\d{2}\/.*\/).*\/<\/link>/$1/' scienceblogs-base-urls.txt > scienceblogs-base-urls-really.txt 
had to do this twice for the trailing patch stuff
perl -pe 's/<link>(http:\/\/scienceblogs.com\/digitalbio\/\d{4}\/\d{2}\/\d{2}\/.*\/)<\/link>/$1/' scienceblogs-base-urls.txt > scienceblogs-base-urls-really.txt
getting the unique lines:
perl -ne 'print unless $seen{$_}++' scienceblogs-base-urls-really.txt > one-more-time
perl -pe 's/<link>(http:\/\/scienceblogs.com\/digitalbio\/\d{4}\/\d{2}\/\d{2}\/.*\/)<\/link>/$1/' one-more-time > almost -- one more cleaning
perl -ne 'print unless $seen{$_}++' almost > done 

Now have 684 posts, the same number loaded, the URLs were tested with:

perl -pe '`wget -nv --spider $_`' done > urltest 
The appropriate 200 response code was returned 684 times; plus all the pages (oops) - rm index.html* cleared those away. 

3. Making the map - this used Excel - but, first Unix sort was used to put the urls in order. 

a. The URLs were opened in Excel; scienceblogs.com was replaced with digitalworldbiology.com. The DWB URLs, were obtained from the DWB site (with a view that gave paths sorted similar to the SB data. These had to be reviewed and small adjustments were needed to make sure redirects were correct. A new problem appeared since some days had more than one blog associated with that day and the SB path was scienceblogs.com/digitalbio/YYY/MM/DD/blog - and since the blog is named different ways, long story, it's legacy data. 

b. Next, the drupal module "Path redirect import" was installed and the 684 redirects were uploaded. Two errors with funny characters were encountered; those were fixed. 

Done!  684 blogs were uploaded, ~650 images were retrieved, and 684 URL redirects map SB URLs to DWB URL. 

A last step was to run link checker and catch any exceptions. These will be URLs that return 404 errors. Link Checker is a great module, but not exceptional. For example, links are truncated in the user interface and can neither be sorted nor searched. With old data, like we just imported, there will be a large number of broken links pointing to the defunct detritus that is the Internet. Thus, to find missing images requires some searching. The Link checker Views module (dev version only) makes the link checker report a view that can be edited to some degree. Most importantly it lets me add a search capability to the Link Checker report. With a search in hand, ~30-50 more URLs were fixed and images retrieved so that the migration is now the best it can be. 

So how does this relate to bioinformatics? In bioinformatics, we would just replace "Scienceblogs" with GEO, SRA, NCBI, MMDB, PDB ... or any other data resource for which you want to study some detail, and integrate portions of data with other data from other databases to learn something new.