Write a Google sitemap for your WordPress blog

September 21st, 2007

One of the most effective ways to increase the visibility of your content is to make sure it’s indexed regularly by Google. However, the Googlebot sometimes has a hard time with database-driven websites like WordPress blogs, so it helps if you tell Google which URLs to visit. The way to do that is with an XML sitemap. There are a couple different kinds of sitemaps, which work with different search engines, but I’m only going to talk about the XML sitemap supported by Google and Yahoo. There’s also a Google sitemap generator for WordPress, but if you’re like me, you try to keep the number of active plug-ins to a minimum to make your site as fast as possible.

Not only will a sitemap ensure Google has the freshest content from your site, but it will also make your site run faster by telling the Googlebot that it doesn’t need to crawl your back archives with the same frequency as your front page. This is especially important for shared hosting situations like Dreamhost. Because the Googlebot alone can use 50% of the CPU of the shared server, if your site isn’t configured properly, you could bog down the server for everyone else and even get your site taken offline1.

To set this up you’ll need an account with Google Webmaster Tools, the downloadable sitemap generator, and a hosting account that uses Analog logging and offers python support. I use Dreamhost. If you need a host, check ‘em out (and use promo code “Synthesis” to get your first year for $60).

First, download the program and upload it to the base directory of your website. Unzip the package and open up config-example.xml. In config-example.xml are the parameters that control how the URL list that makes up the sitemap is generated. You’ll need to rename this to config.xml for it to work. There are two steps to setting up config.xml: Including URLs, and excluding URLs. Because sitemap_gen doesn’t do any crawling itself, you have to supply it with a list of URLs. One simple way to do this is with a text listing of URLs, but manually adding to this list every time you wrote a new post would get tedious. Conveniently, sitemap_gen can parse logfiles, so you can use your logs as the URL list. The frequency with which URLs appear in your logs also allows sitemap_gen to assign a priority score to each URL, letting the Googlebot know which pages to update more frequently and which pages it doesn’t need to crawl as often.

Next, find the section in config.xml that says, “The “site” node describes your basic web site.” In this section, you want to replace http://www.example.com with the path to your site. Replace /var/www/docroot/sitemap.xml.gz or whatever comes after store_into with the name of your sitemap. I used sitemap.xml.gz, to generate a compressed sitemap for google to read.

Moving down the file, find the INPUTS section. This is where you will specify which URLs to include in the sitemap. This part if broken up into sections which contain different link inclusion mechanisms. You can only use one mechanism at a time, so delete or comment out the sections until you get to the one that talks about accesslogs. Remove two of the three example statements in brackets in this section, and modify the remaining one to contain the full path to your access logs. You can use the * character to specify all the logs in the directory like so <accesslog path="/path/to/logs/access.log*" encoding="UTF-8" /> . Delete the remaining sections in the INPUTS section.

The next section is the filters section. This is where you will specify which URLs to exclude. You can do a lot of fancy stuff here, but the most important thing for WordPress is to remove URLs that lead to non-content pages, like wp-login, for example2. In these statements you tell sitemap_gen which URLs to add or remove from the list, using normal wildcards or regular expressions. I recommend keeping this as simple as possible, avoiding the use of pass statements because those act like short circuits and will leave matching URLs in the list no matter what you specify later, and in conjunction with regular expressions, this can sometimes be non-intuitive and hard to debug.

Here’s my filters section:

<filter action="drop"  type="regexp"  pattern="/wp-admin/"    />  
  <filter action="drop"  type="regexp"  pattern="/wp-login/"       /> 
  <filter action="drop"  type="regexp"  pattern="wp-cron\.php"    />    
  <filter action="drop"  type="regexp"  pattern="wp-login\.php"      />  
  <filter action="drop"  type="regexp"  pattern="/doc/"        />
  <filter action="drop"  type="regexp"  pattern="/noexist_" />       
  <filter action="drop"  type="regexp"  pattern="/\?p=[\d]"      />  
  <filter action="drop"  type="regexp"  pattern="/\?s=[a-zA-Z0-9]" />       
  <filter action="drop"  type="regexp"  pattern="/Photos/tags/.*\.html" />       
  <filter action="drop"  type="regexp"  pattern="/Photos/tags/.*/tags/"    />    
  <filter action="drop"  type="regexp"  pattern="/wp-content/"  />
  <filter action="drop"  type="regexp"  pattern="/wp-includes/" />
  <filter action="drop"  type="regexp"  pattern="/stats/" />
  <filter action="drop"  type="regexp"  pattern="/_vti_bin/" />
  <filter action="drop"  type="regexp"  pattern="/MSOffice/" />
  <filter action="drop"  type="regexp"  pattern="/dh_phpmyadmin/"/> 
  <filter action="drop"  type="regexp"  pattern="/htmledit/" />
  <filter action="drop"  type="regexp"  pattern="/robots\.txt" />
  <filter action="drop"  type="regexp"  pattern="/sitemap\.xml"/> 
  <filter action="drop"  type="regexp"  pattern="/xmlrpc\.php" />
  <filter action="drop"  type="wildcard"  pattern="*.jpg"         />
  <filter action="drop"  type="wildcard"  pattern="*.tif"         />
  <filter action="drop"  type="wildcard"  pattern="*.tiff"        /> 
  <filter action="drop"  type="wildcard"  pattern="*.bmp"       />  
  <filter action="drop"  type="wildcard"  pattern="*.ico"         />
  <filter action="drop"  type="wildcard"  pattern="*.js"         />
  <filter action="drop"  type="wildcard"  pattern="*.css"       />  
  <filter action="drop"  type="wildcard"  pattern="*.gif"        /> 
     <!-- Exclude URLs within UNIX-style hidden files or directories       -->
  <filter action="drop"  type="regexp"    pattern="/\.[^/]*"   />  

That’s all fairly straightforward, I hope, but two things merit explaining. The section below

<filter action="drop"  type="regexp"  pattern="/\?p=[\d]"     />   
  <filter action="drop"  type="regexp"  pattern="/\?s=[a-zA-Z0-9]"    />    
  <filter action="drop"  type="regexp"  pattern="/Photos/tags/.*\.html"    />    
  <filter action="drop"  type="regexp"  pattern="/Photos/tags/.*/tags/"     />   

is an example of one way to remove redundant URLs from your list. You don’t need the “Pretty URL” to your site and the /p?number URL both, and if you’ve changed that setting recently, they will both show up in your logs. The /\?p=[\d] string tells site_gen to exclude any URL of the form /p?some number. Also, you don’t necessarily need search result pages to appear in the list, so the next line takes care of that. The following two lines are for use with the Flickr Photo Gallery plugin. This plugin allows you to browse your tags just as you would at Flickr, but this creates a URL problem when the site is crawled, resulting in 90% of your logs being composed of redundant crap. Those two lines remove all the URLs pertaining to the gallery except gallery pages and display pages for a single tag.

The next thing worth mentioning is the lines below, which are generated when someone using IE visits your page with the discussion toolbar loaded. IE looks to see if your site supports it, which mine doesn’t.

<filter action="drop" type="regexp" pattern="/_vti_bin/" />
<filter action="drop" type="regexp" pattern="/MSOffice/" />

After processing your logs and applying some intelligent filter rules to exclude URLs that aren’t content-containing parts of your site, you’re ready to submit. Run python sitemap_gen.py --config=config.xml --testing, extract the sitemap.xml file from sitemap.xml.gz, and load it in your browser. Look through it and make sure your rules have worked as expected, then run the command again, removing the –testing part. If you want to get fancy, you can set this up as a cron job. If you do, run it on access.log.0, yesterdays logs, around 2am. That way you don’t miss any traffic as the logging switches over at midnight.

Finally, log into Google webmaster tools and submit your sitemap to Google!

To see how must of your traffic is coming from the Googlebot, SSH to your server and run tail -10000 access.log| awk '{print $1}' | sort | uniq -c |sort -n from the same directory as your access.log files. The first number is the connections, the second is the IP making those connections. IPs that start with 66.249 are the Googlebot. If 66.249 is the last entry, and the number of connections is very high(over a thousand, say) and many times bigger than the number of connections for the second most frequent IP, you probably need to do something before the hosting company does something for you, like ban Google from accessing your site.
I’m not exactly sure if it would be better to leave some things in, but set to a zero priority, however I have non-content stuff removed for now. Really, the non-content pages should probably be excluded in robots.txt

I’ve had enough.

August 25th, 2007

I’m sticking with the default lame-ass Kubrick theme, as it seems to be the only one that plug-in developers test against, and I don’t have time to mess around editing the template to fix one thing while breaking another.

EDIT: I couldn’t resist, I’m trying K2

The REAL Nigerian Finance Minister, Mrs. Ngozi Okonjo-Iweala

August 3rd, 2007

Not the one who has been emailing you for your assistance transferring TEN MILLION US DOLLARS to a foreign bank account.

Discussion thread here.

Google Documents and WordPress

June 27th, 2007

My dissertation post is here. When I edit the document at Google Documents and republish, it overwrites the post, so any explanatory text or tags are lost. One thing that is a little annoying is how it tries to take over the right-click context menu. I end up with the Google Document right-click menu opening up, with the Firefox context menu on top of it, obscuring the top half of the google menu.

I would have thought Google would have known better than to try to subvert such an important browser function. Bad Google, Bad!

The good: revision control, easy collaboration, seamless output to many formats, rich editing features.
The bad: post metadata isn’t preserved, non-standard browser UI, no way(I know of) to put the post on a separate page.

Maybe I could get the best of both by sticking the RSS feed of revisions on a separate WP page.

Keywords work now, and editing works. Now to get widgets figured out.

June 13th, 2007

Deleting posts from the manage page doesn’t work, but deleting from the edit entry page does work. There are about 10 support threads at WordPress for this, but no resolution. The ones where it was a rights issue have been figured out, but not the weird behavior of the manage page.

Because the widgets work in the default theme, but not in Tiga, there must be some weirdness with the theme, but I should be able to paste the widget code into sidebar.php in the theme directory.

I don’t think wp-admin/widgets.php works with tiga, because it expects wp-content/plugins/widgets.php. I’ll have to check that soon, and in the mean time, I could probably just paste the code in.

Replacing tiga’s sidebar.php with the default’s works, but the formatting is screwed up. I need to figure out what parts of the default sidebar need to be reproduced in tiga’s.

I’m having issues with my old theme, Tiga, and the new wordpress.

June 12th, 2007

Some things don’t work until I get this figured out.

Specifically, the jerome’s keywords plugin doesn’t work with both Tiga and wordpress2.2, though they work with either alone.
Many of the fancier sidebar widgets don’t work, like the one that displays RSS feeds.

Not only that, but I can’t delete posts.

Please God, let me find nukes in Iran.

May 8th, 2007

...and make Russia, China, and India get mad at Iran, too.

Finally, I have a feed reading system that works how I want it to. Almost.

May 2nd, 2007

I used Sharpreader for years, and I still think it’s the best of the free stand-alone readers, even though it’s not the prettiest, and the mechanism for changing feed properties is about the wonkiest thing I’ve ever seen. I didn’t want to change, because I’m a minimalist at heart, but I just wanted a little more.

While looking around for something that would be just a little easier on the eyes and not so restrictive in terms of user-configuration, I found Newzie. I used it for a couple weeks at home, and I liked how it looked better than Sharpreader, RSSOwl, or Greatnews. The large variety of news reading modes was a welcome change. I used the single column, full-article newspaper view the most, because the headline rarely indicates the content in many of the feeds I subscribe to, such as Ask.Metafilter and all of the Scienceblogs feeds; However, if you have the full posts laid out one after the other, you can skim them easily. This works particularly well for Flickr feeds.

I realized after a couple weeks, however, that I just liked reading things in Sharpreader better. The reason is that Newzie has designed its own novel UI. Right-clicking doesn’t work like you expect, closing and minimizing doesn’t work like you’re used to, and it’s just unpleasant to switch between UI styles like that. Simply clicking a link would open up a IE tab within the application, no matter if you selected “open” or “open in new window”. There was a third option, called “Open (ext.)” that would open the link in IE externally. There was no way to open a firefox window externally if you were using the Gecko rendering engine, and the only way to do it using embedded IE was to install the “Open in Firefox” extension for IE, and right-click and select open link in firefox from the context menu. The way this worked would change depending on what mode you were in, too. Simple things, like right-clicking a feed in the left pane would open a slew of options, none of which were “Mark all read”. Every article container had all these buttons and options that would popup upon mouse-over, but I never used them, because they were for deleting the post or changing the read status or some other thing I didn’t care about. Who wants to manually mark each individual post they read? Who wants to go through and individually delete old, read posts? Another annoying thing was the unnecessary distinction between “New” and “Unread”. If I haven’t read it, it’s new to me, OK? So while I found Sharpreader to be unpleasantly restrictive and minimalist, Newzie was way over the top with unnecessary, cluttered features. The developer should focus on making what features he has coded work right, rather than grafting on a slew of half-ass new ones.

I’m using Sage now, which used to work not so well on pre-2.0 Firefox, but seems to work great in the latest release. It works in the sidebar just like History or Bookmarks do, instead of grafting on some new interface. Marking of items read is done via the browser history, instead of some hack. Because it obeys the conventions of the system within which it operates, I was able to use the Optimoz tweaks extension to auto-hide it in the sidebar, giving me all the navigational ease of a 3-pane interface, with the page presentation of a full window. Because it obeys the conventions of the system within which it operates, you can use a custom stylesheet to display the feed however you’d like, with no need for a little buttons or preference for each style, color, background, font, and so on. This also allows you to benefit from the design capabilities of someone other than the developer. Judging from the available styles, this is a very good thing. The only thing I don’t have that I’d like here is a display of unread messages. There was an extension that purported to do that, but it wasn’t listed on mozilla.org, and wasn’t compatible to 2.0.0.3, so I’m reluctant to mess with it.

Transcript: Rock the Vote Democratic Presidential Debate (washingtonpost.com)

November 5th, 2003

I was just reading the Transcript of the Rock the Vote Democratic Presidential Debate (washingtonpost.com) and I remembered a lesson my dad taught me. The lesson was: “Find the Referent.” It’s from Stuart Chase’sThe Tyranny of Words“. In other words, think about what the words being used actually refer to, physically. Political speak is meticulously devoid of referents, except, of course, in the cases where the speakers abuse statistics. I want to see a debate where candidates aren’t allowed to use the phrase, “the American people”.

I was kinda enjoying the debate, and laughing about how all the other candidates, including the Reverend Al Sharpton, lord help us all, were trying to give Dean such a hard time for pointing out the unrequited loyalty of Southern white voters to the Republican Party. Bless ‘im.

Bush renews rebuke of Boykin – The Washington Times: Nation/Politics

October 29th, 2003

Trent Lott loses his spot, Rush Limbaugh gets booted, Greg Easterbrook is fired. Each of them for making statements that could possibly have been misinterpreted, or could be seen in a more understanding light, about relatively benign issues. Lt. Gen. William G. Boykin, however, has made and, as far as we know, is continuing to make absolutely unambiguous statements about issues that are likely to precipitate immediate terrorist attacks against U.S. citizens, as well as to further the negative image Americans are getting in other countries. He still has his job.

I have to ask: Who would you rather piss off? The American Religious Right, or all Muslims worldwide? I guess all Muslims worldwide aren’t going to be voting for Bush in the coming elections, are they?