Category Archives: analysis

On this page:

Audio comparison: Blue Yeti vs headset, webcam microphone, video

You know how you sound lower-pitch to yourself and higher-pitch to others? (Science says it’s because of bone conduction.)

I sound high-pitch to myself. I’ve always sounded like a kid. Friends teased me about voice-acting for anime. I tried to avoid being self-conscious about it, but you know how sometimes that sneaks in anyway. I found it difficult to listen to recordings of my presentations or podcasts. I ended up paying other people to transcribe them.

As it turns out, this might be one of the things you can fix with money. Here’s a quick comparison of:

  • the Blue Yeti
  • the Logitech H800 headset I usually use
  • the Logitech Webcam Pro 9000 I use for video
  • my computer’s built-in microphone

With the Yeti, I can deal with listening to myself. Hmm. That’s something. It’s not cheap, but if this is one of those rare occasions you can spend money to get around confidence barriers… bring it on!

I justify the expense by telling myself that this will encourage me to make more videos and screencasts. Let’s see how it works out. For recording and webcasts, I put the Yeti on a thick stack of fleece to help muffle the vibrations from the desk.

I know there are even fancier microphones out there. I think those will have to wait for a better setup, though. I’m fine with the quality of the Yeti, and I’ve learned not to let my expenses outpace my senses by too much.

If you’re thinking of getting your own, try getting it from a brick-and-mortar store so that you can return it if it doesn’t work out for you. Apparently, microphones do different things for different voices. Here’s an affiliate link to it on Amazon, if you’re inclined to get it online: Blue Microphones Yeti USB Microphone – Silver Edition. =) (I’ll get a tiny fraction of the purchase price, which is handy for buying the occasional book.)

Hope this helps! I’d love to hear (from) you.

Analyzing my Lotus Notes sent mail since January 2011

Today is my penultimate day at IBM! Having successfully turned my projects over to another developer (hooray for the habit of organizing project-related files in Lotus Connections Activities), I’ve been focusing on getting things ready for the traditional goodbye e-mail, which I plan to send tomorrow.

I dug around in the Lotus Connections Profiles API to see if I could get a list of my contacts’ e-mail addresses. I fixed a small bug in the feed exporter of the Community Toolkit (w4.ibm.com/community for people in the IBM intranet) and exported my contacts, giving me a list of 530 IBMers who had accepted or sent me an invitation to connect.

Not everyone participates in that Web 2.0 network, though, so I wanted to analyze my sent mail to identify other people to whom I should send a note. I couldn’t find a neat LotusScript to do the job, and I couldn’t get the NSF to EML or mbox converter to work. Because I didn’t need all the information, just the recipients, subjects, and times, I wrote my own script (included at the end of this blog post).

I used the script to summarize the messages in my sent mail folder, and crunched the numbers using PivotTables in Microsoft Excel. I worked with monthly batches so that it was easier to find and fix errors. I decided to analyze all the mail going back to the beginning of last year in order to identify the people I mailed the most frequently, and to come up with some easy statistics as well.

image

Spiky around project starts/ends, I’d guess.

image

I wanted to see which roles I tended to e-mail often, so I categorized each recipient with their role. I distinguished between people I’d worked with directly on projects (coworkers) and people who worked with IBM but with whom I didn’t work on a project (colleagues). The numbers below count individual recipients.

Role

Number of people

Number of individual
e-mails sent

Average e-mails sent
per person

colleague 407 827 2.0
coworker 50 562 11.2
client 21 387 18.4
manager 4 109 27.3
partner 9 51 5.7
system 9 21 2.3
other 8 11 1.4
self 1 5 5.0
Grand Total 509 1973 3.9

As it turns out, I sent a lot of mail to a lot of people throughout IBM, mostly in response to questions about Lotus Connections, Idea Labs, or collaboration tools.

Now I can sort my summarized data to see whom I e-mailed the most often, and add more names to my don’t-forget-to-say-goodbye list. If all goes well, I might even be able to use that mail merge script. =)

The following agent processes selected messages and creates a table with one row per recipient, e-mailing the results to the specified mail address. It seems to choke on calendar entries and other weird documents, but if you go through your sent mail box in batches (Search This View by date is handy), then you should be able to find and delete the offending entries.

Option Public
Dim TempNitem As NotesItem
Dim TempNm As NotesName
Dim session As  NotesSession
Dim db As NotesDatabase
Sub Initialize
	mailAddress = "YOUR_ADDRESS@HERE"
	
	Dim ws As New NotesUIWorkspace
	Dim uidoc As NotesUIDocument
	Dim partno As String
	Dim db As NotesDatabase
	Dim view As NotesView
	Dim doc As NotesDocument
	Dim collection As NotesDocumentCollection
	Dim memo As NotesDocument
	Dim body As NotesRichTextItem
	Dim range As NotesRichTextRange
	Dim count As Integer
	
	Set session = New NotesSession
	Set db = session.CurrentDatabase
	Set collection = db.UnprocessedDocuments
	
	Dim FldTitles(3) As String
	FldTitles(0) = "E-mail"
	FldTitles(1) = "Subject"
	FldTitles(2) = "Date sent"
	
	Set maildoc = db.CreateDocument
	maildoc.Form = "Memo"
	maildoc.Subject = "Summary"
	maildoc.SendTo = mailAddress
	Dim ritem As NotesRichTextItem
	Set ritem=New NotesRichTextItem(maildoc,"body") 
' passing the rich text item & other relevant details
	Set ritem = CreateTable(FldTitles, collection, ritem, "Sent items", "Summary created on " + Format(Now, "YYYY-MM-DD"))
	maildoc.send(False)
End Sub
Function CreateTable(FldTitles As Variant, doccoll  As NotesDocumentCollection, rtitem As NotesRichTextItem,msgTitle As String,msgBody As String ) As NotesRichTextItem
	'http://searchdomino.techtarget.com/tip/0,289483,sid4_gci1254682_mem1,00.html
	'Takes  Documentcollection & creates tabular information on to the passed   rtitem (rich text item)
	
	Set ritem=rtitem
	Set rtnav = ritem.CreateNavigator
	Set rstyle=session.CreateRichTextStyle 
	
	'===================================================
	'heading in the body section of the mail
	rstyle.Bold=True
	rstyle.NotesColor=COLOR_RED 
	rstyle.Underline=True
	rstyle.NotesFont=FONT_COURIER
	rstyle.FontSize=12
	Call  ritem.AppendStyle(rstyle)
	ritem.AppendText(msgTitle)
	
	rstyle.Underline=False
	rstyle.NotesColor=COLOR_BLACK
	ritem.AddNewline(2)
	
	rstyle.FontSize=10
	rstyle.Bold=False
	rstyle.NotesColor=COLOR_BLACK
	Call  ritem.AppendStyle(rstyle) 
	ritem.AppendText(msgBody)
	ritem.AddNewline(1)
	
	'===================================================
	rows=doccoll.Count +1
	cols=CInt(UBound(FldTitles)) 
	
	Call ritem.AppendTable(1, cols)
	Dim rtt As NotesRichTextTable
	Call rtnav.FindFirstElement(RTELEM_TYPE_TABLE)
	Set rtt = rtNav.GetElement 
	'=================================================
	'heading of the table
	rstyle.Bold=True
	rstyle.NotesColor=COLOR_BLUE 
	rstyle.FontSize=10
	Call  ritem.AppendStyle(rstyle)
	
	For i=0 To UBound(FldTitles) - 1
		Call rtnav.FindNextElement(RTELEM_TYPE_TABLECELL)
		Call ritem.BeginInsert(rtnav) 
		Call ritem.AppendText(FldTitles(i))
		Call ritem.EndInsert
	Next
	
	'=================================================
	rstyle.FontSize=10
	rstyle.Bold=False
	rstyle.NotesColor=COLOR_BLACK
	Call  ritem.AppendStyle(rstyle)
	Dim count As Integer
	count = 0
	Set  doc=doccoll.GetFirstDocument
	While Not (doc Is Nothing)
		subject = doc.GetFirstItem("Subject").values(0)
		posted = doc.GetFirstItem("PostedDate").values(0)
		Set sendTo = doc.getFirstItem("SendTo")
		For i = 0 To UBound(sendTo.values)
			Call rtt.AddRow(1)
			Call rtnav.FindNextElement(RTELEM_TYPE_TABLECELL)   
			Call ritem.BeginInsert(rtnav)
			ritem.appendText(sendTo.values(i))
			Call ritem.EndInsert
			Call rtnav.FindNextElement(RTELEM_TYPE_TABLECELL)   
			Call ritem.BeginInsert(rtnav)
			ritem.appendText(subject)
			Call ritem.EndInsert
			Call rtnav.FindNextElement(RTELEM_TYPE_TABLECELL)   
			Call ritem.BeginInsert(rtnav)
			ritem.appendText(posted)
			Call ritem.EndInsert
		Next   
		count = count + 1
		Set doc=doccoll.GetNextDocument(doc)
	Wend
	Set CreateTable=ritem
	MsgBox "E-mails summarized: " & count	
End Function

I find it helpful to save it as the "Summarize Recipients" agent and assign it to a toolbar button that runs @Command([RunAgent]; "Summarize Recipients").

Quantified Awesome: Squishing my excuses

I’ve been fiddling with Quantified Awesome, this personal dashboard that I’m building so that I can keep track of what’s going on in my life and use that data to make it even more awesome. For example:

  • Tracking my time helps me make sure work doesn’t tempt me too much, and that I make time for both personal projects as well as connecting with other people. It also helps me improve my time estimates: How much time does it really take to walk to the subway station? How instant are instant noodles?
  • Tracking library books reminds me before they’re overdue, helps me collect my reading history, and gives me a greater appreciation for where my tax dollars go.
  • Tracking my clothes helps me remember to wear different types of clothes more often, makes it easier to donate items I don’t typically wear, and encourages me to try new combinations.
  • Tracking the produce we get from community-supported agriculture helps us avoid waste.
  • Tracking stuff helps me remember where infrequently-accessed items are.

It turns out that other people are interested in this too. 21 people have signed up through my “I’ll e-mail you when I figure out how to get this ready for other people” page, and my mom wants to use it too. That’s awesome!

Now I have to go ahead and actually build it so that other people can use it. That’s scary.

And like the way I deal with other scary, intimidating, procrastination-inducing things, I’m going to list my excuses here, so that I can shine a light on those assumptions and watch them scurry away like the cockroaches they are and, if necessary, squishing them with a well-applied flipflop.

  • Excuse #1: Idiosyncrasy. The way I work might be really weird, and other people may not be able to figure out what to do.
    • What’s the worst-case scenario? “I have no idea how this works!” I end up with lots of crufty special cases because I can’t figure out how to reconcile different ways of working.
    • What’s the best case? I adapt the system to the way other people work, and I get inspired by what they do. I build a lovely, flexible web app and API.
  • Excuse #2: Risk. I’m fine with loading my own data into an experimental system, but if I mess up and delete other people’s data, I’ll feel terrible. Also, they might trigger bugs.
    • What’s the worst-case scenario? Catastrophic data failure, nothing saved.
    • What’s the best case? Regular backups help me recover from any major mishaps, and careful coding avoids more common mistakes.
  • Excuse #3: Support. I’m going to spend more time handling bug reports and feature requests, and less time building little things that might be useful only for me.
    • What’s the worst-case scenario? People get annoyed and frustrated because I’m currently focused on other things, like my work.
    • What’s the best case? I get the system to become mostly usable for people, and I use my discretionary time to build more features. People’s requests inspire me to build more stuff and create more value.
  • Excuse #4: Documentation. I’ll need to write documentation, or at the very least online help. This means confronting the less-than-intuitive parts of the system. ;)
    • What’s the worst-case scenario? I describe what currently exists, get frustrated because I want to improve it, and end up cycling between updating documentation and improving the system.
    • What’s the best case? I describe what currently exists, and end up improving it along the way. I build online help into the system so that it’s easy to change. There’s a blog that helps people learn about updates, too.
  • Excuse #5: Offline access. A web-based time tracker might be of limited use if you don’t have web access often. I’ve been working on an offline HTML5 interface, but it’s still buggy.
    • What’s the worst-case scenario? Early testers try it out, but get frustrated because of the lack of offline access.
    • What’s the best case? I figure out the HTML5 offline thing. Someone else might be interested in building a native app, and we work together on fleshing out an API.
  • Excuse #6: Impatience. If I bring people on too early, they might get annoyed with a buggy system, and lose interest.
    • What’s the worst-case scenario? People give it a cursory try, and give up in annoyance.
    • What’s the best case? Early users are extraordinarily patient. We figure out a minimal viable product for each of them – the simplest thing that could possibly support what they want to do. Over time, things keep getting better and better. Also, I build a decent export interface, so even if people move on to a different system, they’ll still have their data.
  • Excuse #7: Privacy and control. A bug might accidentally expose people’s information, which is not fun. I also don’t want to have to police the system for objectionable content, considering the thumbnail uploads.
    • What’s the worst-case scenario? Someone’s private notes get accidentally published.
    • What’s the best case? People sign on knowing that I might have bugs, and don’t save any super-secret or inappropriate information on the system.

Okay. I think I can deal with that. So, what are the smallest, least-intimidating steps I need to take in order to get closer to opening up?

  • Write a quick test to make sure that people’s data will stay private. We’ll make people’s accounts private by default, although mine will stay mostly-public.
  • Make a list of things that people should be able to do right now. (Not including new functionality!) Gradually write tests to nail down that behaviour.
  • Make a list of things that people may want to do some day. Eventually set up an issue tracker.
  • Enable Devise’s invitable feature so that I can set up accounts for people easily.
  • Doublecheck backups.
  • Bring one person on. Then the next, then the next…

It will still be better than nothing, it will be a good learning experience, and participation is purely voluntary anyway.

One step at a time.

Quantified Awesome: Community-supported agriculture with Plan B Organic Farms, fall 2011

image

After a good summer season with Plan B Organic Farms, we decided to sign up for their fall season as well. This time, I made sure to weigh and track all the produce that came in. I also took notes on what we did with the produce to make it easier to think of ways to use them before they were wasted.

Here’s what I was curious about:

  • How much did we get?
  • What was the cost per kilo or pound?
  • How does it compare to organic produce prices at the supermarket?
  • What were the proportions like? Did they match up with our perceptions?
  • How do I feel about the different vegetables now?

How much did we get? Over the 11 distributions I tracked, we received a total of 71.6 kilograms of organic produce and a container of apple cider. This worked out to an average of 6.5kg per distribution, with a standard deviation of 1.08kg.

What was the cost per kilo or pound? Weekly half-shares cost $25, about $3.84/kg or $1.75/lb of organic produce (not including the cider).

How does it compare to organic produce prices at the supermarket? The No Frills supermarket we usually shop at doesn’t have a wide selection of organic produce, so I used prices from GroceryGateway instead. In a past analysis, I found them to be usually 10% more than No Frills prices, and there are minimum order limits and delivery fees as well. Using the prices for organic produce whenever available and guessing “bunch” weights from my data, I calculated that we received an average of $31 of produce each week (including the cider). This worked out to a savings of $6 per week, or 20% (not including taxes, delivery charges, or other purchases to meet the minimum).

Would we have bought all that produce if we weren’t part of the community-supported agriculture program? I’m not sure, but the commitment device of having a box of vegetables come into our house every week helped us improve our diet.

What were the proportions like? Did they match up with our perceptions?

image

I’m surprised by this, because it felt like we received a whole lot more squash and cabbage (which I’ve included in the Greens category). They were bulky and not in our usual cooking repertoire, so they were more of a challenge. We mostly managed to finishing the cabbage, but we had to cut up and throw some of the squash away. The apples and tomatoes were occasionally suspect, too.

Here’s the breakdown within each category:

image

On average, we received 11 different types in a distribution (standard deviation = 1.2), covering 32 different types in total. The fall box included imported items such as bananas and kiwi to fill out the selection, as well as produce grown in greenhouses.

How do I feel about the different vegetables now? After two seasons of community-supported agriculture, I’m more comfortable with dealing with the increased volume of vegetables passing through our kitchen. We’ve organized the pantries with bins so that we can store all the squash and onions neatly, and we manage to get through the produce in our fridge drawers in a reasonable period of time. We waste a small fraction of the produce through inattention (apples, mostly), but have managed to convert most of the produce into good food. I’d say we’re working at 90-95% efficiency or so.

Some experimental recipes have been more fun than others. Sweet potato fries have become a favourite in the house. Baked acorn squash with brown sugar and butter is a nice winter dessert. We discovered that adding sausages to butternut squash soup makes it much easier to finish. Turnips and beets still need a lot of tweaking.

We’ve signed up for a bi-weekly winter share from Cooper’s Farm CSA in order to take advantage of delivery. We happened to start with their program in time to make a side-by-side comparison with Plan B Organic Farms, and they turned out favourably (although their produce required more scrubbing). We’ll see how things work out over the next season.

Here’s my raw data.

How I tracked this: I built a small tool for tracking community-supported agriculture into my Quantified Awesome website. Every week, I weighed all the produce and typed in the their names and weights. At the end of the season, I copied the data and used pivot tables in Microsoft Excel to analyze the results by category and week. I manually checked the GroceryGateway website for prices, and I used VLOOKUP to cross-reference the data with the prices.

My input system didn’t do anything special that a spreadsheet couldn’t handle, although I liked how the weights became part of my dashboard. If you want to start tracking either community-supported agriculture or your regular groceries, you can start with a spreadsheet (Microsoft Excel or OpenOffice). Log the produce you receive or buy, and summarize them in ways that help you answer your questions. Have fun!

Learning plans and time budgets: packing things into 2012

Nudged by @catehstn‘s recommendation of my blog to @Tending2Entropy as an example of goal planning in personal life, I updated my learning plan with the things I’m planning to learn next year.

It was easy to come up with a quick outline. There are so many interesting things I want to learn. The tough part, however, was thinking about what I might actually get to do.

What does my cognitive surplus look like? I wanted to get a sense of how much discretionary time I actually had on a regular basis. I have about 20 weeks of data since I resumed time-tracking near the end of July. So that my numbers wouldn’t be thrown off by the vacation we took, I focused on the last eight weeks (graph: 2011-10-16 to 2011-12-11).

Over the eight-week period, I got an average of 3.5 hours of discretionary time per weekday and 7 hours of discretionary time per weekend day. I can simplify that to an average of 4.5 hours per day, which comes out to 1642 hours for 2012 (not including vacations, which include more discretionary time).

Around 40% of discretionary time was used for social activities. Let’s say that another 30% is a buffer for breaks and other things that come up, leaving 30% for focused learning. That gives me a time budget of around 500 hours. I want to do more than 1,000. Hmm.

Prioritization is important. I can focus on the things I want the most, then see how the rest of the year shakes out. Plans will change anyway, and estimates are flexible. My first few priorities for personal learning:

  • Android development, so that I can save time syncing and get more of the data I want
  • Goal tracking (handy for keeping the rest of my time in line)
  • Behavioural change (trying small experiments)

Another way to deal with the gap is to shift more time. Over those eight weeks, tidying took about 0.7 hours / day, and cooking took about that much time too. Let’s say half of future tidying and all of future cooking is outsourceable at $20/hour. That’s an additional 384 hours for a trade-off of $7,680 after tax, which is a large chunk of money. I’d rather save the money and let it compound for later use, especially if I time chores so that they take advantage of low energy. Besides, cooking and other chores are partly social time too.

I can shift time in other ways. For example, I can use commuting time to learn more about Emacs, Org, and Rails, so that will help too. I can also use walking time to record life stories if I can figure out a workflow for dealing with audio or short notes.

Good to know what the size of the box is, and how much I want to pack into it! Let’s see how it all works out…

Blog analysis for 2011: 173,363 words so far; also, using the Rails console to work with WordPress

How many posts did I post per month, not including this or future posts? (See the geek appendix below to find out how I got to the point of being able to run code snippets like this:)

posts = WpBlogPost.published.posts.year(2011)
posts.count(:id, :group => 'month(post_date)').sort { |a,b| a[0].to_i <=> b[0].to_i }

Result: [["1", 32], ["2", 34], ["3", 33], ["4", 33], ["5", 34], ["6", 39], ["7", 33], ["8", 33], ["9", 31], ["10", 33], ["11", 31], ["12", 8]]

This is a straightforward SQL query to write, but ActiveRecord and scopes make it more fun, and I can easily slice the data in different ways. Becuase I’ve connected Rails with my WordPress data, I can use all sorts of other gems. For example, Lingua::EN::Readability can give me text statistics. It’s not a gem, but it’s easy to install with the provided install.rb. Code tends to throw off my word count, so let’s get rid of HTML tags and anything in pre tags, then calculate some text statistics:

include ActionView::Helpers::SanitizeHelper
require 'lingua/en/readability'
# Needs lots of memory =)
post_bodies = posts.map { |x| strip_tags(x.post_content.gsub(/<pre.+?<\/pre>/m, '')) }
all_text = post_bodies.join("\n").downcase
report = Lingua::EN::Readability.new(all_text)
Number of words in 2011 173,363
Flesch reading ease 65.3
Gunning Fog index 11.0
Flesch-Kincaid grade level 8.4

According to this, my writing should be readable by high school seniors, although they’ll probably have to be geeks in order to be interested in the first place.

The Readability library has other handy functions, like occurrences for finding out how frequently a word shows up in your text.

I 4375 #4 – It’s a personal blog, after all
you 1926 #9 – Not so bad
my 1555
time 933
people 897
work 710
W- 200
presentations 190
J- 133
Drupal 111
Rails 97
Emacs 77
zucchini 23 Oh, the summer of all that zucchini…

I want to get better at clear, specific descriptions. That means avoiding adjectives like ‘nice’ and hedging words like ‘really’.

really 227 Hmm, I can cut down on this
maybe 211 This one too
probably 211 Down with hedging!
awesome 88 I overuse this, but it’s a fun word
nice 15 The war on generic adjectives continues.

Let’s look at feelings:

happy / happiness / wonderful 107
busy 33
worried / anxious / worry 30
tired 20
excited / exciting 21
delighted 4
suck 4
sad 2

I recently used the N-Gram gem to analyze the text of Homestar reviews looking for recurring phrases. I suspected that one of the contractors we were considering had salted his reviews, and unusual recurring phrases or spikes in frequency might be a tip-off. I can use the same technique to identify any pet phrases of mine.

csv = FasterCSV.open('ngrams.csv', 'w')
n_gram = NGram.new(all_text, :n => [2, 3])
csv << "NGRAM 2"
n_gram.ngrams_of_all_data[2].sort { |a,b| a[1] <=> b[1] }.map { |a| csv << a };
csv << "NGRAM 3"
n_gram.ngrams_of_all_data[3].sort { |a,b| a[1] <=> b[1] }.map { |a| csv << a };
csv.close

The ten most common 3-word phrases on my blog tend to be related to planning and explaining. It figures. I can stop saying “a lot of”, though.

Phrase Frequency
i want to 158
a lot of 126
so that i 94
be able to 86
that i can 76
you want to 74
one of the 68
that you can 63
in order to 55
i need to 55

Some frequent two-word phrases:

i can 425
you can 408

Two-word phrases starting with “I’m…”

i’m going 52
i’m not 29
i’m looking 25
i’m working 24
i’m learning 23
i’m sure 16
i’m thinking 15
i’m glad 14
i’m getting 12

I wonder what other questions I might ask with this data…

Geek appendix: Using the Rails Console to work with WordPress data

The Rails console is awesome. You can do all sorts of things with it, like poke around your data objects or run scripts. With a little hacking, you can even use it as a smarter interface to other databases.

For example, I decided to get rid of all the syntax formatting that Org-mode tried to do with my blog posts when I published them to WordPress. Fortunately, this was the only use of span tags in my post content, so I could zap them all with a regular expression… if I could confidently do regular expressions in the MySQL console.

In the past, I might have written a Perl script to go through my database. If desperate, I might have even written a Perl script to do a regular expression replacement on my database dump file.

Rails to the rescue! I decided that since I was likely to want to use data from my WordPress blog in my Rails-based self-tracking system anyway, I might as well connect the two.

I found some code that created ActiveRecord models for WordPress posts and comments, and I modified it to connect to a different database. I added some scopes for easier queries, too.

class WpBlogPost < ActiveRecord::Base
  establish_connection Rails.configuration.database_configuration["wordpress"]

  set_table_name "wp_posts"
  set_primary_key "ID"

  has_many :comments, :class_name => "WpBlogComment", :foreign_key => "comment_post_ID"

  def self.find_by_permalink(year, month, day, title)
    find(:first,
         :conditions => ["YEAR(post_date) = ? AND MONTH(post_date) = ? AND DAYOFMONTH(post_date) = ? AND post_name = ?",
                         year.to_i, month.to_i, day.to_i, title])
  end

  scope :posts, where("post_type='post'")
  scope :published, where("post_status='publish'")
  scope :year, lambda { |year| where("year(post_date)=?", year) }
end
# http://snippets.dzone.com/posts/show/1314
class WpBlogComment < ActiveRecord::Base
  establish_connection Rails.configuration.database_configuration["wordpress"]

  # if wordpress tables live in a different database (i.e. 'wordpress') change the following
  # line to set_table_name "wordpress.wp_comments"
  # don't forget to give the db user permissions to access the wordpress db
  set_table_name "wp_comments"
  set_primary_key "comment_ID"

  belongs_to :post , :class_name => "WpBlogPost", :foreign_key => "comment_post_ID"

  validates_presence_of :comment_post_ID, :comment_author, :comment_content, :comment_author_email

  def validate_on_create
    if WpBlogPost.find(comment_post_ID).comment_status != 'open'
      errors.add_to_base('Sorry, comments are closed for this post')
    end
  end

end

I specified the database configuration in config/database.yml, and granted my user access to the tables:

wordpress:
  adapter: mysql
  encoding: utf8
  database: wordpress_database_goes_here
  username: rails_username_goes_here

After I rigged that up, I could then run this little bit of code in Rails console to clean up all those entries.

WpBlogPost.where('post_content LIKE ?', '%<span style="color:%').each do |p|
  s = p.post_content.gsub /<span style="color:[^>]+>/, ''
  s.gsub! '</span>', ''
  p.update_attributes(:post_content => s)
end

Cleaning up subscripts (accidental use of underscore without escaping):

WpBlogPost.where('post_content LIKE ?', '%<sub>%').each do |p|
  s = p.post_content.gsub /<sub>/, '_'
  s.gsub! '</sub', ''
  p.update_attributes(:post_content => s)
end

Now I can use all sorts of other ActiveRecord goodness when generating my statistics, like the code above.