Blog analysis for 2011: 173,363 words so far; also, using the Rails console to work with WordPress

| analysis, blogging, geek, review

How many posts did I post per month, not including this or future posts? (See the geek appendix below to find out how I got to the point of being able to run code snippets like this:)

posts = WpBlogPost.published.posts.year(2011)
posts.count(:id, :group => 'month(post_date)').sort { |a,b| a[0].to_i <=> b[0].to_i }

Result: [[“1”, 32], [“2”, 34], [“3”, 33], [“4”, 33], [“5”, 34], [“6”, 39], [“7”, 33], [“8”, 33], [“9”, 31], [“10”, 33], [“11”, 31], [“12”, 8]]

This is a straightforward SQL query to write, but ActiveRecord and scopes make it more fun, and I can easily slice the data in different ways. Becuase I’ve connected Rails with my WordPress data, I can use all sorts of other gems. For example, Lingua::EN::Readability can give me text statistics. It’s not a gem, but it’s easy to install with the provided install.rb. Code tends to throw off my word count, so let’s get rid of HTML tags and anything in pre tags, then calculate some text statistics:

include ActionView::Helpers::SanitizeHelper
require 'lingua/en/readability'
# Needs lots of memory =)
post_bodies = posts.map { |x| strip_tags(x.post_content.gsub(/<pre.+?<\/pre>/m, '')) }
all_text = post_bodies.join("\n").downcase
report = Lingua::EN::Readability.new(all_text)
Number of words in 2011 173,363
Flesch reading ease 65.3
Gunning Fog index 11.0
Flesch-Kincaid grade level 8.4

According to this, my writing should be readable by high school seniors, although they’ll probably have to be geeks in order to be interested in the first place.

The Readability library has other handy functions, like occurrences for finding out how frequently a word shows up in your text.

I 4375 #4 – It’s a personal blog, after all
you 1926 #9 – Not so bad
my 1555
time 933
people 897
work 710
W- 200
presentations 190
J- 133
Drupal 111
Rails 97
Emacs 77
zucchini 23 Oh, the summer of all that zucchini…

I want to get better at clear, specific descriptions. That means avoiding adjectives like ‘nice’ and hedging words like ‘really’.

really 227 Hmm, I can cut down on this
maybe 211 This one too
probably 211 Down with hedging!
awesome 88 I overuse this, but it’s a fun word
nice 15 The war on generic adjectives continues.

Let’s look at feelings:

happy / happiness / wonderful 107
busy 33
worried / anxious / worry 30
tired 20
excited / exciting 21
delighted 4
suck 4
sad 2

I recently used the N-Gram gem to analyze the text of Homestar reviews looking for recurring phrases. I suspected that one of the contractors we were considering had salted his reviews, and unusual recurring phrases or spikes in frequency might be a tip-off. I can use the same technique to identify any pet phrases of mine.

csv = FasterCSV.open('ngrams.csv', 'w')
n_gram = NGram.new(all_text, :n => [2, 3])
csv << "NGRAM 2"
n_gram.ngrams_of_all_data[2].sort { |a,b| a[1] <=> b[1] }.map { |a| csv << a };
csv << "NGRAM 3"
n_gram.ngrams_of_all_data[3].sort { |a,b| a[1] <=> b[1] }.map { |a| csv << a };
csv.close

The ten most common 3-word phrases on my blog tend to be related to planning and explaining. It figures. I can stop saying “a lot of”, though.

Phrase Frequency
i want to 158
a lot of 126
so that i 94
be able to 86
that i can 76
you want to 74
one of the 68
that you can 63
in order to 55
i need to 55

Some frequent two-word phrases:

i can 425
you can 408

Two-word phrases starting with “I’m…”

i’m going 52
i’m not 29
i’m looking 25
i’m working 24
i’m learning 23
i’m sure 16
i’m thinking 15
i’m glad 14
i’m getting 12

I wonder what other questions I might ask with this data…

Geek appendix: Using the Rails Console to work with WordPress data

The Rails console is awesome. You can do all sorts of things with it, like poke around your data objects or run scripts. With a little hacking, you can even use it as a smarter interface to other databases.

For example, I decided to get rid of all the syntax formatting that Org-mode tried to do with my blog posts when I published them to WordPress. Fortunately, this was the only use of span tags in my post content, so I could zap them all with a regular expression… if I could confidently do regular expressions in the MySQL console.

In the past, I might have written a Perl script to go through my database. If desperate, I might have even written a Perl script to do a regular expression replacement on my database dump file.

Rails to the rescue! I decided that since I was likely to want to use data from my WordPress blog in my Rails-based self-tracking system anyway, I might as well connect the two.

I found some code that created ActiveRecord models for WordPress posts and comments, and I modified it to connect to a different database. I added some scopes for easier queries, too.

class WpBlogPost < ActiveRecord::Base
  establish_connection Rails.configuration.database_configuration["wordpress"]

  set_table_name "wp_posts"
  set_primary_key "ID"

  has_many :comments, :class_name => "WpBlogComment", :foreign_key => "comment_post_ID"

  def self.find_by_permalink(year, month, day, title)
    find(:first,
         :conditions => ["YEAR(post_date) = ? AND MONTH(post_date) = ? AND DAYOFMONTH(post_date) = ? AND post_name = ?",
                         year.to_i, month.to_i, day.to_i, title])
  end

  scope :posts, where("post_type='post'")
  scope :published, where("post_status='publish'")
  scope :year, lambda { |year| where("year(post_date)=?", year) }
end
# http://snippets.dzone.com/posts/show/1314
class WpBlogComment < ActiveRecord::Base
  establish_connection Rails.configuration.database_configuration["wordpress"]

  # if wordpress tables live in a different database (i.e. 'wordpress') change the following
  # line to set_table_name "wordpress.wp_comments"
  # don't forget to give the db user permissions to access the wordpress db
  set_table_name "wp_comments"
  set_primary_key "comment_ID"

  belongs_to :post , :class_name => "WpBlogPost", :foreign_key => "comment_post_ID"

  validates_presence_of :comment_post_ID, :comment_author, :comment_content, :comment_author_email

  def validate_on_create
    if WpBlogPost.find(comment_post_ID).comment_status != 'open'
      errors.add_to_base('Sorry, comments are closed for this post')
    end
  end

end

I specified the database configuration in config/database.yml, and granted my user access to the tables:

wordpress:
  adapter: mysql
  encoding: utf8
  database: wordpress_database_goes_here
  username: rails_username_goes_here

After I rigged that up, I could then run this little bit of code in Rails console to clean up all those entries.

WpBlogPost.where('post_content LIKE ?', '%<span style="color:%').each do |p|
  s = p.post_content.gsub /<span style="color:[^>]+>/, ''
  s.gsub! '</span>', ''
  p.update_attributes(:post_content => s)
end

Cleaning up subscripts (accidental use of underscore without escaping):

WpBlogPost.where('post_content LIKE ?', '%<sub>%').each do |p|
  s = p.post_content.gsub /<sub>/, '_'
  s.gsub! '</sub', ''
  p.update_attributes(:post_content => s)
end

Now I can use all sorts of other ActiveRecord goodness when generating my statistics, like the code above.

You can comment with Disqus or you can e-mail me at sacha@sachachua.com.