December 9, 2011

Blog analysis for 2011: 173,363 words so far; also, using the Rails console to work with WordPress

December 9, 2011 - Categories: analysis, blogging, geek, review

How many posts did I post per month, not including this or future posts? (See the geek appendix below to find out how I got to the point of being able to run code snippets like this:)

posts = WpBlogPost.published.posts.year(2011)
posts.count(:id, :group => 'month(post_date)').sort { |a,b| a[0].to_i <=> b[0].to_i }

Result: [["1", 32], ["2", 34], ["3", 33], ["4", 33], ["5", 34], ["6", 39], ["7", 33], ["8", 33], ["9", 31], ["10", 33], ["11", 31], ["12", 8]]

This is a straightforward SQL query to write, but ActiveRecord and scopes make it more fun, and I can easily slice the data in different ways. Becuase I've connected Rails with my WordPress data, I can use all sorts of other gems. For example, Lingua::EN::Readability can give me text statistics. It's not a gem, but it's easy to install with the provided install.rb. Code tends to throw off my word count, so let's get rid of HTML tags and anything in pre tags, then calculate some text statistics:

include ActionView::Helpers::SanitizeHelper
require 'lingua/en/readability'
# Needs lots of memory =)
post_bodies = posts.map { |x| strip_tags(x.post_content.gsub(/<pre.+?<\/pre>/m, '')) }
all_text = post_bodies.join("\n").downcase
report = Lingua::EN::Readability.new(all_text)
Number of words in 2011173,363
Flesch reading ease65.3
Gunning Fog index11.0
Flesch-Kincaid grade level8.4

According to this, my writing should be readable by high school seniors, although they'll probably have to be geeks in order to be interested in the first place.

The Readability library has other handy functions, like occurrences for finding out how frequently a word shows up in your text.

I4375#4 - It's a personal blog, after all
you1926#9 - Not so bad
my1555
time933
people897
work710
W-200
presentations190
J-133
Drupal111
Rails97
Emacs77
zucchini23Oh, the summer of all that zucchini…

I want to get better at clear, specific descriptions. That means avoiding adjectives like 'nice' and hedging words like 'really'.

really227Hmm, I can cut down on this
maybe211This one too
probably211Down with hedging!
awesome88I overuse this, but it's a fun word
nice15The war on generic adjectives continues.

Let's look at feelings:

happy / happiness / wonderful107
busy33
worried / anxious / worry30
tired20
excited / exciting21
delighted4
suck4
sad2

I recently used the N-Gram gem to analyze the text of Homestar reviews looking for recurring phrases. I suspected that one of the contractors we were considering had salted his reviews, and unusual recurring phrases or spikes in frequency might be a tip-off. I can use the same technique to identify any pet phrases of mine.

csv = FasterCSV.open('ngrams.csv', 'w')
n_gram = NGram.new(all_text, :n => [2, 3])
csv << "NGRAM 2"
n_gram.ngrams_of_all_data[2].sort { |a,b| a[1] <=> b[1] }.map { |a| csv << a };
csv << "NGRAM 3"
n_gram.ngrams_of_all_data[3].sort { |a,b| a[1] <=> b[1] }.map { |a| csv << a };
csv.close

The ten most common 3-word phrases on my blog tend to be related to planning and explaining. It figures. I can stop saying "a lot of", though.

PhraseFrequency
i want to158
a lot of126
so that i94
be able to86
that i can76
you want to74
one of the68
that you can63
in order to55
i need to55

Some frequent two-word phrases:

i can425
you can408

Two-word phrases starting with "I'm…"

i'm going52
i'm not29
i'm looking25
i'm working24
i'm learning23
i'm sure16
i'm thinking15
i'm glad14
i'm getting12

I wonder what other questions I might ask with this data…

Geek appendix: Using the Rails Console to work with WordPress data

The Rails console is awesome. You can do all sorts of things with it, like poke around your data objects or run scripts. With a little hacking, you can even use it as a smarter interface to other databases.

For example, I decided to get rid of all the syntax formatting that Org-mode tried to do with my blog posts when I published them to WordPress. Fortunately, this was the only use of span tags in my post content, so I could zap them all with a regular expression… if I could confidently do regular expressions in the MySQL console.

In the past, I might have written a Perl script to go through my database. If desperate, I might have even written a Perl script to do a regular expression replacement on my database dump file.

Rails to the rescue! I decided that since I was likely to want to use data from my WordPress blog in my Rails-based self-tracking system anyway, I might as well connect the two.

I found some code that created ActiveRecord models for WordPress posts and comments, and I modified it to connect to a different database. I added some scopes for easier queries, too.

class WpBlogPost < ActiveRecord::Base
  establish_connection Rails.configuration.database_configuration["wordpress"]

  set_table_name "wp_posts"
  set_primary_key "ID"

  has_many :comments, :class_name => "WpBlogComment", :foreign_key => "comment_post_ID"

  def self.find_by_permalink(year, month, day, title)
    find(:first,
         :conditions => ["YEAR(post_date) = ? AND MONTH(post_date) = ? AND DAYOFMONTH(post_date) = ? AND post_name = ?",
                         year.to_i, month.to_i, day.to_i, title])
  end

  scope :posts, where("post_type='post'")
  scope :published, where("post_status='publish'")
  scope :year, lambda { |year| where("year(post_date)=?", year) }
end
# http://snippets.dzone.com/posts/show/1314
class WpBlogComment < ActiveRecord::Base
  establish_connection Rails.configuration.database_configuration["wordpress"]

  # if wordpress tables live in a different database (i.e. 'wordpress') change the following
  # line to set_table_name "wordpress.wp_comments"
  # don't forget to give the db user permissions to access the wordpress db
  set_table_name "wp_comments"
  set_primary_key "comment_ID"

  belongs_to :post , :class_name => "WpBlogPost", :foreign_key => "comment_post_ID"

  validates_presence_of :comment_post_ID, :comment_author, :comment_content, :comment_author_email

  def validate_on_create
    if WpBlogPost.find(comment_post_ID).comment_status != 'open'
      errors.add_to_base('Sorry, comments are closed for this post')
    end
  end

end

I specified the database configuration in config/database.yml, and granted my user access to the tables:

wordpress:
  adapter: mysql
  encoding: utf8
  database: wordpress_database_goes_here
  username: rails_username_goes_here

After I rigged that up, I could then run this little bit of code in Rails console to clean up all those entries.

WpBlogPost.where('post_content LIKE ?', '%<span style="color:%').each do |p|
  s = p.post_content.gsub /<span style="color:[^>]+>/, ''
  s.gsub! '</span>', ''
  p.update_attributes(:post_content => s)
end

Cleaning up subscripts (accidental use of underscore without escaping):

WpBlogPost.where('post_content LIKE ?', '%<sub>%').each do |p|
  s = p.post_content.gsub /<sub>/, '_'
  s.gsub! '</sub', ''
  p.update_attributes(:post_content => s)
end

Now I can use all sorts of other ActiveRecord goodness when generating my statistics, like the code above.