6104 comments
2357 subscribers
6264 on Twitter
Subscribe! Feed reader E-mail

Blog analysis for 2011: 173,363 words so far; also, using the Rails console to work with WordPress

How many posts did I post per month, not including this or future posts? (See the geek appendix below to find out how I got to the point of being able to run code snippets like this:)

posts = WpBlogPost.published.posts.year(2011)
posts.count(:id, :group => 'month(post_date)').sort { |a,b| a[0].to_i <=> b[0].to_i }

Result: [["1", 32], ["2", 34], ["3", 33], ["4", 33], ["5", 34], ["6", 39], ["7", 33], ["8", 33], ["9", 31], ["10", 33], ["11", 31], ["12", 8]]

This is a straightforward SQL query to write, but ActiveRecord and scopes make it more fun, and I can easily slice the data in different ways. Becuase I’ve connected Rails with my WordPress data, I can use all sorts of other gems. For example, Lingua::EN::Readability can give me text statistics. It’s not a gem, but it’s easy to install with the provided install.rb. Code tends to throw off my word count, so let’s get rid of HTML tags and anything in pre tags, then calculate some text statistics:

include ActionView::Helpers::SanitizeHelper
require 'lingua/en/readability'
# Needs lots of memory =)
post_bodies = posts.map { |x| strip_tags(x.post_content.gsub(/<pre.+?<\/pre>/m, '')) }
all_text = post_bodies.join("\n").downcase
report = Lingua::EN::Readability.new(all_text)
Number of words in 2011 173,363
Flesch reading ease 65.3
Gunning Fog index 11.0
Flesch-Kincaid grade level 8.4

According to this, my writing should be readable by high school seniors, although they’ll probably have to be geeks in order to be interested in the first place.

The Readability library has other handy functions, like occurrences for finding out how frequently a word shows up in your text.

I 4375 #4 – It’s a personal blog, after all
you 1926 #9 – Not so bad
my 1555
time 933
people 897
work 710
W- 200
presentations 190
J- 133
Drupal 111
Rails 97
Emacs 77
zucchini 23 Oh, the summer of all that zucchini…

I want to get better at clear, specific descriptions. That means avoiding adjectives like ‘nice’ and hedging words like ‘really’.

really 227 Hmm, I can cut down on this
maybe 211 This one too
probably 211 Down with hedging!
awesome 88 I overuse this, but it’s a fun word
nice 15 The war on generic adjectives continues.

Let’s look at feelings:

happy / happiness / wonderful 107
busy 33
worried / anxious / worry 30
tired 20
excited / exciting 21
delighted 4
suck 4
sad 2

I recently used the N-Gram gem to analyze the text of Homestar reviews looking for recurring phrases. I suspected that one of the contractors we were considering had salted his reviews, and unusual recurring phrases or spikes in frequency might be a tip-off. I can use the same technique to identify any pet phrases of mine.

csv = FasterCSV.open('ngrams.csv', 'w')
n_gram = NGram.new(all_text, :n => [2, 3])
csv << "NGRAM 2"
n_gram.ngrams_of_all_data[2].sort { |a,b| a[1] <=> b[1] }.map { |a| csv << a };
csv << "NGRAM 3"
n_gram.ngrams_of_all_data[3].sort { |a,b| a[1] <=> b[1] }.map { |a| csv << a };
csv.close

The ten most common 3-word phrases on my blog tend to be related to planning and explaining. It figures. I can stop saying “a lot of”, though.

Phrase Frequency
i want to 158
a lot of 126
so that i 94
be able to 86
that i can 76
you want to 74
one of the 68
that you can 63
in order to 55
i need to 55

Some frequent two-word phrases:

i can 425
you can 408

Two-word phrases starting with “I’m…”

i’m going 52
i’m not 29
i’m looking 25
i’m working 24
i’m learning 23
i’m sure 16
i’m thinking 15
i’m glad 14
i’m getting 12

I wonder what other questions I might ask with this data…

Geek appendix: Using the Rails Console to work with WordPress data

The Rails console is awesome. You can do all sorts of things with it, like poke around your data objects or run scripts. With a little hacking, you can even use it as a smarter interface to other databases.

For example, I decided to get rid of all the syntax formatting that Org-mode tried to do with my blog posts when I published them to WordPress. Fortunately, this was the only use of span tags in my post content, so I could zap them all with a regular expression… if I could confidently do regular expressions in the MySQL console.

In the past, I might have written a Perl script to go through my database. If desperate, I might have even written a Perl script to do a regular expression replacement on my database dump file.

Rails to the rescue! I decided that since I was likely to want to use data from my WordPress blog in my Rails-based self-tracking system anyway, I might as well connect the two.

I found some code that created ActiveRecord models for WordPress posts and comments, and I modified it to connect to a different database. I added some scopes for easier queries, too.

class WpBlogPost < ActiveRecord::Base
  establish_connection Rails.configuration.database_configuration["wordpress"]

  set_table_name "wp_posts"
  set_primary_key "ID"

  has_many :comments, :class_name => "WpBlogComment", :foreign_key => "comment_post_ID"

  def self.find_by_permalink(year, month, day, title)
    find(:first,
         :conditions => ["YEAR(post_date) = ? AND MONTH(post_date) = ? AND DAYOFMONTH(post_date) = ? AND post_name = ?",
                         year.to_i, month.to_i, day.to_i, title])
  end

  scope :posts, where("post_type='post'")
  scope :published, where("post_status='publish'")
  scope :year, lambda { |year| where("year(post_date)=?", year) }
end
# http://snippets.dzone.com/posts/show/1314
class WpBlogComment < ActiveRecord::Base
  establish_connection Rails.configuration.database_configuration["wordpress"]

  # if wordpress tables live in a different database (i.e. 'wordpress') change the following
  # line to set_table_name "wordpress.wp_comments"
  # don't forget to give the db user permissions to access the wordpress db
  set_table_name "wp_comments"
  set_primary_key "comment_ID"

  belongs_to :post , :class_name => "WpBlogPost", :foreign_key => "comment_post_ID"

  validates_presence_of :comment_post_ID, :comment_author, :comment_content, :comment_author_email

  def validate_on_create
    if WpBlogPost.find(comment_post_ID).comment_status != 'open'
      errors.add_to_base('Sorry, comments are closed for this post')
    end
  end

end

I specified the database configuration in config/database.yml, and granted my user access to the tables:

wordpress:
  adapter: mysql
  encoding: utf8
  database: wordpress_database_goes_here
  username: rails_username_goes_here

After I rigged that up, I could then run this little bit of code in Rails console to clean up all those entries.

WpBlogPost.where('post_content LIKE ?', '%<span style="color:%').each do |p|
  s = p.post_content.gsub /<span style="color:[^>]+>/, ''
  s.gsub! '</span>', ''
  p.update_attributes(:post_content => s)
end

Cleaning up subscripts (accidental use of underscore without escaping):

WpBlogPost.where('post_content LIKE ?', '%<sub>%').each do |p|
  s = p.post_content.gsub /<sub>/, '_'
  s.gsub! '</sub', ''
  p.update_attributes(:post_content => s)
end

Now I can use all sorts of other ActiveRecord goodness when generating my statistics, like the code above.

Short URL: http://sachachua.com/blog/p/22664

On This Day...

  • 2012: Weekly review: Week ending December 7, 2012 — I sprained my ankle in fitness class, but I’m almost back to normal. It’s been quite an eventful week! I’m [...]
  • 2010: Sketchnotes: The Science of Blogging: Dan Zarrella (Hubspot) —   Awesome stuff. Key actions for most blogs: by-lines should list real people, you should have  Facebook and Retweet buttons that [...]
  • 2010: Making the most of the conference hallway track — The informal conversations you have in conference corridors in between sessions can help you learn a lot more and [...]
  • 2009: Editing feedback on The Shy Connector — Here’s the detailed feedback from one of my editors on The Shy Connector. Lots of stuff to work with here! [...]
  • 2009: Weekly review: Week ending December 6, 2009 — Whoops! Nearly let a week slip past me. Plans from the previous week: Work Catch up on deferred work Interview Jason Wild [...]
  • 2009: Process: Using Activities to organize workshop-related information — We regularly organize Innovation Discovery workshops that bring together experts across IBM and client decisionmakers to explore emerging topics. In [...]
  • 2005: Trend Micro Tech Challenge, blogged — Check out the MSU-IIT coach’s blog entry about the Trend Micro Tech Challenge. =) On Technorati: tmtc, philippines
  • 2005: Learning Bisaya — Are there comprehensive Bisaya references with example sentences, aside from http://www.bohol.ph/books/Jimenez/EnglishBisayaGrammar.html ? I’d like to learn Bisaya, but I’m worried I might end [...]
  • 2004: Random notes from meeting with Prof. Chignell — Sharon Strauss, lead of the project. Introduction. Order tree interface for searching. Clinical evidence at the point of care. Doctors can’t remember [...]
  • 2004: On effective websites — We’re looking at two audiences here: people who check out our website to find out more about Adphoto as a company, [...]
  • 2003: Breakfast steak, potatoes — Have potatoes down pat – halved marble potatoes in covered container with pat of butter, cook for three minutes, stir, then [...]
  • 2003: 2003.12.09 – White (M.) / Black (Sacha) – Starbucks — 1 e4 e5 2 Nf3 d6 3 d4 Bg5 4 h6 Bh5 5 dxe5 Bxf3 6 gxf3 dxe5 7 Qxd8+ Kd8 8 f4 Nc6 9 fxe5 Nxe5 10 f4 Bb4+? 11 c3 Nc6 12 cxb4 Nxb4 13 Bd2 Nc2+ 14 Ke2 Na1 15 Kd3 Ke7 16 Nc3 Rd8+ 17 Nd5+ Kd6 18 Bb4+ c5 19 Ba5 b6 20 Bc3 f6 21 Kd2 Ne7 22 Bc4 Kc6 23 Rxa1 b5 24 Kd3? bxc4+ 25 Kxc4 Nxd5 26 exd5+ Rxd5 27 b4 Re8 28 b5+ Kd6 29 Rb1 Re4+ 30 Kb3 Rd3 31 Rc1 Kd5 32 a4 Rh3 33 Rd1+ Ke6 34 a5 Rb4+ 35 Ka3? Rxc3+ 36 Ka2 Rxb5 37 Rf1? Rc4 38 f5+ Ke5 39 a6? Ra4++
  • 2003: 2003.12.09 – White (Sacha) / Black (M.) – Starbucks — 1 e4 e5 2 Bc4 Nf6 3 d3 d5 4 exd5 Nxd5 5 Nc3 Bb5 6 Bd2 Be6 7 Nf3 Nd7? 8 Nxd5 Bxd5 9 Bxb4 e4? 10 Bxd5 exf3 11 Qxf3 Nb6? 12 Qf7++ -
  • 2003: 2003.12.09 – White (M.) / Black (Sacha) – Starbucks — 2003.12.09 White (M.) Black (Sacha) 1 e4 e5 2 Bc4 Nf6 3 d3 d5 4 exd5 Nxd5 5 Qh5 Nc6 6 Nc3 Be6 7 Bg5 Nf6 8 o-o-o Nxh5 9 Bxd8 Rxd8 10 d4? Bxc4 11 b3 exd4 12 Nb5? Bxb5 13 c4 Ba3+ 14 Kc2 Nb4+ 15 Kb1 Bc6 16 Nf3 Be4+ 17 Ka1 Nc2+ 18 Kb1 Ne3+ 19 Ka1 Nxd1 20 Rxd1 d3 21 Re1 Nf6 22 Ng5 o-o 23 Nxe4 Nxe4 24 Rxe4? d2 25 resign - No way to prevent … d1 (Q or R,++).
  • 2003: Blind computing — Text of my e-mail to Bong Copuyoc: Dear Bong, I’m glad to hear that you’re working on an article focusing on assistive technology! [...]
  • 2003: Mark Punzalan is a Microsoft Student Ambassador now — Hi Sacha, In case you didn’t know, I’ve been drafted into Microsoft’s Student Ambassador program. We’re planning to go to different [...]
  • 2003: CS21B meeting — - Listeners: canvas vs applet - Selecting: isSelected, highlighting - Deleting shapes - Changing colors - Dragging / moving: mousePressed, mouseDragged, mouseReleased. Optimal: relative position - [...]

Get the highlights as a PDF!

Stories from my Twenties: Highlights from a Decade of Blogging

Free sample!