Cleaning up HTML from Microsoft Word

Posted: - Modified: | tips

I often see HTML pasted in from Microsoft Word. It has a lot of non-standard and irrelevant code in it, so sometimes it breaks our systems. It’s also hard to edit afterwards.

An easy way to clean that up is to paste it into Windows Live Writer using Edit > Paste Special > Thinned HTML, which removes most of the Microsoft Word extras while leaving the basic formatting in place. You can then copy-and-paste it into the blog/wiki editor. You can also use View > Source to get the HTML source code, which you can paste into the HTML mode of the blog/wiki editor.

Hope that helps!

You can view 5 comments or e-mail me at sacha@sachachua.com.

5 comments

Archimedes Trajano

2010-10-08T18:09:06Z

I usually do Paste Special -> Text only. Or worst case copy and paste into Notepad then paste. However, your tip is good if the target object has support for HTML.

I like to use CleanHaven (http://www.holymackerelsoft... - it's cross platform and let's you adjust alot more than just formatting...

Samuel Driessen

2010-10-08T23:57:58Z

Good tips, Sacha! I use them as well. I recently learned you can use ctrl c and then ctrl-shift-v in Chrome to paste without extra's. Works great!

DUDE! (sorry - I can't help but call you that... - perhaps I should have said 'dudette')

I was so thinking you were gonna put up a sed script, or at least a tidy config. or even: tidy --word-2000 -m msword.html

I shudder at the thought of using more Microsoft software than I absolutely have to.

Ramon

Heh. It's a simple thing to do with tidy, but in this case, I do have to help a number of people who are on Microsoft Windows and who may boggle at command lines. If I can help them strip out the HTML using whichever tool they feel comfortable using, then I have to deal with fewer broken posts. =)

If it weren't already built in to tidy, I'd probably reach for Emacs Lisp rather than sed... ;)