Cleaning up HTML from Microsoft Word
Posted: - Modified: | tipsI often see HTML pasted in from Microsoft Word. It has a lot of non-standard and irrelevant code in it, so sometimes it breaks our systems. It’s also hard to edit afterwards.
An easy way to clean that up is to paste it into Windows Live Writer using Edit > Paste Special > Thinned HTML, which removes most of the Microsoft Word extras while leaving the basic formatting in place. You can then copy-and-paste it into the blog/wiki editor. You can also use View > Source to get the HTML source code, which you can paste into the HTML mode of the blog/wiki editor.
Hope that helps!
5 comments
Archimedes Trajano
2010-10-08T18:09:06ZI usually do Paste Special -> Text only. Or worst case copy and paste into Notepad then paste. However, your tip is good if the target object has support for HTML.
Ken Krause
2010-10-08T18:13:41ZI like to use CleanHaven (http://www.holymackerelsoft... - it's cross platform and let's you adjust alot more than just formatting...
Samuel Driessen
2010-10-08T23:57:58ZGood tips, Sacha! I use them as well. I recently learned you can use ctrl c and then ctrl-shift-v in Chrome to paste without extra's. Works great!
ramon
2010-10-13T02:24:45ZDUDE! (sorry - I can't help but call you that... - perhaps I should have said 'dudette')
I was so thinking you were gonna put up a sed script, or at least a tidy config. or even: tidy --word-2000 -m msword.html
I shudder at the thought of using more Microsoft software than I absolutely have to.
Ramon
Sacha Chua
2010-10-13T04:45:58ZHeh. It's a simple thing to do with tidy, but in this case, I do have to help a number of people who are on Microsoft Windows and who may boggle at command lines. If I can help them strip out the HTML using whichever tool they feel comfortable using, then I have to deal with fewer broken posts. =)
If it weren't already built in to tidy, I'd probably reach for Emacs Lisp rather than sed... ;)