Dealing with "smart" quotes and other HTML annoyances when parsing HTML text using Python and BeautifulSoup

I've recently been writing some homegrown scripts to parse web pages and do something useful with the information using Python 2.7 and the BeautifulSoup library.  This involves printing the output to a terminal (e.g. Gnome Terminal) where the output is hopefully human readable but in some cases was not.

The most common issue I ran into were various forms of "smart quotes" and similar html annoyances. If this is not properly handled you'll get output like

JohnXs Blog

(where X is the following 7 character group without the intervening spaces - hard to display it verbatim in the blogger environment!)
 & # 8 2 1 7 ; 

or even

JohnXs Blog 

(where X is some awkward block symbol that can't be adequately described or rendered here. This actually results from the MS quote being "decoded" as Latin-1 - so it's sort of a cascade of errors.)

The first of these is known as an "HTML entity" and here is a list of them

The second of these come about due to so-called "Microsoft Smart Quotes" that have crept into various web documents, presumably due to people cutting and pasting text that had been generated at some point using some sort of Microsoft tool or another. 

The first case (the HTML entities) are easily handled by BeautifulSoup. When creating a BeautifulSoup object in Python simply add a convertEntities parameter to the constructor like so:

html = f.read()
soup = BeautifulSoup(html, convertEntities="html")

Here, the first parameter, html, is a string that has been obtained by means of opening a URL for reading and making an f.read() call. BeautifulSoup will cleanly convert these entities into the corresponding unicode smart quotes or other characters that are encoded by these HTML Entities.

The second case is also easily handled. These Microsoft Smart Quotes are particularly pernicious because they were basically appropriated by MS and they have numerical values in the upper 128 "code points" of the 256 extended ASCII character set. If you simply treat them as unicode they won't render properly due to the non-standard nature of these code points as they were simply appropriated by MS without working with the proper standards committees. Moreover, if these symbols are left in your data you will end up with a lot of 

https://en.wikipedia.org/wiki/Windows-1252: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
Python errors as functions that are expecting ascii as input will not be happy with these symbols as can be seen above.

There are basically two easy ways to get around this problem which is basically one of standard non-compliance.

An easy way is to simply decode the text as windows-1252 by saying

html = f.read().decode('windows-1252')

This is not only easy - it has the additional advantage of turning MS non-standard smart quotes into unicode standard smartquotes.  However, at the bottom of this post there is perhaps even an easier way - which is to pass this decoder into the BeautifulSoup constructor function using the "fromEncoding" parameter.

A more manual, homegrown, method is to use the String module's translate function. To do that you create an input string (the characters you want to translate from) and an output string (the characters you want to translate to) and from these make a translation table. Then, using that table, you can get rid of the pernicious MS quotes. The disadvantage of this is that you can only turn these into "dumb" or ASCII quotes which may or may not be what you want.
But if it is here's how it's done:

intab =  '\221\222\223\224\225\226\227\240'
outtab = '\047\047\042\042\052\055\055\040'
transtab = string.maketrans(intab, outtab)
html = (f.read()).translate(transtab)

Here, the intab has the octal specifications for the MS smart quotes (and a few other symbols) and the outtab has the octal specifications for the basic ASCII single quotes, double quotes and a few others. It turns out that the transtab is a 256 character string where, in the above example, the character in the \221th or 145th position is a single quote as is the character in the \222th or 146th position. 

This is the String module's translate function. If one has a unicode (here's a list of unicode characters) string that one wishes to translate, that translate function takes a dictionary (not a string) where the keys are the ordinal positions and the values are the desired characters. So something like

unicode_transtab = {0xbf : '?', 0xd7: '*'} 

might be used to convert an inverted question mark to a normal one and a multiplication sign that looks like an 'x' to a more conventional one.

Actually this is even better then calling decode I think:

soup = BeautifulSoup(html, convertEntities="html", fromEncoding="windows-1252")

I will add more shortly that explains how the block like characters crept in maybe and attaching a screenshot.






Post a Comment

Popular posts from this blog

Hit failing alternator with a hammer to confirm diagnosis of failing alternator due to bad brushes

alternatives --install gets stuck: failed to read link: No such file or directory

Fedora 23 Calendar PopDown in DarkGlass Gnome Shell Theme is cut off