Processing RSS Feeds

The blogosphere is an important source of text, in both formal and informal registers. With the help of a third-party Python library called the Universal Feed Parser, freely downloadable from http://feedparser.org/, we can access the content of a blog, as shown here:

>>> import feedparser

>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")

u'Language Log'

>>> post = llog.entries[2] >>> post.title u"He's My BF"

>>> content = post.content[0].value >>> content[:70]

u'<p>Today I was chatting with three of our visiting graduate students f' >>> nltk.word_tokenize(nltk.html_clean(content))

>>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value)) [u'Today', u'I', u'was', u'chatting', u'with', u'three', u'of', u'our', u'visiting', u'graduate', u'students', u'from', u'the', u'PRC', u'.', u'Thinking', u'that', u'I', u'was', u'being', u'au', u'courant', u',', u'I', u'mentioned', u'the', u'expression', u'DUI4XIANG4', u'\u5c0d\u8c61', u'("', u'boy', u'/', u'girl', u'friend', u'"', ...]

Note that the resulting strings have a u prefix to indicate that they are Unicode strings (see Section 3.3). With some further work, we can write programs to create a small corpus of blog posts, and use this as the basis for our NLP work.

Was this article helpful?

0 0

Post a comment