We were talking with one of my colleagues about doing some text analysis—that, by the way, I have never done before—for which the first issue is to get text in R. Not any text, but files that can be accessed through internet. In summary, we need to access an HTML file, parse it so we can access specific content and then remove the HTML tags. Finally, we may want to replace some text (the end of lines,
\n, for example) before continue processing the files.
XML has the necessary functionality to deal with HTML, while the rest is done using a few standard R functions.
library(XML) # Read and parse HTML file doc.html <- htmlTreeParse('http://luis.apiolaza.net/babel.html', useInternal = TRUE) # Extract all the paragraphs (HTML tag is p, starting at # the root of the document). Unlist flattens the list to # create a character vector. doc.text <- unlist(xpathApply(doc.html, '//p', xmlValue)) # Replace all by spaces doc.text <- gsub('\n', ' ', doc.text) # Join all the elements of the character vector into a single # character string, separated by spaces doc.text <- paste(doc.text, collapse = ' ')
Incidentally, babel.html contains a translation of the short story 'The Library of Babel' by Jorge Luis Borges. Great story! We can repeat this process with several files and then create a corpus (and analyze it) using the
0 responses to “Reading HTML pages in R for text processing”
Tried this, but got doc.text=””. I may learn a bit by debugging it myself, which I am in the process of doing, but in case somebody already knows the answer, I’m posting this before I begin my investigation.
although leaving the “.html” off and just asking for “…/babel/” seemed to work.
As you probably realise, this post is almost exactly 5 years old, and the content in my site (and location) has been reorganised in the meantime. As you discovered in your following post the code works with the new address.