Rottentomatoes.com Data Scraping: Grabbing rotten tomatoes ratings via their API

Basically, given a list of movie names and the years they were released at the movies, we can query the rotton tomatoes api in order to extract the critics and audiences ratings. One of the problems is when a movie title of ‘untitled’ is used, even in conjunction with 2009 when it was release, many more movies are returned by the api such as ‘untitled star trek sequel’ without a year. So, when you extract the data from a list object, you need to be careful in finding just which list object within the list of lists (argh!) relates to the movie in question and then extract subcomponents from there.

R can be used for data scraping and munging, but it is very frustrating to a junior R person like myself, especially when I know that various ETL tools offer a much simpler solution to this issue. PErhaps it is about time I picked up an open source ETL tool so that I can alleviate some of my R frustrations. After all, my interest is in data mining with R, not extract and munge with R… but, they do say that data preparation is 80% of the effort.

Need to workout how to keep indentation in code snippets on wordpress.

# SET YOUR FREE NY TIMES MOVIE API KEY HERE
my.key <- 'sign up for your key at the developer website of rotten tomatoes'

# lod the required packages
library(RJSONIO)
library(RCurl)
require(stringr)
require(plyr)

# extract the data
extract_details <- function(doc, movie, year){
# this function needs to handle when it finds multiple movies....
cat(movie,' - ',year,' - ',length(doc[[2]]),' matches\n',sep='')
index <- 0
if(length(doc[[2]]) == 0){
dtls <- data.frame(title = movie,
year = year,
critics_rating = NA,
critics_score = NA,
audience_rating = NA,
audience_score = NA,
stringsAsFactors=FALSE)
}else{
for(i in 1:length(doc[[2]])){
if(doc[[2]][[i]][2] == movie){
index <- i
}
if(index != 0){
dtls <- data.frame(title = doc[[2]][[index]][2],
year = doc[[2]][[index]][3],
critics_rating = doc[[2]][[index]][[8]][1],
critics_score = doc[[2]][[index]][[8]][2],
audience_rating = doc[[2]][[index]][[8]][3],
audience_score = doc[[2]][[index]][[8]][4],
stringsAsFactors=FALSE)

} else {
dtls <- data.frame(title = movie,
year = year,
critics_rating = NA,
critics_score = NA,
audience_rating = NA,
audience_score = NA,
stringsAsFactors=FALSE)
}
}

}
return(dtls)
}

# grabs the movie metadata from the API and then the review from the web page
grab_rotten_data <-function(movie){
options(warn=-1)
# 1: replace spaces with +
movie.plus <- str_replace_all(movie[1],' ','+')
# create a bad result if we have a scrape error
bad.scrape <- data.frame(title = movie[1],
year = movie[2],
critics_rating = NA,
critics_score = NA,
audience_rating = NA,
audience_score = NA,
stringsAsFactors=FALSE)
df <- bad.scrape # give it a default fist

# 2: Build URL
rottoms.url <- paste('http://api.rottentomatoes.com/api/public/v1.0/movies.jsonq=',movie.plus,'&year=',movie[2],'&apikey=',my.key,sep='')

rottoms.out <-getURLContent(rottoms.url,curl=getCurlHandle()) # this is the data
doc <- fromJSON(rottoms.out) # return a list structure of the movies that match the title
df <- extract_details(doc, movie[1],movie[2]) # extract the details I need
if(nrow(df) == 0){ # if the exact movie was not returned, then return a bad scrape
df <- bad.scrape
}
options(warn=1)
return(df)
}

###--- Main ---###

# Step 1: Read in the movie names
movie.file <- read.csv("NamesAndYears.csv")

# Step 2: Grab only the movie names and set to character, and get the YEAR from the date
movie.names <- matrix(NA, nrow=nrow(movie.file),ncol=2)
movie.names[,1] <- as.character(movie.file[,1])
movie.names[,2] <- as.numeric(format(as.Date(movie.file[,7],'1970-01-01'),'%Y'))

# Step 3: Grab all the data ! - NOT DOING THE RIGHT APPLY CALL HERE?
movie.names <- as.data.frame(movie.names, stringsAsFactors=FALSE)
rotten.df <- do.call("rbind", apply(movie.names, 1, grab_rotten_data))
# Step 4: write to file
write.csv(rotten.df ,file="RottenRatings.csv", row.names=FALSE)

Source: http://binalytics.wordpress.com/2012/03/07/grabbing-rotten-tomatoes-ratings-via-their-api/

Rottentomatoes.com Data Scraping

Friday, 24 May 2013

Grabbing rotten tomatoes ratings via their API

No comments:

Post a Comment