Rottentomatoes.com Data Scraping: April 2013

Saturday, 27 April 2013

Three common methods for data extraction

Probably the most common technique used traditionally to do this is to cook up some regular expressions that match the pieces you want (e.g., URL’s and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you’re already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing “ontologies“, or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they’re often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it’s probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what’s the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

    If you’re already familiar with regular expressions and at least one programming language, this can be a quick solution.
    Regular expressions allow for a fair amount of “fuzziness” in the matching such that minor changes to the content won’t break them.
    You likely don’t need to learn any new languages or tools (again, assuming you’re already familiar with regular expressions and a programming language).
    Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It’s also nice because the various regular expression implementations don’t vary too significantly in their syntax.

Disadvantages:

    They can be complex for those that don’t have a lot of experience with them. Learning regular expressions isn’t like going from Perl to Java. It’s more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.
    They’re often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you’ll see what I mean.
    If the content you’re trying to match changes (e.g., they change the web page by adding a new “font” tag) you’ll likely need to update your regular expressions to account for the change.
    The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You’ll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there’s no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

    You create it once and it can more or less extract the data from any page within the content domain you’re targeting.
    The data model is generally built in. For example, if you’re extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).
    There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

    It’s relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.
    These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you’re targeting.
    You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you’ll only get into ontologies and artificial intelligence when you’re planning on extracting information from a very large number of sources. It also makes sense to do this when the data you’re trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

    Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.
    Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.
    Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

    The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.
    A potential cost. Most ready-to-go screen-scraping applications are commercial, so you’ll likely be paying in dollars as well as time for this solution.
    A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you’re locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you’re using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don’t mind paying a bit, you can save yourself a significant amount of time by using one. If you’re doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you’re probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we’ve been involved with that has actually required a hybrid approach of two of the aforementioned methods. We’re currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term “number of bedrooms” can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we’ve done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it’s handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we’ve written that uses ontologies in order to extract out the individual pieces we’re after. Once the data has been extracted we then insert it into a database.

Source: http://blog.screen-scraper.com/2006/03/21/three-common-methods-for-data-extraction/

Note:

Delta Ray is experienced web scraping consultant and writes articles on Flixster.com Data Scraping, Rottentomatoes.com Data Scraping, Fandango.com Data Scraping, Moviefone.com Data Scraping, Boxofficemojo.com Data Scraping and Comingsoon.net Data Scraping etc.

Facebook Expands Instant Personalization Program, Adds Rotten Tomatoes As Partner

Facebook has just expanded its highly controversial Instant Personalization program — which allows select third-party sites to access some of your data without requiring you to login or ‘Connect’ — to popular movie reviews community Rotten Tomatoes. In a blog post announcing the news, Facebook says that the new feature will allow users to “immediately see the reviews most relevant to you, without having to register, search for friends, or fill out a profile.”

Instant Personalization was first announced at Facebook’s f8 conference in April. The feature gives sites that have received Facebook’s blessing the ability to access any information you’ve shared with ‘Everyone’ on Facebook as soon as you arrive at the third-party site, with no authentication required. At launch only three sites featured Instant Personalization: Yelp, Microsoft’s Docs.com, and Pandora — this is the first expansion since the April launch, and Facebook says that it will be expanding it slowly over the next few months. I suspect Facebook would have liked to begin rolling this out more broadly before now, but Instant Personalization sparked waves of privacy concerns as soon as it was announced.

The concept behind Instant Personalization is compelling: it offers the promise of a personalized web, where sites know what you’re interested in as soon as you arrive. But it’s rife with privacy issues. Instant Personalization drew fire from advocacy groups and even some senators, in part because it was initially frustratingly difficult to opt-out of (Facebook has since improved this).

Facebook makes the case that handing over this data to trusted third parties is reasonable, because it’s only handing over data that users have chosen to share with ‘Everyone’. Of course, for nearly a year now Facebook has been prodding users toward sharing more of their data under this less private setting.

The biggest push came last December, when it forced users to go through a new privacy wizard that encouraged users to share their Updates and other key information with ‘Everyone’ (I still believe this was the site’s most egregious privacy move to date). Later changes included forcing users to move their Interests, which could be previously hidden on profiles, to public ‘Likes’. In other words, a lot of people are sharing a significant amount of information with ‘Everyone’.

Facebook is taking baby steps with Instant Personalization, and for good reason — even without press and privacy advocates raising red flags about the issues here, there’s a distinct chance that people will totally freak out when they arrive at a site and it already knows who their friends are. But, as with every other privacy-related issue Facebook encounters, there’s a chance that its users simply won’t care — they’ll just like seeing reviews of their favorite movies as soon as they visit Rotten Tomatoes.

Here are some data points about the program that Facebook sent along:

    Users control Instant Personalization – when they arrive at a site they can disable the experience, or they can turn off the program for all websites in their Facebook settings.
    If you have previously opted out of the Instant Personalization program, you’ll continue to be opted out for any new sites.
    Partner sites follow clear product/security/privacy guidelines and may only use your public information and friend lists to offer a more personalized experience.
    All experiences are based on explicit actions (i.e. info you’ve typed into your profile or clicked “Like”); passive behavior (what you’re reading) is never surfaced.
    User data is never transferred to ad networks. Update: Facebook also clarifies that “No revenue is ever exchanged as part of this program and user data cannot be transferred by partners to third-party ad networks.”
    Expanding the program slowly over the next few months with a handful of partner sites where value to people is clear. Focused on verticals where you already find information through friends in the real world (examples like: reviews, food, travel, music, movies).

Source: http://techcrunch.com/2010/09/17/facebook-expands-instant-personalization-program-adds-rotten-tomatoes-as-partner/

Note:

Friday, 26 April 2013

Why Outsourcing Data Mining Services?

Are huge volumes of raw data waiting to be converted into information that you can use? Your organization's hunt for valuable information ends with valuable data mining, which can help to bring more accuracy and clarity in decision making process.

Nowadays world is information hungry and with Internet offering flexible communication, there is remarkable flow of data. It is significant to make the data available in a readily workable format where it can be of great help to your business. Then filtered data is of considerable use to the organization and efficient this services to increase profits, smooth work flow and ameliorating overall risks.

Data mining is a process that engages sorting through vast amounts of data and seeking out the pertinent information. Most of the instance data mining is conducted by professional, business organizations and financial analysts, although there are many growing fields that are finding the benefits of using in their business.

Data mining is helpful in every decision to make it quick and feasible. The information obtained by it is used for several applications for decision-making relating to direct marketing, e-commerce, customer relationship management, healthcare, scientific tests, telecommunications, financial services and utilities.

Data mining services include:

    Congregation data from websites into excel database
    Searching & collecting contact information from websites
    Using software to extract data from websites
    Extracting and summarizing stories from news sources
    Gathering information about competitors business

In this globalization era, handling your important data is becoming a headache for many business verticals. Then outsourcing is profitable option for your business. Since all projects are customized to suit the exact needs of the customer, huge savings in terms of time, money and infrastructure can be realized.

Advantages of Outsourcing Data Mining Services:

    Skilled and qualified technical staff who are proficient in English
    Improved technology scalability
    Advanced infrastructure resources
    Quick turnaround time
    Cost-effective prices
    Secure Network systems to ensure data safety
    Increased market coverage

Outsourcing will help you to focus on your core business operations and thus improve overall productivity. So data mining outsourcing is become wise choice for business. Outsourcing of this services helps businesses to manage their data effectively, which in turn enable them to achieve higher profits.

Article Source: http://EzineArticles.com/3066061

Note:

News Corp. Unloads Rotten Tomatoes Onto Flixster

News Corp is unloading more of its digital assets. This time it’s the movie review site Rotten Tomatoes, which is being acquired by startup Flixster, which has the most popular movie app for the iPhone and other mobile devices. The purchase price was not disclosed, but it was at least in part a stock transaction. News Corp now owns a minority stake in Flixster, which has only raised a total of $7 million in venture capital.

Flixster already shows Rotten Tomato reviews and ratings within its iPhone app (you can contrast the critics’ reviews from Rotten tomatoes with Flixter user reviews). Putting the two companies together certainly strengthens Flixter. The combined reach of both is 30 million unique visitors a month across all different platforms, according to the companies. Just looking at their websites, Flixster has 10 million monthly global unique visitors versus 7.5 million for Rotten Tomatoes (see chart below).

In October, News Corp sold off Photobucket to Ontela for $60 million. Expect it to divest more of its digital businesses this year.

Source: http://techcrunch.com/2010/01/04/rotten-tomatoes-flixster/

Note:

Wednesday, 24 April 2013

Stemming the Tide of Data Misappropriation – Who’s Using Your Information?

Gone are the days when REALTORS® would do the legwork and provide a stack of listings to buyers. Would-be buyers can view listing data any number of places online, though they tend to go in droves to the big online real estate portals (think realtor.com®, Homes.com, Trulia, Zillow). Yet, issues surrounding who owns the listing data, and how it is being disseminated across the Web—at worst, illegally and, at best, contrary to the intentions of its owners—continue to raise the ire of real estate brokers and online listings organizations.

“Giving your data to a third party, who then wants you to pay them for business generated by that listing—either directly or indirectly—is analogous to lending someone your watch and then paying them later to tell you what time it is.”
–Alex Perriello, president and CEO of Realogy Franchise Group

A growing number are asking how they can protect their data, their brands and their business from a rising tide of data misappropriation. What follows are some important considerations for protecting those hard-won assets—your listings.

Understanding the Issues
There are three areas are of primary concern: 1) data scraping; 2) data leakage; and 3) advertising practices of real estate portals. MLSs and brokers with effective websites tend to be most concerned about data scraping, which is the illicit copying or indexing of data by computer programs from a website. The second type of widespread misuse is data leakage, in which once-licensed data is repurposed or passed on for unauthorized uses.

“There is a whole flood of secondary data, where you send your data someplace and then it flows right out the back door,” says Curt Beardsley, vice president of consumer and industry development at Move, Inc.

Jason Doyle, vice president of Homes.com, agrees. “Brokers need to be educated about where their listings are going, both directly or indirectly,” he advises.

Many scrapers are repackaging and selling the data for unauthorized uses—such as lead aggregation, market analytics or trending applications—to companies such as banks, lending institutions or hedge funds.

“There’s a huge grey market for this data,” explains Marty Frame, president of Realtors Property Resource®. For example, “Lending and loan servicing institutions often will buy data from aggregators to check listing prices on short sales, review appraisals or monitor for listings in the bank’s portfolio that might be for sale. That bank may ask for a warranty that the aggregator has the right to sell the listings, but this data is almost always outside the terms of a valid license.”

But the issue that brokers tend to be angriest about is what happens to their data once they release it—free of charge—to real estate portals such as Trulia or Zillow. On the plus side, the sites drive a lot of traffic and exposure to listings. The business model of portals is to generate revenue by selling advertisements and, in some cases, leads, rather than selling property. But many brokers complain about the inappropriate sale of advertising space by these portals around their listings, which often promote competing brokers or ancillary businesses such as mortgage lenders. These ads, brokers say, are confusing buyers about who owns the listing, and driving traffic—and business—to their competitors. Moreover, many brokers feel railroaded into buying ad space around their own listings simply because if they don’t, their competitors will.

“Real estate portals are using the authorized data in ways that were never contemplated,” says David Charron, president and CEO of Metropolitan Regional Information Systems (MRIS). “Brokers have to spend a lot more money to protect their own assets.”

Alex Perriello, president and CEO of Realogy Franchise Group, comments: “Giving your data to a third party, who then wants you to pay them for business generated by that listing—either directly or indirectly—is analogous to lending someone your watch and then paying them later to tell you what time it is.”

It’s worth noting that REALTOR.com®, like its competitors, sells ad space around listings, although never to a competing broker. Explains Beardsley, “Our monetization strategy is around the enhancement of listings as a paid product, but if REALTORS® don’t buy the ads next to their listings, we don’t slap a competitor’s brand on it.”

To a lesser extent, data accuracy and integrity are also at issue, since many sites are not updated in a timely way or receive content from unreliable sources. Potential buyers are given incorrect property details and out-of-date information, which frustrates consumers and reflects poorly on brokers.

How can brokers protect/maximize the value of their listings?

1. Understand the terms and conditions of sites you are authorizing. Whether you’re signing license agreements with a virtual tour company, opting in to list with a syndicator, or uploading photos to the MLS, you have to know exactly what permissions you are granting. If necessary, engage an attorney to make sure you understand the license agreement that lays out exactly what the third party can and can’t do with your data.

“Better yet, insist they abide by the practices and procedures that you, or whomever you designate, establish with them directly in writing. If online publishers are unwilling to abide by those practices, then don’t send them your listing content,” advises Perriello.

“Brokers need to go out and assess the risks and benefits of working with listing sites,” says Doyle. “They need to ask themselves, ‘Am I getting enough value to feel comfortable with the risk of where this company is putting my data?’”

2. Enforce copyrights. There is a growing attention to copyright enforcement. Typically, brokers hold the copyrights in works that the broker and their employees create. They can’t copyright facts, such as the number of bedrooms and square footage, but original text, photographs—assuming they’ve taken them—even list price are copyrightable. Once brokers turn listings over to a multiple listing service (MLS), the MLS can copyright the compilation of the data, i.e. selection, coordination and arrangement of the database. Yet tracking down and enforcing copyright infringement can be a costly and time-consuming endeavor. Brokers are dependent on MLSs to police the data. To enforce these ownership rights, you need to be willing to lawyer up. As a couple of high-profile lawsuits have confirmed, MLSs are getting more serious about fighting piracy.

3. Protect your IDX site with anti-scraping software. If you have an IDX website, anti-scraping software is a must-have. In fact, MLSs may soon require it. Some in the industry hold that watermarking images and seeding data are complementary approaches to protecting data, and can enable you to track data, monitor compliance and enforce ownership rights.

“Data seeding can catch nonprofessionals, but sophisticated data thieves can strip out seeds,” explains Gregg Larson, president and CEO of Clareity Consulting. “Watermarked photos are more challenging, but the majority of stolen data uses don’t require photos. Scrapers often don’t even download photos.”

4. Think through where your data belongs. Every broker has decisions to make about with whom they choose to share their data—decisions they make when they put their inventory on the MLS, and can update at any time. In fact, a number of companies have decided to withdraw from the major sites. “This is the brokers’ decision, since the asset belongs to them. They worked hard to get it, and get to decide where they want to distribute it,” says Charron.

Brokers, take heed. It’s time to get more engaged with where, how and under what terms you distribute your data going forward. Protecting your data and managing its distribution channels are critical to safeguarding your business and your brand.

Source: http://rismedia.com/2013-03-31/stemming-the-tide-of-data-misappropriation/

Note: