Sunday 30 November 2014

Web Scraping’s 2013 Review – part 1

Here we are, almost having ended another year and having the chance to analyze the aspects of the Web scraping market over the last twelve months. First of all i want to underline all the buzzwords on the tech field as published in the Yahoo’s year in review article . According to Yahoo, the most searched items wore

  •     iPhone (including 4, 5, 5s, 5c, and 6)
  •     Samsung (including Galaxy, S4, S3, Note)
  •     Siri
  •     iPad Cases
  •     Snapchat
  •     Google Glass
  •     Apple iPad
  •     BlackBerry Z10
  •     Cloud Computing

It’s easy to see that none of this terms regards in any way with the field of data mining, and they rather focus on the gadgets and apps industry, which is just one of the ways technology can evolve to. Regarding actual data mining industry there were a lot of talks about it in this year’s MIT’s Engaging Data 2013 Conference. One of the speakers Noam Chomsky gave an acid speech relating data extraction and its connection to the Big Data phenomena that is also on everyone’s lips this year. He defined a good way to see if Big Data works by following a series of few simple factors: 1. It’s the analysis, not the raw data, that counts. 2. A picture is worth a thousand words 3. Make a big data portal (Not sure if Facebook is planning on dominating in cloud services some day) 4. Use a hybrid organizational model (We’re asleep already, soon)  let’s move 5. Train employees Other interesting declaration  was given by EETimes saying, “Data science will do more for medicine in the next 10 years than biological science.” which says a lot about the volume of required extracted data.

Because we want to cover as many as possible events about data mining this article will be a two parter, so don’t forget to check our blog tomorrow when the second part of this article will come up!

Source:http://thewebminer.com/blog/2013/12/

Thursday 27 November 2014

Scraping SSL Labs Server Test Results With R

    NOTE: Qualys allows automated access to their SSL Server Test site in their T&C’s, and the R fucntion/script provided here does its best to adhere to their guidelines. However, if you launch multiple scripts at one time and catch their attention you will, no doubt, be banned.

This post will show you how to do some basic web page data scraping with R. To make it more palatable to those in the security domain, we’ll be scraping the results from Qualys’ SSL Labs SSL Test site by building an R function that will:

    fetch the contents of a URL with RCurl
    process the HTML page tags with R’s XML library
    identify the key elements from the page that need to be scraped
    organize the results into a usable R data structure

You can skip ahead to the code at the end (or in this gist) or read on for some expository that isn’t in the code’s comments.

Setting up the script and processing flow

We’ll need some assistance from three R packages to perform the scraping, processing and transformation tasks:

library(RCurl) # scraping
library(XML)   # XML (HTML) processing
library(plyr)  # data transformation

If you poke at the SSL Test site with a few different URLs, you’ll see there are three primary inputs to the GET request we’ll need to issue:

    d (the domain)
    s (the IP address to test)
    ignoreMismatch (which we’ll leave as ‘on‘)

You’ll also see that there’s often a delay between issuing a request and getting the results, so we’ll need to build in a GET+check-loop (like the javascript on the page does automagically). Finally, when the results are eventually displayed they are (at least for this example) usually either "Overall Rating" or "Assessment" and, we’ll use that status result in our tests for what to return.

We’ll account for the domain and IP address in the function parameters along with the amount of time we should pause between GET+check attempts. It’s also a good idea to provide a way to pass in any extra curl options (e.g. in the event folks are behind a proxy server and need to input that to make the requests work). We’ll define the function with some default parameters:

get_rating <- function(site="rud.is", ip="", pause=5, curl.opts=list()) {

}

This definition says that if we just call get_rating(), it will

    default to using "rud.is" as the domain (you can pick what you want in your implementation)
    not supply an IP address (which the script will then have to lookup with nsl)
    will pause 5s between GET+check attempts
    pass no extra curl options

Getting into the details

For the IP address logic, we’ll have to test if we passed in an an address string and perform a lookup if not:

# try to resolve IP if not specified; if no IP can be found, return
# a "NA" data frame

  if (ip == "") {

    tmp <- nsl(site)
    if (is.null(tmp)) {
      return(data.frame(site=site, ip=NA, Certificate=NA,
                        Protocol.Support=NA, Key.Exchange=NA,
                        Cipher.Strength=NA)) }
    ip <- tmp
  }

(don’t worry about the return(...) part yet, we’ll get there in a bit).

Once we have an IP address, we’ll need to make the call to the ssllabs.com test site and perform the check loop:

# get the contents of the URL (will be the raw HTML text)
# build the URL with sprintf

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)

# while we don't find some indication of a completed request,
# pause and try again

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {
  Sys.sleep(pause)
  rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)
}

We can then start making some decisions based on the results:

# if the assessment failed, return a data frame of NA's

if (grepl("Assessment failed", rating.dat)) {

  return(data.frame(site=site, ip=NA, Certificate=NA,
                    Protocol.Support=NA, Key.Exchange=NA,
                    Cipher.Strength=NA))
}

# otherwise, parse the resultant HTML

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

Unfortunately, the results are not “consistent”. While there are plenty of uniquely identifiable <div>s, there are enough differences between runs that we have to be a bit generic in our selection of data elements to extract. I’ll leave the view-source: of a result as an exercise to the reader. For this example, we’ll focus on extracting:

        the overall rating (A-F)
        the “Certificate” score
        the “Protocol Support” score
        the “Key Exchange” score
        the “Cipher Strength” score

There are plenty of additional fields to extract, but you should be able to extrapolate and grab what you want to from the rest of the example.

Extracting the results

We’ll need to delve into XPath to extract the <div> values. We’ll use the xpathSApply function to perform this task. Since there sometimes is a <span> tag within the <div> for the rating and since the rating has a class tag to help identify which color it should be, we use a starts-with selection parameter to just get anything beginning with rating_. If it returns an R list structure, we know we have the one with a <span> element, so we re-issue the call with that extra XPath component.

rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/text()", xmlValue)

if (class(rating) == "list") {

  rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/span/text()", xmlValue)
}

For the four attributes (and values) we’ll be extracting, we can use the getNodeSet call which will give us all of them into a structure we can process with xpathSApply

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

At this point, labs will be a vector of label names and vals will be the corresponding values. We’ll put them, the original domain and the IP address into a data frame:

# rbind will turn the vector into row elements, with each

# value being in a column

rating.result <- data.frame(site=site, ip=ip,

                            rating=rating, rbind(vals),
                            row.names=NULL)

# we use the labs vector as the column names (in the right spot)    

colnames(rating.result) <- c("site", "ip", "rating",

                              gsub(" ", "\\.", labs))

and return the result:
return(rating.result)
Finishing up

If we run the whole function on one domain we’ll get a one-row data frame back as a result. If we use ldply from the plyr package to run the get_rating function repeatedly on a vector of domains, it will combine them all into one whole data frame. For example:

sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

##                site              ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1            rud.is  184.106.97.102      B         100               70           80              90

## 2 stackoverflow.com 198.252.206.140      A         100               90           80              90

## 3        er-ant.com            <NA>   <NA>        <NA>             <NA>         <NA>            <NA>

There are many tweaks you can make to this function to extract more data and perform additional processing. If you make some of your own changes, you’re encouraged to add to the gist (link above & below) and/or drop a note in the comments.

Hopefully you’ve seen how well-suited R is for this type of operation and have been encouraged to use it in your next attempt at some site/data scraping.

library(RCurl)
library(XML)
library(plyr)

 #' get the Qualys SSL Labs rating for a domain+cert

#'

#' @param site domain to test SSL configuration of

#' @param ip address of \code{site} (will resolve it and take\cr

#' first response if not specified, but that may not always work as you expect)

#' @param hide.results ["on"|"off"] should the results show up in the SSL Labs history (default "on")

#' @param pause timeout between tries (default 5s)

#' @param curl.opts options to pass to \code{getURL} i.e. proxy setting

#' @return data frame of results

#'

  get_rating <- function(site="rud.is", ip="", hide.results="on", pause=5, curl.opts=list()) {

# try to resolve IP if not specified; if no IP can be found, return

# a "NA" data frame

if (ip == "") {

tmp <- nsl(site)

if (is.null(tmp)) { return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA)) }

ip <- tmp

}

# need to let it actually process the certificate if not already cached

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {

Sys.sleep(pause)

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

}

if (grepl("Assessment failed", rating.dat)) {

return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA))

}

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

# sometimes there is a <span ...> tag in the <div>, which will result in an

# empty list() object being returned. we check for that and handle it

# appropriately.

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/text()"]])

if (class(rating) == "list") {

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/span/text()"]])

}

# extract the XML objects for the ratings labels & values

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

# make them into a data frame

rating.result <- data.frame(site=site, ip=ip, rating=rating, rbind(vals), row.names=NULL)

colnames(rating.result) <- c("site", "ip", "rating", gsub(" ", "\\.", labs))

return(rating.result)

}

 sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

## site ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1 rud.is 184.106.97.102 B 100 70 80 90

## 2 stackoverflow.com 198.252.206.140 A 100 90 80 90

## 3 er-ant.com <NA> <NA> <NA> <NA> <NA> <NA>

Source: http://www.r-bloggers.com/scraping-ssl-labs-server-test-results-with-r/

Wednesday 26 November 2014

Web Scraping Tools for Non-developers

I recently spoke with a resource-limited organization that is investigating government corruption and wants to access various public datasets to monitor politicians and law firms. They don’t have developers in-house, but feel pretty comfortable analyzing datasets in CSV form. While many public datasources are available in structured form, some sources are hidden in what us data folks call the deep web. Amazon is a nice example of a deep website, where you have to enter text into a search box, click on a few buttons to narrow down your results, and finally access relatively structured data (prices, model numbers, etc.) embedded in HTML. Amazon has a structured database of their products somewhere, but all you get to see is a bunch of webpages trapped behind some forms.

A developer usually isn’t hindered by the deep web. If we want the data on a webpage, we can automate form submissions and key presses, and we can parse some ugly HTML before emitting reasonably structured CSVs or JSON. But what can one accomplish without writing code?

This turns out to be a hard problem. Lots of companies have tried, to varying degrees of success, to build a programmer-free interface for structured web data extraction. I had the pleasure of working on one such project, called Needlebase at ITA before Google acquired it and closed things down. David Huynh, my wonderful colleague from grad school, prototyped a tool called Sifter that did most of what one would need, but like all good research from 2006, the lasting impact is his paper rather than his software artifact.

Below, I’ve compiled a list of some available tools. The list comes from memory, the advice of some friends that have done this before, and, most productively, a question on Twitter that Hilary Mason was nice enough to retweet.

The bad news is that none of the tools I tested would work out of the box for the specific use case I was testing. To understand why, I’ll break down the steps required for a working web scraper, and then use those steps to explain where various solutions broke down.

The anatomy of a web scraper

There are three steps to a structured extraction pipeline:

    Authenticate yourself. This might require logging in to a website or filling out a CAPTCHA to prove you’re not…a web scraper. Because the source I wanted to scrape required filling out a CAPTCHA, all of the automated tools I’ll review below failed step 1. It suggests that as a low bar, good scrapers should facilitate a human in the loop: automate the things machines are good at automating, and fall back to a human to perform authentication tasks the machines can’t do on their own.

    Navigate to the pages with the data. This might require entering some text into a search box (e.g., searching for a product on Amazon), or it might require clicking “next” through all of the pages that results are split over (often called pagination). Some of the tools I looked at allowed entering text into search boxes, but none of them correctly handled pagination across multiple pages of results.

    Extract the data. On any page you’d like to extract content from, the scraper has to help you identify the data you’d like to extract. The cleanest example of this that I’ve seen is captured in a video for one of the tools below: the interface lets you click on some text you want to pluck out of a website, asks you to label it, and then allows you to correct mistakes it learns how to extract the other examples on the page.

As you’ll see in a moment, the steps at the top of this list are hardest to automate.

What are the tools?

Here are some of the tools that came highly recommended, and my experience with them. None of those passed the CAPTCHA test, so I’ll focus on their handling of navigation and extraction.

    Web Scraper is a Chrome plugin that allows you to build navigable site maps and extract elements from those site maps. It would have done everything necessary in this scenario, except the source I was trying to scrape captured click events on links (I KNOW!), which tripped things up. You should give it a shot if you’d like to scrape a simpler site, and the youtube video that comes with it helps get around the slightly confusing user interface.

    import.io looks like a clean webpage-to-api story. The service views any webpage as a potential data source to generate an API from. If the page you’re looking at has been scraped before, you can access an API or download some of its data. If the page hasn’t been processed before, import.io walks you through the process of building connectors (for navigation) or extractors (to pull out the data) for the site. Once at the page with the data you want, you can annotate a screenshot of the page with the fields you’d like to extract. After you submit your request, it appears to get queued for extraction. I’m still waiting for the data 24 hours after submitting a request, so I can’t vouch for the quality, but the delay suggests that import.io uses crowd workers to turn your instructions into some sort of semi-automated extraction process, which likely helps improve extraction quality. The site I tried to scrape requires an arcane combination of javascript/POST requests that threw import.io’s connectors for a lo
op, and ultimately made it impossible to tell import.io how to navigate the site. Despite the complications, import.io seems like one of the more polished website-to-data efforts on this list.

    Kimono was one of the most popular suggestions I got, and is quite polished. After installing the Kimono bookmarklet in your browser, you can select elements of the page you wish to extract, and provide some positive/negative examples to train the extractor. This means that unlike import.io, you don’t have to wait to get access to the extracted data. After labeling the data, you can quickly export it as CSV/JSON/a web endpoint. The tool worked seamlessly to extract a feed from the Hackernews front page, but I’d imagine that failures in the automated approach would make me wish I had access to import.io’s crowd workers. The tool would be high on my list except that navigation/pagination is coming soon, and will ultimately cost money.

    Dapper, which is now owned by Yahoo!, provides about the same level of scraping capabilities as Kimono. You can extract content, but like Kimono it’s unclear how to navigate/paginate.

    Google Docs was an unexpected contender. If the data you’re extracting is in an HTML table/RSS Feed/CSV file/XML document on a single webpage with no navigation/authentication, you can use one of the Import* functions in Google Docs. The IMPORTHTML macro worked as advertised in a quick test.

    iMacros is a tool that I could imagine solves all of the tasks I wanted, but costs more than I was willing to pay to write this blog post. Interestingly, the free version handles the steps that the other tools on this list don’t do as well: navigation. Through your browser, iMacros lets you automate filling out forms, clicking on “next” links, etc. To perform extraction, you have to pay at least $495.

    A friend has used Screen-scraper in the past with good outcomes. It handles navigation as well as extraction, but costs money and requires a small amount of programming/tokenization skills.

    Winautomation seems cool, but it’s only available for Windows, which was a dead end for me.

So that’s it? Nothing works?

Not quite. None of these tools solved the problem I had on a very challenging website: the site clearly didn’t want to be crawled given the CAPTCHA, and the javascript-submitted POST requests threw most of the tools that expected navigation through links for a loop. Still, most of the tools I reviewed have snazzy demos, and I was able to use some of them for extracting content from sites that were less challenging than the one I initially intended to scrape.

All hope is not lost, however. Where pure automation fails, a human can step in. Several proposals suggested paying people on oDesk, Mechanical Turk, or CrowdFlower to extract the content with a human touch. This would certainly get us past the CAPTCHA and hard-to-automate navigation. It might get pretty expensive to have humans copy/paste the data for extraction, however. Given that the tools above are good at extracting content from any single page, I suspect there’s room for a human-in-the-loop scraping tool to steal the show: humans can navigate and train the extraction step, and the machine can perform the extraction. I suspect that’s what import.io is up to, and I’m hopeful they keep the tool available to folks like the ones I initially tried to help.

While we’re on the topic of human-powered solutions, it might make sense to hire a developer on oDesk to just implement the scraper for the site this organization was looking at. While a lot of the developer-free tools I mentioned above look promising, there are clearly cases where paying someone for a few hours of script-building just makes sense.

Source: http://blog.marcua.net/post/74655674340

Sunday 23 November 2014

Using Kimono Labs to Scrape the Web for Free

Historically, I have written and presented about big data—using data to create insights, and how to automate your data ingestion process by connecting to APIs and leveraging advanced database technologies.

Recently I spoke at SMX West about leveraging the rich data in webmaster tools. After the panel, I was approached by the in-house SEO of a small company, who asked me how he could extract and leverage all the rich data out there without having a development team or large budget. I pointed him to the CSV exports and some of the more hidden tools to extract Google data, such as the GA Query Builder and the YouTube Analytics Query Builder.

However, what do you do if there is no API? What do you do if you want to look at unstructured data, or use a data source that does not provide an export?

For today's analytics pros, the world of scraping—or content extraction (sounds less black hat)—has evolved a lot, and there are lots of great technologies and tools out there to help solve those problems. To do so, many companies have emerged that specialize in programmatic content extraction such as Mozenda, ScraperWiki, ImprtIO, and Outwit, but for today's example I will use Kimono Labs. Kimono is simple and easy to use and offers very competitive pricing (including a very functional free version). I should also note that I have no connection to Kimono; it's simply the tool I used for this example.

Before we get into the actual "scraping" I want to briefly discuss how these tools work.

The purpose of a tool like Kimono is to take unstructured data (not organized or exportable) and convert it into a structured format. The prime example of this is any ranking tool. A ranking tool reads Google's results page, extracts the information and, based on certain rules, it creates a visual view of the data which is your ranking report.

Kimono Labs allows you to extract this data either on demand or as a scheduled job. Once you've extracted the data, it then allows you to either download it via a file or extract it via their own API. This is where Kimono really shines—it basically allows you to take any website or data source and turn it into an API or automated export.

For today's exercise I would like to create two scrapers.

A. A ranking tool that will take Google's results and store them in a data set, just like any other ranking tool. (Disclaimer: this is meant only as an example, as scraping Google's results is against Google's Terms of Service).

B. A ranking tool for Slideshare. We will simulate a Slideshare search and then extract all the results including some additional metrics. Once we have collected this data, we will look at the types of insights you are able to generate.

1. Sign up

Signup is simple; just go to http://www.kimonolabs.com/signup and complete the form. You will then be brought to a welcome page where you will be asked to drag their bookmarklet into your bookmarks bar.

The Kimonify Bookmarklet is the trigger that will start the application.

2. Building a ranking tool

Simply navigate your browser to Google and perform a search; in this example I am going to use the term "scraping." Once the results pages are displayed, press the kimonify button (in some cases you might need to search again). Once you complete your search you should see a screen like the one below:

It is basically the default results page, but on the top you should see the Kimono Tool Bar. Let's have a close look at that:

The bar is broken down into a few actions:

    URL – Is the current URL you are analyzing.

    ITEM NAME – Once you define an item to collect, you should name it.

    ITEM COUNT – This will show you the number of results in your current collection.

    NEW ITEM – Once you have completed the first item, you can click this to start to collect the next set.

    PAGINATION – You use this mode to define the pagination link.

    UNDO – I hope I don't have to explain this ;)

    EXTRACTOR VIEW – The mode you see in the screenshot above.

    MODEL VIEW – Shows you the data model (the items and the type).

    DATA VIEW – Shows you the actual data the current page would collect.

    DONE – Saves your newly created API.

After you press the bookmarklet you need to start tagging the individual elements you want to extract. You can do this simply by clicking on the desired elements on the page (if you hover over it, it changes color for collectable elements).

Kimono will then try to identify similar elements on the page; it will highlight some suggested ones and you can confirm a suggestion via the little checkmark:

A great way to make sure you have the correct elements is by looking at the count. For example, we know that Google shows 10 results per page, therefore we want to see "10" in the item count box, which indicates that we have 10 similar items marked. Now go ahead and name your new item group. Each collection of elements should have a unique name. In this page, it would be "Title".

Now it's time to confirm the data; just click on the little Data icon to see a preview of the actual data this page would collect. In the data view you can switch between different formats (JSON, CSV and RSS). If everything went well, it should look like this:

As you can see, it not only extracted the visual title but also the underlying link. Good job!

To collect some more info, click on the Extractor icon again and pick out the next element.

Now click on the Plus icon and then on the description of the first listing. Since the first listing contains site links, it is not clear to Kimono what the structure is, so we need to help it along and click on the next description as well.

As soon as you do this, Kimono will identify some other descriptions; however, our count only shows 8 instead of the 10 items that are actually on that page. As we scroll down, we see some entries with author markup; Kimono is not sure if they are part of the set, so click the little checkbox to confirm. Your count should jump to 10.

Now that you identified all 10 objects, go ahead and name that group; the process is the same as in the Title example. In order to make our Tool better than others, I would like to add one more set— the author info.

Once again, click the Plus icon to start a new collection and scroll down to click on the author name. Because this is totally unstructured, Google will make a few recommendations; in this case, we are working on the exclusion process, so press the X for everything that's not an author name. Since the word "by" is included, highlight only the name and not "by" to exclude that (keep in mind you can always undo if things get odd).

Once you've highlighted both names, results should look like the one below, with the count in the circle being 2 representing the two authors listed on this page.

Out of interest I did the same for the number of people in their Google+ circles. Once you have done that, click on the Model View button, and you should see all the fields. If you click on the Data View you should see the data set with the authors and circles.

As a final step, let's go back to the Extractor view and define the pagination; just click the Pagination button (it looks like a book) and select the next link. Once you have done that, click Done.

You will be presented with a screen similar to this one:

Here you simply name your API, define how often you want this data to be extracted and how many pages you want to crawl. All of these settings can be changed manually; I would leave it with On demand and 10 pages max to not overuse your credits.

Once you've saved your API, there are a ton of options (too many to review here). Kimono has a great learning section you can check out any time.

To collect the listings requires a quick setup. Click on the pagination tab, turn it on and set your schedule to On demand to pull data when you ask it to. Your screen should look like this:

Now press Crawl and Kimono will start collecting your data. If you see any issues, you can always click on Edit API and go back to the extraction screen.

Once the crawl is completed, go to the Test Endpoint tab to view or download your data (I prefer CSV because you can easily open it in Excel, CSV, Spotfire, etc.) A possible next step here would be doing this for multiple keywords and then analyzing the impact of, say, G+ Authority on rankings. Again, many of you might say that a ranking tool can already do this, and that's true, but I wanted to cover the basics before we dive into the next one.

3. Extracting SlideShare data

With Slideshare's recent growth in popularity it has become a document sharing tool of choice for many marketers. But what's really on Slideshare, who are the influencers, what makes it tick? We can utilize a custom scraper to extract that kind data from Slideshare.

To get started, point your browser to Slideshare and pick a keyword to search for.

For our example I want to look at presentations that talk about PPC in English, sorted by popularity, so the URL would be:

http://www.slideshare.net/search/slideshow?ft=presentations&lang=en&page=1&q=ppc&qf=qf1&sort=views&ud=any

Once you are on that page, pick the Kimonify button as you did earlier and tag the elements. In this case I will tag:

    Title
    Description
    Category
    Author
    Likes
    Slides

Once you have tagged those, go ahead and add the pagination as described above.

That will make a nice rich dataset which should look like this:

Hit Done and you're finished. In order to quickly highlight the benefits of this rich data, I am going to load the data into Spotfire to get some interesting statics (I hope).

4. Insights

Rather than do a step-by-step walktrough of how to build dashboards, which you can find here, I just want to show you some insights you can glean from this data:

    Most Popular Authors by Category. This shows you the top contributors and the categories they are in for PPC (squares sized by Likes)

    Correlations. Is there a correlation between the numbers of slides vs. the number of likes? Why not find out?
    Category with the most PPC content. Discover where your content works best (most likes).

5. Output

One of the great things about Kimono we have not really covered is that it actually converts websites into APIs. That means you build them once, and each time you need the data you can call it up. As an example, if I call up the Slideshare API again tomorrow, the data will be different. So you basically appified Slisdeshare. The interesting part here is the flexibility that Kimono offers. If you go to the How to Use slide, you will see the way Kimono treats the Source URL In this case it looks like this:

The way you can pull data from Kimono aside from the export is their own API; in this case you call the default URL,

http://www.kimonolabs.com/api/YOURPAIID?apikey=YO...

You would get the default data from the original URL; however, as illustrated in the table above, you can dynamically adjust elements of the source URL.

For example, if you append "&q=SEO"

(http://www.kimonolabs.com/api/YOURPAIID?apikey=YOURAPIKEY&q=SEO)

you would get the top slides for SEO instead of PPC. You can change any of the URL options easily.

I know this was a lot of information, but believe me when I tell you, we just scratched the surface. Tools like Kimono offer a variety of advanced functions that really open up the possibilities. Once you start to realize the potential, you will come up with some amazing, innovative ideas. I would love to see some of them here shared in the comments. So get out there and start scraping … and please feel free to tweet at me or reply below with any questions or comments!

Source: http://moz.com/blog/web-scraping-with-kimono-labs

Wednesday 19 November 2014

Web Scraping for SEO with these Open-Source Scrapers

When conducting Search Engine Optimization (SEO), we’re required to scrape websites for data, our campaigns, and reports for our clients. At the lowest level we utilize scraping to keep track of rankings on search engines like Google, Bing, and Yahoo, even keep a track of links on websites to know when it’s completed its lifespan. Then we’ve used them to help us aggregate data from APIs, RSS feeds, and websites to conduct some of our data mining to find patterns to help us become more competitive. 

So scraping is a function majority of companies (SEOmoz, Raventools, and Google) have to do to either save money, protect intellectual property, track trends, etc… Businesses can find infinite uses with scraping tools, it just depends if you’re an printed circuit board manufacturer looking for ideas on your e-mail marketing campaign or a Orange County based business trying to keep an eye out on the competition. which is why we’ve created a comprehensive list of open source scrapers out there to help all the businesses out there. Just keep in mind we haven’t used all of them!

Words of caution, web scrapers require knowledge specific to the language such as PHP & cURL. Take into considerations issues like cookie management, fault tolerance, organizing the data properly, not crashing the website being scraped, and making sure the website doesn’t prohibit scraping.

If you’re ready, here’s the list…

Erlang

    eBot

Java

    Heritrix
    Nutch
    Piggy Bank
    WebSPHINX
    WebHarvest

PHP

    PHPCrawl
    Snoopy
    SpiderMonkey

Python

    BeautifulSoap
    HarvestMan
    Scrape.py
    Scrapemark
    Scrapy **
    Mechanize

Ruby

    Anemone
    scRUBYt

We’ll come back and update this list as we encounter more! If you would like to submit a solution we missed, feel free. Also we’re looking for guides related to each of these, so if you know of any or would be interested in guesting blogging about one, let us know!

Source:http://www.annexcore.com/blog/web-scraping-for-seo-with-these-open-source-scrapers/

Monday 17 November 2014

How to scrape data without coding? A step by step tutorial on import.io

Import.io (pronounced import-eye-oh) lets you scrape data from any website into a searchable database. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills. As Sally Hadadi, from Import.io, told Journalism.co.uk: the idea is to “democratise” data. “We want journalists to get the best information possible to encourage and enhance unique, powerful pieces of work and generally make their research much easier.” Different uses for journalists, supplemented by case studies, can be found here.

A beginner’s guide

After downloading and opening import.io browser, copy the URL of the page you want to scrape into the import.io browser. I decided to scrape the search results website of orphanages in London:

001 Orphanages in London

After opening the website, press the tiny pink button in top right corner of the browser and follow up with “Let’s get cracking!” in the bottom right menu which has just appeared.

Then, choose the type of scraping you want to perform. In my case, it’s a Crawler (we’ll be getting data from multiple similar pages on the same site):

crawler

And confirm the URL of the website you want to scrape by clicking “I’m there”.

As advised, choose “Detect optimal settings” and confirm the following:

data

In the menu “Rows per page” select the format in which data appears on the website, whether it is “single” or “multiple”. I’m opting for the multiple as my URL is a listing of multiple search results:multiple

Now, the time has come to “train your rows” i.e. mark which part of the website you are interested in scraping. Hover over an entire “entry” or “paragraph”:hover over entry

…and he entry will be highlighted in pink or blue. Press “Train rows”.

train rows

Repeat the operation with the next entry/paragraph so that the scraper gets the hang of the pattern of your selections. Two examples should suffice. Scroll down to the bottom of your website to make sure that all entries until the last one are selected (=highlighted in pink or blue alternately).

If it is, press “I’ve got all 50 rows” (the number depends on how many rows you have selected).

Now it’s time to focus on particular chunks of data you would like to extract. My entries consist of a name of the orphanage, address, phone number and a short description so I will extract all those to separate columns. Let’s start by adding a column “name”:

add column

Next, highlight the name of the first orphanage in the list and press “Train”.

highlighttrain

Your table should automatically fill in with names of all orphanages in the list:table name

If it didn’t, try tweaking your selection a bit. Then add another column “address” and extract the address of the orphanage by highlighting the two lines of addresses and “training” the rows.

Repeat the operation for a “phone number” and “description”. Your table should end up looking like this:table final

*Before passing on to the next column it is worth to check that all the rows have filled up. If not, highlighting and training of the individual elements might be necessary.

Once you’ve grabbed all that you need, click “I’ve got what I need”. The menu will now ask you if you want to scrape more pages. In this case, the search yielded two pages of search results so I will add another page. In order to this this, go back to your website in you regular browser, choose page 2 (or any next one) of your search results and copy the URL. Paste it into the import.io browser and confirm by clicking “I’m there”:

i'm there

The scraper should automatically fill in your table for page 2. Click “I’ve got all 45 rows” and “I’ve got what I needed”.

You need to add at least 5 pages, which is a bit frustrating with a smaller data set like this one. The way around it is to add page 2 a couple of times and delete the unnecessary rows in the final table.

Once the cheating is done, click “I’m done training!” and “Upload to import.io”.

upload

Give the name to your Crawler, e.g. “Orphanages in London” and wait for import.io to upload your data. Then, run crawler:run crawler

Make sure that the page depth is 10 and that click “Go”. If you’re scraping a huge dataset with several pages of search results, you can copy your URLs to Excel, highlight them and drag down with a black cross (bottom right of the cell) to obtain a comprehensive list. Paste it into the “Where to start?” window and press “Go”.go

crawlingAfter the crawling is complete, you can download you data in EXCEL, HTML, JSON or CSV.dataset

As a result, we obtain a data set which can be easily turned into a map of orphanages in London, e.g. using Google Fusion Tables.

Source:http://www.interhacktives.com/2014/03/06/scrape-data-without-coding-step-step-tutorial-import-io/

Thursday 13 November 2014

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data - Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

minimal manual intervention

low on effort and time

can work on any scale

Cons-

Data quality compromised

inaccurate and incomplete datasets

lesser details suited only for high-level analyses

Suited for gathering- blogs, forums, news

Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets - If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

High data quality

Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering - any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source: https://www.promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/

Wednesday 12 November 2014

'Scrapers' Dig Deep for Data on Web

At 1 a.m. on May 7, the website PatientsLikeMe.com noticed suspicious activity on its "Mood" discussion board. There, people exchange highly personal stories about their emotional disorders, ranging from bipolar disease to a desire to cut themselves.

It was a break-in. A new member of the site, using sophisticated software, was "scraping," or copying, every single message off PatientsLikeMe's private online forums.

Enlarge Image

Bilal Ahmed wrote about his health on a site that was scraped. Andrew Quilty for The Wall Street Journal.

PatientsLikeMe managed to block and identify the intruder: Nielsen Co., the privately held New York media-research firm. Nielsen monitors online "buzz" for clients, including major drug makers, which buy data gleaned from the Web to get insight from consumers about their products, Nielsen says.

"I felt totally violated," says Bilal Ahmed, a 33-year-old resident of Sydney, Australia, who used PatientsLikeMe to connect with other people suffering from depression. He used a pseudonym on the message boards, but his PatientsLikeMe profile linked to his blog, which contains his real name.

After PatientsLikeMe told users about the break-in, Mr. Ahmed deleted all his posts, plus a list of drugs he uses. "It was very disturbing to know that your information is being sold," he says. Nielsen says it no longer scrapes sites requiring an individual account for access, unless it has permission.

Related Reading

    Digits: Escaping the 'Scrapers'
    Complete Coverage: What They Know

Journal Community

The market for personal data about Internet users is booming, and in the vanguard is the practice of "scraping." Firms offer to harvest online conversations and collect personal details from social-networking sites, résumé sites and online forums where people might discuss their lives.

The emerging business of web scraping provides some of the raw material for a rapidly expanding data economy. Marketers spent $7.8 billion on online and offline data in 2009, according to the New York management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009.

The Wall Street Journal's examination of scraping—a trade that involves personal information as well as many other types of data—is part of the newspaper's investigation into the business of tracking people's activities online and selling details about their behavior and personal interests.

Some companies collect personal information for detailed background reports on individuals, such as email addresses, cell numbers, photographs and posts on social-network sites.

Others offer what are known as listening services, which monitor in real time hundreds or thousands of news sources, blogs and websites to see what people are saying about specific products or topics.

One such service is offered by Dow Jones & Co., publisher of the Journal. Dow Jones collects data from the Web—which may include personal information contained in news articles and blog postings—that help corporate clients monitor how they are portrayed. It says it doesn't gather information from password-protected parts of sites.

It's rarely a coincidence when you see Web ads for products that match your interests. WSJ's Christina Tsuei explains how advertisers use cookies to track your online habits.

The competition for data is fierce. PatientsLikeMe also sells data about its users. PatientsLikeMe says the data it sells is anonymized, no names attached.

Nielsen spokesman Matt Anchin says the company's reports to its clients include publicly available information gleaned from the Internet, "so if someone decides to share personally identifiable information, it could be included."

Internet users often have little recourse if personally identifiable data is scraped: There is no national law requiring data companies to let people remove or change information about themselves, though some firms let users remove their profiles under certain circumstances.

California has a special protection for public officials, including politicians, sheriffs and district attorneys. It makes it easier for them to remove their home address and phone numbers from these databases, by filling out a special form stating they fear for their safety.

Data brokers long have scoured public records, such as real-estate transactions and courthouse documents, for information on individuals. Now, some are adding online information to people's profiles.

Many scrapers and data brokers argue that if information is available online, it is fair game, no matter how personal.

"Social networks are becoming the new public records," says Jim Adler, chief privacy officer of Intelius Inc., a leading paid people-search website. It offers services that include criminal background checks and "Date Check," which promises details about a prospective date for $14.95.

"This data is out there," Mr. Adler says. "If we don't bring it to the consumer's attention, someone else will."

Scraping for Your Real Name

PeekYou.com has applied for a patent for a way to, among other things, match people's real names to pseudonyms they use on blogs, Twitter and online forums.

Read PeekYou.com's patent application.

Enlarge Image

New York-based PeekYou LLC has applied for a patent for a method that, among other things, matches people's real names to the pseudonyms they use on blogs, Twitter and other social networks. PeekYou's people-search website offers records of about 250 million people, primarily in the U.S. and Canada.

PeekYou says it also is starting to work with listening services to help them learn more about the people whose conversations they are monitoring. It says it hands over only demographic information, not names or addresses.

Employers, too, are trying to figure out how to use such data to screen job candidates. It's tricky: Employers legally can't discriminate based on gender, race and other factors they may glean from social-media profiles.

One company that screens job applicants for employers, InfoCheckUSA LLC in Florida, began offering limited social-networking data—some of it scraped—to employers about a year ago. "It's slowly starting to grow," says Chris Dugger, national account manager. He says he's particularly interested in things like whether people are "talking about how they just ripped off their last employer."

Scrapers operate in a legal gray area. Internationally, anti-scraping laws vary. In the U.S., court rulings have been contradictory. "Scraping is ubiquitous, but questionable," says Eric Goldman, a law professor at Santa Clara University. "Everyone does it, but it's not totally clear that anyone is allowed to do it without permission."

Scrapers and listening companies say what they're doing is no different from what any person does when gathering information online—they just do it on a much larger scale.

"We take an incomprehensible amount of information and make it intelligent," says Chase McMichael, chief executive of InfiniGraph, a Palo Alto, Calif., "listening service" that helps companies understand the likes and dislikes of online customers.

Scraping services range from dirt cheap to custom-built. Some outfits, such as 80Legs.com in Texas, will scrape a million Web pages for $101. One Utah company, screen-scraper.com, offers do-it-yourself scraping software for free. The top listening services can charge hundreds of thousands of dollars to monitor and analyze Web discussions.

Some scrapers-for-hire don't ask clients many questions.

"If we don't think they're going to use it for illegal purposes—they often don't tell us what they're going to use it for—generally, we'll err on the side of doing it," says Todd Wilson, owner of screen-scraper.com, a 10-person firm in Provo, Utah, that operates out of a two-room office. It is one of at least three firms in a scenic area known locally as "Happy Valley" that specialize in scraping.

Enlarge Image

Some of the computer code behind screen-scraper.com's software. Chris Detrick for The Wall Street Journal

Screen-scraper charges between $1,500 and $10,000 for most jobs. The company says it's often hired to conduct "business intelligence," working for companies who want to scrape competitors' websites.

One recent assignment: A major insurance company wanted to scrape the names of agents working for competitors. Why? "We don't know," says Scott Wilson, the owner's brother and vice president of sales. Another job: attempting to scrape Facebook for a multi-level marketing company that wanted email addresses of users who "like" the firm's page—as well as their friends—so they all could be pitched products.

Scraping often is a cat-and-mouse game between websites, which try to protect their data, and the scrapers, who try to outfox their defenses. Scraping itself isn't difficult: Nearly any talented computer programmer can do it. But penetrating a site's defenses can be tough.

One defense familiar to most Internet users involves "captchas," the squiggly letters that many websites require people to type to prove they're human and not a scraping robot. Scrapers sometimes fight back with software that deciphers captchas.

More From the Series

    Web's New Goldmine: Your Secrets

    Personal Details Exposed Via Biggest Websites

    Microsoft Quashed Bid to Boost Web Privacy

    On Web's Cutting Edge, Anonymity in Name Only

    Stalking by Cellphone

    Google Agonizes Over Privacy

    The Tracking Ecosystem

    On the Web, Children Face Intensive Tracking

Some professional scrapers stage blitzkrieg raids, mounting around a dozen simultaneous attacks on a website to grab as much data as quickly as possible without being detected or crashing the site they're targeting.

Raids like these are on the rise. "Customers for whom we were regularly blocking about 1,000 to 2,000 scrapes a month are now seeing three times or in some cases 10 times as much scraping," says Marino Zini, managing director of Sentor Anti Scraping System. The company's Stockholm team blocks scrapers on behalf of website clients.

At Monster.com, the jobs website that stores résumés for tens of millions of individuals, fighting scrapers is a full-time job, "every minute of every day of every week," says Patrick Manzo, global chief privacy officer of Monster Worldwide Inc. Facebook, with its trove of personal data on some 500 million users, says it takes legal and technical steps to deter scraping.

At PatientsLikeMe, there are forums where people discuss experiences with AIDS, supranuclear palsy, depression, organ transplants, post-traumatic stress disorder and self-mutilation. These are supposed to be viewable only by members who have agreed not to scrape, and not by intruders such as Nielsen.

"It was a bad legacy practice that we don't do anymore," says Dave Hudson, who in June took over as chief executive of the Nielsen unit that scraped PatientsLikeMe in May. "It's something that we decided is not acceptable, and we stopped."

Mr. Hudson wouldn't say how often the practice occurred, and wouldn't identify its client.

The Nielsen unit that did the scraping is now part of a joint venture with McKinsey & Co. called NM Incite. It traces its roots to a Cincinnati company called Intelliseek that was founded in 1997. One of its most successful early businesses was scraping message boards to find mentions of brand names for corporate clients.

In 2001, the venture-capital arm of the Central Intelligence Agency, In-Q-Tel Inc., was among a group of investors that put $8 million into the business.

Intelliseek struggled to set boundaries in the new business of monitoring individual conversations online, says Sundar Kadayam, Intelliseek's co-founder. The firm decided it wouldn't be ethical to use automated software to log into private message boards to scrape them.

But, he says, Intelliseek occasionally would ask employees to do that kind of scraping if clients requested it. "The human being can just sign in as who they are," he says. "They don't have to be deceitful."

In 2006, Nielsen bought Intelliseek, which had revenue of more than $10 million and had just become profitable, Mr. Kadayam says. He left one year after the acquisition.

At the time, Nielsen, which provides television ratings and other media services, was looking to diversify into digital businesses. Nielsen combined Intelliseek with a New York startup it had bought called BuzzMetrics.

The new unit, Nielsen BuzzMetrics, quickly became a leader in the field of social-media monitoring. It collects data from 130 million blogs, 8,000 message boards, Twitter and social networks. It sells services such as "ThreatTracker," which alerts a company if its brand is being discussed in a negative light. Clients include more than a dozen of the biggest pharmaceutical companies, according to the company's marketing material.

Like many websites, PatientsLikeMe has software that detects unusual activity. On May 7, that software sounded an alarm about the "Mood" forum.

David Williams, the chief marketing officer, quickly determined that the "member" who had triggered the alert actually was an automated program scraping the forum. He shut down the account.

The next morning, the holder of that account e-mailed customer support to ask why the login and password weren't working. By the afternoon, PatientsLikeMe had located three other suspect accounts and shut them down. The site's investigators traced all of the accounts to Nielsen BuzzMetrics.

On May 18, PatientsLikeMe sent a cease-and-desist letter to Nielsen. Ten days later, Nielsen sent a letter agreeing to stop scraping. Nielsen says it was unable to remove the scraped data from its database, but a company spokesman later said Nielsen had found a way to quarantine the PatientsLikeMe data to prevent it from being included in its reports for clients.

PatientsLikeMe's president, Ben Heywood, disclosed the break-in to the site's 70,000 members in a blog post. He also reminded users that PatientsLikeMe also sells its data in an anonymous form, without attaching user's names to it. That sparked a lively debate on the site about the propriety of selling sensitive information. The company says most of the 350 responses to the blog post were supportive. But it says a total of 218 members quit.

In total, PatientsLikeMe estimates that the scraper obtained about 5% of the messages in the site's forums, primarily in "Mood" and "Multiple Sclerosis."

Source: http://online.wsj.com/articles/SB10001424052748703358504575544381288117888

Monday 10 November 2014

My Experience in Choosing a Web Scraping Service

Recently I decided to outsource a web scraping project to another company. I typed “web scraping service” in Google, chose six services from the first two search result pages and sent the project specifications to all of them to get quotes. Eventually I decided to go another way and did not order the services, but my experience may be useful for others who want to entrust web scraping jobs to third party services.

If you are interested in price comparisons only and not ready to read the whole story just scroll down.

A list of web scraping services I sent my project to:

    www.datahen.com - Canadian web scraping service with nice web design
    webdata-scraping.com - Indian service by Keval Kothari
    www.iwebscraping.com - India based web scraping company (same as www.3idatascraping.com)
    scrapinghub.com - A scraping service founded by creators of Scrapy
    web-scraper.com - Yet another web scraping service
    grepsr.com - A scraping service that we already reviewed two years ago


Sending the request
All the services except scrapinghub.com have quite simple forms for the description of the project requirements. Basically, you just need to give your contact details and a project description in any form. Some of them are pretty (like datahen.com), some of them are more ascetic (like web-scraper.com), but all of them allow you to send your requirements to developers.

Scrapinghub.com has a quite long form, but most of the fields are optional and all the questions are quite natural. If you really know what you need, then it won’t be hard to answer all of them; moreover they rather help you to describe your need in detail.

Note, that in the context of the project I didn’t make a request for a scraper itself. I asked to receive data on a weekly basis only.

Getting responses

Since I sent my request on Sunday it would have been ok not to receive responses the same day, but I got the first response in 3 hrs! It was from web-scraper.com and stated that this project will cost me $250 monthly. Simple and clear. Thank you, Thang!

Right after that, I received the second response. This time it was Keval from webdata-scraping.com. He had some questions regarding the project. Then after two days he wrote me that it would be hard to scrape some of my data with the software he uses, and that he will try to use a custom scraper. After that he disappeared… ((

Then on Monday I received Cost & ETAT details from datahen.com. It looked quite professional and contained not only price, but also time estimation. They were ready to create such a scraper in 3-4 days for $249 and then maintain it for just $65/month.

On the same day I received a quote from iwebscraping.com. It was $60 per week. Everything is fine, but I’d like to mention that it wasn’t the last letter from them. After I replied to them (right after receiving the quote), I received a reminder letter from them every other day for about a week. So be ready for aggressive marketing if you ask them for a quote )).

Finally in two days after requesting a quote I got a response from scrapinghub.com. Paul Tremberth wrote me that they were ready to build a scraper for $1200 and then maintain it for $300/month.

It is interesting that I have never received an answer from grepsr.com! Two years ago it was the first web scraping service we faced on the web, but now they simply ignored my request! Or perhaps they didn’t receive it somehow? Anyway I had no time for investigation.

So what?

Let us put everything together. Out of six web scraping  services I received four quotes with the following prices:

Service     Setup fee     Monthly fee

web-scraper.com     -     $250
datahen.com     $249     $65
iwebscraping.com     -     $240
scrapinghub.com     $1200     $300


From this table you can see that  scrapinghub.com appears to be the most expensive service among those compared.

EDIT: These $300/month gives you as much support and development needed to fix a 5M multi-site web crawler, for example. If you need a cheaper solution you can use their Autoscraping tool, which is free, and would have costed around $2/month to crawl at my requested rates.

The average cost of monthly scraping is about $250, but from a long term perspective datahen.com may save you money due to their low monthly fee.

That’s it! If I had enough money available it would be interesting to compare all these services in operation and provide you a more complete report, but this is all I have for now.

If you have anything to share about your experience in using similar services, please contribute to this post by commenting on it below. Cheers!

Source: http://scraping.pro/choosing-web-scraping-service/

Saturday 8 November 2014

Why People Hesitate To Try Data Mining

What is hindering a number of people from venturing into the promising world of data mining? Despite so much encouragement, promotions, testimonials, and evidences of the benefits of online data collection, still only a handful take the challenge and really gain the pay offs it has to offer.

It may sound unthinkable that such an opportunity for success has been neglected by many. It may also sound absurd why many well-meaning individuals are hindered from enjoying the benefits of the blessings of the 21st century.

The Causes

After considerable observation and analysis of the human psyche, one can understand the underlying reasons behind the hesitance to try the profitable data mining service. The most common reasons why people are afraid to try new technology or why they remain passive and uninvolved are: fear; lack of knowledge; and pride.

Fear. The most paralyzing of human emotions is fear. It can, to some extent, cause a person to be insane, unprofitable, sick, and lost. Although fear is a normal reaction to certain stimuli and a natural feeling experienced by humans, it must always be monitored and controlled.  Usually, people share common fears, such as: fear of change; fear of anything new; and fear of the unknown.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/people-hesitate-try-data-mining/

Wednesday 5 November 2014

Why Web Scraping is Indispensable

The 21st century has opened the gates to hidden treasures and unlimited access to information globally without the constraints of time and space, through Internet technology. Along with this development comes the necessity for each business or company to get as much information as possible in order in order to thrive in the ever increasing demand for new innovations, comparisons, and trends.

Web scraping has consequently become an indispensable option to achieve all the needed data as quickly and efficiently as possible. In this view, data mining then appears to be the best and the only way to answer the present demand for updates, data, coping, foreknowledge, analysis, and evaluation. Indeed, information has inevitably become a valuable commodity and the most sought after product among online and offline entrepreneurs.

Need for Data

The increasing need for new data makes it possible for the experts to become increasingly creative in accessing information worldwide. The more knowledge one has, the better are his or her chances of growing and surviving. There seems to be no other time in the human existence where data has become so much a major source of revenue as the contemporary times.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-indispensable/