Web Scraping by Data Scientists

It is a common practice by most Data Scientists to scrape public/private data that can be used in building their Models. These scraped data augment the presentation of the data in varied forms to the business.

To begin with, let's review some of the frequently asked questions on Web Scraping:

Build your own (BYO) or Rent from others (RTO)?
What are the pitfalls of BYO?
Is it ethically right?
What is the right technology?

Web scraping is primarily utilized by Marketing companies, Real Estate firms (scraping listings data), e-commerce competitive edge, rolling into Data Science at some point.

Build your own (BYO) or Rent from others (RTO)?

Target advertising to consumers looking at products such as appliances, automobiles, travel, etc. Data is scraped from consumers through their submission of profiles and other such relevant information. Key is the email information and some kind of metadata that is relevant in their searches
Compare listings of properties within an area of sale by real estate agents through apps enhanced by scraping these newly listed properties
Useful in lead generation by scraping contact details
Retail data for product comparison resulting in increases/reduction of pricing models

There are companies that sell web scraping subscription models for a cost. It would make more sense in case you do not have the skill-set in-house to build your web scraping models. The typical pricing model from ParseHub is about $499 per month with a total of 120 projects. The annual cost could amount to about $6K plus any customization.

The pricing model is listed here. There are other competitors who can provide similar services ranging from $4K to $10K with size limit, limited data retention times. In short, assuming that you are scraping 5-10 websites, you are looking at a cost of $10K implementation cost plus $1000 per month in maintenance and upkeep.

Freelance Web Scrapers charge anywhere between $5 to $30 an hour.

However, if web scraping is your primary method of consumer engagement, it is advisable to have an in-house developer with the skills in open source technologies. Data Analysts can also develop models that can be tweaked and regularly maintained. The data can also be saved in your cloud database. Keep in mind, the website owners consistently change the layouts to prevent web scraping. So your model that worked last month may not work today. Hence due to the constant adjustment to the web scraping model, BYO makes more sense if you have a cloud-based resource such as Paas or a hybrid cloud.

So, what are the pitfalls of BYO?

The initial setup and product setup could end up being more expensive than budgeted especially developing the model and resolving issues
The scraped model requires constant upkeep and your developer has to be consistently engaged in the upkeep
Limiting the scrape to a single or few websites will not help you get everything you need. Many websites miss key elements such as phone numbers, pricing, and reviews. Enhance the experience by including many websites.
Putting all the scraped together as they come from disparate sites, it involves a lot of rules to be set up upfront and these rules need to be flexible to accommodate future changes.

Let's discuss ethics:

The law by itself to scrape websites is a bit fuzzy and eventually, there would be much reform in this area. The ethics are up to the company investing its resources into developing the web scrapers. Some methods that can be applied:

Use the scrapped data only to derive new value out of it. Do not duplicate the content and do not save unused data
Provide developer/company contact info in the code used to scrape the data as a courtesy
Request/scrape data at a reasonable rate
Give credit to the sites that you are scraping data from
If a public API is available, do not pursue scraping from that source
Check to see if a site has terms of service or terms of use web scraping

What is the right technology? How do Data Scientists Scrape Data?

As a Data Scientist if you are solving a regression problem of a retail product, a rule of thumb is to identify at least 5 websites/web pages that you would need to scrape initially. Inspect each page and find commonalities using Html tags (Div, Img, or any container elements). These tags can be viewed using the Console Inspect method in Chrome.

Use Anaconda Jupyter Notebook to get all the relevant Python Libraries and these libraries have to be installed in the correct Python environment.

Set up a MongoDB instance either on the cloud or your personal workstation initially. The document-based database is useful to store data in collections that tend to vary.

Use an application such as Flask to present the data that you have scraped from various websites and saved into MongoDB. Ensure that correct CSS is applied using the Bootstrap grid method. This would be needed to present Data Visualization to external customers/clients.

Use the data that is scraped to Train and Test your model. Use Scipy / Scikit-learn packages to leverage the existing algorithm to produce the desired result.

Summary: Web scraping is a great tool in a Data Scientist's arsenal of Data collection. As seen from this article with few lines of code and understanding the layout of websites, key data points can be scraped which can further be added to your predictive models. Small businesses can leverage this technology to keep them ahead of their competitors. Tech-savvy individuals can build cloud-based commercial web scraping models that can be used as subscription models from consumers using Apps.

Useful URLs: