It is a common practice by most Data Scientists to scrape public/private data that can be used in building their Models. These scraped data augment the presentation of the data in varied forms to the business.
To begin with, let's review some of the frequently asked questions on Web Scraping:
Web scraping is primarily utilized by Marketing companies, Real Estate firms (scraping listings data), e-commerce competitive edge, rolling into Data Science at some point.
There are companies that sell web scraping subscription models for a cost. It would make more sense in case you do not have the skill-set in-house to build your web scraping models. The typical pricing model from ParseHub is about $499 per month with a total of 120 projects. The annual cost could amount to about $6K plus any customization.
The pricing model is listed here. There are other competitors who can provide similar services ranging from $4K to $10K with size limit, limited data retention times. In short, assuming that you are scraping 5-10 websites, you are looking at a cost of $10K implementation cost plus $1000 per month in maintenance and upkeep.
As a Data Scientist if you are solving a regression problem of a retail product, a rule of thumb is to identify at least 5 websites/web pages that you would need to scrape initially. Inspect each page and find commonalities using Html tags (Div, Img, or any container elements). These tags can be viewed using the Console Inspect method in Chrome.
Use Anaconda Jupyter Notebook to get all the relevant Python Libraries and these libraries have to be installed in the correct Python environment.
Set up a MongoDB instance either on the cloud or your personal workstation initially. The document-based database is useful to store data in collections that tend to vary.
Use an application such as Flask to present the data that you have scraped from various websites and saved into MongoDB. Ensure that correct CSS is applied using the Bootstrap grid method. This would be needed to present Data Visualization to external customers/clients.
Use the data that is scraped to Train and Test your model. Use Scipy / Scikit-learn packages to leverage the existing algorithm to produce the desired result.
Summary: Web scraping is a great tool in a Data Scientist's arsenal of Data collection. As seen from this article with few lines of code and understanding the layout of websites, key data points can be scraped which can further be added to your predictive models. Small businesses can leverage this technology to keep them ahead of their competitors. Tech-savvy individuals can build cloud-based commercial web scraping models that can be used as subscription models from consumers using Apps.