Scraping. Get any data from any website automatically

Increasingly common and very useful and necessary for many websites is obtaining data in an automated way from other websites, which is what is known as “scraping”. The best example that comes to mind and surely the most widely used is knowing competitors’ prices and monitoring them for different purposes: from adapting your own e-commerce prices, keeping an eye on deals, monitoring how it marks your products’ prices, adapting their campaigns on shopping or amazon. The possibilities and needs are multiple.

Professionally I have had to develop countless scrapers for very different types of websites, mostly e-commerce sites to get competitor prices, but also business directories, recipes and surely many more if I think hard. Among them I have fought with Google Shopping, Amazon, marketplaces, national business directories… I do not consider myself an expert in the field but I can contribute some knowledge.

I am not going to get into the ethical/legal topic that scraping practice entails but it is certain that today it is a fundamental part of many types of businesses and a very widespread activity on the internet (Google itself and search engine robots are ultimately “scrapers”).

I will focus on the purely technical aspect.

The difficulty of performing scraping depends on many factors, basically:

  • Number of websites to scrape.
  • Number of concurrent requests and necessary frequency.
  • The protection measures of the website itself against scraping techniques.
  • The HTML/page from which to obtain the necessary data.
  • Need to render JavaScript and simulate human actions (for example logging in or clicking)
  • The structure of the website.

Although any developer could obtain any data from any specific page, few developers have the necessary knowledge to understand the complexity needed to work in volume and bypass all the protection measures of the target websites and it is this latter that makes it interesting to a more system-oriented professional profile where knowing the entire chain of technologies from when a request is initiated to a server and it returns the data is a great help.

There are countless libraries (especially in Python), classes and tools that facilitate the development of a scraper, but finally, in my case most of the time I did not use any of them. All required large hardware resources as soon as it was necessary to make several hundred thousand requests per day and the supposed advantages of using the HTML DOM with specific tools to locate the XPath of the needed data was not so useful nor gave me a real advantage over my old familiar regular expressions. A different matter is when human interaction simulation is needed, in which case the use of these tools is almost mandatory.

The other important aspect is having either a large pool of IPs with different subnets (OVH is very helpful here) or a good global proxy company with IPs of different types (datacenter, residential, mobile network, by country) at a reasonable price. In my case I have both options to optimize costs; when the security measures of the target website allow it, I use my own IP pool and if it gets complicated, the proxy provider.

Then there is much more, although already at a level further from what for me is the most exciting part. The database design where to store the data and the application that uses the data you have obtained for the desired purpose, example: generate an optimized data feed for shopping with your best products according to your prices and those of the competition and thus optimize the return on your investment in advertising.

In future posts I will focus on more practical examples and more specific information on how to obtain data from a website.

 

Leave a Comment

Este formulario guarda los datos que indiques de nombre, email y comentario para poder realizar un seguimiento de los comentarios dejados en cada entrada.