Monday, May 1, 2023

scrapy CrawlSpider instead of wget -r

 

I needed to download all pages from a wix-made website.

`wget -r` didn't work.

httrack , lynx didn't work, either.

https://askubuntu.com/questions/391622/download-a-whole-website-with-wget-or-other-including-all-its-downloadable-con


I could download websites with CrawlSpider and FollowLink of https://github.com/scrapy/scrapy

https://www.youtube.com/watch?v=o1g8prnkuiQ


I'll use playwrite later, too. (instead of selenium)

https://scrapeops.io/python-scrapy-playbook/scrapy-playwright/


I found some candidates

https://github.com/crawlab-team/crawlab