For serverless simplified web scraping we can use cloudflare-workers. It offers a completely unique way to parse and read DOM content using its HTMLRewriter API.
HTML Rewriter
HTML Rewriter is a powerful tool that allows developers to modify HTML content in real-time at the edge. Technically, it is a streaming HTML parser and modifier. You can unleash all your creativity here to do wonders as it enables dynamic rewriting of web pages, making it possible to add, remove, or modify elements on the fly. It is a bit tricky to get your head around this but once your code works it will amaze you.
Check out the official documentation to learn more about HTML Rewriter.
Points to note
- The HTMLRewriter class should be instantiated once in your Workers script.
Issues I came across
- Often the selectors are not exactly providing the elements you are looking for. But that happens with CSS selectors in other frameworks as well.
- Since the text comes in chunks so carefully check if the complete text is received and not just a part of it.
- Text contains encoded characters for few special characters such as quotes, semicolons, astericks. So it is difficult to deal with incoming text and need to clean it up to make sense of it.
- Many websites have prevention against bot crawling and that can make it difficult to get to the page and to other secured areas.
Bonus points
- You can easily make that worker to function as a cron job. Go to settings and create a cron trigger. Thus the function will automatically trigger on every cron run and do the scrapping or whatever job you want it to.
Limits and restrictions to take care of
- Number of subrequests that we can make from inside a worker is limited. So need to be careful how many http requests we are making from inside the worker. If a worker is calling another worker then that is also counted in this, so that's a limitation.
- There is also a limit to number of workers per account, so see if you can consolidate your work in less number of workers.
- Number of cron triggers is also limited.
Thanks for reading along.
Let me know your feedback/suggestions in the comments.
- Ayush 🙂