/posts

Master the Game of Information: Web Scraping meets AI Summarization

Recently, I’ve been enjoying the process of scraping content from websites using the shot-scraper and summarizing them with the cutting-edge gpt-3.5-turbo-16k model.

It feels as though information is instantly accessible, which greatly simplifies the process of information retrieval. Here is a example of the usage:

Bash
shot-scraper javascript https://vitamincpu.de "
async () => {
    const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
    return (new readability.Readability(document)).parse().content;
}" | strip-tags -m | llm --system 'concise summary bullet points' -m gpt-3.5-turbo-16k

Output:

Text
- The webpage features different sections such as Home, Writings, Tags, Contact, Citations, and About
- The author is an IT-Security Researcher who does not use cookies or tracking
- The author can be found on social media platforms such as Twitter, Facebook, and LinkedIn
- The Writings section includes ten different blog posts ranging from technology tutorials to cybersecurity threats and AI-powered attacks
- The Projects section presents three different tools created by the author: pyMetnoForecast, LostArk Serverstatus, and Substack Archiver

Let’s break down the process. When you run shot-scraper independently, it captures a screenshot of the specified URL, provided by a headless browser. However, you also have the option to execute raw JavaScript, which interacts with the webpage and returns the result in a JSON format. For this, we’re using Skypack’s Readability.js, a standalone variant of the readability library used in Firefox’s Reader View. You’ve likely used this feature in Firefox or Chrome already. While it’s possible to use curl as an alternative to shot-scraper, please note that it comes with more limitations.

The results are then fed into another program named strip-tags. This is a tool designed to remove HTML tags from text, with the additional option to output a subset of the page based on CSS selectors. Finally, we pipe the entire output into llm, which triggers the OpenAI API request with our predefined SYSTEM prompt.

All of the tools I’ve mentioned have been developed by Simon Willison. I’ve read a substantial amount of code from him and I’ve learned a lot in the process. I highly recommend his blog and projects.