I discovered Y Combinator’s Hacker News aggregator in early 2021, and browsing HN quickly became part of my daily routine. But it was hard to pick which articles I wanted to read efficiently because the stock HN interface provides very little information.
I love the variety and quality of the articles and the knowledgeable HN community. Now, the plain UI of the site is prized by some of HN’s most ardent members, but I found it a bit lacking in the departments of aesthetics and information. I wanted more insight into what the links were about so I could make faster and better decisions whether to read an article and/or its comments or just keep on scrolling, and I wanted the presentation of the links to be less monotonous. For example, I like seeing an article’s og:image, so I display that. If the link goes to a social media platform, the name of the account or channel that published it is displayed.
Regarding the og:image thumbnails, four sizes are created and converted to webp format.
Two of my favorite THNR features might even fly under the radar of my readers’ notice: I surface the programming languages used in GitHub projects by scraping the languages and percentages from the project’s GitHub page and display that in text format below the GitHub project’s og:image. The other small feature is that when the HN story links directly to a PDF, I display a thumbnail of page 1 of the PDF but with a small dogear on the corner of the page. I’m tickled every time I see that dogeared page!
THNR’s “about” page has additional info, if you’re curious.
50 unique visitors/day, with 15 pages visited per unique visitor
On the back end, THNR sends some queries to HN’s official API to get the lists of top and new stories as well as the details for each story. THNR scrapes the linked articles for the og:image and other details as applicable (e.g., the social media account name and the GitHub programming languages). Then the HTML for the each story’s <div> is generated containing the details that will be displayed on thnr.net. Each chunk of 20 divs gets sandwiched in a header and footer and written out to the HTML file that will be served. HN’s API offers separate endpoints for “top stories” and “new stories,” and there are 500 stories in each feed. So each feed translates to 25 HTML pages.
On the front end, thnr.net is a folder of 50 static HTML pages updated every 20 minutes or so and served by nginx. (Plus the about page makes 51 HTML files.)
@media queries are used to serve the right size thumbnail.
The THNR stack is primarily Python (packages include: requests for ingesting HN data, Playwright async for scraping articles, Wand for thumbnail resizing and conversion, PyPDF2 for extracting first page of PDFs, pickle for caching HN story data to avoid pestering the hacker-news.firebaseio.com API, logging for logging), YAML for config and settings, a bash script to run the THNR backend on a schedule (sleep for 20 minutes after scraper exits before running scraper again), and as mentioned nginx for URL routing and also serving the statically generated HTML pages.