Show HN：自行託管 Reddit – 23.8 億則貼文，離線可用，永久屬於您

Hacker News·3 個月前

這篇「Show HN」介紹了 Redd-Archiver，一個基於 PostgreSQL 的歸檔生成器，可從 Reddit、Voat 和 Ruqqus 等連結聚合平台建立可瀏覽的 HTML 歸檔，讓使用者能夠自行託管並離線保存內容。

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

To see all available qualifiers, see our documentation.

A PostgreSQL-backed archive generator that creates browsable HTML archives from link aggregator platforms including Reddit, Voat, and Ruqqus.

License

Uh oh!

There was an error while loading. Please reload this page.

19-84/redd-archiver

Folders and files

Latest commit

History

Repository files navigation

Redd-Archiver

Transform compressed data dumps into browsable HTML archives with flexible deployment options. Redd-Archiver supports offline browsing via sorted index pages OR full-text search with Docker deployment. Features mobile-first design, multi-platform support, and enterprise-grade performance with PostgreSQL full-text indexing.

Supported Platforms:

Tracked content: 2.384 billion posts across 68,883 communities (Reddit full Pushshift dataset through Dec 31 2024, Voat/Ruqqus complete archives)

Version 1.0 features multi-platform archiving, REST API with 30+ endpoints, MCP server for AI integration, and PostgreSQL-backed architecture for large-scale processing.

🚀 Quick Start

Try the live demo: Browse Example Archive →

New to Redd-Archiver? Start here: QUICKSTART.md

Get running in 2-15 minutes with our step-by-step guide covering:

🎯 Key Features

🌐 Multi-Platform Support

Archive content from multiple link aggregator platforms in a single unified archive:

🤖 MCP Server (AI Integration)

29 MCP tools auto-generated from OpenAPI for AI assistants:

See MCP Server Documentation for complete setup guide.

Core Functionality

Technical Excellence

Deployment Options

📦 Deployment Options

Redd-Archiver generates static HTML files that can be browsed offline OR deployed with full-text search:

Offline Browsing Features:

With Search Server:

🚨 Get Involved: Help Preserve Internet History

Internet content disappears every day. Communities get banned, platforms shut down, and valuable discussions vanish. You can help prevent this.

📥 Download & Mirror Data Now

Don't wait for content to disappear. Download these datasets today:

† Voat Performance Tip: Use pre-split files for 1000x faster imports (2-5 min vs 30+ min per subverse)
‡ Ruqqus: Docker image includes p7zip for automatic .7z decompression

Every mirror matters. Store locally, seed torrents, share with researchers. Be part of the preservation network.

🌐 Join the Registry: Deploy Your Instance

Already running an archive? Register it on our public leaderboard:

Benefits:

👉 Register Your Instance Now →

🆕 Submit New Data Sources

Found a new platform dataset? Help expand the archive network:

👉 Submit Data Source →

Why submit?

📸 Screenshots

Dashboard

Main landing page showing archive overview with statistics for 9,592 posts across Reddit, Voat, and Ruqqus. Features customizable branding (site name, project URL), responsive cards, activity metrics, and content statistics. (Works offline)

Subreddit Index

Post listing with sorting options (score, comments, date), pagination, and badge coloring. Includes navigation and theme toggle. (Works offline - sorted by score/comments/date)

Post Page with Comments

Individual post displaying nested comment threads with collapsible UI, user flair, and timestamps. Comments include anchor links for direct navigation from user pages. (Works offline)

Mobile Responsive Design

Fully optimized for mobile devices with touch-friendly navigation and responsive layout.

Search Interface

PostgreSQL full-text search with Google-style operators. Supports filtering by subreddit, author, date range, and score. (Requires Docker deployment)

Search results with highlighted excerpts using PostgreSQL ts_headline(). Sub-second response times with GIN indexing. (Server-based, Tor-compatible)

Sample Archive: Multi-platform archive featuring programming and technology communities from Reddit, Voat, and Ruqqus · See all screenshots →

🛠️ Installation

Prerequisites

Python Dependencies

Redd-Archiver uses modern, performance-focused dependencies:

Core:

HTML Generation:

Performance:

Quick Start

Review the CHANGELOG.md for version updates and changes.

📊 Usage

1. Prepare Your Data

Redd-Archiver processes data dumps from multiple platforms:

2. Identify High-Priority Communities (Optional)

Scanner Tools help you identify which communities to archive first based on priority scores:

What the scanners do:

Example output:

Use cases:

Output files (included in tools/ directory):

View the complete data catalog to browse all communities and their priority scores.

3. Configure PostgreSQL

Ensure DATABASE_URL is set (see Installation above):

4. Generate Your Archive

Reddit Archives (.zst files):

Voat Archives (SQL dumps):

Ruqqus Archives (.7z files):

Multi-Platform Mixed Archive:

With filtering and SEO:

Import/Export workflow (for large datasets):

4. Deploy Your Archive

Multiple deployment options available:

Local/Development (HTTP):

Production HTTPS (Let's Encrypt):

Homelab/Tor (.onion hidden service):

Dual-Mode (HTTPS + Tor):

Static Hosting (GitHub/Codeberg Pages):

See deployment guides:

5. Advanced CLI Options

Processing Control:

Logging:

Performance Tuning:

Environment Variables:

🏗️ Architecture

Redd-Archiver features a clean modular architecture with specialized components:

Project Structure

HTML Modules (18 specialized modules)

Jinja2 Templates (15 templates)

Database Schema

🔍 PostgreSQL Full-Text Search

Lightning-Fast Database Search

Redd-Archiver v1.0 uses PostgreSQL full-text search with GIN indexing for blazing-fast search capabilities:

Key Features:

Search API

PostgreSQL search is exposed via postgres_search.py (CLI) and search_server.py (Web API):

Command-Line Interface:

Web API (✅ Implemented):

Features:

🌐 REST API & Registry

REST API v1

Full-featured API with 30+ endpoints for programmatic access and MCP/AI integration:

MCP/AI-Optimized Features:

Rate limited to 100 requests/minute. See API Documentation for complete reference.

Instance Registry & Leaderboard

Redd-Archiver supports a distributed registry system for tracking archive instances:

See Registry Setup Guide for configuration.

📈 Performance & Optimization

PostgreSQL Backend Performance (v1.0+)

Constant Memory Usage:

Database Storage:

Processing Speed:

Search Performance

Performance varies based on dataset size, query complexity, and hardware:

Architecture Benefits

PostgreSQL v1.0 Features:

🔀 Scaling for Very Large Archives

Single Instance Limits

Redd-Archiver has been tested with archives up to hundreds of gigabytes. For optimal performance:

Horizontal Scaling Strategy

For very large archive collections (multiple terabytes), deploy multiple instances divided by topic:

Architecture:

Benefits:

Deployment Options:

Example Multi-Instance Setup:

When to Use:

🎯 Use Cases

Research & Academia

Community Archiving

Investigation & Analysis

📚 Documentation

Deployment Guides

API & Integration

Project Documentation

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for development guidelines, code structure, and testing procedures.

Key areas for contribution:

See our modular architecture (18 specialized modules) for easy entry points to contribute.

📝 License

This is free and unencumbered software released into the public domain. See the LICENSE file (Unlicense) for details.

Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software for any purpose, commercial or non-commercial, and by any means.

📦 Data Sources

This project leverages public datasets from the following sources:

🙏 Acknowledgments

This project builds upon the work of several excellent archival projects:

📧 Contact

💰 Support the Project

Redd-Archiver was built by one person over 6 months as a labor of love to preserve internet history before it disappears forever.

This isn't backed by a company or institution—just an individual committed to keeping valuable discussions accessible. Your support helps:

Every donation, no matter the size, helps keep this preservation effort alive.

Bitcoin (BTC)

Scan to donate Bitcoin

Monero (XMR)

Scan to donate Monero

Thank you for supporting internet archival efforts! Every contribution helps maintain and improve this project.

This software is provided "as is" under the Unlicense. See LICENSE for details. Users are responsible for compliance with applicable laws and terms of service when processing data.

About