|
| 1 | +# Site Content Protocol (SCP) |
| 2 | + |
| 3 | +A collection-based protocol that reduces waste of bandwidth, processing power, and energy through pre-generated snapshots and deltas. |
| 4 | + |
| 5 | +## The Problem |
| 6 | + |
| 7 | +Web crawlers (search engines, AI bots, aggregators) consume massive bandwidth and server resources by parsing web-pages designed for human viewing. |
| 8 | +With the explosion of AI crawlers, this traffic has become a significant cost for websites and strain on internet infrastructure. |
| 9 | + |
| 10 | +Sources: |
| 11 | + |
| 12 | +- https://radar.cloudflare.com/year-in-review/2025 |
| 13 | +- https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |
| 14 | +- https://scrapeops.io/web-scraping-playbook/web-scraping-market-report-2025/ |
| 15 | + |
| 16 | + |
| 17 | +## The Solution |
| 18 | + |
| 19 | +SCP enables websites to serve pre-generated collections of their content in compressed JSON Lines format. |
| 20 | + |
| 21 | +**Target Goals**: |
| 22 | + |
| 23 | +- 50-60% bandwidth reduction for initial snapshots vs compressed HTML |
| 24 | +- 90-95% bandwidth reduction with delta updates (after initial download) |
| 25 | +- 90% faster parsing than HTML/CSS/JS processing |
| 26 | +- 90% fewer requests - one download fetches entire site sections |
| 27 | +- Zero impact on user experience (users continue accessing regular sites) |
| 28 | + |
| 29 | +## How It Works |
| 30 | + |
| 31 | +Websites pre-generate compressed collections and host them on CDN or Cloud Object Storage: |
| 32 | + |
| 33 | +1. Website generates blog-snapshot-2025-01-15.scp.gz (5,247 pages → 52 MB) |
| 34 | +2. Uploads to CDN or Cloud Object Storage |
| 35 | +3. Declares availability of content collections in sitemap.xml |
| 36 | +4. Crawler downloads entire collection in one request |
| 37 | +5. Later: crawler downloads delta blog-delta-2025-01-16.scp.gz (47 pages → 480 KB) |
| 38 | + |
| 39 | + |
| 40 | +## Technical Overview |
| 41 | + |
| 42 | +SCP uses JSON Lines (newline-delimited JSON) format, compressed with gzip or zstd. |
| 43 | + |
| 44 | +### File Structure |
| 45 | + |
| 46 | +- File extension: `.scp.gz` (gzip) or `.scp.zst` (zstd) |
| 47 | +- Content-Type: `application/x-ndjson+gzip` or `application/x-ndjson+zstd` |
| 48 | +- Format: One JSON object per line |
| 49 | + |
| 50 | +```jsonl |
| 51 | +{"collection":{"id":"blog-snapshot-2025-01-15","section":"blog","type":"snapshot","generated":"2025-01-15T00:00:00Z","version":"0.1"}} |
| 52 | +{"url":"https://example.com/blog/post-1","title":"First Post","description":"...","modified":"2025-01-15T09:00:00Z","language":"en","content":[...]} |
| 53 | +{"url":"https://example.com/blog/post-2","title":"Second Post","description":"...","modified":"2025-01-14T10:00:00Z","language":"en","content":[...]} |
| 54 | +``` |
| 55 | + |
| 56 | +- Line 1: Collection metadata (snapshot or delta) |
| 57 | +- Lines 2+: Individual pages |
| 58 | + |
| 59 | +### Page Structure |
| 60 | + |
| 61 | +Each page is a JSON object with: |
| 62 | + |
| 63 | +```json |
| 64 | +{ |
| 65 | + "url": "https://example.com/blog/post-title", |
| 66 | + "title": "Page Title", |
| 67 | + "description": "Meta description for SEO", |
| 68 | + "author": "John Doe", |
| 69 | + "published": "2024-01-15T10:30:00Z", |
| 70 | + "modified": "2024-01-20T14:22:00Z", |
| 71 | + "language": "en", |
| 72 | + "content": [ |
| 73 | + {"type": "heading", "level": 1, "text": "Main Heading"}, |
| 74 | + {"type": "text", "text": "Paragraph content goes here."}, |
| 75 | + {"type": "link", "url": "https://example.com", "text": "Link text"}, |
| 76 | + {"type": "image", "url": "https://example.com/image.jpg", "alt": "Alt text"}, |
| 77 | + {"type": "list", "ordered": false, "items": ["Item 1", "Item 2"]}, |
| 78 | + {"type": "code", "language": "python", "code": "print('Hello')"}, |
| 79 | + {"type": "table", "rows": [["Cell 1", "Cell 2"], ["Cell 3", "Cell 4"]]} |
| 80 | + ] |
| 81 | +} |
| 82 | +``` |
| 83 | + |
| 84 | +### Content Block Types |
| 85 | + |
| 86 | +- text: Paragraph text |
| 87 | +- heading: H1-H6 headings (level 1-6) |
| 88 | +- link: Hyperlinks with optional rel attributes |
| 89 | +- image: Images with alt text |
| 90 | +- list: Ordered or unordered lists |
| 91 | +- code: Code blocks with language syntax |
| 92 | +- table: Tables (row-major array format) |
| 93 | +- quote: Blockquotes with optional citation |
| 94 | +- video: Video embeds with sources, captions, transcripts |
| 95 | +- audio: Audio content with metadata |
| 96 | +- structured: Schema.org structured data (JSON-LD) |
| 97 | + |
| 98 | +## Discovery via Sitemap |
| 99 | + |
| 100 | +Crawlers discover SCP collections through `sitemap.xml`: |
| 101 | + |
| 102 | +```xml |
| 103 | +<?xml version="1.0" encoding="UTF-8"?> |
| 104 | +<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" |
| 105 | + xmlns:scp="https://scp-protocol.org/schemas/sitemap/1.0"> |
| 106 | + |
| 107 | + <!-- SCP Metadata --> |
| 108 | + <scp:version>0.1</scp:version> |
| 109 | + <scp:compression>zstd,gzip</scp:compression> |
| 110 | + |
| 111 | + <!-- Available Sections --> |
| 112 | + <scp:section name="blog" updateFreq="daily" pages="~5000"/> |
| 113 | + <scp:section name="docs" updateFreq="weekly" pages="~200"/> |
| 114 | + |
| 115 | + <!-- Snapshot Collections (full state) --> |
| 116 | + <scp:collection section="blog" type="snapshot" |
| 117 | + url="https://r2.example.com/blog-snapshot-2025-01-15.scp.gz" |
| 118 | + generated="2025-01-15T00:00:00Z" |
| 119 | + expires="2025-01-16T00:00:00Z" |
| 120 | + pages="5247" size="52000000"/> |
| 121 | + |
| 122 | + <!-- Delta Collections (incremental changes) --> |
| 123 | + <scp:delta section="blog" period="2025-01-15" |
| 124 | + url="https://r2.example.com/blog-delta-2025-01-15.scp.gz" |
| 125 | + generated="2025-01-15T23:00:00Z" |
| 126 | + expires="2025-01-17T00:00:00Z" |
| 127 | + pages="47" size="480000" |
| 128 | + since="2025-01-14T00:00:00Z"/> |
| 129 | +</urlset> |
| 130 | +``` |
| 131 | + |
| 132 | +## Snapshots and Deltas |
| 133 | + |
| 134 | +### Snapshot Collections |
| 135 | + |
| 136 | +Full section state, regenerated periodically: |
| 137 | + |
| 138 | +- Contains all pages in the section |
| 139 | +- Updated daily/weekly based on section updateFreq |
| 140 | +- First crawl downloads full snapshot |
| 141 | +- Example: `blog-snapshot-2025-01-15.scp.gz` (5,247 pages, 52 MB) |
| 142 | + |
| 143 | +### Delta Collections |
| 144 | + |
| 145 | +Incremental changes only: |
| 146 | + |
| 147 | +- Contains only modified/new pages during the period |
| 148 | +- Much smaller than snapshots (typically <1% of snapshot size) |
| 149 | +- Subsequent crawls download deltas and merge locally |
| 150 | +- Example: `blog-delta-2025-01-15.scp.gz` (47 pages, 480 KB) |
| 151 | + |
| 152 | +### Crawler Workflow |
| 153 | + |
| 154 | +**Initial Crawl**: |
| 155 | +1. Parse `sitemap.xml` |
| 156 | +2. Download snapshot collection: `blog-snapshot-2025-01-15.scp.gz` |
| 157 | +3. Decompress and parse JSON Lines |
| 158 | +4. Index all 5,247 pages |
| 159 | + |
| 160 | +**Incremental Updates** (next day): |
| 161 | +1. Check `sitemap.xml` for new deltas |
| 162 | +2. Download `blog-delta-2025-01-16.scp.gz` (89 pages, 920 KB) |
| 163 | +3. Merge delta into local index (update/add pages) |
| 164 | + |
| 165 | +**Timeline Example**: |
| 166 | +- Day 1: Download snapshot (5,247 pages, 52 MB) |
| 167 | +- Day 2: Download delta (47 pages, 480 KB) |
| 168 | +- Day 3: Download delta (89 pages, 920 KB) |
| 169 | +- Day 4: Download delta (124 pages, 1.2 MB) |
| 170 | + |
| 171 | +**Total bandwidth**: 54.6 MB vs 208 MB traditional (4 daily full crawls) = **74% savings** |
| 172 | + |
| 173 | +## Project Status |
| 174 | + |
| 175 | +**Current Phase**: Specification draft complete (v0.1) |
| 176 | + |
| 177 | +**Next Steps**: |
| 178 | + |
| 179 | +- Ask community for review of specification draft |
| 180 | +- Reference implementation (Python) |
| 181 | +- Crawler support implementation for [qCrawl](https://github.com/crawlcore/qcrawl) crawler |
| 182 | +- Collection generator tools (CMS and Frameworks plugins) |
| 183 | + |
| 184 | +**After that**: |
| 185 | + |
| 186 | +- Bot verification to ensure only approved crawlers access site content using [Web Bot Auth](https://developers.cloudflare.com/bots/reference/bot-verification/web-bot-auth/) |
| 187 | +- Pay for content to support fair crawler-content creator dynamic using model similar to [Pay Per Crawl](https://blog.cloudflare.com/introducing-pay-per-crawl/) |
| 188 | + |
| 189 | +## Getting Involved |
| 190 | + |
| 191 | +- Implementers: Build collection generators and parsers (Python, Go, Rust, JavaScript) |
| 192 | +- CMS Plugin Developers: WordPress, Drupal, Django integrations |
| 193 | +- Crawler Developers: Crawler implementations |
| 194 | +- Benchmarkers: Validate bandwidth savings on real websites |
| 195 | + |
| 196 | + |
| 197 | +## Resources |
| 198 | + |
| 199 | +- Specification: [scp_specification.md](scp_specification.md) - Technical specification (v0.1) |
| 200 | +- License: [CC0 1.0 Universal](LICENSE) - Public Domain |
| 201 | + |
| 202 | +## Contact |
| 203 | + |
| 204 | +For questions, feedback: |
| 205 | + |
| 206 | +Vasiliy Kiryanov |
| 207 | + |
| 208 | +- https://github.com/vasiliyk |
| 209 | +- https://x.com/vasiliykiryanov |
| 210 | +- https://linkedin.com/in/vasiliykiryanov |
0 commit comments