Skip to content

Commit f4d031a

Browse files
committed
Reference implementation and docs added
1 parent 69f8cb6 commit f4d031a

38 files changed

+3952
-184
lines changed

.github/workflows/deploy_site.yml

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
name: Site Deployment
2+
permissions:
3+
contents: read
4+
pages: write
5+
id-token: write
6+
7+
on:
8+
push:
9+
branches: [ main ]
10+
paths:
11+
- 'docs/**'
12+
- 'mkdocs.yml'
13+
workflow_dispatch:
14+
15+
concurrency:
16+
group: pages
17+
cancel-in-progress: true
18+
19+
jobs:
20+
build:
21+
runs-on: ubuntu-latest
22+
steps:
23+
- uses: actions/checkout@v4
24+
25+
- uses: actions/setup-python@v5
26+
with:
27+
python-version: '3.x'
28+
29+
- name: Cache pip
30+
uses: actions/cache@v4
31+
with:
32+
key: mkdocs-${{ runner.os }}-${{ hashFiles('**/requirements.txt') }}
33+
path: ~/.cache/pip
34+
35+
- run: pip install mkdocs-material
36+
37+
- run: mkdocs build --strict
38+
39+
- uses: actions/upload-pages-artifact@v3
40+
with:
41+
path: site/
42+
43+
deploy:
44+
environment:
45+
name: github-pages
46+
url: ${{ steps.deployment.outputs.page_url }}
47+
runs-on: ubuntu-latest
48+
needs: build
49+
steps:
50+
- uses: actions/deploy-pages@v4
51+
id: deployment

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,6 @@ htmlcov/
4646
Thumbs.db
4747

4848
# Exports
49-
*.json
5049
*.csv
5150
*.xml
5251
*.pickle

README.md

Lines changed: 4 additions & 180 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,9 @@ A collection-based protocol that reduces waste of bandwidth, processing power, a
77
Web crawlers (search engines, AI bots, aggregators) consume massive bandwidth and server resources by parsing web-pages designed for human viewing.
88
With the explosion of AI crawlers, this traffic has become a significant cost for websites and strain on internet infrastructure.
99

10-
Sources:
11-
12-
- https://radar.cloudflare.com/year-in-review/2025
13-
- https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
14-
- https://scrapeops.io/web-scraping-playbook/web-scraping-market-report-2025/
15-
16-
1710
## The Solution
1811

19-
SCP enables websites to serve pre-generated collections of their content in compressed JSON Lines format.
12+
SCP enables websites to serve pre-generated collections of their content in compressed format from CDN or Cloud Object Storage.
2013

2114
**Target Goals**:
2215

@@ -26,183 +19,14 @@ SCP enables websites to serve pre-generated collections of their content in comp
2619
- 90% fewer requests - one download fetches entire site sections
2720
- Zero impact on user experience (users continue accessing regular sites)
2821

29-
## How It Works
30-
31-
Websites pre-generate compressed collections and host them on CDN or Cloud Object Storage:
32-
33-
1. Website generates blog-snapshot-2025-01-15.scp.gz (5,247 pages → 52 MB)
34-
2. Uploads to CDN or Cloud Object Storage
35-
3. Declares availability of content collections in sitemap.xml
36-
4. Crawler downloads entire collection in one request
37-
5. Later: crawler downloads delta blog-delta-2025-01-16.scp.gz (47 pages → 480 KB)
38-
39-
40-
## Technical Overview
41-
42-
SCP uses JSON Lines (newline-delimited JSON) format, compressed with gzip or zstd.
43-
44-
### File Structure
45-
46-
- File extension: `.scp.gz` (gzip) or `.scp.zst` (zstd)
47-
- Content-Type: `application/x-ndjson+gzip` or `application/x-ndjson+zstd`
48-
- Format: One JSON object per line
49-
50-
```jsonl
51-
{"collection":{"id":"blog-snapshot-2025-01-15","section":"blog","type":"snapshot","generated":"2025-01-15T00:00:00Z","version":"0.1"}}
52-
{"url":"https://example.com/blog/post-1","title":"First Post","description":"...","modified":"2025-01-15T09:00:00Z","language":"en","content":[...]}
53-
{"url":"https://example.com/blog/post-2","title":"Second Post","description":"...","modified":"2025-01-14T10:00:00Z","language":"en","content":[...]}
54-
```
55-
56-
- Line 1: Collection metadata (snapshot or delta)
57-
- Lines 2+: Individual pages
58-
59-
### Page Structure
60-
61-
Each page is a JSON object with:
62-
63-
```json
64-
{
65-
"url": "https://example.com/blog/post-title",
66-
"title": "Page Title",
67-
"description": "Meta description for SEO",
68-
"author": "John Doe",
69-
"published": "2024-01-15T10:30:00Z",
70-
"modified": "2024-01-20T14:22:00Z",
71-
"language": "en",
72-
"content": [
73-
{"type": "heading", "level": 1, "text": "Main Heading"},
74-
{"type": "text", "text": "Paragraph content goes here."},
75-
{"type": "link", "url": "https://example.com", "text": "Link text"},
76-
{"type": "image", "url": "https://example.com/image.jpg", "alt": "Alt text"},
77-
{"type": "list", "ordered": false, "items": ["Item 1", "Item 2"]},
78-
{"type": "code", "language": "python", "code": "print('Hello')"},
79-
{"type": "table", "rows": [["Cell 1", "Cell 2"], ["Cell 3", "Cell 4"]]}
80-
]
81-
}
82-
```
83-
84-
### Content Block Types
85-
86-
- text: Paragraph text
87-
- heading: H1-H6 headings (level 1-6)
88-
- link: Hyperlinks with optional rel attributes
89-
- image: Images with alt text
90-
- list: Ordered or unordered lists
91-
- code: Code blocks with language syntax
92-
- table: Tables (row-major array format)
93-
- quote: Blockquotes with optional citation
94-
- video: Video embeds with sources, captions, transcripts
95-
- audio: Audio content with metadata
96-
- structured: Schema.org structured data (JSON-LD)
97-
98-
## Discovery via Sitemap
99-
100-
Crawlers discover SCP collections through `sitemap.xml`:
101-
102-
```xml
103-
<?xml version="1.0" encoding="UTF-8"?>
104-
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
105-
xmlns:scp="https://scp-protocol.org/schemas/sitemap/1.0">
106-
107-
<!-- SCP Metadata -->
108-
<scp:version>0.1</scp:version>
109-
<scp:compression>zstd,gzip</scp:compression>
110-
111-
<!-- Available Sections -->
112-
<scp:section name="blog" updateFreq="daily" pages="~5000"/>
113-
<scp:section name="docs" updateFreq="weekly" pages="~200"/>
114-
115-
<!-- Snapshot Collections (full state) -->
116-
<scp:collection section="blog" type="snapshot"
117-
url="https://r2.example.com/blog-snapshot-2025-01-15.scp.gz"
118-
generated="2025-01-15T00:00:00Z"
119-
expires="2025-01-16T00:00:00Z"
120-
pages="5247" size="52000000"/>
121-
122-
<!-- Delta Collections (incremental changes) -->
123-
<scp:delta section="blog" period="2025-01-15"
124-
url="https://r2.example.com/blog-delta-2025-01-15.scp.gz"
125-
generated="2025-01-15T23:00:00Z"
126-
expires="2025-01-17T00:00:00Z"
127-
pages="47" size="480000"
128-
since="2025-01-14T00:00:00Z"/>
129-
</urlset>
130-
```
131-
132-
## Snapshots and Deltas
133-
134-
### Snapshot Collections
135-
136-
Full section state, regenerated periodically:
137-
138-
- Contains all pages in the section
139-
- Updated daily/weekly based on section updateFreq
140-
- First crawl downloads full snapshot
141-
- Example: `blog-snapshot-2025-01-15.scp.gz` (5,247 pages, 52 MB)
142-
143-
### Delta Collections
144-
145-
Incremental changes only:
146-
147-
- Contains only modified/new pages during the period
148-
- Much smaller than snapshots (typically <1% of snapshot size)
149-
- Subsequent crawls download deltas and merge locally
150-
- Example: `blog-delta-2025-01-15.scp.gz` (47 pages, 480 KB)
151-
152-
### Crawler Workflow
153-
154-
**Initial Crawl**:
155-
1. Parse `sitemap.xml`
156-
2. Download snapshot collection: `blog-snapshot-2025-01-15.scp.gz`
157-
3. Decompress and parse JSON Lines
158-
4. Index all 5,247 pages
159-
160-
**Incremental Updates** (next day):
161-
1. Check `sitemap.xml` for new deltas
162-
2. Download `blog-delta-2025-01-16.scp.gz` (89 pages, 920 KB)
163-
3. Merge delta into local index (update/add pages)
164-
165-
**Timeline Example**:
166-
- Day 1: Download snapshot (5,247 pages, 52 MB)
167-
- Day 2: Download delta (47 pages, 480 KB)
168-
- Day 3: Download delta (89 pages, 920 KB)
169-
- Day 4: Download delta (124 pages, 1.2 MB)
170-
171-
**Total bandwidth**: 54.6 MB vs 208 MB traditional (4 daily full crawls) = **74% savings**
172-
173-
## Project Status
174-
175-
**Current Phase**: Specification draft complete (v0.1)
176-
177-
**Next Steps**:
178-
179-
- Ask community for review of specification draft
180-
- Reference implementation (Python)
181-
- Crawler support implementation for [qCrawl](https://github.com/crawlcore/qcrawl) crawler
182-
- Collection generator tools (CMS and Frameworks plugins)
183-
184-
**After that**:
185-
186-
- Bot verification to ensure only approved crawlers access site content using [Web Bot Auth](https://developers.cloudflare.com/bots/reference/bot-verification/web-bot-auth/)
187-
- Pay for content to support fair crawler-content creator dynamic using model similar to [Pay Per Crawl](https://blog.cloudflare.com/introducing-pay-per-crawl/)
188-
189-
## Getting Involved
190-
191-
- Implementers: Build collection generators and parsers (Python, Go, Rust, JavaScript)
192-
- CMS Plugin Developers: WordPress, Drupal, Django integrations
193-
- Crawler Developers: Crawler implementations
194-
- Benchmarkers: Validate bandwidth savings on real websites
195-
196-
19722
## Resources
19823

199-
- Specification: [scp_specification.md](scp_specification.md) - Technical specification (v0.1)
200-
- License: [CC0 1.0 Universal](LICENSE) - Public Domain
24+
- **Documentation**: [scp-protocol.org](https://scp-protocol.org) - Getting started, guides, and examples
25+
- **Specification**: [scp_specification.md](scp_specification.md) - Technical specification (v0.1)
26+
- **License**: [CC0 1.0 Universal](LICENSE) - Public Domain
20127

20228
## Contact
20329

204-
For questions, feedback:
205-
20630
Vasiliy Kiryanov
20731

20832
- https://github.com/vasiliyk

docs/CNAME

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
www.scp-protocol.org

0 commit comments

Comments
 (0)