@@ -7,16 +7,9 @@ A collection-based protocol that reduces waste of bandwidth, processing power, a
77Web crawlers (search engines, AI bots, aggregators) consume massive bandwidth and server resources by parsing web-pages designed for human viewing.
88With the explosion of AI crawlers, this traffic has become a significant cost for websites and strain on internet infrastructure.
99
10- Sources:
11-
12- - https://radar.cloudflare.com/year-in-review/2025
13- - https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
14- - https://scrapeops.io/web-scraping-playbook/web-scraping-market-report-2025/
15-
16-
1710## The Solution
1811
19- SCP enables websites to serve pre-generated collections of their content in compressed JSON Lines format .
12+ SCP enables websites to serve pre-generated collections of their content in compressed format from CDN or Cloud Object Storage .
2013
2114** Target Goals** :
2215
@@ -26,183 +19,14 @@ SCP enables websites to serve pre-generated collections of their content in comp
2619- 90% fewer requests - one download fetches entire site sections
2720- Zero impact on user experience (users continue accessing regular sites)
2821
29- ## How It Works
30-
31- Websites pre-generate compressed collections and host them on CDN or Cloud Object Storage:
32-
33- 1 . Website generates blog-snapshot-2025-01-15.scp.gz (5,247 pages → 52 MB)
34- 2 . Uploads to CDN or Cloud Object Storage
35- 3 . Declares availability of content collections in sitemap.xml
36- 4 . Crawler downloads entire collection in one request
37- 5 . Later: crawler downloads delta blog-delta-2025-01-16.scp.gz (47 pages → 480 KB)
38-
39-
40- ## Technical Overview
41-
42- SCP uses JSON Lines (newline-delimited JSON) format, compressed with gzip or zstd.
43-
44- ### File Structure
45-
46- - File extension: ` .scp.gz ` (gzip) or ` .scp.zst ` (zstd)
47- - Content-Type: ` application/x-ndjson+gzip ` or ` application/x-ndjson+zstd `
48- - Format: One JSON object per line
49-
50- ``` jsonl
51- {"collection" :{"id" :" blog-snapshot-2025-01-15" ,"section" :" blog" ,"type" :" snapshot" ,"generated" :" 2025-01-15T00:00:00Z" ,"version" :" 0.1" }}
52- {"url" :" https://example.com/blog/post-1" ,"title" :" First Post" ,"description" :" ..." ,"modified" :" 2025-01-15T09:00:00Z" ,"language" :" en" ,"content" :[... ]}
53- {"url" :" https://example.com/blog/post-2" ,"title" :" Second Post" ,"description" :" ..." ,"modified" :" 2025-01-14T10:00:00Z" ,"language" :" en" ,"content" :[... ]}
54- ```
55-
56- - Line 1: Collection metadata (snapshot or delta)
57- - Lines 2+: Individual pages
58-
59- ### Page Structure
60-
61- Each page is a JSON object with:
62-
63- ``` json
64- {
65- "url" : " https://example.com/blog/post-title" ,
66- "title" : " Page Title" ,
67- "description" : " Meta description for SEO" ,
68- "author" : " John Doe" ,
69- "published" : " 2024-01-15T10:30:00Z" ,
70- "modified" : " 2024-01-20T14:22:00Z" ,
71- "language" : " en" ,
72- "content" : [
73- {"type" : " heading" , "level" : 1 , "text" : " Main Heading" },
74- {"type" : " text" , "text" : " Paragraph content goes here." },
75- {"type" : " link" , "url" : " https://example.com" , "text" : " Link text" },
76- {"type" : " image" , "url" : " https://example.com/image.jpg" , "alt" : " Alt text" },
77- {"type" : " list" , "ordered" : false , "items" : [" Item 1" , " Item 2" ]},
78- {"type" : " code" , "language" : " python" , "code" : " print('Hello')" },
79- {"type" : " table" , "rows" : [[" Cell 1" , " Cell 2" ], [" Cell 3" , " Cell 4" ]]}
80- ]
81- }
82- ```
83-
84- ### Content Block Types
85-
86- - text: Paragraph text
87- - heading: H1-H6 headings (level 1-6)
88- - link: Hyperlinks with optional rel attributes
89- - image: Images with alt text
90- - list: Ordered or unordered lists
91- - code: Code blocks with language syntax
92- - table: Tables (row-major array format)
93- - quote: Blockquotes with optional citation
94- - video: Video embeds with sources, captions, transcripts
95- - audio: Audio content with metadata
96- - structured: Schema.org structured data (JSON-LD)
97-
98- ## Discovery via Sitemap
99-
100- Crawlers discover SCP collections through ` sitemap.xml ` :
101-
102- ``` xml
103- <?xml version =" 1.0" encoding =" UTF-8" ?>
104- <urlset xmlns =" http://www.sitemaps.org/schemas/sitemap/0.9"
105- xmlns : scp =" https://scp-protocol.org/schemas/sitemap/1.0" >
106-
107- <!-- SCP Metadata -->
108- <scp : version >0.1</scp : version >
109- <scp : compression >zstd,gzip</scp : compression >
110-
111- <!-- Available Sections -->
112- <scp : section name =" blog" updateFreq =" daily" pages =" ~5000" />
113- <scp : section name =" docs" updateFreq =" weekly" pages =" ~200" />
114-
115- <!-- Snapshot Collections (full state) -->
116- <scp : collection section =" blog" type =" snapshot"
117- url =" https://r2.example.com/blog-snapshot-2025-01-15.scp.gz"
118- generated =" 2025-01-15T00:00:00Z"
119- expires =" 2025-01-16T00:00:00Z"
120- pages =" 5247" size =" 52000000" />
121-
122- <!-- Delta Collections (incremental changes) -->
123- <scp : delta section =" blog" period =" 2025-01-15"
124- url =" https://r2.example.com/blog-delta-2025-01-15.scp.gz"
125- generated =" 2025-01-15T23:00:00Z"
126- expires =" 2025-01-17T00:00:00Z"
127- pages =" 47" size =" 480000"
128- since =" 2025-01-14T00:00:00Z" />
129- </urlset >
130- ```
131-
132- ## Snapshots and Deltas
133-
134- ### Snapshot Collections
135-
136- Full section state, regenerated periodically:
137-
138- - Contains all pages in the section
139- - Updated daily/weekly based on section updateFreq
140- - First crawl downloads full snapshot
141- - Example: ` blog-snapshot-2025-01-15.scp.gz ` (5,247 pages, 52 MB)
142-
143- ### Delta Collections
144-
145- Incremental changes only:
146-
147- - Contains only modified/new pages during the period
148- - Much smaller than snapshots (typically <1% of snapshot size)
149- - Subsequent crawls download deltas and merge locally
150- - Example: ` blog-delta-2025-01-15.scp.gz ` (47 pages, 480 KB)
151-
152- ### Crawler Workflow
153-
154- ** Initial Crawl** :
155- 1 . Parse ` sitemap.xml `
156- 2 . Download snapshot collection: ` blog-snapshot-2025-01-15.scp.gz `
157- 3 . Decompress and parse JSON Lines
158- 4 . Index all 5,247 pages
159-
160- ** Incremental Updates** (next day):
161- 1 . Check ` sitemap.xml ` for new deltas
162- 2 . Download ` blog-delta-2025-01-16.scp.gz ` (89 pages, 920 KB)
163- 3 . Merge delta into local index (update/add pages)
164-
165- ** Timeline Example** :
166- - Day 1: Download snapshot (5,247 pages, 52 MB)
167- - Day 2: Download delta (47 pages, 480 KB)
168- - Day 3: Download delta (89 pages, 920 KB)
169- - Day 4: Download delta (124 pages, 1.2 MB)
170-
171- ** Total bandwidth** : 54.6 MB vs 208 MB traditional (4 daily full crawls) = ** 74% savings**
172-
173- ## Project Status
174-
175- ** Current Phase** : Specification draft complete (v0.1)
176-
177- ** Next Steps** :
178-
179- - Ask community for review of specification draft
180- - Reference implementation (Python)
181- - Crawler support implementation for [ qCrawl] ( https://github.com/crawlcore/qcrawl ) crawler
182- - Collection generator tools (CMS and Frameworks plugins)
183-
184- ** After that** :
185-
186- - Bot verification to ensure only approved crawlers access site content using [ Web Bot Auth] ( https://developers.cloudflare.com/bots/reference/bot-verification/web-bot-auth/ )
187- - Pay for content to support fair crawler-content creator dynamic using model similar to [ Pay Per Crawl] ( https://blog.cloudflare.com/introducing-pay-per-crawl/ )
188-
189- ## Getting Involved
190-
191- - Implementers: Build collection generators and parsers (Python, Go, Rust, JavaScript)
192- - CMS Plugin Developers: WordPress, Drupal, Django integrations
193- - Crawler Developers: Crawler implementations
194- - Benchmarkers: Validate bandwidth savings on real websites
195-
196-
19722## Resources
19823
199- - Specification: [ scp_specification.md] ( scp_specification.md ) - Technical specification (v0.1)
200- - License: [ CC0 1.0 Universal] ( LICENSE ) - Public Domain
24+ - ** Documentation** : [ scp-protocol.org] ( https://scp-protocol.org ) - Getting started, guides, and examples
25+ - ** Specification** : [ scp_specification.md] ( scp_specification.md ) - Technical specification (v0.1)
26+ - ** License** : [ CC0 1.0 Universal] ( LICENSE ) - Public Domain
20127
20228## Contact
20329
204- For questions, feedback:
205-
20630Vasiliy Kiryanov
20731
20832- https://github.com/vasiliyk
0 commit comments