Skip to content

Commit 69f8cb6

Browse files
committed
First commit
0 parents  commit 69f8cb6

File tree

4 files changed

+1595
-0
lines changed

4 files changed

+1595
-0
lines changed

.gitignore

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
__pycache__/
2+
*.py[cod]
3+
*$py.class
4+
*.so
5+
.Python
6+
build/
7+
develop-eggs/
8+
dist/
9+
downloads/
10+
eggs/
11+
.eggs/
12+
lib/
13+
lib64/
14+
parts/
15+
sdist/
16+
var/
17+
wheels/
18+
*.egg-info/
19+
.installed.cfg
20+
*.egg
21+
22+
# Virtual environments
23+
.venv/
24+
env/
25+
venv/
26+
ENV/
27+
28+
# IDE
29+
.vscode/
30+
.idea/
31+
*.swp
32+
*.swo
33+
34+
# Testing
35+
.pytest_cache/
36+
.tox/
37+
.coverage
38+
.mypy_cache
39+
.ruff_cache
40+
41+
htmlcov/
42+
*.log
43+
44+
# OS
45+
.DS_Store
46+
Thumbs.db
47+
48+
# Exports
49+
*.json
50+
*.csv
51+
*.xml
52+
*.pickle
53+
54+
# Mkdocs site output
55+
/site/

LICENSE

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
CC0 1.0 Universal
2+
3+
Statement of Purpose
4+
5+
The laws of most jurisdictions throughout the world automatically confer
6+
exclusive Copyright and Related Rights (defined below) upon the creator and
7+
subsequent owner(s) (each and all, an "owner") of an original work of
8+
authorship and/or a database (each, a "Work").
9+
10+
Certain owners wish to permanently relinquish those rights to a Work for the
11+
purpose of contributing to a commons of creative, cultural and scientific
12+
works ("Commons") that the public can reliably and without fear of later
13+
claims of infringement build upon, modify, incorporate in other works, reuse
14+
and redistribute as freely as possible in any form whatsoever and for any
15+
purposes, including without limitation commercial purposes. These owners may
16+
contribute to the Commons to promote the ideal of a free culture and the
17+
further production of creative, cultural and scientific works, or to gain
18+
reputation or greater distribution for their Work in part through the use and
19+
efforts of others.
20+
21+
For these and/or other purposes and motivations, and without any expectation
22+
of additional consideration or compensation, the person associating CC0 with a
23+
Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
24+
and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
25+
and publicly distribute the Work under its terms, with knowledge of his or her
26+
Copyright and Related Rights in the Work and the meaning and intended legal
27+
effect of CC0 on those rights.
28+
29+
1. Copyright and Related Rights. A Work made available under CC0 may be
30+
protected by copyright and related or neighboring rights ("Copyright and
31+
Related Rights"). Copyright and Related Rights include, but are not limited
32+
to, the following:
33+
34+
i. the right to reproduce, adapt, distribute, perform, display, communicate,
35+
and translate a Work;
36+
37+
ii. moral rights retained by the original author(s) and/or performer(s);
38+
39+
iii. publicity and privacy rights pertaining to a person's image or likeness
40+
depicted in a Work;
41+
42+
iv. rights protecting against unfair competition in regards to a Work,
43+
subject to the limitations in paragraph 4(a), below;
44+
45+
v. rights protecting the extraction, dissemination, use and reuse of data in
46+
a Work;
47+
48+
vi. database rights (such as those arising under Directive 96/9/EC of the
49+
European Parliament and of the Council of 11 March 1996 on the legal
50+
protection of databases, and under any national implementation thereof,
51+
including any amended or successor version of such directive); and
52+
53+
vii. other similar, equivalent or corresponding rights throughout the world
54+
based on applicable law or treaty, and any national implementations thereof.
55+
56+
2. Waiver. To the greatest extent permitted by, but not in contravention of,
57+
applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
58+
unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
59+
and Related Rights and associated claims and causes of action, whether now
60+
known or unknown (including existing as well as future claims and causes of
61+
action), in the Work (i) in all territories worldwide, (ii) for the maximum
62+
duration provided by applicable law or treaty (including future time
63+
extensions), (iii) in any current or future medium and for any number of
64+
copies, and (iv) for any purpose whatsoever, including without limitation
65+
commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
66+
the Waiver for the benefit of each member of the public at large and to the
67+
detriment of Affirmer's heirs and successors, fully intending that such Waiver
68+
shall not be subject to revocation, rescission, cancellation, termination, or
69+
any other legal or equitable action to disrupt the quiet enjoyment of the Work
70+
by the public as contemplated by Affirmer's express Statement of Purpose.
71+
72+
3. Public License Fallback. Should any part of the Waiver for any reason be
73+
judged legally invalid or ineffective under applicable law, then the Waiver
74+
shall be preserved to the maximum extent permitted taking into account
75+
Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
76+
is so judged Affirmer hereby grants to each affected person a royalty-free,
77+
non transferable, non sublicensable, non exclusive, irrevocable and
78+
unconditional license to exercise Affirmer's Copyright and Related Rights in
79+
the Work (i) in all territories worldwide, (ii) for the maximum duration
80+
provided by applicable law or treaty (including future time extensions), (iii)
81+
in any current or future medium and for any number of copies, and (iv) for any
82+
purpose whatsoever, including without limitation commercial, advertising or
83+
promotional purposes (the "License"). The License shall be deemed effective as
84+
of the date CC0 was applied by Affirmer to the Work. Should any part of the
85+
License for any reason be judged legally invalid or ineffective under
86+
applicable law, such partial invalidity or ineffectiveness shall not
87+
invalidate the remainder of the License, and in such case Affirmer hereby
88+
affirms that he or she will not (i) exercise any of his or her remaining
89+
Copyright and Related Rights in the Work or (ii) assert any associated claims
90+
and causes of action with respect to the Work, in either case contrary to
91+
Affirmer's express Statement of Purpose.
92+
93+
4. Limitations and Disclaimers.
94+
95+
a. No trademark or patent rights held by Affirmer are waived, abandoned,
96+
surrendered, licensed or otherwise affected by this document.
97+
98+
b. Affirmer offers the Work as-is and makes no representations or warranties
99+
of any kind concerning the Work, express, implied, statutory or otherwise,
100+
including without limitation warranties of title, merchantability, fitness
101+
for a particular purpose, non infringement, or the absence of latent or
102+
other defects, accuracy, or the present or absence of errors, whether or not
103+
discoverable, all to the greatest extent permissible under applicable law.
104+
105+
c. Affirmer disclaims responsibility for clearing rights of other persons
106+
that may apply to the Work or any use thereof, including without limitation
107+
any person's Copyright and Related Rights in the Work. Further, Affirmer
108+
disclaims responsibility for obtaining any necessary consents, permissions
109+
or other rights required for any use of the Work.
110+
111+
d. Affirmer understands and acknowledges that Creative Commons is not a
112+
party to this document and has no duty or obligation with respect to this
113+
CC0 or use of the Work.
114+
115+
For more information, please see
116+
<https://creativecommons.org/publicdomain/zero/1.0/>

README.md

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
# Site Content Protocol (SCP)
2+
3+
A collection-based protocol that reduces waste of bandwidth, processing power, and energy through pre-generated snapshots and deltas.
4+
5+
## The Problem
6+
7+
Web crawlers (search engines, AI bots, aggregators) consume massive bandwidth and server resources by parsing web-pages designed for human viewing.
8+
With the explosion of AI crawlers, this traffic has become a significant cost for websites and strain on internet infrastructure.
9+
10+
Sources:
11+
12+
- https://radar.cloudflare.com/year-in-review/2025
13+
- https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
14+
- https://scrapeops.io/web-scraping-playbook/web-scraping-market-report-2025/
15+
16+
17+
## The Solution
18+
19+
SCP enables websites to serve pre-generated collections of their content in compressed JSON Lines format.
20+
21+
**Target Goals**:
22+
23+
- 50-60% bandwidth reduction for initial snapshots vs compressed HTML
24+
- 90-95% bandwidth reduction with delta updates (after initial download)
25+
- 90% faster parsing than HTML/CSS/JS processing
26+
- 90% fewer requests - one download fetches entire site sections
27+
- Zero impact on user experience (users continue accessing regular sites)
28+
29+
## How It Works
30+
31+
Websites pre-generate compressed collections and host them on CDN or Cloud Object Storage:
32+
33+
1. Website generates blog-snapshot-2025-01-15.scp.gz (5,247 pages → 52 MB)
34+
2. Uploads to CDN or Cloud Object Storage
35+
3. Declares availability of content collections in sitemap.xml
36+
4. Crawler downloads entire collection in one request
37+
5. Later: crawler downloads delta blog-delta-2025-01-16.scp.gz (47 pages → 480 KB)
38+
39+
40+
## Technical Overview
41+
42+
SCP uses JSON Lines (newline-delimited JSON) format, compressed with gzip or zstd.
43+
44+
### File Structure
45+
46+
- File extension: `.scp.gz` (gzip) or `.scp.zst` (zstd)
47+
- Content-Type: `application/x-ndjson+gzip` or `application/x-ndjson+zstd`
48+
- Format: One JSON object per line
49+
50+
```jsonl
51+
{"collection":{"id":"blog-snapshot-2025-01-15","section":"blog","type":"snapshot","generated":"2025-01-15T00:00:00Z","version":"0.1"}}
52+
{"url":"https://example.com/blog/post-1","title":"First Post","description":"...","modified":"2025-01-15T09:00:00Z","language":"en","content":[...]}
53+
{"url":"https://example.com/blog/post-2","title":"Second Post","description":"...","modified":"2025-01-14T10:00:00Z","language":"en","content":[...]}
54+
```
55+
56+
- Line 1: Collection metadata (snapshot or delta)
57+
- Lines 2+: Individual pages
58+
59+
### Page Structure
60+
61+
Each page is a JSON object with:
62+
63+
```json
64+
{
65+
"url": "https://example.com/blog/post-title",
66+
"title": "Page Title",
67+
"description": "Meta description for SEO",
68+
"author": "John Doe",
69+
"published": "2024-01-15T10:30:00Z",
70+
"modified": "2024-01-20T14:22:00Z",
71+
"language": "en",
72+
"content": [
73+
{"type": "heading", "level": 1, "text": "Main Heading"},
74+
{"type": "text", "text": "Paragraph content goes here."},
75+
{"type": "link", "url": "https://example.com", "text": "Link text"},
76+
{"type": "image", "url": "https://example.com/image.jpg", "alt": "Alt text"},
77+
{"type": "list", "ordered": false, "items": ["Item 1", "Item 2"]},
78+
{"type": "code", "language": "python", "code": "print('Hello')"},
79+
{"type": "table", "rows": [["Cell 1", "Cell 2"], ["Cell 3", "Cell 4"]]}
80+
]
81+
}
82+
```
83+
84+
### Content Block Types
85+
86+
- text: Paragraph text
87+
- heading: H1-H6 headings (level 1-6)
88+
- link: Hyperlinks with optional rel attributes
89+
- image: Images with alt text
90+
- list: Ordered or unordered lists
91+
- code: Code blocks with language syntax
92+
- table: Tables (row-major array format)
93+
- quote: Blockquotes with optional citation
94+
- video: Video embeds with sources, captions, transcripts
95+
- audio: Audio content with metadata
96+
- structured: Schema.org structured data (JSON-LD)
97+
98+
## Discovery via Sitemap
99+
100+
Crawlers discover SCP collections through `sitemap.xml`:
101+
102+
```xml
103+
<?xml version="1.0" encoding="UTF-8"?>
104+
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
105+
xmlns:scp="https://scp-protocol.org/schemas/sitemap/1.0">
106+
107+
<!-- SCP Metadata -->
108+
<scp:version>0.1</scp:version>
109+
<scp:compression>zstd,gzip</scp:compression>
110+
111+
<!-- Available Sections -->
112+
<scp:section name="blog" updateFreq="daily" pages="~5000"/>
113+
<scp:section name="docs" updateFreq="weekly" pages="~200"/>
114+
115+
<!-- Snapshot Collections (full state) -->
116+
<scp:collection section="blog" type="snapshot"
117+
url="https://r2.example.com/blog-snapshot-2025-01-15.scp.gz"
118+
generated="2025-01-15T00:00:00Z"
119+
expires="2025-01-16T00:00:00Z"
120+
pages="5247" size="52000000"/>
121+
122+
<!-- Delta Collections (incremental changes) -->
123+
<scp:delta section="blog" period="2025-01-15"
124+
url="https://r2.example.com/blog-delta-2025-01-15.scp.gz"
125+
generated="2025-01-15T23:00:00Z"
126+
expires="2025-01-17T00:00:00Z"
127+
pages="47" size="480000"
128+
since="2025-01-14T00:00:00Z"/>
129+
</urlset>
130+
```
131+
132+
## Snapshots and Deltas
133+
134+
### Snapshot Collections
135+
136+
Full section state, regenerated periodically:
137+
138+
- Contains all pages in the section
139+
- Updated daily/weekly based on section updateFreq
140+
- First crawl downloads full snapshot
141+
- Example: `blog-snapshot-2025-01-15.scp.gz` (5,247 pages, 52 MB)
142+
143+
### Delta Collections
144+
145+
Incremental changes only:
146+
147+
- Contains only modified/new pages during the period
148+
- Much smaller than snapshots (typically <1% of snapshot size)
149+
- Subsequent crawls download deltas and merge locally
150+
- Example: `blog-delta-2025-01-15.scp.gz` (47 pages, 480 KB)
151+
152+
### Crawler Workflow
153+
154+
**Initial Crawl**:
155+
1. Parse `sitemap.xml`
156+
2. Download snapshot collection: `blog-snapshot-2025-01-15.scp.gz`
157+
3. Decompress and parse JSON Lines
158+
4. Index all 5,247 pages
159+
160+
**Incremental Updates** (next day):
161+
1. Check `sitemap.xml` for new deltas
162+
2. Download `blog-delta-2025-01-16.scp.gz` (89 pages, 920 KB)
163+
3. Merge delta into local index (update/add pages)
164+
165+
**Timeline Example**:
166+
- Day 1: Download snapshot (5,247 pages, 52 MB)
167+
- Day 2: Download delta (47 pages, 480 KB)
168+
- Day 3: Download delta (89 pages, 920 KB)
169+
- Day 4: Download delta (124 pages, 1.2 MB)
170+
171+
**Total bandwidth**: 54.6 MB vs 208 MB traditional (4 daily full crawls) = **74% savings**
172+
173+
## Project Status
174+
175+
**Current Phase**: Specification draft complete (v0.1)
176+
177+
**Next Steps**:
178+
179+
- Ask community for review of specification draft
180+
- Reference implementation (Python)
181+
- Crawler support implementation for [qCrawl](https://github.com/crawlcore/qcrawl) crawler
182+
- Collection generator tools (CMS and Frameworks plugins)
183+
184+
**After that**:
185+
186+
- Bot verification to ensure only approved crawlers access site content using [Web Bot Auth](https://developers.cloudflare.com/bots/reference/bot-verification/web-bot-auth/)
187+
- Pay for content to support fair crawler-content creator dynamic using model similar to [Pay Per Crawl](https://blog.cloudflare.com/introducing-pay-per-crawl/)
188+
189+
## Getting Involved
190+
191+
- Implementers: Build collection generators and parsers (Python, Go, Rust, JavaScript)
192+
- CMS Plugin Developers: WordPress, Drupal, Django integrations
193+
- Crawler Developers: Crawler implementations
194+
- Benchmarkers: Validate bandwidth savings on real websites
195+
196+
197+
## Resources
198+
199+
- Specification: [scp_specification.md](scp_specification.md) - Technical specification (v0.1)
200+
- License: [CC0 1.0 Universal](LICENSE) - Public Domain
201+
202+
## Contact
203+
204+
For questions, feedback:
205+
206+
Vasiliy Kiryanov
207+
208+
- https://github.com/vasiliyk
209+
- https://x.com/vasiliykiryanov
210+
- https://linkedin.com/in/vasiliykiryanov

0 commit comments

Comments
 (0)