First commit

vasiliyk · vasiliyk · commit 69f8cb6a6dd8 · 2025-12-18T13:54:46.000-05:00
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,55 @@
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# Virtual environments
+.venv/
+env/
+venv/
+ENV/
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+
+# Testing
+.pytest_cache/
+.tox/
+.coverage
+.mypy_cache
+.ruff_cache
+
+htmlcov/
+*.log
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Exports
+*.json
+*.csv
+*.xml
+*.pickle
+
+# Mkdocs site output
+/site/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,116 @@
+CC0 1.0 Universal
+
+Statement of Purpose
+
+The laws of most jurisdictions throughout the world automatically confer
+exclusive Copyright and Related Rights (defined below) upon the creator and
+subsequent owner(s) (each and all, an "owner") of an original work of
+authorship and/or a database (each, a "Work").
+
+Certain owners wish to permanently relinquish those rights to a Work for the
+purpose of contributing to a commons of creative, cultural and scientific
+works ("Commons") that the public can reliably and without fear of later
+claims of infringement build upon, modify, incorporate in other works, reuse
+and redistribute as freely as possible in any form whatsoever and for any
+purposes, including without limitation commercial purposes. These owners may
+contribute to the Commons to promote the ideal of a free culture and the
+further production of creative, cultural and scientific works, or to gain
+reputation or greater distribution for their Work in part through the use and
+efforts of others.
+
+For these and/or other purposes and motivations, and without any expectation
+of additional consideration or compensation, the person associating CC0 with a
+Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
+and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
+and publicly distribute the Work under its terms, with knowledge of his or her
+Copyright and Related Rights in the Work and the meaning and intended legal
+effect of CC0 on those rights.
+
+1. Copyright and Related Rights. A Work made available under CC0 may be
+protected by copyright and related or neighboring rights ("Copyright and
+Related Rights"). Copyright and Related Rights include, but are not limited
+to, the following:
+
+  i. the right to reproduce, adapt, distribute, perform, display, communicate,
+  and translate a Work;
+
+  ii. moral rights retained by the original author(s) and/or performer(s);
+
+  iii. publicity and privacy rights pertaining to a person's image or likeness
+  depicted in a Work;
+
+  iv. rights protecting against unfair competition in regards to a Work,
+  subject to the limitations in paragraph 4(a), below;
+
+  v. rights protecting the extraction, dissemination, use and reuse of data in
+  a Work;
+
+  vi. database rights (such as those arising under Directive 96/9/EC of the
+  European Parliament and of the Council of 11 March 1996 on the legal
+  protection of databases, and under any national implementation thereof,
+  including any amended or successor version of such directive); and
+
+  vii. other similar, equivalent or corresponding rights throughout the world
+  based on applicable law or treaty, and any national implementations thereof.
+
+2. Waiver. To the greatest extent permitted by, but not in contravention of,
+applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
+unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
+and Related Rights and associated claims and causes of action, whether now
+known or unknown (including existing as well as future claims and causes of
+action), in the Work (i) in all territories worldwide, (ii) for the maximum
+duration provided by applicable law or treaty (including future time
+extensions), (iii) in any current or future medium and for any number of
+copies, and (iv) for any purpose whatsoever, including without limitation
+commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
+the Waiver for the benefit of each member of the public at large and to the
+detriment of Affirmer's heirs and successors, fully intending that such Waiver
+shall not be subject to revocation, rescission, cancellation, termination, or
+any other legal or equitable action to disrupt the quiet enjoyment of the Work
+by the public as contemplated by Affirmer's express Statement of Purpose.
+
+3. Public License Fallback. Should any part of the Waiver for any reason be
+judged legally invalid or ineffective under applicable law, then the Waiver
+shall be preserved to the maximum extent permitted taking into account
+Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
+is so judged Affirmer hereby grants to each affected person a royalty-free,
+non transferable, non sublicensable, non exclusive, irrevocable and
+unconditional license to exercise Affirmer's Copyright and Related Rights in
+the Work (i) in all territories worldwide, (ii) for the maximum duration
+provided by applicable law or treaty (including future time extensions), (iii)
+in any current or future medium and for any number of copies, and (iv) for any
+purpose whatsoever, including without limitation commercial, advertising or
+promotional purposes (the "License"). The License shall be deemed effective as
+of the date CC0 was applied by Affirmer to the Work. Should any part of the
+License for any reason be judged legally invalid or ineffective under
+applicable law, such partial invalidity or ineffectiveness shall not
+invalidate the remainder of the License, and in such case Affirmer hereby
+affirms that he or she will not (i) exercise any of his or her remaining
+Copyright and Related Rights in the Work or (ii) assert any associated claims
+and causes of action with respect to the Work, in either case contrary to
+Affirmer's express Statement of Purpose.
+
+4. Limitations and Disclaimers.
+
+  a. No trademark or patent rights held by Affirmer are waived, abandoned,
+  surrendered, licensed or otherwise affected by this document.
+
+  b. Affirmer offers the Work as-is and makes no representations or warranties
+  of any kind concerning the Work, express, implied, statutory or otherwise,
+  including without limitation warranties of title, merchantability, fitness
+  for a particular purpose, non infringement, or the absence of latent or
+  other defects, accuracy, or the present or absence of errors, whether or not
+  discoverable, all to the greatest extent permissible under applicable law.
+
+  c. Affirmer disclaims responsibility for clearing rights of other persons
+  that may apply to the Work or any use thereof, including without limitation
+  any person's Copyright and Related Rights in the Work. Further, Affirmer
+  disclaims responsibility for obtaining any necessary consents, permissions
+  or other rights required for any use of the Work.
+
+  d. Affirmer understands and acknowledges that Creative Commons is not a
+  party to this document and has no duty or obligation with respect to this
+  CC0 or use of the Work.
+
+For more information, please see
+<https://creativecommons.org/publicdomain/zero/1.0/>
diff --git a/README.md b/README.md
@@ -0,0 +1,210 @@
+# Site Content Protocol (SCP)
+
+A collection-based protocol that reduces waste of bandwidth, processing power, and energy through pre-generated snapshots and deltas.
+
+## The Problem
+
+Web crawlers (search engines, AI bots, aggregators) consume massive bandwidth and server resources by parsing web-pages designed for human viewing.
+With the explosion of AI crawlers, this traffic has become a significant cost for websites and strain on internet infrastructure.
+
+Sources:
+
+- https://radar.cloudflare.com/year-in-review/2025
+- https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
+- https://scrapeops.io/web-scraping-playbook/web-scraping-market-report-2025/
+
+
+## The Solution
+
+SCP enables websites to serve pre-generated collections of their content in compressed JSON Lines format.
+
+**Target Goals**:
+
+- 50-60% bandwidth reduction for initial snapshots vs compressed HTML
+- 90-95% bandwidth reduction with delta updates (after initial download)
+- 90% faster parsing than HTML/CSS/JS processing
+- 90% fewer requests - one download fetches entire site sections
+- Zero impact on user experience (users continue accessing regular sites)
+
+## How It Works
+
+Websites pre-generate compressed collections and host them on CDN or Cloud Object Storage:
+
+1. Website generates blog-snapshot-2025-01-15.scp.gz (5,247 pages → 52 MB)
+2. Uploads to CDN or Cloud Object Storage
+3. Declares availability of content collections in sitemap.xml
+4. Crawler downloads entire collection in one request
+5. Later: crawler downloads delta blog-delta-2025-01-16.scp.gz (47 pages → 480 KB)
+
+
+## Technical Overview
+
+SCP uses JSON Lines (newline-delimited JSON) format, compressed with gzip or zstd.
+
+### File Structure
+
+- File extension: `.scp.gz` (gzip) or `.scp.zst` (zstd)
+- Content-Type: `application/x-ndjson+gzip` or `application/x-ndjson+zstd`
+- Format: One JSON object per line
+
+```jsonl
+{"collection":{"id":"blog-snapshot-2025-01-15","section":"blog","type":"snapshot","generated":"2025-01-15T00:00:00Z","version":"0.1"}}
+{"url":"https://example.com/blog/post-1","title":"First Post","description":"...","modified":"2025-01-15T09:00:00Z","language":"en","content":[...]}
+{"url":"https://example.com/blog/post-2","title":"Second Post","description":"...","modified":"2025-01-14T10:00:00Z","language":"en","content":[...]}
+```
+
+- Line 1: Collection metadata (snapshot or delta)
+- Lines 2+: Individual pages
+
+### Page Structure
+
+Each page is a JSON object with:
+
+```json
+{
+  "url": "https://example.com/blog/post-title",
+  "title": "Page Title",
+  "description": "Meta description for SEO",
+  "author": "John Doe",
+  "published": "2024-01-15T10:30:00Z",
+  "modified": "2024-01-20T14:22:00Z",
+  "language": "en",
+  "content": [
+    {"type": "heading", "level": 1, "text": "Main Heading"},
+    {"type": "text", "text": "Paragraph content goes here."},
+    {"type": "link", "url": "https://example.com", "text": "Link text"},
+    {"type": "image", "url": "https://example.com/image.jpg", "alt": "Alt text"},
+    {"type": "list", "ordered": false, "items": ["Item 1", "Item 2"]},
+    {"type": "code", "language": "python", "code": "print('Hello')"},
+    {"type": "table", "rows": [["Cell 1", "Cell 2"], ["Cell 3", "Cell 4"]]}
+  ]
+}
+```
+
+### Content Block Types
+
+- text: Paragraph text
+- heading: H1-H6 headings (level 1-6)
+- link: Hyperlinks with optional rel attributes
+- image: Images with alt text
+- list: Ordered or unordered lists
+- code: Code blocks with language syntax
+- table: Tables (row-major array format)
+- quote: Blockquotes with optional citation
+- video: Video embeds with sources, captions, transcripts
+- audio: Audio content with metadata
+- structured: Schema.org structured data (JSON-LD)
+
+## Discovery via Sitemap
+
+Crawlers discover SCP collections through `sitemap.xml`:
+
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
+        xmlns:scp="https://scp-protocol.org/schemas/sitemap/1.0">
+
+  <!-- SCP Metadata -->
+  <scp:version>0.1</scp:version>
+  <scp:compression>zstd,gzip</scp:compression>
+
+  <!-- Available Sections -->
+  <scp:section name="blog" updateFreq="daily" pages="~5000"/>
+  <scp:section name="docs" updateFreq="weekly" pages="~200"/>
+
+  <!-- Snapshot Collections (full state) -->
+  <scp:collection section="blog" type="snapshot"
+                  url="https://r2.example.com/blog-snapshot-2025-01-15.scp.gz"
+                  generated="2025-01-15T00:00:00Z"
+                  expires="2025-01-16T00:00:00Z"
+                  pages="5247" size="52000000"/>
+
+  <!-- Delta Collections (incremental changes) -->
+  <scp:delta section="blog" period="2025-01-15"
+             url="https://r2.example.com/blog-delta-2025-01-15.scp.gz"
+             generated="2025-01-15T23:00:00Z"
+             expires="2025-01-17T00:00:00Z"
+             pages="47" size="480000"
+             since="2025-01-14T00:00:00Z"/>
+</urlset>
+```
+
+## Snapshots and Deltas
+
+### Snapshot Collections
+
+Full section state, regenerated periodically:
+
+- Contains all pages in the section
+- Updated daily/weekly based on section updateFreq
+- First crawl downloads full snapshot
+- Example: `blog-snapshot-2025-01-15.scp.gz` (5,247 pages, 52 MB)
+
+### Delta Collections
+
+Incremental changes only:
+
+- Contains only modified/new pages during the period
+- Much smaller than snapshots (typically <1% of snapshot size)
+- Subsequent crawls download deltas and merge locally
+- Example: `blog-delta-2025-01-15.scp.gz` (47 pages, 480 KB)
+
+### Crawler Workflow
+
+**Initial Crawl**:
+1. Parse `sitemap.xml`
+2. Download snapshot collection: `blog-snapshot-2025-01-15.scp.gz`
+3. Decompress and parse JSON Lines
+4. Index all 5,247 pages
+
+**Incremental Updates** (next day):
+1. Check `sitemap.xml` for new deltas
+2. Download `blog-delta-2025-01-16.scp.gz` (89 pages, 920 KB)
+3. Merge delta into local index (update/add pages)
+
+**Timeline Example**:
+- Day 1: Download snapshot (5,247 pages, 52 MB)
+- Day 2: Download delta (47 pages, 480 KB)
+- Day 3: Download delta (89 pages, 920 KB)
+- Day 4: Download delta (124 pages, 1.2 MB)
+
+**Total bandwidth**: 54.6 MB vs 208 MB traditional (4 daily full crawls) = **74% savings**
+
+## Project Status
+
+**Current Phase**: Specification draft complete (v0.1)
+
+**Next Steps**:
+
+- Ask community for review of specification draft
+- Reference implementation (Python)
+- Crawler support implementation for [qCrawl](https://github.com/crawlcore/qcrawl) crawler
+- Collection generator tools (CMS and Frameworks plugins)
+
+**After that**:
+
+- Bot verification to ensure only approved crawlers access site content using [Web Bot Auth](https://developers.cloudflare.com/bots/reference/bot-verification/web-bot-auth/)
+- Pay for content to support fair crawler-content creator dynamic using model similar to [Pay Per Crawl](https://blog.cloudflare.com/introducing-pay-per-crawl/)
+
+## Getting Involved
+
+- Implementers: Build collection generators and parsers (Python, Go, Rust, JavaScript)
+- CMS Plugin Developers: WordPress, Drupal, Django integrations
+- Crawler Developers: Crawler implementations
+- Benchmarkers: Validate bandwidth savings on real websites
+
+
+## Resources
+
+- Specification: [scp_specification.md](scp_specification.md) - Technical specification (v0.1)
+- License: [CC0 1.0 Universal](LICENSE) - Public Domain
+
+## Contact
+
+For questions, feedback:
+
+Vasiliy Kiryanov
+
+- https://github.com/vasiliyk
+- https://x.com/vasiliykiryanov
+- https://linkedin.com/in/vasiliykiryanov
diff --git a/scp_specification.md b/scp_specification.md