AI Web FeedsAIWebFeeds
Features

llms-full.txt Format

Detailed specification of the enhanced llms-full.txt structured format

The /llms-full.txt endpoint provides a comprehensive, structured format optimized for AI agents and RAG systems.

Overview

The enhanced format includes:

  • Metadata header with generation info
  • Table of contents for navigation
  • Structured page sections with clear separators
  • Individual metadata for each page
  • AI-friendly formatting for easy parsing
This format is designed to be both human-readable and machine-parsable, making it ideal for RAG systems, embeddings, and AI analysis.

Format Structure

The document follows this hierarchical structure:

================================================================================
HEADER SECTION
================================================================================
├── Metadata (date, page count, base URL)
├── Description
├── Structure explanation
└── Table of Contents

================================================================================
DOCUMENTATION CONTENT
================================================================================
├── PAGE 1
│   ├── Page metadata (title, URL, description, path)
│   ├── Content separator
│   ├── Full markdown content
│   └── End marker
├── PAGE 2
│   └── ...
└── PAGE N

================================================================================
FOOTER SECTION
================================================================================
└── Summary and access information

Header Section

Metadata Block

Essential information about the documentation:

================================================================================
AI WEB FEEDS - COMPLETE DOCUMENTATION
================================================================================

METADATA
--------------------------------------------------------------------------------
Generated: 2025-10-14T12:00:00.000Z
Total Pages: 5
Base URL: https://yourdomain.com
Format: Markdown
Encoding: UTF-8

Description Block

Project overview for context:

DESCRIPTION
--------------------------------------------------------------------------------
A comprehensive collection of curated RSS/Atom feeds optimized for AI agents
and large language models. This document contains the complete documentation
for the AI Web Feeds project, including setup guides, API references, and
usage examples.

Structure Explanation

Format guide for parsers:

STRUCTURE
--------------------------------------------------------------------------------
Each page section follows this format:
  - Page separator (===)
  - Page number (X OF Y)
  - Page metadata (title, URL, description, path)
  - Content separator (---)
  - Full markdown content

Table of Contents

Complete navigation index:

NAVIGATION
--------------------------------------------------------------------------------
Table of Contents:

  1. Getting Started - /docs
  2. PDF Export - /docs/features/pdf-export
  3. AI Integration - /docs/features/ai-integration
  4. Testing Guide - /docs/guides/testing
  5. Quick Reference - /docs/guides/quick-reference

================================================================================
DOCUMENTATION CONTENT
================================================================================

Page Section Format

Each page follows a consistent structure:

================================================================================
PAGE 1 OF 5
================================================================================

TITLE: Getting Started
URL: https://yourdomain.com/docs
MARKDOWN: https://yourdomain.com/docs.mdx
DESCRIPTION: Quick start guide for AI Web Feeds
PATH: /

--------------------------------------------------------------------------------
CONTENT
--------------------------------------------------------------------------------

# Getting Started

[Full markdown content of the page...]

--------------------------------------------------------------------------------
END OF PAGE 1
--------------------------------------------------------------------------------

Page Metadata Fields

FieldDescriptionExample
TITLEPage titleGetting Started
URLFull page URLhttps://yourdomain.com/docs
MARKDOWNMarkdown endpointhttps://yourdomain.com/docs.mdx
DESCRIPTIONPage descriptionQuick start guide...
PATHRelative path/

Summary and access instructions:

================================================================================
END OF DOCUMENTATION
================================================================================

Total pages processed: 5
Generated: 2025-10-14T12:00:00.000Z
Format: Plain text with markdown content

For individual pages, append .mdx to any documentation URL.
For the discovery file, visit /llms.txt

================================================================================

Benefits for AI Agents

Clear Structure

  • Consistent separators - 80-character wide = and - lines
  • Numbered pages - PAGE X OF Y format
  • Hierarchical organization - Header → Content → Footer
  • Predictable format - Easy to parse with regex

Rich Metadata

  • Generation timestamp - Know when docs were created
  • Total page count - Plan context window usage
  • Base URL - Resolve relative links
  • Per-page metadata - Title, URL, description, path

Multiple Access Patterns

  • Complete documentation - Single request for all content
  • Table of contents - Quick overview of structure
  • Individual pages - URLs for targeted access
  • Markdown endpoints - Source content links

Parser-Friendly

  • Fixed-width separators - 80 characters for consistency
  • Clear section markers - Unmistakable boundaries
  • Predictable structure - Same format every time
  • UTF-8 encoding - Universal character support

HTTP Headers

Enhanced response headers provide additional metadata:

Content-Type: text/plain; charset=utf-8
Cache-Control: public, max-age=0, must-revalidate
X-Content-Pages: 5
X-Generated-Date: 2025-10-14T12:00:00.000Z
Custom headers allow clients to access metadata without parsing the document body.

Usage Examples

RAG System Integration

import requests

# Fetch complete documentation
response = requests.get('https://yourdomain.com/llms-full.txt')
content = response.text

# Parse metadata from headers
total_pages = int(response.headers['X-Content-Pages'])
generated = response.headers['X-Generated-Date']

# Split by page separators
separator = '=' * 80 + '\nPAGE '
pages = content.split(separator)

# Extract table of contents
toc_start = content.find('Table of Contents:')
toc_end = content.find('=' * 80 + '\nDOCUMENTATION CONTENT')
toc = content[toc_start:toc_end]

# Process individual pages
for i, page in enumerate(pages[1:], 1):
    if 'TITLE:' in page:
        # Extract page metadata
        title = page.split('TITLE: ')[1].split('\n')[0]
        url = page.split('URL: ')[1].split('\n')[0]

        # Extract content
        content_start = page.find('CONTENT\n' + '-' * 80 + '\n\n')
        content_end = page.find('\n\n' + '-' * 80 + '\nEND OF PAGE')
        content = page[content_start:content_end]

        print(f"Page {i}: {title}")
// Fetch complete documentation
const response = await fetch('https://yourdomain.com/llms-full.txt');
const content = await response.text();

// Parse metadata from headers
const totalPages = parseInt(response.headers.get('X-Content-Pages'));
const generated = response.headers.get('X-Generated-Date');

// Split by page separators
const separator = '='.repeat(80) + '\nPAGE ';
const pages = content.split(separator);

// Extract table of contents
const tocStart = content.indexOf('Table of Contents:');
const tocEnd = content.indexOf('='.repeat(80) + '\nDOCUMENTATION CONTENT');
const toc = content.substring(tocStart, tocEnd);

// Process individual pages
pages.slice(1).forEach((page, index) => {
  if (page.includes('TITLE:')) {
    // Extract page metadata
    const title = page.split('TITLE: ')[1].split('\n')[0];
    const url = page.split('URL: ')[1].split('\n')[0];

    // Extract content
    const contentStart = page.indexOf('CONTENT\n' + '-'.repeat(80) + '\n\n');
    const contentEnd = page.indexOf('\n\n' + '-'.repeat(80) + '\nEND OF PAGE');
    const content = page.substring(contentStart, contentEnd);

    console.log(`Page ${index + 1}: ${title}`);
  }
});
# Download complete documentation
curl https://yourdomain.com/llms-full.txt -o docs.txt

# View headers
curl -I https://yourdomain.com/llms-full.txt

# Extract table of contents
curl https://yourdomain.com/llms-full.txt | \
  sed -n '/Table of Contents:/,/^===/p'

# Count pages
curl https://yourdomain.com/llms-full.txt | \
  grep -c "^PAGE [0-9]"

# Extract first page
curl https://yourdomain.com/llms-full.txt | \
  sed -n '/^PAGE 1 OF/,/^END OF PAGE 1/p'

Parsing Tips

Regular Expressions

import re

# Extract page numbers
page_pattern = r'PAGE (\d+) OF (\d+)'
matches = re.findall(page_pattern, content)

# Extract metadata fields
title_pattern = r'TITLE: (.+)'
url_pattern = r'URL: (.+)'
desc_pattern = r'DESCRIPTION: (.+)'

# Split by separators
separator_80 = r'={80}'
separator_dash = r'-{80}'

Content Extraction

def extract_pages(content: str) -> list:
    """Extract individual pages from llms-full.txt"""
    pages = []

    # Find all page sections
    page_pattern = r'={80}\nPAGE (\d+) OF (\d+)={80}(.+?)(?=={80}\nPAGE |\Z)'

    for match in re.finditer(page_pattern, content, re.DOTALL):
        page_num, total, page_content = match.groups()

        # Extract metadata
        metadata = {}
        for line in page_content.split('\n'):
            if ':' in line and line.isupper().startswith(line.split(':')[0]):
                key, value = line.split(':', 1)
                metadata[key.strip()] = value.strip()

        # Extract content
        content_match = re.search(
            r'CONTENT\n-{80}\n\n(.+?)\n\n-{80}',
            page_content,
            re.DOTALL
        )

        if content_match:
            pages.append({
                'page_number': int(page_num),
                'total_pages': int(total),
                'metadata': metadata,
                'content': content_match.group(1).strip()
            })

    return pages

Token Counting

def count_tokens_per_page(content: str) -> dict:
    """Estimate token count for each page"""
    import tiktoken

    enc = tiktoken.get_encoding("cl100k_base")
    pages = extract_pages(content)

    token_counts = {}
    for page in pages:
        page_content = page['content']
        tokens = len(enc.encode(page_content))
        token_counts[page['metadata']['TITLE']] = tokens

    return token_counts

Comparison with Previous Format

Before Enhancement

# Page Title (url)

Content...

# Another Page (url)

Content...

Limitations:

  • No metadata header
  • No table of contents
  • Basic separators
  • No page numbers
  • No HTTP headers

After Enhancement

================================================================================
HEADER WITH METADATA
================================================================================
...
Table of Contents: [all pages]
================================================================================
PAGE 1 OF 5
================================================================================
TITLE: ...
URL: ...
MARKDOWN: ...
...

Improvements:

  • ✅ Rich metadata header
  • ✅ Complete table of contents
  • ✅ 80-character separators
  • ✅ Page numbers (X OF Y)
  • ✅ Custom HTTP headers
  • ✅ Structured format

Best Practices

For RAG Systems

  1. Parse metadata first - Get page count and base URL
  2. Use table of contents - Quick overview of structure
  3. Extract pages individually - Process one at a time
  4. Respect token limits - Use page numbers to estimate size
  5. Cache the response - Revalidate periodically

For Embeddings

  1. Chunk by pages - Natural boundaries
  2. Include metadata - Title, URL, description in embeddings
  3. Cross-reference - Use URLs for linking
  4. Update regularly - Check X-Generated-Date header

For Analysis

  1. Validate structure - Check separator consistency
  2. Handle errors - Missing descriptions are optional
  3. Use HTTP headers - Metadata without parsing
  4. Test parsing - Verify on sample data first

Testing

Verify Format

# Download and inspect
curl https://yourdomain.com/llms-full.txt > docs.txt

# Check header
head -50 docs.txt

# Count separators (should be consistent)
grep -c "^====" docs.txt
grep -c "^----" docs.txt

# Verify page numbers
grep "^PAGE [0-9]" docs.txt

Validate Headers

# Check custom headers
curl -I https://yourdomain.com/llms-full.txt | grep "X-"

# Expected output:
# X-Content-Pages: 5
# X-Generated-Date: 2025-10-14T12:00:00.000Z