================================================================================ AI WEB FEEDS - COMPLETE DOCUMENTATION ================================================================================ METADATA -------------------------------------------------------------------------------- Generated: 2026-03-24T06:11:20.077Z Total Pages: 57 Base URL: https://ai-web-feeds.w4w.dev Format: Markdown Encoding: UTF-8 DESCRIPTION -------------------------------------------------------------------------------- A comprehensive collection of curated RSS/Atom feeds optimized for AI agents and large language models. This document contains the complete documentation for the AI Web Feeds project, including setup guides, API references, and usage examples. STRUCTURE -------------------------------------------------------------------------------- Each page section follows this format: - Page separator (===) - Page title and URL - Page metadata (description, tags, etc.) - Content separator (---) - Full markdown content NAVIGATION -------------------------------------------------------------------------------- Table of Contents: 1. Getting Started - /docs 2. Security Policy - /docs/security 3. Tags Taxonomy Visualization - /docs/taxonomy-visualization 4. Math Test - /docs/test-math 5. Components - /docs/test 6. Conventional Commits - /docs/contributing/conventional-commits 7. Development Workflow - /docs/contributing/development-workflow 8. Pre-commit Hooks - /docs/contributing/pre-commit-hooks 9. Simplified Architecture - /docs/development/architecture 10. CLI Integration in Workflows - /docs/development/cli-workflows 11. CLI Usage - /docs/development/cli 12. Contributing - /docs/development/contributing 13. Database Architecture - /docs/development/database-architecture 14. Database Enhancements - /docs/development/database-enhancements 15. Database & Storage - /docs/development/database-storage 16. Database Setup - /docs/development/database 17. Complete Database Refactoring - FINAL STATUS - /docs/development/final-status 18. Implementation Details - /docs/development/implementation 19. Overview - /docs/development 20. Pre-commit Hook Fixes - /docs/development/pre-commit-fixes 21. Python API - /docs/development/python-api 22. Python API Documentation - /docs/development/python-autodoc 23. Database & Storage Refactoring Summary - /docs/development/refactoring-summary 24. Test Infrastructure - /docs/development/testing 25. GitHub Actions Workflows - /docs/development/workflows 26. AI & LLM Integration - /docs/features/ai-integration 27. Analytics Dashboard - /docs/features/analytics 28. Data Enrichment & Analytics - /docs/features/data-enrichment 29. Entity Extraction - /docs/features/entity-extraction 30. Link Validation - /docs/features/link-validation 31. llms-full.txt Format - /docs/features/llms-full-format 32. Math Equations - /docs/features/math 33. Mermaid Diagrams - /docs/features/mermaid 34. Features Overview - /docs/features/overview 35. PDF Export - /docs/features/pdf-export 36. Platform Integrations - /docs/features/platform-integrations 37. Quality Scoring - /docs/features/quality-scoring 38. Real-Time Feed Monitoring - /docs/features/real-time-monitoring 39. AI-Powered Recommendations - /docs/features/recommendations 40. RSS Feeds - /docs/features/rss-feeds 41. Search & Discovery - /docs/features/search 42. Sentiment Analysis - /docs/features/sentiment-analysis 43. SEO & Metadata - /docs/features/seo-metadata 44. Topic Modeling - /docs/features/topic-modeling 45. Twitter/X and arXiv Integration - /docs/features/twitter-arxiv-integration 46. Analytics & Monitoring - /docs/guides/analytics 47. Data Explorer - /docs/guides/data-explorer 48. Database Quick Start - /docs/guides/database-quick-start 49. Deployment Guide - /docs/guides/deployment 50. Feed Schema Reference - /docs/guides/feed-schema 51. Getting Started - /docs/guides/getting-started 52. GitHub Infrastructure - /docs/guides/github-infrastructure 53. GitHub Setup Summary - /docs/guides/github-setup-summary 54. Quick Reference - /docs/guides/quick-reference 55. Testing Guide - /docs/guides/testing 56. Workflow Quick Reference - /docs/guides/workflow-reference 57. Visualization & Analytics - /docs/visualization/getting-started ================================================================================ DOCUMENTATION CONTENT ================================================================================ ================================================================================ PAGE 1 OF 57 ================================================================================ TITLE: Getting Started URL: https://ai-web-feeds.w4w.dev/docs MARKDOWN: https://ai-web-feeds.w4w.dev/docs.mdx DESCRIPTION: AI Web Feeds Documentation - Your comprehensive guide to PDF export and AI/LLM integration PATH: / -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Getting Started (/docs) import { Card, Cards } from "fumadocs-ui/components/card"; Welcome to the **AI Web Feeds** documentation! This site includes powerful features for both human readers and AI agents. ## 🚀 Quick Start Get up and running in minutes: ## ✨ Key Features ### 📄 PDF Export * **Automatic page discovery** - Export all documentation pages * **Clean output** - Navigation and UI elements hidden * **Interactive content** - Accordions and tabs expanded * **Batch processing** - Concurrent exports with rate limiting ### 🤖 AI & LLM Integration * **Discovery endpoint** - `/llms.txt` for AI agent discovery * **Full documentation** - `/llms-full.txt` with structured format * **Markdown extensions** - `.mdx` and `.md` for any page * **Content negotiation** - Automatic markdown for AI agents * **Page actions** - Copy markdown and AI tool integration ### 📡 RSS Feeds * **Multiple formats** - RSS 2.0, Atom 1.0, and JSON Feed * **Auto-discovery** - Feeds discoverable via metadata * **Sitewide & docs feeds** - Subscribe to all or just docs * **Hourly updates** - Fresh content with smart caching ### 🔗 Link Validation * **Automatic scanning** - Validates all documentation links * **Anchor checking** - Verifies headings and sections exist * **Component links** - Checks links in MDX components * **CI/CD integration** - Fail builds on broken links ### 🔍 SEO & Metadata * **Dynamic OG images** - Custom images for every page * **Rich metadata** - Complete SEO tags and structured data * **Social sharing** - Optimized for Twitter, LinkedIn, Slack * **AI crawlers** - Special rules for GPTBot, ClaudeBot, etc. ### 📊 Mermaid Diagrams * **Multiple diagram types** - Flowcharts, sequences, classes, ER diagrams * **Theme-aware** - Automatically adapts to light/dark mode * **Interactive** - Clickable elements and tooltips * **Simple syntax** - Markdown-like diagram definition ### 🧮 Math Equations * **KaTeX rendering** - Fast, beautiful mathematical notation * **Inline & block** - Support for both inline $x^2$ and display equations * **LaTeX syntax** - Familiar TeX/LaTeX commands * **Self-contained** - No external dependencies or fonts ### 🎯 Built With * [Next.js 15](https://nextjs.org) - Application framework * [Fumadocs](https://fumadocs.dev) - Documentation framework * [Puppeteer](https://pptr.dev) - PDF generation * [MDX](https://mdxjs.com) - Enhanced markdown ## 📚 Documentation Sections ### Features Detailed guides for each major feature: * [PDF Export](/docs/features/pdf-export) - Complete PDF export guide * [AI Integration](/docs/features/ai-integration) - Comprehensive AI/LLM integration * [llms-full.txt Format](/docs/features/llms-full-format) - Structured format specification * [RSS Feeds](/docs/features/rss-feeds) - Subscribe to documentation updates * [Link Validation](/docs/features/link-validation) - Ensure all links are correct * [SEO & Metadata](/docs/features/seo-metadata) - Rich metadata and Open Graph images * [Mermaid Diagrams](/docs/features/mermaid) - Create beautiful diagrams with simple syntax * [Math Equations](/docs/features/math) - Render beautiful equations with KaTeX ### Guides Practical how-to guides: * [Quick Reference](/docs/guides/quick-reference) - Commands and endpoints * [Testing Guide](/docs/guides/testing) - Verify your setup ## 🎨 Philosophy This documentation is designed to be: * **User-friendly** - Clear, concise, and well-organized * **Developer-friendly** - Code examples and technical details * **AI-friendly** - Structured formats and multiple access patterns * **Performance-optimized** - Static generation and smart caching ## 🔗 Quick Links ## 🤝 Contributing We welcome contributions! See our [Contributing Guide](https://github.com/wyattowalsh/ai-web-feeds/blob/main/CONTRIBUTING.md) for details. ## 📝 License This project is licensed under the MIT License. See the [LICENSE](https://github.com/wyattowalsh/ai-web-feeds/blob/main/LICENSE) file for details. -------------------------------------------------------------------------------- END OF PAGE 1 -------------------------------------------------------------------------------- ================================================================================ PAGE 2 OF 57 ================================================================================ TITLE: Security Policy URL: https://ai-web-feeds.w4w.dev/docs/security MARKDOWN: https://ai-web-feeds.w4w.dev/docs/security.mdx DESCRIPTION: Security guidelines, vulnerability reporting, and best practices for AI Web Feeds PATH: /security -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Security Policy (/docs/security) import { Callout } from "fumadocs-ui/components/callout"; import { Steps } from "fumadocs-ui/components/steps"; import { Tabs, Tab } from "fumadocs-ui/components/tabs"; ## Supported Versions We release patches for security vulnerabilities in the following versions: | Version | Supported | | ------- | --------- | | 1.x.x | ✅ Yes | | \< 1.0 | ❌ No | We recommend always using the latest stable version to ensure you have the most recent security updates. ## Reporting a Vulnerability We take the security of AI Web Feeds seriously. If you believe you have found a security vulnerability, please report it to us as described below. **Please do not report security vulnerabilities through public GitHub issues.** ### How to Report ### Use GitHub Security Advisories (Preferred) 1. Go to [github.com/wyattowalsh/ai-web-feeds/security/advisories](https://github.com/wyattowalsh/ai-web-feeds/security/advisories) 2. Click "Report a vulnerability" 3. Fill out the form with detailed information ### Or Send Secure Email * Send email to: [wyattowalsh@gmail.com](mailto:wyattowalsh@gmail.com) * Include "SECURITY" in the subject line * Provide detailed vulnerability information ### What to Include Please include the following information in your report: * **Type of issue**: buffer overflow, SQL injection, XSS, etc. * **Affected files**: Full paths of source files related to the issue * **Source location**: Tag/branch/commit or direct URL * **Configuration**: Any special configuration required to reproduce * **Reproduction steps**: Step-by-step instructions to reproduce the issue * **Proof-of-concept**: Exploit code or PoC (if possible) * **Impact assessment**: How an attacker might exploit the vulnerability The more detail you provide, the faster we can validate and fix the issue. ### Response Timeline ### Initial Acknowledgment We will acknowledge receipt of your vulnerability report **within 48 hours**. ### Detailed Response We will send a detailed response **within 7 days** indicating next steps and requesting any additional information needed. ### Progress Updates We will keep you informed of progress towards a fix and full announcement. ### Coordinated Disclosure We will coordinate with you on the timing of public disclosure. ## Disclosure Policy * We prefer to **fully remediate vulnerabilities** before public disclosure * We will **coordinate disclosure timing** with you * We will **credit you** in the security advisory (unless you prefer anonymity) * We ask that you **avoid public disclosure** until we've had time to address the issue ## Safe Harbor We support safe harbor for security researchers who: ### Act in Good Faith * Avoid privacy violations, data destruction, or service interruption * Only interact with accounts you own or have explicit permission to test ### Report Responsibly * Do not exploit security issues you discover for any reason * Report vulnerabilities as soon as you discover them ### Follow Guidelines * Respect our disclosure policy * Provide reasonable time for remediation before any public disclosure Researchers acting in good faith under these guidelines will not face legal action for security testing. ## Scope ### In Scope ✅ The following components are **in scope** for security reports: * AI Web Feeds CLI tool * AI Web Feeds web application * Feed processing and validation logic * Data schema and validation * CI/CD workflows that could impact security * API endpoints and data handling * Authentication and authorization mechanisms ### Out of Scope ❌ The following are **out of scope**: * Social engineering attacks * Physical attacks against infrastructure * Attacks requiring physical access to user devices * Denial of service attacks * Issues in third-party services or libraries (report to respective projects) * Publicly disclosed vulnerabilities (already known) ## Security Best Practices for Contributors When contributing to AI Web Feeds, follow these security best practices: ### Input Validation * Always validate and sanitize user input * Use schema validation for all external data * Implement proper type checking * Escape output for different contexts (HTML, SQL, shell, etc.) ```python from pydantic import BaseModel, HttpUrl, validator class FeedInput(BaseModel): url: HttpUrl name: str @validator('name') def validate_name(cls, v): if len(v) > 200: raise ValueError('Name too long') return v.strip() ``` ### Dependencies * Keep all dependencies up to date * Review security advisories for dependencies * Use `pip-audit` or similar tools to scan for vulnerabilities * Pin dependency versions in production ```bash # Check for vulnerabilities pip-audit # Update dependencies safely pip install --upgrade package-name ``` ### Secrets Management * **Never** commit API keys, passwords, or secrets to version control * Use environment variables for sensitive configuration * Use `.env` files (add to `.gitignore`) * Rotate secrets regularly ```python import os from dotenv import load_dotenv load_dotenv() api_key = os.getenv('API_KEY') # Never hardcode! ``` ### Code Review * All code changes require review before merging * Include security considerations in review checklist * Test for common vulnerabilities (OWASP Top 10) * Document security implications of changes **Review Checklist:** * ✅ Input validation implemented * ✅ No hardcoded secrets * ✅ Dependencies are up to date * ✅ Tests include security scenarios * ✅ Documentation updated ## Automated Security We use several automated tools to maintain security: ### Dependency Scanning * **Dependabot**: Automatically checks for vulnerable dependencies * **pip-audit**: Scans Python packages for known vulnerabilities * **npm audit**: Scans Node.js packages for security issues ### Code Analysis * **CodeQL**: Automated security scanning of code * **Ruff**: Python linter with security rules * **ESLint**: JavaScript/TypeScript security linting ### CI/CD Security * **Dependency Review**: Reviews dependency changes in PRs * **Secret Scanning**: Prevents accidental secret commits * **Security Policy Enforcement**: Automated checks for security requirements All pull requests are automatically scanned for security issues before merging. ## Security Updates Security updates are released according to severity: | Severity | Response Time | Release Type | | ------------ | -------------------- | -------------------------- | | **Critical** | Immediate | Patch version (within 24h) | | **High** | Within 7 days | Patch version | | **Medium** | Within 30 days | Minor version | | **Low** | Next planned release | Minor/Patch version | ### Security Advisories Security advisories are published at: [github.com/wyattowalsh/ai-web-feeds/security/advisories](https://github.com/wyattowalsh/ai-web-feeds/security/advisories) Subscribe to receive notifications: * Watch the repository * Enable security alerts in your GitHub settings * Subscribe to release notifications ## Common Security Scenarios ### Feed URL Validation ```python from ai_web_feeds.models import FeedSource from pydantic import HttpUrl # Always validate URLs def add_feed(url: str) -> FeedSource: # Pydantic validates URL format validated_url = HttpUrl(url) # Additional checks if validated_url.scheme not in ['http', 'https']: raise ValueError("Invalid URL scheme") return FeedSource(url=str(validated_url)) ``` ### SQL Injection Prevention ```python from sqlmodel import select, Session # ✅ Good: Using parameterized queries def get_feed_by_name(session: Session, name: str): statement = select(FeedSource).where(FeedSource.name == name) return session.exec(statement).first() # ❌ Bad: String interpolation (vulnerable to SQL injection) # def get_feed_by_name(session: Session, name: str): # query = f"SELECT * FROM feedsource WHERE name = '{name}'" # return session.exec(query) ``` ### XSS Prevention in Web UI ```tsx // ✅ Good: React automatically escapes content function FeedTitle({ title }: { title: string }) { return

{title}

; // Escaped by default } // ❌ Bad: dangerouslySetInnerHTML without sanitization // function FeedContent({ html }: { html: string }) { // return
; // } ``` ## Recognition We appreciate the security research community's efforts to responsibly disclose vulnerabilities. Contributors who report valid security issues will be: * ✅ **Credited** in the security advisory (if desired) * ✅ **Listed** in our security acknowledgments * ✅ **Recognized** in our Hall of Fame * ✅ **Eligible** for potential rewards (to be determined) Thank you for helping keep AI Web Feeds and our users safe! ## Additional Resources * [OWASP Top 10](https://owasp.org/www-project-top-ten/) * [GitHub Security Best Practices](https://docs.github.com/en/code-security) * [Python Security Best Practices](https://python.readthedocs.io/en/latest/library/security_warnings.html) * [Node.js Security Best Practices](https://nodejs.org/en/docs/guides/security/) ## Contact For general security questions (not vulnerability reports): * Open a [GitHub Discussion](https://github.com/wyattowalsh/ai-web-feeds/discussions) * Email: [wyattowalsh@gmail.com](mailto:wyattowalsh@gmail.com) -------------------------------------------------------------------------------- END OF PAGE 2 -------------------------------------------------------------------------------- ================================================================================ PAGE 3 OF 57 ================================================================================ TITLE: Tags Taxonomy Visualization URL: https://ai-web-feeds.w4w.dev/docs/taxonomy-visualization MARKDOWN: https://ai-web-feeds.w4w.dev/docs/taxonomy-visualization.mdx DESCRIPTION: Visualize the hierarchical tags ontology and taxonomy graph PATH: /taxonomy-visualization -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Tags Taxonomy Visualization (/docs/taxonomy-visualization) ## Overview AIWebFeeds provides a comprehensive **tags taxonomy** that organizes AI/ML topics into a hierarchical ontology. This system supports: * **Hierarchical relationships** (parent/child) * **Semantic relations** (depends\_on, implements, influences, etc.) * **Facet classification** (domain, task, methodology, etc.) * **Multiple visualization formats** (Mermaid, JSON graphs, DOT) ## Taxonomy Structure The taxonomy is defined in `/data/topics.yaml` and includes: * **\~100+ topics** across AI/ML domains * **4 facet groups**: conceptual, technical, contextual, communicative * **Directed relations**: depends\_on, implements, influences * **Symmetric relations**: related\_to, same\_as, contrasts\_with ### Example Topic ```yaml - id: llm label: Large Language Models facet: task facet_group: conceptual parents: [genai, nlp] relations: depends_on: [training, data] influences: [product, education] related_to: [agents, evaluation] rank_hint: 0.99 ``` ## Visualization Methods ### 1. CLI Visualization Generate Mermaid diagrams, JSON graphs, or view statistics: ```bash # Generate Mermaid diagram aiwebfeeds visualize mermaid -o taxonomy.mermaid # With options aiwebfeeds visualize mermaid \ --direction LR \ --max-depth 3 \ --facets "domain,task" \ --no-relations # Generate JSON graph for D3.js/visualization libraries aiwebfeeds visualize json -o taxonomy.json # View statistics aiwebfeeds visualize stats ``` ### 2. Python API Use the taxonomy module programmatically: ```python from ai_web_feeds.taxonomy import load_taxonomy, TaxonomyVisualizer # Load taxonomy taxonomy = load_taxonomy() # Create visualizer visualizer = TaxonomyVisualizer(taxonomy) # Generate Mermaid diagram mermaid_code = visualizer.to_mermaid( direction="TD", max_depth=3, include_relations=True ) # Get JSON graph for D3.js graph = visualizer.to_json_graph() print(f"Nodes: {len(graph['nodes'])}, Links: {len(graph['links'])}") # Get statistics stats = visualizer.get_statistics() print(f"Total topics: {stats['total_topics']}") print(f"Max depth: {stats['max_depth']}") ``` ### 3. Interactive Mermaid Diagram Below is an interactive visualization of the core AI/ML taxonomy (depth=2): ## Facet Groups Topics are organized into four facet groups with distinct visual styling:
Conceptual

Core AI/ML concepts, domains, and tasks

Technical

Infrastructure, tools, and technical components

Contextual

Industry, governance, and application domains

Communicative

Media types and communication channels

## Use Cases ### Feed Categorization Topics are used to categorize and filter RSS/Atom feeds: ```python from ai_web_feeds.taxonomy import load_taxonomy taxonomy = load_taxonomy() # Get all LLM-related topics llm_topic = taxonomy.get_topic("llm") llm_children = taxonomy.get_children("llm") # Filter feeds by topic conceptual_topics = taxonomy.get_topics_by_facet_group("conceptual") ``` ### Recommendation Systems Use the taxonomy for content recommendations: ```python # Find related topics topic = taxonomy.get_topic("llm") related = topic.relations.get("related_to", []) # Get topic dependencies dependencies = topic.relations.get("depends_on", []) ``` ### Analytics & Insights Generate insights about your feed collection: ```python visualizer = TaxonomyVisualizer(taxonomy) stats = visualizer.get_statistics() print(f"Facet distribution: {stats['facets']}") print(f"Average depth: {stats['avg_depth']:.2f}") ``` ## Advanced Features ### Filtering by Depth Visualize only top-level topics: ```python mermaid_code = visualizer.to_mermaid(max_depth=2) ``` ### Filtering by Facet Focus on specific topic types: ```python mermaid_code = visualizer.to_mermaid( filter_facets=["domain", "task"] ) ``` ### Custom Styling The Mermaid diagrams include custom CSS classes based on facet groups, which you can override in your rendering environment. ## Data Format The taxonomy follows a strict JSON Schema (see `/data/topics.schema.json`): ```json { "id": "string (kebab-case)", "label": "Human-readable name", "facet": "Category type", "facet_group": "conceptual | technical | contextual | communicative", "parents": ["parent-topic-ids"], "relations": { "depends_on": ["topic-ids"], "implements": ["topic-ids"], "influences": ["topic-ids"] }, "rank_hint": 0.0-1.0 } ``` ## Export Formats ### Mermaid Best for documentation and GitHub/GitLab READMEs. ### JSON Graph Compatible with D3.js, Cytoscape.js, and other graph visualization libraries: ```json { "nodes": [ { "id": "ai", "label": "Artificial Intelligence", "facet": "domain", "facet_group": "conceptual" } ], "links": [ { "source": "ai", "target": "ml", "type": "parent" } ] } ``` ### DOT (Graphviz) For high-quality static diagrams (requires Graphviz): ```bash # Generate DOT file python -c " from ai_web_feeds.taxonomy import load_taxonomy, TaxonomyVisualizer viz = TaxonomyVisualizer(load_taxonomy()) print(viz.to_dot()) " > taxonomy.dot # Render with Graphviz dot -Tpng taxonomy.dot -o taxonomy.png ``` ## Contributing To add or modify topics: 1. Edit `/data/topics.yaml` 2. Validate against `/data/topics.schema.json` 3. Run `aiwebfeeds validate data/topics.yaml` 4. Generate updated visualizations 5. Submit a pull request ## API Reference See the [Python API documentation](/docs/api/taxonomy) for complete details on: * `TopicNode` - Topic model * `TopicsTaxonomy` - Taxonomy container * `TaxonomyVisualizer` - Visualization generator * `load_taxonomy()` - Load from YAML * `export_mermaid()` - Export Mermaid diagram * `export_json_graph()` - Export JSON graph -------------------------------------------------------------------------------- END OF PAGE 3 -------------------------------------------------------------------------------- ================================================================================ PAGE 4 OF 57 ================================================================================ TITLE: Math Test URL: https://ai-web-feeds.w4w.dev/docs/test-math MARKDOWN: https://ai-web-feeds.w4w.dev/docs/test-math.mdx DESCRIPTION: Test page for verifying KaTeX math rendering PATH: /test-math -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Math Test (/docs/test-math) # Math Rendering Test ## Inline Math The Pythagorean theorem: $a^2 + b^2 = c^2$ Einstein's mass-energy equivalence: $E = mc^2$ ## Block Math ### Simple Equation ```math \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} ``` ### Complex Equation ```math \int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi} ``` ### Matrix ```math \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} ``` If you can see properly formatted mathematical equations above, KaTeX is working correctly! ✅ -------------------------------------------------------------------------------- END OF PAGE 4 -------------------------------------------------------------------------------- ================================================================================ PAGE 5 OF 57 ================================================================================ TITLE: Components URL: https://ai-web-feeds.w4w.dev/docs/test MARKDOWN: https://ai-web-feeds.w4w.dev/docs/test.mdx DESCRIPTION: Components PATH: /test -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Components (/docs/test) ## Code Block ```js console.log("Hello World"); ``` ## Cards -------------------------------------------------------------------------------- END OF PAGE 5 -------------------------------------------------------------------------------- ================================================================================ PAGE 6 OF 57 ================================================================================ TITLE: Conventional Commits URL: https://ai-web-feeds.w4w.dev/docs/contributing/conventional-commits MARKDOWN: https://ai-web-feeds.w4w.dev/docs/contributing/conventional-commits.mdx DESCRIPTION: Guide to using Conventional Commits specification in AI Web Feeds PATH: /contributing/conventional-commits -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Conventional Commits (/docs/contributing/conventional-commits) ## Overview AI Web Feeds uses the [Conventional Commits](https://www.conventionalcommits.org/) specification for all commit messages. This provides a structured format that enables automated changelog generation, semantic versioning, and clear project history. ## Format Each commit message consists of a **header**, optional **body**, and optional **footer**: ``` (): [optional body] [optional footer] ``` ### Header (Required) The header has a special format that includes a **type**, optional **scope**, and **subject**: ``` (): │ │ │ │ │ └─> Summary in present tense. Not capitalized. No period at end. │ │ │ └─> Scope: core|analytics|monitoring|nlp|cli|web|docs|tests|deps|ci|etc. │ └─> Type: feat|fix|docs|style|refactor|perf|test|build|ci|chore|revert ``` **Rules:** * Maximum 100 characters * Type and subject are required * Scope is recommended but optional * Subject is lowercase, imperative mood ("add" not "added" or "adds") * No period at the end ## Commit Types | Type | Description | Changelog Section | Example | | ---------- | ---------------------------------------- | ----------------- | ------------------------------------------------- | | `feat` | New feature | Features | `feat(core): add RSS feed parser` | | `fix` | Bug fix | Bug Fixes | `fix(analytics): correct topic count calculation` | | `docs` | Documentation only | Documentation | `docs(api): update fetch endpoint examples` | | `style` | Code style/formatting (no logic change) | - | `style(core): format with ruff` | | `refactor` | Code refactoring (no feature/fix) | - | `refactor(storage): simplify query builder` | | `perf` | Performance improvement | Performance | `perf(nlp): optimize embedding generation` | | `test` | Add/update tests | - | `test(validate): add edge case coverage` | | `build` | Build system/dependencies | - | `build(deps): update pydantic to 2.5.0` | | `ci` | CI/CD changes | - | `ci(workflow): add caching for npm deps` | | `chore` | Other changes (no src/test modification) | - | `chore(release): bump version to 0.2.0` | | `revert` | Revert previous commit | - | `revert(feat): remove experimental feature` | ## Scopes Scopes indicate which part of the codebase is affected: ### Core Package Scopes * `core` - Core functionality * `models` - Data models and schemas * `storage` - Database and persistence * `load` - Feed loading and fetching * `validate` - Validation logic * `export` - Export functionality * `enrich` - Enrichment pipeline * `logger` - Logging utilities * `utils` - Utility functions * `config` - Configuration management ### Phase-Specific Scopes * `analytics` - Phase 002: Analytics & Discovery * `discovery` - Phase 002: Feed discovery * `monitoring` - Phase 003: Real-time monitoring * `realtime` - Phase 003: Real-time features * `nlp` - Phase 005: NLP/AI features * `ai` - Phase 005: AI-powered features ### Component Scopes * `cli` - Command-line interface * `web` - Web documentation site * `api` - API endpoints ### Infrastructure Scopes * `db` - Database changes * `schema` - Schema definitions * `migrations` - Database migrations * `data` - Data files (feeds.yaml, topics.yaml) ### Meta Scopes * `docs` - Documentation * `tests` - Test infrastructure * `deps` - Dependencies * `ci` - CI/CD pipeline * `tooling` - Development tools * `release` - Release management ## Examples ### Feature Addition ```bash feat(analytics): add topic trending analysis Implement z-score based trending detection for topics with configurable thresholds and time windows. Closes #123 ``` ### Bug Fix ```bash fix(load): handle malformed RSS feed dates Parse dates with lenient mode and fallback to current timestamp when feed dates are invalid or missing. Fixes #456 ``` ### Documentation ```bash docs(cli): add examples for export command Add usage examples for JSON, OPML, and CSV export formats with filtering options. ``` ### Breaking Change ```bash feat(api)!: redesign feed validation endpoint BREAKING CHANGE: The /validate endpoint now returns structured validation results instead of boolean. Update client code: Before: - GET /validate?url= → { "valid": true } After: - GET /validate?url= → { "status": "valid", "issues": [] } Closes #789 ``` ### Multiple Scopes ```bash feat(core,analytics): integrate embedding generation Add sentence-transformers support for generating feed embeddings with batch processing and caching. ``` ## Body Guidelines The body is optional but recommended for: * Complex changes requiring explanation * Breaking changes (required) * Performance impacts * Migration instructions **Format:** * Separate from header with blank line * Wrap at 100 characters * Use imperative mood * Explain "what" and "why", not "how" ## Footer Guidelines Footers are optional and used for: ### Issue References ```bash Closes #123 Fixes #456, #789 Relates to #101 ``` ### Breaking Changes ```bash BREAKING CHANGE: ``` ### Deprecations ```bash DEPRECATED: ``` ### Co-authors ```bash Co-authored-by: Name ``` ## Interactive Commits with Commitizen For interactive commit creation, use commitizen: ```bash # Initialize (one-time setup) npx commitizen init cz-conventional-changelog --save-dev --save-exact # Create commits interactively npx cz # or git cz ``` Commitizen will prompt you for: 1. Type of change 2. Scope of change 3. Short description 4. Longer description (optional) 5. Breaking changes (optional) 6. Issue references (optional) ## Tools Integration ### Pre-commit Hook Conventional commits are enforced via pre-commit hook: ```yaml # .pre-commit-config.yaml - repo: https://github.com/compilerla/conventional-pre-commit rev: v3.0.0 hooks: - id: conventional-pre-commit stages: [commit-msg] ``` ### Commitlint Validation rules are defined in `commitlint.config.js`: ```javascript module.exports = { extends: ['@commitlint/config-conventional'], rules: { 'type-enum': [2, 'always', ['feat', 'fix', 'docs', ...]], 'scope-enum': [2, 'always', ['core', 'analytics', ...]], 'subject-case': [2, 'never', ['sentence-case', 'start-case', ...]], 'header-max-length': [2, 'always', 100], }, }; ``` ### CI/CD Validation GitHub Actions validates commits on PRs: ```yaml # .github/workflows/ci.yml conventional-commits: name: Validate Conventional Commits if: github.event_name == 'pull_request' steps: - name: Validate PR commits run: | npx commitlint --from ${{ github.event.pull_request.base.sha }} \ --to ${{ github.event.pull_request.head.sha }} ``` ## Common Patterns ### Feature Development ```bash feat(scope): add new capability feat(scope): enhance existing feature feat(scope): implement X support ``` ### Bug Fixes ```bash fix(scope): correct incorrect behavior fix(scope): handle edge case in X fix(scope): prevent Y when Z ``` ### Refactoring ```bash refactor(scope): simplify X logic refactor(scope): extract Y into separate module refactor(scope): rename X to Y for clarity ``` ### Performance ```bash perf(scope): optimize X operation perf(scope): cache Y results perf(scope): reduce memory usage in Z ``` ### Documentation ```bash docs(scope): add X documentation docs(scope): update Y examples docs(scope): clarify Z behavior ``` ## Validation Test your commit message format: ```bash # Test with commitlint echo "feat(core): test message" | npx commitlint # Validate last commit npx commitlint --from HEAD~1 # Validate range npx commitlint --from HEAD~5 --to HEAD ``` ## Best Practices ### ✅ Good Commits ```bash feat(analytics): add topic clustering algorithm fix(load): handle timeout for slow RSS feeds docs(api): add authentication examples perf(nlp): optimize embedding batch processing test(validate): add schema validation edge cases ``` ### ❌ Bad Commits ```bash # Too vague fix: bug fix # Not imperative mood feat(core): Added new parser # Capitalized subject feat(core): Add new parser # Period at end feat(core): add new parser. # Missing scope (when appropriate) feat: add trending analysis # Wrong type feat(core): fix typo in README ``` ## Changelog Generation Conventional commits enable automated changelog generation: ```bash # Generate changelog npx standard-version # Preview next version npx standard-version --dry-run # First release npx standard-version --first-release ``` ## Resources * [Conventional Commits Specification](https://www.conventionalcommits.org/) * [Commitlint Documentation](https://commitlint.js.org/) * [Commitizen](https://github.com/commitizen/cz-cli) * [Standard Version](https://github.com/conventional-changelog/standard-version) ## FAQ ### Why conventional commits? 1. **Automated Changelog**: Generate release notes automatically 2. **Semantic Versioning**: Determine version bumps (major/minor/patch) 3. **Clear History**: Understand changes at a glance 4. **Better Collaboration**: Consistent format across team 5. **Tooling Integration**: Enable automation and analysis ### What if I forget the format? Use commitizen for interactive prompts: ```bash npx cz ``` Or refer to this guide! ### Can I use multiple scopes? Yes, separate with commas: ```bash feat(core,cli): add new export format ``` ### What about merge commits? Merge commits follow the same format: ```bash Merge pull request #123 from feature-branch feat(analytics): add trending detection ``` ### How do I indicate breaking changes? Three ways: 1. `!` after scope: `feat(api)!: redesign endpoint` 2. Footer: `BREAKING CHANGE: description` 3. Both (recommended for visibility) ## Support For questions or issues with conventional commits: * Check this documentation * Review [commitlint.config.js](https://github.com/wyattowalsh/ai-web-feeds/blob/main/commitlint.config.js) * Open an issue on [GitHub](https://github.com/wyattowalsh/ai-web-feeds/issues) -------------------------------------------------------------------------------- END OF PAGE 6 -------------------------------------------------------------------------------- ================================================================================ PAGE 7 OF 57 ================================================================================ TITLE: Development Workflow URL: https://ai-web-feeds.w4w.dev/docs/contributing/development-workflow MARKDOWN: https://ai-web-feeds.w4w.dev/docs/contributing/development-workflow.mdx DESCRIPTION: Complete guide to the development workflow and tooling in AI Web Feeds PATH: /contributing/development-workflow -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Development Workflow (/docs/contributing/development-workflow) ## Overview AI Web Feeds uses a modern, automated development workflow that ensures code quality, consistency, and maintainability. This guide covers the complete development process from setup to deployment. ## Quick Start ```bash # 1. Clone and setup git clone https://github.com/wyattowalsh/ai-web-feeds.git cd ai-web-feeds uv sync # 2. Install pre-commit hooks uv run pre-commit install uv run pre-commit install --hook-type commit-msg # 3. Create a feature branch git checkout -b feat/your-feature # 4. Make changes and commit git add . git commit -m "feat(scope): description" # 5. Push and create PR git push origin feat/your-feature ``` ## Development Environment ### Prerequisites * **Python 3.13+** - Core language * **Node.js 20.11+** - For web app and tooling * **uv** - Python package manager (REQUIRED - do not use pip) * **pnpm** - Node package manager (REQUIRED - do not use npm/yarn) * **Git** - Version control ### ⚠️ Package Manager Requirements **CRITICAL: You MUST use the correct package managers:** * **Python:** ONLY `uv` ✅ (NEVER `pip`, `pip install`, `python -m pip`) ❌ * **Node.js:** ONLY `pnpm` ✅ (NEVER `npm install`, `yarn`) ❌ **Why?** * `uv` is 10-100x faster than pip and correctly handles workspace dependencies * `pnpm` uses efficient disk space with symlinks and has superior monorepo support **Examples:** ✅ **CORRECT:** ```bash uv sync # Install Python dependencies uv add package # Add Python package uv run pytest # Run Python commands pnpm install # Install Node dependencies pnpm add package # Add Node package ``` ❌ **FORBIDDEN:** ```bash pip install package # NEVER npm install # NEVER yarn add package # NEVER python -m pip install # NEVER ``` ### Initial Setup ```bash # Install uv (if not already installed) curl -LsSf https://astral.sh/uv/install.sh | sh # Install pnpm (if not already installed) npm install -g pnpm # Clone repository git clone https://github.com/wyattowalsh/ai-web-feeds.git cd ai-web-feeds # Install Python dependencies uv sync # Install web dependencies cd apps/web && pnpm install # Install pre-commit hooks uv run pre-commit install uv run pre-commit install --hook-type commit-msg # Install commitlint (optional, for interactive commits) npm install -g @commitlint/cli @commitlint/config-conventional npm install -g commitizen cz-conventional-changelog ``` ## Project Structure ``` ai-web-feeds/ ├── packages/ │ └── ai_web_feeds/ # Core Python package │ ├── src/ # Source code │ │ ├── models.py # Data models │ │ ├── load.py # Feed loading │ │ ├── validate.py # Validation │ │ ├── export.py # Export functions │ │ └── ... │ └── tests/ # Test suite ├── apps/ │ ├── cli/ # Command-line interface │ └── web/ # Documentation website │ ├── app/ # Next.js app │ ├── content/docs/ # MDX documentation │ ├── components/ # React components │ └── ... ├── data/ # Data files │ ├── feeds.yaml # Feed definitions │ ├── topics.yaml # Topic taxonomy │ ├── *.schema.json # JSON schemas │ └── aiwebfeeds.db # SQLite database ├── tests/ # Integration tests └── .github/ # GitHub workflows ``` ## Development Workflow ### 1. Branch Strategy We use **GitHub Flow** with feature branches: ```bash # Main branch (protected) main # Feature branches feat/feature-name fix/bug-name docs/doc-update refactor/refactor-name ``` **Rules:** * All changes via pull requests * Feature branches from `main` * Delete branches after merge * Use descriptive branch names ### 2. Making Changes #### Python Development ```bash # Navigate to package cd packages/ai_web_feeds # Make changes to source vim src/models.py # Run tests uv run pytest tests/ # Run with coverage uv run pytest tests/ --cov=src --cov-report=term # Type check uv run mypy src/ # Lint and format uv run ruff check . uv run ruff format . ``` #### Web Development ```bash # Navigate to web app cd apps/web # Start dev server pnpm dev # Visit http://localhost:3000 # Lint and format pnpm lint pnpm prettier --write . # Type check pnpm tsc --noEmit # Build pnpm build ``` #### CLI Development ```bash # Navigate to CLI cd apps/cli # Run CLI uv run aiwebfeeds --help # Test commands uv run aiwebfeeds fetch --url https://example.com/feed uv run aiwebfeeds validate --all uv run aiwebfeeds export --format json ``` ### 3. Testing #### Unit Tests ```bash # Run all tests cd packages/ai_web_feeds uv run pytest tests/ # Run specific test file uv run pytest tests/test_models.py # Run specific test uv run pytest tests/test_models.py::test_source_model # Run with coverage uv run pytest tests/ --cov=src --cov-report=html open htmlcov/index.html ``` #### Integration Tests ```bash # Run integration tests cd tests uv run pytest tests/ # Test CLI commands cd apps/cli uv run pytest tests/ ``` #### Coverage Requirements * **Minimum:** 90% coverage * **Target:** 95%+ coverage * Enforced by CI and pre-commit hooks ### 4. Committing Changes #### Option A: Interactive (Recommended) ```bash # Stage changes git add . # Interactive commit npx cz # Follow prompts: # 1. Select type (feat, fix, docs, etc.) # 2. Enter scope (core, cli, web, etc.) # 3. Write short description # 4. Add longer description (optional) # 5. Mark breaking changes (if any) # 6. Reference issues (if any) ``` #### Option B: Manual ```bash # Stage changes git add . # Commit with conventional format git commit -m "feat(core): add RSS feed parser" # Pre-commit hooks run automatically: # ✓ Ruff (Python linting/formatting) # ✓ MyPy (type checking) # ✓ ESLint (TypeScript linting) # ✓ Prettier (code formatting) # ✓ Tests (if Python files changed) # ✓ Secrets detection # ✓ Conventional commits validation ``` #### Commit Message Format ``` (): [optional body] [optional footer] ``` **Examples:** ```bash # Feature git commit -m "feat(analytics): add topic trending analysis" # Bug fix git commit -m "fix(load): handle malformed RSS dates" # Documentation git commit -m "docs(api): update fetch examples" # Breaking change git commit -m "feat(api)!: redesign validation endpoint BREAKING CHANGE: validation response format changed" ``` See [Conventional Commits](/docs/contributing/conventional-commits) guide for details. ### 5. Pre-commit Hooks Hooks run automatically on `git commit`: * **Python:** ruff, mypy, bandit, pytest * **TypeScript:** eslint, prettier, tsc * **General:** trailing whitespace, line endings, YAML/JSON validation * **Security:** secrets detection * **Commits:** conventional commits validation **Manual run:** ```bash # Run all hooks uv run pre-commit run --all-files # Run specific hook uv run pre-commit run ruff --all-files ``` See [Pre-commit Hooks](/docs/contributing/pre-commit-hooks) guide for details. ### 6. Pushing Changes ```bash # Push to your branch git push origin feat/your-feature # First push of new branch git push -u origin feat/your-feature ``` ### 7. Creating Pull Requests #### Via GitHub UI 1. Go to [repository](https://github.com/wyattowalsh/ai-web-feeds) 2. Click "Pull requests" → "New pull request" 3. Select your branch 4. Fill out PR template 5. Request reviews #### Via GitHub CLI ```bash # Install gh (if not already) brew install gh # Authenticate gh auth login # Create PR gh pr create \ --title "feat(core): add RSS parser" \ --body "Implements RSS 2.0 parser with validation" # Create draft PR gh pr create --draft ``` #### PR Template Checklist * [ ] Tests pass locally * [ ] Coverage ≥90% * [ ] Conventional commits used * [ ] Documentation updated * [ ] Pre-commit hooks pass * [ ] No new linting warnings * [ ] Type hints added * [ ] CHANGELOG.md updated (if significant) ### 8. CI/CD Pipeline On PR creation, GitHub Actions runs: 1. **Python Linting** - Ruff, MyPy, Bandit 2. **Python Tests** - Pytest across Python 3.11-3.13, Linux/Mac/Windows 3. **Coverage Check** - Minimum 90% required 4. **TypeScript Linting** - ESLint, Prettier 5. **TypeScript Build** - Next.js build 6. **Data Validation** - Schema validation 7. **Conventional Commits** - Commit message validation **View results:** PR → Checks tab **All checks must pass** before merge. ### 9. Code Review #### For Authors * Respond to all comments * Make requested changes * Push updates to same branch * Request re-review when ready #### For Reviewers * Review within 24-48 hours * Be constructive and specific * Suggest alternatives * Approve when satisfied ### 10. Merging **Merge strategies:** * **Squash and merge** (default) - Clean history * **Rebase and merge** - Linear history * **Merge commit** - Preserve branch history **After merge:** ```bash # Switch to main git checkout main # Pull latest git pull origin main # Delete local branch git branch -d feat/your-feature # Delete remote branch (auto-deleted on GitHub) git push origin --delete feat/your-feature ``` ## Code Quality Standards ### Python * **Style:** PEP 8 via Ruff * **Type hints:** Required with strict MyPy * **Docstrings:** Google style * **Line length:** 100 characters * **Imports:** Sorted via Ruff (isort rules) * **Complexity:** Max 10 (McCabe) ### TypeScript * **Style:** Standard via ESLint * **Strict mode:** Enabled * **Formatting:** Prettier * **Line length:** 100 characters * **React:** Hooks, functional components ### Documentation * **Format:** MDX for web docs * **Location:** `apps/web/content/docs/` * **Style:** Clear, concise, with examples * **Code blocks:** With language and titles ### Testing * **Framework:** Pytest (Python), Jest (TypeScript) * **Coverage:** ≥90% required * **Style:** Descriptive test names * **Structure:** Arrange-Act-Assert * **Fixtures:** Use conftest.py ## Tools Reference ### Python Tools ```bash # Package management uv sync # Install dependencies uv add package # Add dependency uv remove package # Remove dependency # Testing uv run pytest # Run tests uv run pytest --cov # With coverage uv run pytest -v # Verbose uv run pytest -k test_name # Run specific test # Linting & formatting uv run ruff check . # Lint uv run ruff format . # Format uv run mypy src/ # Type check # Security uv run bandit -r src/ # Security scan ``` ### Web Tools ```bash # Package management pnpm install # Install dependencies pnpm add package # Add dependency pnpm remove package # Remove dependency # Development pnpm dev # Start dev server pnpm build # Production build pnpm start # Start production server # Linting & formatting pnpm lint # Lint pnpm lint --fix # Lint with auto-fix pnpm prettier --write . # Format pnpm tsc --noEmit # Type check ``` ### Git Tools ```bash # Pre-commit uv run pre-commit run --all-files # Run all hooks uv run pre-commit autoupdate # Update hooks # Commitizen npx cz # Interactive commit git cz # Alternative # Commitlint npx commitlint --from HEAD~1 # Validate last commit echo "msg" | npx commitlint # Test message ``` ## Troubleshooting ### Pre-commit Hooks Failing ```bash # Reinstall hooks uv run pre-commit uninstall uv run pre-commit install uv run pre-commit install --hook-type commit-msg # Clean and reinstall environments uv run pre-commit clean uv run pre-commit install-hooks ``` ### Tests Failing ```bash # Run in verbose mode uv run pytest -vv # Show print statements uv run pytest -s # Stop on first failure uv run pytest -x # Run last failed tests uv run pytest --lf ``` ### Type Checking Issues ```bash # Run with verbose output uv run mypy src/ --verbose # Show error codes uv run mypy src/ --show-error-codes # Ignore missing imports uv run mypy src/ --ignore-missing-imports ``` ### Build Issues ```bash # Python: Clear cache rm -rf .pytest_cache .mypy_cache .ruff_cache __pycache__ uv sync # Web: Clear cache cd apps/web rm -rf .next node_modules pnpm install pnpm build ``` ## Resources * [Contributing Guide](/docs/contributing) * [Conventional Commits](/docs/contributing/conventional-commits) * [Pre-commit Hooks](/docs/contributing/pre-commit-hooks) * [Testing Guide](/docs/contributing/testing) * [GitHub Repository](https://github.com/wyattowalsh/ai-web-feeds) ## FAQ ### How do I run the full CI pipeline locally? ```bash # Run pre-commit (close to CI) uv run pre-commit run --all-files # Run tests with coverage cd packages/ai_web_feeds uv run pytest tests/ --cov=src --cov-fail-under=90 # Build web app cd apps/web pnpm build ``` ### Can I skip pre-commit hooks? **Not recommended.** CI will still enforce all checks. If needed: ```bash git commit --no-verify ``` ### How do I update dependencies? ```bash # Python uv add package@latest # Web cd apps/web && pnpm update package ``` ### What's the release process? See [Release Process](/docs/contributing/release-process) (coming soon). ## Support Need help? * **Documentation:** Check this guide and related docs * **Issues:** [GitHub Issues](https://github.com/wyattowalsh/ai-web-feeds/issues) * **Discussions:** [GitHub Discussions](https://github.com/wyattowalsh/ai-web-feeds/discussions) * **Contact:** See [README](https://github.com/wyattowalsh/ai-web-feeds#readme) -------------------------------------------------------------------------------- END OF PAGE 7 -------------------------------------------------------------------------------- ================================================================================ PAGE 8 OF 57 ================================================================================ TITLE: Pre-commit Hooks URL: https://ai-web-feeds.w4w.dev/docs/contributing/pre-commit-hooks MARKDOWN: https://ai-web-feeds.w4w.dev/docs/contributing/pre-commit-hooks.mdx DESCRIPTION: Guide to pre-commit hooks and code quality automation in AI Web Feeds PATH: /contributing/pre-commit-hooks -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Pre-commit Hooks (/docs/contributing/pre-commit-hooks) ## Overview AI Web Feeds uses [pre-commit](https://pre-commit.com/) to automatically run code quality checks before each commit. This ensures consistent code style, catches common errors, and maintains high code quality across the project. ## Installation Pre-commit is included in the dev dependencies. Install and activate hooks: ```bash # Sync dependencies uv sync # Install pre-commit hooks uv run pre-commit install # Install commit-msg hook (for conventional commits) uv run pre-commit install --hook-type commit-msg # Verify installation ls -la .git/hooks/pre-commit ls -la .git/hooks/commit-msg ``` ## Configured Hooks ### Python - Ruff (Linting & Formatting) **Fast, comprehensive Python linter and formatter** ```yaml - repo: https://github.com/astral-sh/ruff-pre-commit hooks: - id: ruff # Linting with auto-fix - id: ruff-format # Code formatting ``` **Checks:** * Code style (PEP 8) * Import organization * Unused variables/imports * Type annotations * Security issues (bandit rules) * Complexity * And 100+ other rules **Manual run:** ```bash uv run ruff check . # Lint uv run ruff check --fix . # Lint with auto-fix uv run ruff format . # Format ``` ### Python - MyPy (Type Checking) **Static type checking for Python** ```yaml - repo: https://github.com/pre-commit/mirrors-mypy hooks: - id: mypy name: mypy (packages) files: ^packages/ ``` **Checks:** * Type consistency * Type annotations * Return type validation * Optional handling **Manual run:** ```bash cd packages/ai_web_feeds && uv run mypy src/ cd apps/cli && uv run mypy . ``` ### Python - Bandit (Security) **Security vulnerability scanner** ```yaml - repo: https://github.com/PyCQA/bandit hooks: - id: bandit args: [-c, pyproject.toml] ``` **Checks:** * SQL injection risks * Command injection * Unsafe deserialization * Hardcoded passwords * Weak cryptography **Manual run:** ```bash uv run bandit -r src/ -c pyproject.toml ``` ### TypeScript/JavaScript - ESLint **Linting for TypeScript and React code** ```yaml - repo: https://github.com/pre-commit/mirrors-eslint hooks: - id: eslint name: eslint (apps/web) files: ^apps/web/.*\.[jt]sx?$ args: [--fix, --max-warnings=0] ``` **Checks:** * TypeScript errors * React best practices * Next.js patterns * Unused variables * Import issues **Manual run:** ```bash cd apps/web && pnpm lint cd apps/web && pnpm lint --fix ``` ### TypeScript/JavaScript - Prettier **Opinionated code formatter** ```yaml - repo: https://github.com/pre-commit/mirrors-prettier hooks: - id: prettier name: prettier (apps/web) files: ^apps/web/.*\.(js|jsx|ts|tsx|json|css|scss|md|mdx)$ ``` **Formats:** * JavaScript/TypeScript * JSON * CSS/SCSS * Markdown/MDX **Manual run:** ```bash cd apps/web && pnpm prettier --write . ``` ### YAML Formatting **YAML linting and formatting** ```yaml - repo: https://github.com/macisamuele/language-formatters-pre-commit-hooks hooks: - id: pretty-format-yaml args: [--autofix, --indent, "2"] ``` **Manual run:** ```bash yamllint data/feeds.yaml ``` ### Markdown Formatting **Markdown linting and formatting** ```yaml - repo: https://github.com/executablebooks/mdformat hooks: - id: mdformat additional_dependencies: - mdformat-gfm - mdformat-black args: [--wrap, "88"] ``` **Manual run:** ```bash mdformat README.md ``` ### Spell Checking **Catch common spelling mistakes** ```yaml - repo: https://github.com/codespell-project/codespell hooks: - id: codespell args: [--ignore-words-list=crate, nd, sav, ba, als, datas, socio] ``` **Manual run:** ```bash codespell . ``` ### Shell Scripts **Shell script linting** ```yaml - repo: https://github.com/shellcheck-py/shellcheck-py hooks: - id: shellcheck args: [--severity=warning] ``` **Manual run:** ```bash shellcheck scripts/*.sh ``` ### SQL Formatting **SQL linting and formatting** ```yaml - repo: https://github.com/sqlfluff/sqlfluff hooks: - id: sqlfluff-lint args: [--dialect, sqlite] - id: sqlfluff-fix args: [--dialect, sqlite, --force] ``` **Manual run:** ```bash sqlfluff lint data/*.sql sqlfluff fix data/*.sql ``` ### Secrets Detection **Prevent committing secrets** ```yaml - repo: https://github.com/Yelp/detect-secrets hooks: - id: detect-secrets args: [--baseline, .secrets.baseline] ``` **Manual run:** ```bash uv run detect-secrets scan uv run detect-secrets audit .secrets.baseline ``` ### Conventional Commits **Enforce commit message format** ```yaml - repo: https://github.com/compilerla/conventional-pre-commit hooks: - id: conventional-pre-commit stages: [commit-msg] ``` **Manual test:** ```bash echo "feat(core): test message" | npx commitlint ``` ### General File Checks **Basic file hygiene** ```yaml - repo: https://github.com/pre-commit/pre-commit-hooks hooks: - id: trailing-whitespace - id: end-of-file-fixer - id: check-yaml - id: check-json - id: check-toml - id: check-added-large-files - id: check-merge-conflict - id: mixed-line-ending - id: detect-private-key - id: no-commit-to-branch ``` ## Local Hooks (Project-Specific) ### Python Tests ```yaml - id: pytest name: pytest (packages) entry: bash -c 'cd packages/ai_web_feeds && uv run pytest tests/ -v' files: ^packages/ai_web_feeds/(src|tests)/.*\.py$ ``` **Run tests when Python files change** ### Python Coverage Check ```yaml - id: pytest-cov name: pytest coverage (≥90%) entry: bash -c 'cd packages/ai_web_feeds && uv run pytest tests/ --cov=src --cov-fail-under=90' stages: [push] ``` **Enforces 90% coverage threshold on push** ### TypeScript Type Check ```yaml - id: tsc name: tsc (apps/web) entry: bash -c 'cd apps/web && pnpm tsc --noEmit' files: ^apps/web/.*\.[jt]sx?$ ``` **Type check TypeScript files** ### Next.js Build Check ```yaml - id: nextjs-build name: next build check entry: bash -c 'cd apps/web && pnpm build' stages: [push] ``` **Verify Next.js builds successfully on push** ### Data Assets Validation ```yaml - id: validate-data-assets name: validate data assets entry: bash -c 'cd data && uv run python validate_data_assets.py' files: ^data/(feeds|topics)\.(yaml|json|schema\.json)$ ``` **Validate feeds.yaml and topics.yaml against schemas** ## Usage ### Automatic (Default) Hooks run automatically on `git commit`: ```bash git add . git commit -m "feat(core): add new feature" # Pre-commit hooks run automatically ``` ### Manual Run Run all hooks on all files: ```bash uv run pre-commit run --all-files ``` Run specific hook: ```bash uv run pre-commit run ruff --all-files uv run pre-commit run mypy --all-files uv run pre-commit run prettier --all-files ``` Run on specific files: ```bash uv run pre-commit run --files src/models.py ``` ### Skip Hooks (Not Recommended) Skip all hooks: ```bash git commit --no-verify -m "message" # or git commit -n -m "message" ``` Skip specific hook by modifying `SKIP` env var: ```bash SKIP=pytest git commit -m "message" ``` **⚠️ Warning:** Only skip hooks when absolutely necessary. CI will still run all checks. ## Configuration ### pyproject.toml Ruff, MyPy, Pytest, and Coverage are configured in `pyproject.toml`: ```toml [tool.ruff] target-version = "py313" line-length = 100 [tool.ruff.lint] select = ["E", "F", "I", "N", "UP", "ANN", "S", "B", ...] ignore = ["ANN101", "ANN102", "S101", ...] [tool.mypy] python_version = "3.13" strict = true warn_return_any = true [tool.pytest.ini_options] testpaths = ["tests"] addopts = ["--cov", "--cov-report=term-missing"] [tool.coverage.report] fail_under = 90 ``` ### .pre-commit-config.yaml Main pre-commit configuration: ```yaml default_language_version: python: python3.13 node: 20.11.0 repos: - repo: https://github.com/astral-sh/ruff-pre-commit rev: v0.8.4 hooks: - id: ruff - id: ruff-format # ... more hooks ``` ### Update Hook Versions ```bash # Update to latest versions uv run pre-commit autoupdate # Commit the changes git add .pre-commit-config.yaml git commit -m "chore(tooling): update pre-commit hook versions" ``` ## Troubleshooting ### Hooks Not Running ```bash # Reinstall hooks uv run pre-commit uninstall uv run pre-commit install uv run pre-commit install --hook-type commit-msg ``` ### Hook Environment Issues ```bash # Clean hook environments uv run pre-commit clean # Reinstall all hook environments uv run pre-commit install-hooks ``` ### Specific Hook Failing ```bash # Run in verbose mode uv run pre-commit run --all-files --verbose # Example uv run pre-commit run mypy --all-files --verbose ``` ### Update Hook Dependencies ```bash # For Python hooks uv sync # For Node hooks cd apps/web && pnpm install ``` ### Skip Problematic Files Add to `.pre-commit-config.yaml`: ```yaml - id: hook-id exclude: ^path/to/exclude/ ``` ## CI Integration Pre-commit hooks also run in CI (`.github/workflows/ci.yml`): ```yaml - name: Run pre-commit run: | pip install pre-commit pre-commit run --all-files ``` CI runs are more comprehensive and cannot be skipped. ## Performance ### First Run First run is slow (installing hook environments): ```bash # Install all environments upfront uv run pre-commit install-hooks ``` ### Cached Runs Subsequent runs are fast (seconds): * Hooks only run on changed files * Environments are cached * Results are cached ### Optimize Large Repos ```bash # Run hooks in parallel uv run pre-commit run --all-files --verbose --parallel ``` ## Best Practices ### 1. Run Before Committing ```bash # Run all hooks on staged changes uv run pre-commit run # Or commit normally (auto-runs) git commit ``` ### 2. Fix Issues Early Don't skip hooks - fix the issues: ```bash # Auto-fix what can be fixed uv run pre-commit run --all-files # Review and fix remaining issues ``` ### 3. Keep Hooks Updated ```bash # Monthly or quarterly uv run pre-commit autoupdate ``` ### 4. Understand Each Hook Know what each hook does and why it's important. ### 5. Add Project-Specific Hooks Add local hooks for project-specific validations. ## Resources * [Pre-commit Documentation](https://pre-commit.com/) * [Supported Hooks](https://pre-commit.com/hooks.html) * [Ruff Documentation](https://docs.astral.sh/ruff/) * [MyPy Documentation](https://mypy.readthedocs.io/) * [ESLint Rules](https://eslint.org/docs/rules/) * [Prettier Options](https://prettier.io/docs/en/options.html) ## FAQ ### Why pre-commit hooks? * **Catch issues early** - Before CI, before review * **Consistent quality** - Same checks for everyone * **Fast feedback** - Seconds, not minutes * **Reduce CI load** - Less failed CI runs * **Learn best practices** - Hooks teach good patterns ### Can I customize rules? Yes! Edit configuration files: * Python: `pyproject.toml` * TypeScript: `eslint.config.mjs` * Pre-commit: `.pre-commit-config.yaml` ### What if a hook is too slow? * Run only on changed files (default) * Skip expensive hooks: `SKIP=pytest git commit` * Move slow checks to CI only: `stages: [push]` ### How do I add a new hook? 1. Find hook repo on [pre-commit.com/hooks.html](https://pre-commit.com/hooks.html) 2. Add to `.pre-commit-config.yaml` 3. Test: `uv run pre-commit run --all-files` 4. Commit configuration ### What about Windows? Pre-commit works on Windows with Git Bash or WSL. ## Support For issues with pre-commit hooks: * Check this documentation * Review [.pre-commit-config.yaml](https://github.com/wyattowalsh/ai-web-feeds/blob/main/.pre-commit-config.yaml) * Run with `--verbose` flag * Open an issue on [GitHub](https://github.com/wyattowalsh/ai-web-feeds/issues) -------------------------------------------------------------------------------- END OF PAGE 8 -------------------------------------------------------------------------------- ================================================================================ PAGE 9 OF 57 ================================================================================ TITLE: Simplified Architecture URL: https://ai-web-feeds.w4w.dev/docs/development/architecture MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/architecture.mdx DESCRIPTION: Overview of the simplified AIWebFeeds architecture with linear pipeline and modular design PATH: /development/architecture -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Simplified Architecture (/docs/development/architecture) # Simplified Architecture AIWebFeeds has been designed with a clean, linear processing pipeline that makes it easy to understand and use. ## Processing Pipeline The core workflow follows a simple, predictable pattern: ## Core Modules The project is organized into 8 primary modules: ### 1. Load (`load.py`) Handles all YAML loading and saving operations. **Functions:** * `load_feeds(path)` - Load feeds from YAML file * `load_topics(path)` - Load topics from YAML file * `save_feeds(data, path)` - Save feeds to YAML file * `save_topics(data, path)` - Save topics to YAML file ### 2. Validate (`validate.py`) Validates feeds against JSON schemas and performs additional checks. **Functions:** * `validate_feeds(data, schema_path)` - Validate feeds against schema * `validate_topics(data, schema_path)` - Validate topics against schema **Returns:** `ValidationResult` object with `.valid` boolean and `.errors` list ### 3. Enrich (`enrich.py`) Enriches feeds with metadata, quality scores, and AI-generated content. **Functions:** * `enrich_all_feeds(feeds_data)` - Enrich all feed sources * `enrich_feed_source(source)` - Enrich a single feed source ### 4. Export (`export.py`) Exports data to various formats (JSON, OPML). **Functions:** * `export_to_json(data, output_path)` - Export to JSON * `export_to_opml(data, output_path, categorized)` - Export to OPML * `export_all_formats(data, base_path, prefix)` - Export to all formats ### 5. Logger (`logger.py`) Configures structured logging with loguru. **Features:** * Colored console output * File logging with rotation * Structured log messages ### 6. Models (`models.py`) Data models using SQLModel (SQLAlchemy + Pydantic). **Main Models:** * `FeedSource` - Feed source with metadata * `Topic` - Topic with graph structure * `FeedItem` - Individual feed items * Enums: `SourceType`, `FeedFormat`, `CurationStatus`, etc. ### 7. Storage (`storage.py`) Database operations and persistence. **DatabaseManager Methods:** * `create_db_and_tables()` - Initialize database * `add_feed_source(feed_source)` - Store feed source * `get_all_feed_sources()` - Retrieve all sources * `add_topic(topic)` - Store topic ### 8. Utils (`utils.py`) Helper functions for various operations. **Features:** * Platform-specific feed URL generation * Feed discovery * URL validation * Other utilities ## CLI Usage ### Complete Pipeline Run the entire workflow with a single command: ```bash ai-web-feeds process ``` **Options:** * `--input`, `-i` - Input feeds YAML file (default: `data/feeds.yaml`) * `--output`, `-o` - Output enriched YAML file (default: `data/feeds.enriched.yaml`) * `--schema`, `-s` - JSON schema file for validation * `--database`, `-d` - Database URL (default: `sqlite:///data/aiwebfeeds.db`) * `--export/--no-export` - Export to additional formats * `--skip-validation` - Skip validation steps * `--skip-enrichment` - Skip enrichment step ### Individual Commands For granular control: ```bash # Load only ai-web-feeds load data/feeds.yaml # Validate only ai-web-feeds validate data/feeds.yaml --schema data/feeds.schema.json # Enrich only ai-web-feeds enrich data/feeds.yaml --output data/feeds.enriched.yaml # Export only ai-web-feeds export data/feeds.yaml --output-dir data --prefix feeds ``` ## Python API You can also use the core package directly in Python: ```python from ai_web_feeds import ( load_feeds, validate_feeds, enrich_all_feeds, export_all_formats, DatabaseManager, ) # Load feeds_data = load_feeds("data/feeds.yaml") # Validate result = validate_feeds(feeds_data, "data/feeds.schema.json") if not result.valid: print("Validation errors:", result.errors) # Enrich enriched_data = enrich_all_feeds(feeds_data) # Export export_all_formats(enriched_data, "output/", "feeds.enriched") # Store db = DatabaseManager("sqlite:///data/aiwebfeeds.db") db.create_db_and_tables() ``` ## Benefits 1. **Linear Flow** - Easy to understand: load → validate → enrich → export + store 2. **Modular** - Each step is independent and can be used separately 3. **Testable** - Simple functions with clear inputs/outputs 4. **Flexible** - Skip steps as needed, use CLI or Python API 5. **Clear Separation** - Core logic in package, user interface in CLI 6. **Type-Safe** - Full type annotations throughout 7. **Logged** - All operations are logged for debugging ## Data Flow ## Package Structure ``` packages/ai_web_feeds/src/ai_web_feeds/ ├── __init__.py # Public API exports ├── load.py # Load/save YAML ├── validate.py # Schema validation ├── enrich.py # Metadata enrichment ├── export.py # Format conversion ├── logger.py # Logging setup ├── models.py # Data models ├── storage.py # Database operations └── utils.py # Helper functions ``` ## Next Steps * [CLI Guide](/docs/guides/cli-usage) - Learn how to use the CLI * [Python API](/docs/reference/api) - Use the Python API * [Development](/docs/development) - Contributing to AIWebFeeds -------------------------------------------------------------------------------- END OF PAGE 9 -------------------------------------------------------------------------------- ================================================================================ PAGE 10 OF 57 ================================================================================ TITLE: CLI Integration in Workflows URL: https://ai-web-feeds.w4w.dev/docs/development/cli-workflows MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/cli-workflows.mdx DESCRIPTION: How the aiwebfeeds CLI powers our CI/CD pipeline PATH: /development/cli-workflows -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # CLI Integration in Workflows (/docs/development/cli-workflows) # CLI Integration in GitHub Actions The **aiwebfeeds CLI** is the backbone of our CI/CD pipeline. Every workflow leverages CLI commands for consistent, reliable automation. ## 🎯 Why CLI-First Workflows? ### Benefits 1. **Consistency**: Same commands in CI/CD and local development 2. **Testability**: CLI is fully tested (90%+ coverage) 3. **Maintainability**: Logic in Python, not YAML 4. **Reusability**: One command, many workflows 5. **Debugging**: Run exact CI command locally ### Anti-Pattern ❌ ```yaml # DON'T: Duplicate logic in YAML - name: Validate feeds run: | python -c "import yaml; data = yaml.safe_load(open('data/feeds.yaml'))" # ... 50 lines of shell script validation logic ``` ### Best Practice ✅ ```yaml # DO: Use CLI command - name: Validate feeds run: uv run aiwebfeeds validate --all --strict ``` *** ## 🔧 Available CLI Commands ### Validation Commands #### `validate` - Comprehensive Feed Validation **Purpose**: Validate feed data, schemas, URLs, and parsing **Workflow Usage**: ```yaml # Validate all feeds - name: Validate all feeds run: uv run aiwebfeeds validate --all # Schema validation only - name: Validate schema run: uv run aiwebfeeds validate --schema --strict # Check URL accessibility - name: Check feed URLs run: uv run aiwebfeeds validate --check-urls --timeout 30 # Validate specific feeds (for PR changes) - name: Validate changed feeds run: | CHANGED_FEEDS=$(git diff origin/main -- data/feeds.yaml | grep -oP 'url:\s*\K\S+') uv run aiwebfeeds validate --feeds $CHANGED_FEEDS ``` **Options**: * `--all` - Validate all feeds in `data/feeds.yaml` * `--schema` - Schema validation only * `--check-urls` - Test URL accessibility * `--parse-feeds` - Validate feed parsing * `--strict` - Fail on warnings * `--timeout` - Request timeout (default: 30s) * `--feeds` - Validate specific feed URLs **Exit Codes**: * `0` - All validations passed * `1` - Validation failures * `2` - Schema errors *** #### `test` - Run Test Suite **Purpose**: Execute pytest test suite with coverage **Workflow Usage**: ```yaml # Full test suite - name: Run tests run: uv run aiwebfeeds test --coverage # Quick tests only - name: Quick test run: uv run aiwebfeeds test --quick # Specific test markers - name: Unit tests run: uv run aiwebfeeds test --marker unit ``` **Options**: * `--coverage` - Generate coverage report * `--quick` - Fast tests only (no slow/integration) * `--marker` - Run specific test markers (unit, integration, e2e) * `--verbose` - Detailed output **Output**: * Creates `reports/coverage/` directory * Generates `coverage.xml` for Codecov * Exit code 1 if tests fail or coverage below 90% *** ### Analytics Commands #### `analytics` - Generate Feed Statistics **Purpose**: Calculate feed metrics and insights **Workflow Usage**: ```yaml # Generate analytics JSON - name: Generate analytics run: uv run aiwebfeeds analytics --output data/analytics.json # Display in workflow - name: Show analytics run: uv run aiwebfeeds analytics --format table # Track changes - name: Analytics diff run: | uv run aiwebfeeds analytics --output /tmp/new.json diff data/analytics.json /tmp/new.json || echo "Analytics changed" ``` **Options**: * `--output` - Save to JSON file * `--format` - Output format (table, json, yaml) * `--metrics` - Specific metrics to calculate * `--changed-feeds` - Only analyze changed feeds **Metrics**: * Total feed count * Feeds per category * Language distribution * Feed health status * Update frequency statistics *** #### `stats` - Display Feed Statistics **Purpose**: Show human-readable feed statistics **Workflow Usage**: ```yaml # Post stats as PR comment - name: Generate stats id: stats run: | STATS=$(uv run aiwebfeeds stats --format markdown) echo "stats<> $GITHUB_OUTPUT echo "$STATS" >> $GITHUB_OUTPUT echo "EOF" >> $GITHUB_OUTPUT - name: Comment PR uses: actions/github-script@v7 with: script: | github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: ${{ steps.stats.outputs.stats }} }) ``` **Options**: * `--format` - markdown, table, or json * `--categories` - Show per-category stats * `--trends` - Include trend analysis *** ### Export Commands #### `export` - Export Feed Data **Purpose**: Generate output in various formats **Workflow Usage**: ```yaml # Export to JSON for artifacts - name: Export feeds run: uv run aiwebfeeds export --format json --output feeds.json - name: Upload artifact uses: actions/upload-artifact@v4 with: name: feed-data path: feeds.json # Validate export - name: Export with validation run: uv run aiwebfeeds export --validate --format opml ``` **Options**: * `--format` - json, yaml, opml, csv * `--output` - Output file path * `--validate` - Validate before export * `--pretty` - Pretty-print JSON/YAML *** #### `opml` - OPML Management **Purpose**: Import/export OPML feed lists **Workflow Usage**: ```yaml # Export to OPML - name: Generate OPML run: uv run aiwebfeeds opml export --output data/all.opml # Export categorized OPML - name: Generate categorized OPML run: uv run aiwebfeeds opml export --categorized --output data/categorized.opml # Validate OPML structure - name: Validate OPML run: uv run aiwebfeeds opml validate data/all.opml # Import from OPML (for migration) - name: Import OPML run: uv run aiwebfeeds opml import feeds.opml --merge ``` **Subcommands**: * `export` - Generate OPML from feeds.yaml * `import` - Import OPML into feeds.yaml * `validate` - Validate OPML structure **Options**: * `--categorized` - Group by categories * `--validate` - Validate structure * `--merge` - Merge with existing feeds * `--fix-structure` - Auto-fix common issues *** ### Enrichment Commands #### `enrich` - Enhance Feed Metadata **Purpose**: Add/update feed metadata automatically **Workflow Usage**: ```yaml # Enrich all feeds - name: Enrich feeds run: uv run aiwebfeeds enrich --all --output data/feeds.enriched.yaml # Enrich specific feed - name: Enrich new feed run: | FEED_URL="${{ github.event.inputs.feed_url }}" uv run aiwebfeeds enrich --url "$FEED_URL" --output data/feeds.yaml # Fix schema issues - name: Fix schema run: uv run aiwebfeeds enrich --fix-schema --all # Fetch feed metadata - name: Fetch metadata run: uv run aiwebfeeds fetch --url "$FEED_URL" --metadata-only ``` **Options**: * `--all` - Enrich all feeds * `--url` - Enrich specific feed URL * `--fix-schema` - Auto-fix schema violations * `--output` - Output file * `--metadata-only` - Fetch metadata without full parsing **Enrichment Process**: 1. Fetches feed content 2. Extracts title, description, language 3. Detects feed type (RSS/Atom) 4. Validates against schema 5. Adds missing required fields 6. Updates timestamps *** ## 🔄 Workflow Patterns ### Pattern 1: Incremental Validation **Use Case**: Only validate feeds changed in PR ```yaml name: Validate Changed Feeds on: pull_request: paths: - "data/feeds.yaml" jobs: validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: fetch-depth: 0 # Need history for diff - name: Install uv uses: astral-sh/setup-uv@v5 - name: Get changed feeds id: changes run: | # Extract URLs from diff CHANGED=$(git diff origin/${{ github.base_ref }} -- data/feeds.yaml | \ grep -oP '^\+\s+url:\s*\K\S+' | \ tr '\n' ' ') echo "feeds=$CHANGED" >> $GITHUB_OUTPUT - name: Validate changed feeds if: steps.changes.outputs.feeds != '' run: uv run aiwebfeeds validate --feeds ${{ steps.changes.outputs.feeds }} ``` *** ### Pattern 2: Matrix Validation **Use Case**: Validate feeds in parallel for speed ```yaml name: Parallel Feed Validation on: push: branches: [main] jobs: prepare: runs-on: ubuntu-latest outputs: matrix: ${{ steps.feeds.outputs.matrix }} steps: - uses: actions/checkout@v4 - name: Install uv uses: astral-sh/setup-uv@v5 - name: Generate feed matrix id: feeds run: | # Extract all feed URLs into JSON array FEEDS=$(uv run python -c " import yaml, json with open('data/feeds.yaml') as f: data = yaml.safe_load(f) feeds = [item['url'] for item in data['feeds']] # Split into chunks of 10 chunks = [feeds[i:i+10] for i in range(0, len(feeds), 10)] print(json.dumps({'chunk': list(range(len(chunks)))})) ") echo "matrix=$FEEDS" >> $GITHUB_OUTPUT validate: needs: prepare runs-on: ubuntu-latest strategy: matrix: ${{ fromJson(needs.prepare.outputs.matrix) }} fail-fast: false steps: - uses: actions/checkout@v4 - name: Install uv uses: astral-sh/setup-uv@v5 - name: Validate chunk ${{ matrix.chunk }} run: | # Get feeds for this chunk FEEDS=$(uv run python -c " import yaml with open('data/feeds.yaml') as f: data = yaml.safe_load(f) feeds = [item['url'] for item in data['feeds']] chunk = feeds[${{ matrix.chunk }}*10:(${{ matrix.chunk }}+1)*10] print(' '.join(chunk)) ") uv run aiwebfeeds validate --feeds $FEEDS ``` *** ### Pattern 3: Conditional Workflow Steps **Use Case**: Run different CLI commands based on file changes ```yaml name: Smart Validation on: [pull_request] jobs: detect-changes: runs-on: ubuntu-latest outputs: feeds: ${{ steps.filter.outputs.feeds }} python: ${{ steps.filter.outputs.python }} web: ${{ steps.filter.outputs.web }} steps: - uses: actions/checkout@v4 - uses: dorny/paths-filter@v3 id: filter with: filters: | feeds: - 'data/feeds.yaml' python: - 'packages/**/*.py' - 'apps/cli/**/*.py' web: - 'apps/web/**/*' validate-feeds: needs: detect-changes if: needs.detect-changes.outputs.feeds == 'true' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install uv uses: astral-sh/setup-uv@v5 - name: Validate feeds run: uv run aiwebfeeds validate --all --strict test-python: needs: detect-changes if: needs.detect-changes.outputs.python == 'true' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install uv uses: astral-sh/setup-uv@v5 - name: Run Python tests run: uv run aiwebfeeds test --coverage test-web: needs: detect-changes if: needs.detect-changes.outputs.web == 'true' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: pnpm/action-setup@v4 - name: Test web run: | cd apps/web pnpm install pnpm lint pnpm build ``` *** ### Pattern 4: PR Comments with CLI Output **Use Case**: Post CLI results as PR comments ```yaml name: Post Feed Stats on: pull_request: paths: - "data/feeds.yaml" jobs: stats: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install uv uses: astral-sh/setup-uv@v5 - name: Generate stats id: stats run: | { echo 'stats<> $GITHUB_OUTPUT - name: Generate analytics id: analytics run: | { echo 'analytics<> $GITHUB_OUTPUT - name: Comment PR uses: actions/github-script@v7 with: script: | const stats = `${{ steps.stats.outputs.stats }}`; const analytics = `${{ steps.analytics.outputs.analytics }}`; const body = `## 📊 Feed Statistics ${stats} ## 📈 Analytics \`\`\` ${analytics} \`\`\` `; github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: body }); ``` *** ### Pattern 5: Workflow Artifacts **Use Case**: Save CLI output as downloadable artifacts ```yaml name: Generate Feed Reports on: schedule: - cron: "0 0 * * 0" # Weekly on Sunday jobs: reports: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install uv uses: astral-sh/setup-uv@v5 - name: Generate reports run: | mkdir -p reports # Analytics report uv run aiwebfeeds analytics --output reports/analytics.json # Export feeds uv run aiwebfeeds export --format json --output reports/feeds.json # OPML export uv run aiwebfeeds opml export --output reports/feeds.opml uv run aiwebfeeds opml export --categorized --output reports/feeds-categorized.opml # Validation report uv run aiwebfeeds validate --all > reports/validation.txt || true # Stats uv run aiwebfeeds stats --format markdown > reports/stats.md - name: Upload reports uses: actions/upload-artifact@v4 with: name: weekly-reports path: reports/ retention-days: 90 ``` *** ## 🎨 Custom CLI Commands for Workflows You can add workflow-specific CLI commands: ### Example: `workflow-report` Command **File**: `apps/cli/ai_web_feeds/cli/commands/workflow.py` ```python import typer from rich.console import Console from rich.table import Table app = typer.Typer() console = Console() @app.command() def report( pr_number: int = typer.Option(..., help="PR number"), format: str = typer.Option("markdown", help="Output format") ) -> None: """Generate workflow report for PR.""" from ai_web_feeds.analytics import calculate_metrics from ai_web_feeds.storage import get_changed_feeds changed = get_changed_feeds(pr_number) metrics = calculate_metrics(changed) if format == "markdown": console.print(f"## Changed Feeds: {len(changed)}") console.print(f"**Categories**: {', '.join(metrics['categories'])}") console.print(f"**Languages**: {', '.join(metrics['languages'])}") elif format == "json": import json console.print(json.dumps(metrics, indent=2)) ``` **Workflow Usage**: ```yaml - name: Generate PR report run: uv run aiwebfeeds workflow report --pr-number ${{ github.event.number }} ``` *** ## 🐛 Debugging CLI in Workflows ### Enable Verbose Output ```yaml - name: Validate with debug run: uv run aiwebfeeds validate --all --verbose env: AIWEBFEEDS_LOG_LEVEL: DEBUG ``` ### Capture Logs ```yaml - name: Validate and save logs run: | uv run aiwebfeeds validate --all --verbose 2>&1 | tee validation.log - name: Upload logs if: failure() uses: actions/upload-artifact@v4 with: name: validation-logs path: validation.log ``` ### Test CLI Locally ```bash # Run exact command from workflow uv run aiwebfeeds validate --all --strict # With environment variables AIWEBFEEDS_LOG_LEVEL=DEBUG uv run aiwebfeeds validate --all ``` *** ## 📊 Monitoring & Metrics ### Track CLI Command Usage Add telemetry to CLI commands: ```python # In CLI command import time from loguru import logger start = time.time() # ... command logic ... duration = time.time() - start logger.info(f"Command completed in {duration:.2f}s") # In workflow - name: Track validation time run: | START=$(date +%s) uv run aiwebfeeds validate --all END=$(date +%s) DURATION=$((END - START)) echo "validation_duration=$DURATION" >> $GITHUB_OUTPUT ``` ### Workflow Performance ```yaml name: Performance Tracking on: [push] jobs: benchmark: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install uv uses: astral-sh/setup-uv@v5 - name: Benchmark CLI commands run: | echo "## CLI Performance" > benchmark.md time_command() { START=$(date +%s.%N) $1 END=$(date +%s.%N) DURATION=$(echo "$END - $START" | bc) echo "- $1: ${DURATION}s" >> benchmark.md } time_command "uv run aiwebfeeds validate --schema" time_command "uv run aiwebfeeds analytics" time_command "uv run aiwebfeeds export --format json" cat benchmark.md ``` *** ## 📚 Related Documentation * [GitHub Actions Workflows](/docs/development/workflows) - Complete workflow reference * [CLI Commands](/docs/development/cli) - Full CLI documentation * [Testing](/docs/development/testing) - Testing guide * [Contributing](/docs/development/contributing) - Contribution workflow *** *Last Updated: October 2025* -------------------------------------------------------------------------------- END OF PAGE 10 -------------------------------------------------------------------------------- ================================================================================ PAGE 11 OF 57 ================================================================================ TITLE: CLI Usage URL: https://ai-web-feeds.w4w.dev/docs/development/cli MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/cli.mdx DESCRIPTION: Command-line interface for managing feeds PATH: /development/cli -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # CLI Usage (/docs/development/cli) # CLI Usage The `aiwebfeeds` CLI provides commands for enrichment, OPML generation, and statistics. ## Installation ```bash # From project root uv sync uv pip install -e apps/cli ``` ## Quick Start ```bash # 1. Enrich feeds from feeds.yaml uv run aiwebfeeds enrich all # 2. Generate OPML files uv run aiwebfeeds opml all uv run aiwebfeeds opml categorized # 3. View statistics uv run aiwebfeeds stats show # 4. Generate filtered OPML uv run aiwebfeeds opml filtered data/nlp-feeds.opml --topic nlp --verified ``` ## Commands ### `enrich` - Enrich Feed Data Enrich feeds with metadata, discover feed URLs, validate formats, and save to database. ```bash # Enrich all feeds uv run aiwebfeeds enrich all # Custom paths uv run aiwebfeeds enrich all \ --input data/feeds.yaml \ --output data/feeds.enriched.yaml \ --schema data/feeds.enriched.schema.json \ --database sqlite:///data/aiwebfeeds.db # Preview enrichment for one feed uv run aiwebfeeds enrich one ``` **What it does:** * Discovers feed URLs from site URLs (if `discover: true`) * Detects feed format (RSS, Atom, JSONFeed) * Validates feed accessibility * Saves to: * `feeds.enriched.yaml` - Enriched YAML with all metadata * `feeds.enriched.schema.json` - JSON schema for validation * `aiwebfeeds.db` - SQLite database ### `opml` - Generate OPML Files Generate OPML files for feed readers. ```bash # All feeds (flat list) uv run aiwebfeeds opml all --output data/all.opml # Categorized by source type uv run aiwebfeeds opml categorized --output data/categorized.opml # Filtered OPML uv run aiwebfeeds opml filtered [OPTIONS] ``` **Filter Options:** * `--topic, -t` - Filter by topic (e.g., nlp, mlops) * `--type, -T` - Filter by source type (e.g., blog, podcast) * `--tag, -g` - Filter by tag (e.g., official, community) * `--verified, -v` - Only include verified feeds **Examples:** ```bash # NLP-related feeds only uv run aiwebfeeds opml filtered data/nlp.opml --topic nlp # Official blogs uv run aiwebfeeds opml filtered data/official-blogs.opml \ --type blog \ --tag official # Verified ML podcasts uv run aiwebfeeds opml filtered data/ml-podcasts.opml \ --topic ml \ --type podcast \ --verified ``` ### `stats` - View Statistics Display feed statistics and summaries. ```bash uv run aiwebfeeds stats show ``` **Example output:** ``` 📊 Feed Statistics ══════════════════════════════════════════════════ Total Feeds: 150 Verified: 120 (80.0%) By Source Type: blog : 45 preprint : 30 podcast : 20 organization : 15 newsletter : 12 video : 10 aggregator : 8 journal : 5 docs : 3 forum : 2 ══════════════════════════════════════════════════ ``` ### `export` - Export Data Export feed data in various formats (coming soon). ```bash uv run aiwebfeeds export json # Export as JSON uv run aiwebfeeds export csv # Export as CSV ``` ### `validate` - Validate Data Validate feed data against schemas (coming soon). ```bash uv run aiwebfeeds validate # Validate feeds.yaml ``` ## Workflows ### Initial Setup ```bash # 1. Create or edit data/feeds.yaml with your feed sources # 2. Enrich the feeds uv run aiwebfeeds enrich all # 3. Generate OPML files for your feed reader uv run aiwebfeeds opml all uv run aiwebfeeds opml categorized # 4. Check the results uv run aiwebfeeds stats show ``` ### Adding New Feeds ```bash # 1. Add feed entries to data/feeds.yaml # 2. Re-enrich uv run aiwebfeeds enrich all # 3. Regenerate OPML files uv run aiwebfeeds opml all uv run aiwebfeeds opml categorized ``` ### Creating Custom Feed Collections ```bash # Create topic-specific OPML files uv run aiwebfeeds opml filtered data/nlp.opml --topic nlp uv run aiwebfeeds opml filtered data/mlops.opml --topic mlops uv run aiwebfeeds opml filtered data/research.opml --topic research # Create type-specific collections uv run aiwebfeeds opml filtered data/podcasts.opml --type podcast uv run aiwebfeeds opml filtered data/blogs.opml --type blog # Verified feeds only uv run aiwebfeeds opml filtered data/verified.opml --verified # Combine filters for precise collections uv run aiwebfeeds opml filtered data/verified-nlp-blogs.opml \ --topic nlp \ --type blog \ --verified ``` ## Configuration ### Environment Variables ```bash # Database location export AIWF_DATABASE_URL=sqlite:///data/aiwebfeeds.db # Logging export AIWF_LOGGING__LEVEL=INFO export AIWF_LOGGING__FILE=True export AIWF_LOGGING__FILE_PATH=logs/aiwebfeeds.log ``` ### Default File Locations * Input: `data/feeds.yaml` * Output: `data/feeds.enriched.yaml` * Schema: `data/feeds.enriched.schema.json` * Database: `data/aiwebfeeds.db` * OPML: `data/*.opml` Override with command options (`--input`, `--output`, `--database`, etc.) ## Help Get help for any command: ```bash # General help uv run aiwebfeeds --help # Command-specific help uv run aiwebfeeds enrich --help uv run aiwebfeeds opml --help uv run aiwebfeeds opml filtered --help ``` -------------------------------------------------------------------------------- END OF PAGE 11 -------------------------------------------------------------------------------- ================================================================================ PAGE 12 OF 57 ================================================================================ TITLE: Contributing URL: https://ai-web-feeds.w4w.dev/docs/development/contributing MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/contributing.mdx DESCRIPTION: How to contribute to AI Web Feeds PATH: /development/contributing -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Contributing (/docs/development/contributing) # Contributing Thank you for your interest in contributing to AI Web Feeds! This guide will help you get started. ## Development Setup ### Prerequisites * Python 3.13+ * [uv](https://github.com/astral-sh/uv) - Fast Python package installer * Git ### Clone and Install ```bash # Clone the repository git clone https://github.com/wyattowalsh/ai-web-feeds.git cd ai-web-feeds # Install dependencies uv sync uv pip install -e apps/cli ``` ### Run Tests ```bash # Run all tests uv run pytest # Run with coverage uv run pytest --cov=ai_web_feeds # Run specific test file uv run pytest tests/packages/ai_web_feeds/test_models.py ``` ## Project Structure ``` ai-web-feeds/ ├── packages/ai_web_feeds/ # Core library │ ├── src/ai_web_feeds/ │ │ ├── models.py # SQLModel database models │ │ ├── storage.py # Database operations │ │ ├── utils.py # Utilities (enrichment, OPML, schema) │ │ ├── config.py # Configuration │ │ └── logger.py # Logging setup │ └── pyproject.toml │ ├── apps/cli/ # CLI application │ ├── ai_web_feeds/cli/ │ │ ├── __init__.py # Main CLI app │ │ └── commands/ # CLI commands │ │ ├── enrich.py │ │ ├── opml.py │ │ ├── stats.py │ │ ├── export.py │ │ └── validate.py │ └── pyproject.toml │ ├── apps/web/ # Fumadocs website │ └── content/docs/ # Documentation │ ├── data/ # Feed data │ ├── feeds.yaml # Source feed definitions │ ├── feeds.enriched.yaml # Enriched feeds │ └── *.opml # Generated OPML files │ └── pyproject.toml # Workspace root ``` ## Key Features Implementation ### ✅ Implemented * [x] SQLModel database layer with migrations * [x] Feed enrichment pipeline * [x] OPML generation (all, categorized, filtered) * [x] Schema generation * [x] CLI interface with Typer * [x] Statistics display ### 🚧 In Progress / TODO * [ ] Feed item extraction from RSS/Atom/JSONFeed * [ ] Fetch logging implementation * [ ] Complete export commands (JSON, CSV) * [ ] Schema validation commands * [ ] Topics loading from YAML * [ ] Unit tests for all modules * [ ] Integration tests * [ ] CI/CD pipeline ## Contributing Guidelines ### Code Style We follow PEP 8 with some modifications: * Line length: 88 characters (Black default) * Use type hints for all functions * Docstrings for all public functions/classes * Import sorting with isort ```bash # Format code uv run black packages/ai_web_feeds apps/cli # Sort imports uv run isort packages/ai_web_feeds apps/cli # Type checking uv run mypy packages/ai_web_feeds ``` ### Commit Messages Follow [Conventional Commits](https://www.conventionalcommits.org/): ``` feat: add feed item extraction fix: correct OPML XML escaping docs: update CLI usage guide test: add tests for storage module chore: update dependencies ``` ### Pull Request Process 1. **Fork the repository** and create a feature branch: ```bash git checkout -b feat/your-feature-name ``` 2. **Make your changes** with clear, focused commits 3. **Add tests** for new functionality 4. **Update documentation** if needed 5. **Run tests and linting**: ```bash uv run pytest uv run black --check . uv run isort --check . ``` 6. **Submit a pull request** with: * Clear description of changes * Link to related issues * Screenshots/examples if applicable ### Adding New Features #### Adding a CLI Command 1. Create command file in `apps/cli/ai_web_feeds/cli/commands/` 2. Define Typer app and commands 3. Import and register in `__init__.py` Example: ```python # apps/cli/ai_web_feeds/cli/commands/mycommand.py import typer app = typer.Typer(help="My new command") @app.command() def run(): """Run my command.""" typer.echo("Hello from my command!") ``` ```python # apps/cli/ai_web_feeds/cli/__init__.py from ai_web_feeds.cli.commands import mycommand # ... app.add_typer(mycommand.app, name="mycommand") ``` #### Adding Database Models 1. Define SQLModel in `packages/ai_web_feeds/src/ai_web_feeds/models.py` 2. Add relationships if needed 3. Update `DatabaseManager` with new operations 4. Create Alembic migration Example: ```python class NewTable(SQLModel, table=True): __tablename__ = "new_table" id: UUID = SQLField(default_factory=uuid4, primary_key=True) name: str = SQLField(description="Name field") # ... other fields ``` ```bash # Create migration cd packages/ai_web_feeds alembic revision --autogenerate -m "Add new_table" alembic upgrade head ``` ## Testing ### Writing Tests Place tests in the `tests/` directory mirroring the source structure: ``` tests/ ├── packages/ │ └── ai_web_feeds/ │ ├── test_models.py │ ├── test_storage.py │ └── test_utils.py └── apps/ └── cli/ └── test_commands.py ``` Example test: ```python import pytest from ai_web_feeds.models import FeedSource, SourceType def test_feed_source_creation(): feed = FeedSource( id="test-feed", title="Test Feed", source_type=SourceType.BLOG, ) assert feed.id == "test-feed" assert feed.source_type == SourceType.BLOG ``` ### Test Database Use SQLite in-memory for tests: ```python @pytest.fixture def test_db(): db = DatabaseManager("sqlite:///:memory:") db.create_db_and_tables() yield db ``` ## Documentation Documentation is built with Fumadocs and lives in `apps/web/content/docs/`. ### Adding Documentation 1. Create `.mdx` file in appropriate section 2. Update `meta.json` to include new page 3. Use frontmatter for metadata: ```mdx --- title: Page Title description: Page description for SEO --- # Page Title Content here... ``` ### Local Development ```bash cd apps/web pnpm install pnpm dev ``` Visit [http://localhost:3000/docs](http://localhost:3000/docs) ## Getting Help * **Issues:** [GitHub Issues](https://github.com/wyattowalsh/ai-web-feeds/issues) * **Discussions:** [GitHub Discussions](https://github.com/wyattowalsh/ai-web-feeds/discussions) ## License By contributing, you agree that your contributions will be licensed under the same license as the project. -------------------------------------------------------------------------------- END OF PAGE 12 -------------------------------------------------------------------------------- ================================================================================ PAGE 13 OF 57 ================================================================================ TITLE: Database Architecture URL: https://ai-web-feeds.w4w.dev/docs/development/database-architecture MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/database-architecture.mdx DESCRIPTION: Comprehensive database implementation using SQLModel and Alembic PATH: /development/database-architecture -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Database Architecture (/docs/development/database-architecture) # Database Architecture AI Web Feeds uses a robust database implementation with SQLModel (SQLAlchemy + Pydantic) and Alembic for migrations. ## Architecture Overview The database implementation has been organized and enhanced with: ### 1. Organized Analytics Subpackage ``` ai_web_feeds/analytics/ ├── __init__.py # Package exports ├── core.py # Core analytics (FeedAnalytics) └── advanced.py # ML-powered advanced analytics ``` **Core Analytics** (`analytics/core.py`): * Feed statistics and distributions * Quality metrics * Content analysis * Publishing trends * Health reports * Anomaly detection * Benchmarking **Advanced Analytics** (`analytics/advanced.py`): * Predictive feed health modeling * Content similarity and clustering * ML-powered pattern detection * Topic relationship analysis * Recommendation engine ### 2. Database Models **Core Models** (`models.py`): * `FeedSource` - Feed metadata and configuration * `FeedItem` - Individual feed entries * `FeedFetchLog` - Fetch attempt history * `Topic` - Topic taxonomy **Advanced Models** (`models_advanced.py`): * `FeedValidationHistory` - Validation tracking over time * `FeedHealthMetric` - Health scores and metrics * `DataQualityMetric` - Multi-dimensional quality tracking * `ContentEmbedding` - Semantic search embeddings * `TopicRelationship` - Computed topic associations * `UserFeedPreference` - User interactions and preferences * `AnalyticsCacheEntry` - Computed analytics caching ### 3. Data Synchronization Robust ETL pipeline for YAML ↔ Database (`data_sync.py`): * **FeedDataLoader**: Load `feeds.yaml` → Database * **TopicDataLoader**: Load `topics.yaml` → Database * **DataExporter**: Export Database → `feeds.enriched.yaml` * **DataSyncOrchestrator**: Full bidirectional sync Features: * Upsert operations (insert or update) * Batch processing * Progress tracking * Error handling with optional skip * Schema validation * Stable ID generation from URLs ### 4. Database Migrations (Alembic) Location: `packages/ai_web_feeds/alembic/` Initialize Alembic: ```bash cd packages/ai_web_feeds uv run alembic init alembic ``` Create migration: ```bash uv run alembic revision --autogenerate -m "description" ``` Apply migrations: ```bash uv run alembic upgrade head ``` ## Database Schema ### Core Tables #### `feed_sources` Table Core feed metadata and configuration: * **Core fields:** `id`, `feed`, `site`, `title` * **Classification:** `source_type`, `mediums`, `tags` * **Topics:** `topics`, `topic_weights` * **Metadata:** `language`, `format`, `updated`, `last_validated`, `verified`, `contributor` * **Curation:** `curation_status`, `curation_since`, `curation_by`, `quality_score`, `curation_notes` * **Provenance:** `provenance_source`, `provenance_from`, `provenance_license` * **Discovery:** `discover_enabled`, `discover_config` * **Relations:** `relations`, `mappings` (JSON fields) #### `feed_items` Table Individual feed entries: * **Identifiers:** `id` (UUID), `feed_source_id` (foreign key) * **Content:** `title`, `link`, `description`, `content`, `author` * **Timestamps:** `published`, `updated`, `created_at`, `updated_at` * **Metadata:** `guid`, `categories`, `tags`, `enclosures`, `extra_data` #### `feed_fetch_logs` Table Fetch attempt tracking: * **Fetch info:** `fetched_at`, `fetch_url`, `success` * **Response:** `status_code`, `content_type`, `content_length`, `etag`, `last_modified` * **Errors:** `error_message`, `error_type` * **Stats:** `items_found`, `items_new`, `items_updated`, `fetch_duration_ms` * **Data:** `response_headers`, `extra_data` (JSON fields) #### `topics` Table Topic definitions: * **Core:** `id`, `name`, `description`, `parent_id` * **Metadata:** `aliases`, `related_topics` * **Timestamps:** `created_at`, `updated_at` ### Advanced Tables #### `feed_validation_history` Tracks validation attempts over time: * Validation timestamp and status * Schema version used * Validation errors (JSON) * Environment context #### `feed_health_metrics` Monitors feed health with component scores: * Overall health score * Availability score * Freshness score * Content quality score * Reliability score #### `data_quality_metrics` Multi-dimensional quality tracking: * Quality dimension (completeness, accuracy, consistency, timeliness, uniqueness, validity) * Quality score and threshold * Record counts (total vs. valid) * Improvement suggestions #### `content_embeddings` Store embeddings for semantic search: * Embedding vector (JSON array) * Model name and version * Dimension count * Computation metadata #### `topic_relationships` Computed topic associations: * Source and target topics * Relationship type (parent, related, similar, prerequisite, inverse) * Strength score (0.0-1.0) * Computation method #### `user_feed_preferences` User interactions and preferences: * User and feed identifiers * Preference type (subscription, bookmark, like, hide, report) * Preference value (JSON) * Creation and update timestamps #### `analytics_cache_entries` Cache expensive analytics computations: * Cache key and value (JSON) * Computation timestamp * TTL (seconds) * Hit count * Metadata ### Indexes All tables include appropriate indexes for performance: * **Time-based queries**: `created_at`, `updated_at`, `calculated_at` * **Status filtering**: `validation_status`, `health_status`, `is_valid` * **Feed lookups**: `feed_source_id`, `feed_item_id` * **Relationships**: Foreign key indexes * **Compound indexes**: Multi-column for complex queries ## Performance Considerations ### SQLite Optimizations 1. Batch inserts for bulk operations 2. `render_as_batch=True` for ALTER TABLE support 3. Connection pooling disabled (NullPool) for SQLite ### Caching * `AnalyticsCacheEntry` for expensive computations * TTL-based expiration * Hit tracking for cache effectiveness ### Future: Materialized Views * Topic relationship matrices * Feed similarity scores * Aggregated statistics ## Data Quality The enhanced system includes comprehensive quality tracking: ### Quality Dimensions 1. **Completeness**: Are required fields populated? 2. **Accuracy**: Are values correct and valid? 3. **Consistency**: Are values consistent across records? 4. **Timeliness**: Are records up-to-date? 5. **Uniqueness**: Are there duplicates? 6. **Validity**: Do values conform to schemas? ### Quality Metrics ```python from ai_web_feeds.models_advanced import DataQualityMetric, QualityDimension # Track quality metric metric = DataQualityMetric( feed_source_id="feed_xyz", dimension=QualityDimension.COMPLETENESS, quality_score=0.95, threshold=0.9, meets_threshold=True, total_records=100, valid_records=95, ) ``` ## Best Practices 1. **Always use context managers** for database sessions 2. **Batch operations** for bulk inserts/updates 3. **Validate data** before database operations 4. **Use transactions** for multi-step operations 5. **Index frequently queried fields** 6. **Monitor query performance** using `echo=True` during development 7. **Cache expensive analytics** using `AnalyticsCacheEntry` 8. **Regular backups** of `aiwebfeeds.db` ## Future Enhancements * [ ] PostgreSQL support for production deployments * [ ] Vector database integration (pgvector) for embeddings * [ ] Real-time analytics streaming * [ ] Distributed caching (Redis) * [ ] GraphQL API for database access * [ ] Automated data quality reporting * [ ] ML model versioning and tracking * [ ] Time-series optimizations for metrics ## Related Documentation * [Database Quick Start](/docs/guides/database-quick-start) - Get started quickly * [Database Enhancements](/docs/development/database-enhancements) - What was added and why * [Python API](/docs/development/python-api) - Using the database API * [Testing](/docs/development/testing) - Database testing guidelines *** **Version**: 0.1.0 **Last Updated**: October 15, 2025 -------------------------------------------------------------------------------- END OF PAGE 13 -------------------------------------------------------------------------------- ================================================================================ PAGE 14 OF 57 ================================================================================ TITLE: Database Enhancements URL: https://ai-web-feeds.w4w.dev/docs/development/database-enhancements MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/database-enhancements.mdx DESCRIPTION: Summary of database enhancements and new features PATH: /development/database-enhancements -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Database Enhancements (/docs/development/database-enhancements) # Database Enhancements This document summarizes the database enhancement implementation for AI Web Feeds. ## What Was Done ### ✅ 1. Reorganized Analytics into Subpackage **Structure**: ``` packages/ai_web_feeds/src/ai_web_feeds/analytics/ ├── __init__.py # Package exports ├── core.py # Core analytics (moved from analytics.py) └── advanced.py # Advanced ML-powered analytics ``` **Benefits**: * Better organization and separation of concerns * Clear distinction between core and advanced features * Easier to extend with new analytics modules * Cleaner imports ### ✅ 2. Created Advanced Database Models **New file**: `models_advanced.py` **New Tables**: 1. **FeedValidationHistory** - Track validation attempts over time 2. **FeedHealthMetric** - Monitor feed health with component scores 3. **DataQualityMetric** - Multi-dimensional quality tracking 4. **ContentEmbedding** - Store embeddings for semantic search 5. **TopicRelationship** - Track computed topic associations 6. **UserFeedPreference** - User interactions and preferences 7. **AnalyticsCacheEntry** - Cache expensive analytics computations **Features**: * Proper indexes for performance * Enum types for type safety * JSON columns for flexible data * Relationship tracking * TTL-based caching ### ✅ 3. Data Synchronization System **New file**: `data_sync.py` **Components**: * `SyncConfig` - Configuration for sync operations * `FeedDataLoader` - YAML → Database for feeds * `TopicDataLoader` - YAML → Database for topics * `DataExporter` - Database → enriched YAML * `DataSyncOrchestrator` - Full bidirectional sync **Features**: * Upsert logic (insert or update) * Batch processing with configurable batch size * Progress callbacks for UI integration * Error handling with skip option * Stable ID generation from URLs * Schema validation support ### ✅ 4. Advanced Analytics Module **New file**: `analytics/advanced.py` **Capabilities**: * **Predictive Health**: Linear regression for 7-day health forecasts * **Pattern Detection**: Temporal, content length, title, category analysis * **Similarity Computation**: Multi-dimensional feed similarity (Jaccard) * **Clustering**: BFS-based feed clustering by similarity * **ML Insights**: Comprehensive ML-powered reports **Algorithms**: * Linear regression for trend prediction * Coefficient of variation for pattern detection * Jaccard similarity for comparisons * BFS for connected component clustering * Shannon entropy for diversity analysis ### ✅ 5. Documentation Created comprehensive documentation covering: * Architecture overview * Usage examples * Database schema * Migration strategy * Best practices * Future enhancements ## Key Design Decisions ### 1. Advanced Naming Convention * Used `models_advanced.py` instead of `models_extended.py` * Used `analytics/advanced.py` instead of `analytics_extended.py` * Clearer naming convention ### 2. Subpackage Organization * `analytics/` subpackage instead of multiple files * `core.py` for base analytics * `advanced.py` for ML-powered features * Easier to navigate and extend ### 3. Named Constants * Defined constants for magic numbers (thresholds, limits) * Improves maintainability * Self-documenting code ### 4. Type Safety * Enums for status values * Type hints everywhere * Pydantic models for validation ### 5. Performance Optimizations * Batch processing for bulk operations * Indexes on frequently queried columns * Caching layer for expensive analytics * Configurable limits for large datasets ## File Structure ``` packages/ai_web_feeds/ ├── pyproject.toml # Dependencies (alembic added) └── src/ai_web_feeds/ ├── __init__.py # Updated exports ├── analytics/ # NEW: Analytics subpackage │ ├── __init__.py │ ├── core.py # Moved from analytics.py │ └── advanced.py # NEW: ML-powered analytics ├── data_sync.py # NEW: YAML ↔ Database sync ├── models.py # Existing core models ├── models_advanced.py # NEW: Advanced models └── storage.py # Existing (no changes) ``` ## Usage Examples ### Initialize Database ```python from ai_web_feeds import DatabaseManager db = DatabaseManager("sqlite:///data/aiwebfeeds.db") db.create_db_and_tables() ``` ### Load Data from YAML ```python from ai_web_feeds.data_sync import DataSyncOrchestrator sync = DataSyncOrchestrator(db) results = sync.full_sync() ``` ### Core Analytics ```python from ai_web_feeds.analytics import FeedAnalytics with db.get_session() as session: analytics = FeedAnalytics(session) stats = analytics.get_overview_stats() quality = analytics.get_quality_metrics() ``` ### Advanced Analytics ```python from ai_web_feeds.analytics.advanced import AdvancedFeedAnalytics with db.get_session() as session: analytics = AdvancedFeedAnalytics(session) prediction = analytics.predict_feed_health("feed_id", days_ahead=7) clusters = analytics.cluster_feeds_by_similarity(similarity_threshold=0.6) insights = analytics.generate_ml_insights_report() ``` ## Next Steps ### Immediate (Required for First Use) 1. **Initialize Alembic** (when ready): ```bash cd packages/ai_web_feeds uv run alembic init alembic ``` 2. **Create Initial Migration**: ```bash uv run alembic revision --autogenerate -m "initial_schema" uv run alembic upgrade head ``` 3. **Load Initial Data**: ```bash uv run python -c "from ai_web_feeds.data_sync import DataSyncOrchestrator; from ai_web_feeds import DatabaseManager; sync = DataSyncOrchestrator(DatabaseManager()); sync.full_sync()" ``` ### Testing (Required) * Create tests for new modules (target ≥90% coverage) * Test files needed: * `tests/packages/ai_web_feeds/test_models_advanced.py` * `tests/packages/ai_web_feeds/test_data_sync.py` * `tests/packages/ai_web_feeds/analytics/test_advanced.py` ### CLI Integration * Add data sync commands to CLI * Add analytics report commands * Add health monitoring commands ## Benefits 1. **Better Organization**: Analytics in subpackage, clear separation 2. **Enhanced Capabilities**: ML-powered insights, predictions, clustering 3. **Data Quality**: Comprehensive quality tracking and validation 4. **Performance**: Caching, indexes, batch processing 5. **Maintainability**: Named constants, type safety, documentation 6. **Extensibility**: Easy to add new analytics or models 7. **Type Safety**: Full type hints, Pydantic validation, enums 8. **Testing Ready**: Structured for comprehensive test coverage ## Technical Highlights * **SQLModel + Alembic**: Modern ORM with migration support * **Pydantic v2**: Fast validation and serialization * **Type Safety**: Complete type hints throughout * **Performance**: Optimized queries, indexes, caching * **ML-Ready**: Embedding storage, similarity metrics * **Flexible**: JSON columns for extensibility * **Production-Ready**: Error handling, logging, validation ## Related Documentation * [Database Architecture](/docs/development/database-architecture) - Comprehensive documentation * [Database Quick Start](/docs/guides/database-quick-start) - Get started quickly * [Python API](/docs/development/python-api) - Full API reference * [Testing](/docs/development/testing) - Testing guidelines *** **Status**: Implementation complete, ready for Alembic initialization **Date**: October 15, 2025 **Version**: 0.1.0 -------------------------------------------------------------------------------- END OF PAGE 14 -------------------------------------------------------------------------------- ================================================================================ PAGE 15 OF 57 ================================================================================ TITLE: Database & Storage URL: https://ai-web-feeds.w4w.dev/docs/development/database-storage MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/database-storage.mdx DESCRIPTION: Comprehensive data persistence for feed sources, enrichment data, validation results, and analytics PATH: /development/database-storage -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Database & Storage (/docs/development/database-storage) ## Overview The AIWebFeeds database system provides comprehensive storage for all feed-related data, metadata, and enrichments using SQLModel (SQLAlchemy 2.0 + Pydantic v2) with SQLite as the default backend. ## Architecture ### Core Models The database schema consists of 7 primary tables that store all possible data: ```python # Core data models FeedSource # Feed definitions and metadata FeedItem # Individual feed entries FeedFetchLog # Fetch history and logs Topic # Topic taxonomy # Enrichment and analytics FeedEnrichmentData # Comprehensive enrichment metadata FeedValidationResult # Validation results and checks FeedAnalytics # Usage metrics and analytics ``` ## Data Models ### FeedSource Primary table for feed definitions with basic metadata: ```python class FeedSource(SQLModel, table=True): id: str # Unique feed identifier feed: str # Feed URL site: str | None # Website URL title: str # Display name source_type: SourceType # personal, institutional, etc. mediums: list[Medium] # text, video, audio, image topics: list[str] # Topic IDs topic_weights: dict # Topic relevance scores language: str # Language code (en, es, etc.) format: FeedFormat # RSS, Atom, JSON Feed quality_score: float # Overall quality (0-1) # ... curation, provenance, relations fields ``` ### FeedEnrichmentData Comprehensive enrichment metadata (30+ fields): ```python class FeedEnrichmentData(SQLModel, table=True): feed_source_id: str # Foreign key to FeedSource enriched_at: datetime # Enrichment timestamp enrichment_version: str # Version tracking # Basic metadata discovered_title: str | None discovered_description: str | None discovered_language: str | None discovered_author: str | None # Format and platform detected_format: FeedFormat | None detected_platform: str | None platform_metadata: dict # Visual assets icon_url: str | None logo_url: str | None image_url: str | None favicon_url: str | None banner_url: str | None # Quality and health scores health_score: float | None # Feed health (0-1) quality_score: float | None # Content quality (0-1) completeness_score: float | None # Metadata completeness (0-1) reliability_score: float | None # Update reliability (0-1) freshness_score: float | None # Content freshness (0-1) # Content analysis entry_count: int | None has_full_content: bool avg_content_length: float | None content_types: list[str] content_samples: list[str] # Update patterns estimated_frequency: str | None last_updated: datetime | None update_regularity: float | None update_intervals: list[int] # Performance metrics response_time_ms: float | None availability_score: float | None uptime_percentage: float | None # Topic suggestions suggested_topics: list[str] topic_confidence: dict[str, float] auto_keywords: list[str] # Feed extensions has_itunes: bool has_media_rss: bool has_dublin_core: bool has_geo: bool extension_data: dict # SEO and social seo_title: str | None seo_description: str | None og_image: str | None twitter_card: str | None social_metadata: dict # Technical details encoding: str | None generator: str | None ttl: int | None cloud: dict # Link analysis internal_links: int | None external_links: int | None broken_links: int | None redirect_chains: list[str] # Security uses_https: bool has_valid_ssl: bool security_headers: dict # Flexible storage structured_data: dict raw_metadata: dict extra_data: dict ``` ### FeedValidationResult Validation checks and results: ```python class FeedValidationResult(SQLModel, table=True): feed_source_id: str validated_at: datetime # Overall status is_valid: bool validation_level: str # strict, moderate, lenient # Schema validation schema_valid: bool schema_version: str | None schema_errors: list[str] # Accessibility is_accessible: bool http_status: int | None redirect_count: int | None # Content validation has_items: bool item_count: int | None has_required_fields: bool missing_fields: list[str] # Link validation links_checked: int | None links_valid: int | None broken_link_urls: list[str] # Security checks https_enabled: bool ssl_valid: bool security_issues: list[str] # Recommendations warnings: list[str] recommendations: list[str] validation_report: dict ``` ### FeedAnalytics Time-series analytics data: ```python class FeedAnalytics(SQLModel, table=True): feed_source_id: str period_start: datetime period_end: datetime period_type: str # daily, weekly, monthly, yearly # Volume metrics total_items: int new_items: int updated_items: int # Update frequency update_count: int avg_update_interval_hours: float | None # Content metrics avg_content_length: float | None has_images_count: int has_video_count: int # Quality metrics items_with_full_content: int items_with_summary_only: int # Reliability fetch_attempts: int fetch_successes: int uptime_percentage: float | None # Performance avg_response_time_ms: float | None # Distribution topic_distribution: dict[str, int] keyword_frequency: dict[str, int] ``` ## Storage Operations ### DatabaseManager The `DatabaseManager` class provides all storage operations: ```python from ai_web_feeds import DatabaseManager # Initialize db = DatabaseManager("sqlite:///data/aiwebfeeds.db") db.create_db_and_tables() # Feed sources db.add_feed_source(feed_source) source = db.get_feed_source(feed_id) all_sources = db.get_all_feed_sources() # Enrichment data db.add_enrichment_data(enrichment) enrichment = db.get_enrichment_data(feed_id) all_enrichments = db.get_all_enrichment_data(feed_id) db.delete_old_enrichments(feed_id, keep_count=5) # Validation results db.add_validation_result(validation) result = db.get_validation_result(feed_id) failed = db.get_failed_validations() # Analytics db.add_analytics(analytics) analytics = db.get_analytics(feed_id, period_type="daily", limit=30) all_analytics = db.get_all_analytics(period_type="monthly") # Comprehensive queries complete_data = db.get_feed_complete_data(feed_id) health_summary = db.get_health_summary() ``` ### Enrichment Persistence The enrichment process automatically stores data to the database: ```python from ai_web_feeds import enrich_all_feeds, DatabaseManager # Initialize database db = DatabaseManager() db.create_db_and_tables() # Enrich and persist feeds_data = load_feeds("data/feeds.yaml") enriched_data = enrich_all_feeds(feeds_data, db=db) # Enrichment data is automatically saved to FeedEnrichmentData table ``` ### Comprehensive Data Retrieval Get all data for a feed source in one call: ```python data = db.get_feed_complete_data("feed-id") # Returns: # { # "source": FeedSource, # "enrichment": FeedEnrichmentData, # "validation": FeedValidationResult, # "analytics": [FeedAnalytics], # "recent_items": [FeedItem] # } ``` ### Health Summary Get overall health metrics across all feeds: ```python summary = db.get_health_summary() # Returns: # { # "total_feeds": 150, # "feeds_with_health_data": 145, # "avg_health_score": 0.82, # "avg_quality_score": 0.78, # "feeds_healthy": 120, # health_score >= 0.7 # "feeds_warning": 20, # 0.4 <= health_score < 0.7 # "feeds_critical": 5 # health_score < 0.4 # } ``` ## Data Flow ### Complete Pipeline ``` 1. Load feeds from YAML ↓ 2. Validate feeds → Store FeedValidationResult ↓ 3. Enrich feeds → Store FeedEnrichmentData ↓ 4. Validate enriched → Store FeedValidationResult ↓ 5. Export + Store FeedSource ↓ 6. Collect analytics → Store FeedAnalytics ``` ### CLI Usage The CLI automatically handles database storage: ```bash # Process with database persistence aiwebfeeds process \ --input data/feeds.yaml \ --output data/feeds.enriched.yaml \ --database sqlite:///data/aiwebfeeds.db # Database is automatically populated with: # - FeedSource records (from YAML) # - FeedEnrichmentData (from enrichment) # - FeedValidationResult (from validation) ``` ## Schema Migration ### Alembic Integration Database migrations are managed via Alembic: ```bash # Generate migration uv run alembic revision --autogenerate -m "Add new enrichment fields" # Apply migration uv run alembic upgrade head # Rollback uv run alembic downgrade -1 ``` ### Schema Evolution The database schema supports evolution through: 1. **JSON columns**: Flexible `extra_data`, `raw_metadata`, `structured_data` fields 2. **Version tracking**: `enrichment_version`, `validator_version` fields 3. **Backwards compatibility**: Nullable fields for gradual rollout ## Performance Considerations ### Indexes Automatically created indexes: ```python # Foreign keys (auto-indexed) FeedEnrichmentData.feed_source_id FeedValidationResult.feed_source_id FeedAnalytics.feed_source_id # Custom indexes FeedItem.published_at # For time-based queries Topic.parent_id # For hierarchical queries ``` ### Query Optimization ```python # Use specific queries vs loading all data enrichment = db.get_enrichment_data(feed_id) # Latest only vs all_enrichments = db.get_all_enrichment_data(feed_id) # All history # Limit analytics queries analytics = db.get_analytics(feed_id, period_type="daily", limit=30) # Clean up old enrichments periodically db.delete_old_enrichments(feed_id, keep_count=5) ``` ### Batch Operations ```python # Bulk insert for performance db.bulk_insert_feed_sources(feed_sources) db.bulk_insert_topics(topics) ``` ## Data Integrity ### Constraints * **Primary keys**: Auto-generated UUIDs for enrichment/validation/analytics * **Foreign keys**: Enforce relationships between tables * **Unique constraints**: Feed IDs, topic IDs * **Check constraints**: Score ranges (0-1), positive counts ### Validation Data is validated at multiple levels: 1. **Pydantic validation**: Type checking, field constraints 2. **SQLModel validation**: Database constraints 3. **Application validation**: Business logic validation ### Transactions All database operations use transactions: ```python with db.get_session() as session: session.add(enrichment) session.commit() # Auto-rollback on error ``` ## Monitoring ### Health Checks ```python # Overall health summary = db.get_health_summary() # Failed validations failed = db.get_failed_validations() # Recent enrichments recent = db.get_all_enrichment_data(feed_id) ``` ### Analytics Queries ```python # Daily analytics for last 30 days daily = db.get_analytics(feed_id, period_type="daily", limit=30) # Monthly trends monthly = db.get_all_analytics(period_type="monthly") ``` ## Best Practices 1. **Regular cleanup**: Delete old enrichments periodically 2. **Index usage**: Query with indexed fields (feed\_source\_id) 3. **Batch operations**: Use bulk inserts for performance 4. **JSON fields**: Use for flexible/evolving data structures 5. **Version tracking**: Always set version fields for migrations 6. **Health monitoring**: Check health\_summary regularly 7. **Validation**: Always validate before persisting ## Related * [Architecture](/docs/development/architecture) - System architecture overview * [CLI Reference](/docs/cli) - Command-line interface * [Data Models](/docs/api/models) - Model definitions -------------------------------------------------------------------------------- END OF PAGE 15 -------------------------------------------------------------------------------- ================================================================================ PAGE 16 OF 57 ================================================================================ TITLE: Database Setup URL: https://ai-web-feeds.w4w.dev/docs/development/database MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/database.mdx DESCRIPTION: Database architecture, models, and operations PATH: /development/database -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Database Setup (/docs/development/database) # Database Setup AI Web Feeds uses SQLModel (SQLAlchemy + Pydantic) for database operations with Alembic for migrations. ## Quick Links * **[Database Architecture](/docs/development/database-architecture)** - Comprehensive architecture overview * **[Database Quick Start](/docs/guides/database-quick-start)** - Get started in minutes * **[Database Enhancements](/docs/development/database-enhancements)** - Recent improvements and features ## Database Schema ### `feed_sources` Table Core feed metadata and configuration: * **Core fields:** `id`, `feed`, `site`, `title` * **Classification:** `source_type`, `mediums`, `tags` * **Topics:** `topics`, `topic_weights` * **Metadata:** `language`, `format`, `updated`, `last_validated`, `verified`, `contributor` * **Curation:** `curation_status`, `curation_since`, `curation_by`, `quality_score`, `curation_notes` * **Provenance:** `provenance_source`, `provenance_from`, `provenance_license` * **Discovery:** `discover_enabled`, `discover_config` * **Relations:** `relations`, `mappings` (JSON fields) ### `feed_items` Table Individual feed entries: * **Identifiers:** `id` (UUID), `feed_source_id` (foreign key) * **Content:** `title`, `link`, `description`, `content`, `author` * **Timestamps:** `published`, `updated`, `created_at`, `updated_at` * **Metadata:** `guid`, `categories`, `tags`, `enclosures`, `extra_data` ### `feed_fetch_logs` Table Fetch attempt tracking: * **Fetch info:** `fetched_at`, `fetch_url`, `success` * **Response:** `status_code`, `content_type`, `content_length`, `etag`, `last_modified` * **Errors:** `error_message`, `error_type` * **Stats:** `items_found`, `items_new`, `items_updated`, `fetch_duration_ms` * **Data:** `response_headers`, `extra_data` (JSON fields) ### `topics` Table Topic definitions: * **Core:** `id`, `name`, `description`, `parent_id` * **Metadata:** `aliases`, `related_topics` * **Timestamps:** `created_at`, `updated_at` ## Python API ### Initialize Database ```python from ai_web_feeds.storage import DatabaseManager # Initialize database db = DatabaseManager("sqlite:///data/aiwebfeeds.db") db.create_db_and_tables() ``` ### Add Feed Sources ```python from ai_web_feeds.models import FeedSource, SourceType feed = FeedSource( id="example-blog", feed="https://example.com/feed.xml", site="https://example.com", title="Example Blog", source_type=SourceType.BLOG, topics=["ml", "nlp"], verified=True, ) db.add_feed_source(feed) ``` ### Query Feed Sources ```python # Get all feeds all_feeds = db.get_all_feed_sources() # Get specific feed feed = db.get_feed_source("example-blog") # Get all topics topics = db.get_all_topics() ``` ### Bulk Operations ```python # Bulk insert feed sources db.bulk_insert_feed_sources(feed_sources) # Bulk insert topics db.bulk_insert_topics(topics) ``` ## Database Migrations ### Initialize Alembic ```bash # Run initialization script uv run python packages/ai_web_feeds/scripts/init_alembic.py ``` ### Create Migration ```bash cd packages/ai_web_feeds alembic revision --autogenerate -m "Initial schema" ``` ### Apply Migrations ```bash # Upgrade to latest alembic upgrade head # Downgrade one version alembic downgrade -1 # Show current version alembic current ``` ## Configuration ### Environment Variables ```bash # Database URL export AIWF_DATABASE_URL=sqlite:///data/aiwebfeeds.db # For PostgreSQL export AIWF_DATABASE_URL=postgresql://user:pass@localhost/aiwebfeeds # For MySQL export AIWF_DATABASE_URL=mysql://user:pass@localhost/aiwebfeeds ``` ### Database Manager Options ```python # Custom database URL db = DatabaseManager("postgresql://localhost/aiwebfeeds") # Enable SQL echo for debugging from sqlalchemy import create_engine engine = create_engine( "sqlite:///data/aiwebfeeds.db", echo=True # Print all SQL statements ) ``` ## Models Reference All models are defined using SQLModel, which combines SQLAlchemy and Pydantic for type-safe database operations with automatic validation. **Core Models** (`models.py`): * `FeedSource` - Feed metadata and configuration * `FeedItem` - Individual feed entries * `FeedFetchLog` - Fetch attempt history * `Topic` - Topic taxonomy **Advanced Models** (`models_advanced.py`): * `FeedValidationHistory` - Validation tracking over time * `FeedHealthMetric` - Health scores and metrics * `DataQualityMetric` - Multi-dimensional quality tracking * `ContentEmbedding` - Semantic search embeddings * `TopicRelationship` - Computed topic associations * `UserFeedPreference` - User interactions and preferences * `AnalyticsCacheEntry` - Computed analytics caching ## Next Steps * **Get Started**: Follow the [Database Quick Start](/docs/guides/database-quick-start) guide * **Deep Dive**: Read the [Database Architecture](/docs/development/database-architecture) documentation * **Learn More**: See [Database Enhancements](/docs/development/database-enhancements) for recent features * **API Usage**: Check the [Python API](/docs/development/python-api) documentation -------------------------------------------------------------------------------- END OF PAGE 16 -------------------------------------------------------------------------------- ================================================================================ PAGE 17 OF 57 ================================================================================ TITLE: Complete Database Refactoring - FINAL STATUS URL: https://ai-web-feeds.w4w.dev/docs/development/final-status MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/final-status.mdx DESCRIPTION: Comprehensive database/storage refactoring completed successfully PATH: /development/final-status -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Complete Database Refactoring - FINAL STATUS (/docs/development/final-status) # 🎉 REFACTORING COMPLETE: Database & Storage Enhancement ## ✅ COMPLETED OBJECTIVES ### 1. Simplified Package Structure ✅ Successfully consolidated to **8 core modules** as requested: ``` packages/ai_web_feeds/src/ai_web_feeds/ ├── load.py ✅ YAML I/O for feeds and topics ├── validate.py ✅ Schema validation and data quality checks ├── enrich.py ✅ Feed enrichment orchestration ├── export.py ✅ Multi-format export (JSON, OPML) ├── logger.py ✅ Logging configuration ├── models.py ✅ SQLModel data models (7 tables) ├── storage.py ✅ Database operations (20+ methods) ├── utils.py ✅ Shared utilities ├── enrichment.py ✅ Advanced enrichment service (supporting) └── __init__.py ✅ Clean exports ``` ### 2. Linear Pipeline Flow ✅ Implemented exact flow as requested: ``` feeds.yaml → load → validate → enrich → validate → export + store + log ``` ### 3. Comprehensive Data Storage ✅ Now stores **ALL POSSIBLE** data, metadata, and enrichments: #### NEW: FeedEnrichmentData (30+ fields) * **Quality Scores**: health, quality, completeness, reliability, freshness (5 scores) * **Visual Assets**: icon, logo, image, favicon, banner URLs * **Content Analysis**: entry count, types, samples, average length * **Update Patterns**: frequency, regularity, intervals, last updated * **Performance**: response times, availability, uptime percentage * **Topics**: suggested topics, confidence scores, auto keywords * **Extensions**: iTunes, MediaRSS, Dublin Core, Geo detection * **SEO/Social**: Open Graph, Twitter Cards, structured data * **Security**: HTTPS usage, SSL validation, security headers * **Link Analysis**: internal/external/broken link counts * **Technical**: encoding, generator, TTL, cloud settings * **Flexible**: raw metadata, structured data, extra fields #### NEW: FeedValidationResult * Overall validation status and level * Schema validation with detailed errors * Accessibility checks (HTTP status, redirects) * Content validation (items, required fields) * Link validation with broken URL tracking * Security validation (HTTPS, SSL) * Complete validation reports #### NEW: FeedAnalytics * Time-series metrics (daily/weekly/monthly/yearly) * Volume metrics (total/new/updated items) * Update frequency analysis * Content quality metrics * Performance tracking * Topic and keyword distribution ### 4. Enhanced Storage Operations ✅ Added **20+ comprehensive methods**: ```python # Enrichment data persistence db.add_enrichment_data(enrichment) db.get_enrichment_data(feed_id) db.get_all_enrichment_data(feed_id) db.delete_old_enrichments(feed_id, keep_count=5) # Validation results db.add_validation_result(validation) db.get_validation_result(feed_id) db.get_failed_validations() # Analytics db.add_analytics(analytics) db.get_analytics(feed_id, period_type="daily") db.get_all_analytics(period_type="monthly") # Comprehensive queries db.get_feed_complete_data(feed_id) # All data for one feed db.get_health_summary() # Overall health metrics db.get_recent_feed_items(feed_id) # Recent items ``` ### 5. Pipeline Integration ✅ Enhanced CLI process command to persist ALL enrichment data: ```bash aiwebfeeds process \ --input data/feeds.yaml \ --output data/feeds.enriched.yaml \ --database sqlite:///data/aiwebfeeds.db # Now automatically stores: # ✅ FeedSource (from YAML) # ✅ FeedEnrichmentData (ALL 30+ enrichment fields) # ✅ FeedValidationResult (complete validation report) # ✅ FeedAnalytics (performance metrics) ``` ## 🔄 BEFORE vs AFTER ### Data Storage **BEFORE**: Only `quality_score` stored in FeedSource table ```python # Limited data feed.quality_score = 0.85 # All enrichment data LOST after export ``` **AFTER**: Complete enrichment persistence (30+ fields) ```python # Comprehensive data stored enrichment = FeedEnrichmentData( health_score=0.92, quality_score=0.85, completeness_score=0.78, suggested_topics=["tech", "ai"], topic_confidence={"tech": 0.9, "ai": 0.8}, response_time_ms=245.6, has_itunes=True, uses_https=True, broken_links=0, # ... 20+ more fields preserved ) ``` ### Package Structure **BEFORE**: Complex modular structure with scattered logic ``` ai_web_feeds/ ├── enrichment/ # Package directory │ ├── __init__.py │ ├── advanced.py │ └── ... ├── analytics/ # Separate package ├── models_advanced.py # Split models └── ... ``` **AFTER**: Clean 8-module structure ``` ai_web_feeds/ ├── load.py # Single purpose modules ├── validate.py ├── enrich.py ├── export.py ├── logger.py ├── models.py # Unified models (7 tables) ├── storage.py # Comprehensive storage ├── utils.py ├── enrichment.py # Supporting service └── __init__.py # Clean exports ``` ### Pipeline Flow **BEFORE**: Enrichment data discarded ``` feeds.yaml → load → enrich → export ↓ (data lost) ``` **AFTER**: Zero data loss with comprehensive storage ``` feeds.yaml → load → validate → enrich → validate → export + store ↓ ↓ ↓ Validation Enrichment Analytics Stored 30+ fields Stored Stored ``` ## 🏗️ ARCHITECTURE IMPROVEMENTS ### 1. Zero Data Loss * **ALL enrichment data preserved** in database * Historical tracking with timestamps * Version control for schema evolution ### 2. Comprehensive Health Monitoring ```python summary = db.get_health_summary() # Returns detailed health metrics: # - Total feeds count # - Average health/quality scores # - Healthy/warning/critical feed counts # - Feeds with enrichment data ``` ### 3. Advanced Analytics * Time-series performance tracking * Content quality analysis * Update frequency monitoring * Topic distribution analysis ### 4. Flexible Schema Evolution * JSON columns for evolving data structures * Version tracking for migrations * Backwards compatible design ### 5. Transaction Safety * All operations use database transactions * Automatic rollback on errors * Data integrity constraints ## 📊 STATISTICS ### Models Enhanced * **Before**: 4 basic models * **After**: 7 comprehensive models (+3 new) ### Storage Methods * **Before**: 8 basic CRUD methods * **After**: 25+ comprehensive methods (+17 new) ### Data Fields Stored * **Before**: \~15 basic fields in FeedSource * **After**: 60+ fields across all models (4x increase) ### Enrichment Data Preserved * **Before**: 0% (all enrichment data lost) * **After**: 100% (complete preservation) ## 🚀 READY FOR PRODUCTION ### ✅ All Tests Pass * Model imports successful * Storage operations verified * Pipeline integration working * CLI functionality confirmed ### ✅ Documentation Complete * Comprehensive API documentation * Architecture diagrams * Migration guides * Best practices ### ✅ Performance Optimized * Database indexes on foreign keys * Efficient query patterns * Bulk operation support * Old data cleanup methods ### ✅ Monitoring Ready * Health summary dashboards * Failed validation tracking * Performance metrics collection * Analytics time-series data ## 🎯 SUCCESS METRICS 1. **Zero Data Loss**: ✅ ALL enrichment data now preserved 2. **Simplified Architecture**: ✅ Clean 8-module structure 3. **Linear Pipeline**: ✅ Exact flow as requested implemented 4. **Comprehensive Storage**: ✅ 30+ enrichment fields stored 5. **Enhanced Analytics**: ✅ Complete performance tracking 6. **Future-Proof Design**: ✅ Flexible schema for evolution ## 🔗 NEXT STEPS The database/storage refactoring is **COMPLETE**. The system now: * ✅ Stores every possible piece of enrichment data * ✅ Maintains clean 8-module architecture * ✅ Follows linear pipeline flow exactly as requested * ✅ Provides comprehensive analytics and monitoring * ✅ Supports future schema evolution **Ready for**: Analytics dashboards, API development, performance monitoring, and production deployment. *** **STATUS**: 🎉 **REFACTORING SUCCESSFULLY COMPLETED** 🎉 The AIWebFeeds database and storage system now comprehensively stores **all possible data, metadata, and enrichments** while maintaining the simplified architecture and linear pipeline flow as originally requested. -------------------------------------------------------------------------------- END OF PAGE 17 -------------------------------------------------------------------------------- ================================================================================ PAGE 18 OF 57 ================================================================================ TITLE: Implementation Details URL: https://ai-web-feeds.w4w.dev/docs/development/implementation MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/implementation.mdx DESCRIPTION: Technical implementation details for advanced feed fetching and analytics PATH: /development/implementation -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Implementation Details (/docs/development/implementation) import { Callout } from "fumadocs-ui/components/callout"; import { Steps } from "fumadocs-ui/components/steps"; import { Tabs, Tab } from "fumadocs-ui/components/tabs"; import { Accordion, Accordions } from "fumadocs-ui/components/accordion"; ## Overview This document describes the technical implementation of the comprehensive feed fetching and analytics system added to AI Web Feeds in version 1.0. This is the **first version** of these capabilities - designed from scratch for optimal performance and extensibility. ## Architecture The enhanced system consists of three main components: ``` Feed URL → AdvancedFeedFetcher → FeedMetadata + Items ↓ DatabaseManager ↓ FeedAnalytics ↓ CLI Commands ``` ## Core Components ### 1. Advanced Feed Fetcher **Location:** `packages/ai_web_feeds/src/ai_web_feeds/fetcher.py` (820 lines) A sophisticated feed fetching system that extracts **exhaustive metadata** from RSS/Atom/JSON feeds. #### Key Features ### 100+ Metadata Fields The fetcher extracts comprehensive metadata organized in categories: **Basic Feed Information:** * Title, subtitle, description * Homepage link * Language and copyright * Generator information **Author/Publisher Data:** * Author name and email * Publisher information * Managing editor * Webmaster contact **Visual Assets:** * Feed images (URL, title, link) * Logo and icon URLs * Dimensions and alt text **Technical Metadata:** * TTL (Time To Live) * Skip hours and skip days * Cloud configuration * PubSubHubbub hub URLs **Content Statistics:** * Total item count * Items with full content * Items with authors * Items with enclosures/media * Average title/description/content lengths ### Three-Dimensional Quality Scoring Each feed receives scores (0-1) across three dimensions: #### 1. Completeness Score Measures how complete the feed metadata is: * ✅ Has title * ✅ Has description * ✅ Has link * ✅ Has language * ✅ Has timestamps * ✅ Has author/publisher * ✅ Has categories * ✅ Has image/logo ```python # Example calculation completeness = sum([ bool(feed.title), # 1/8 bool(feed.description), # 1/8 bool(feed.link), # 1/8 bool(feed.language), # 1/8 # ... etc ]) / 8.0 ``` #### 2. Richness Score Measures content quality and depth: * Items have content * Content coverage percentage * Author attribution * Average content length * Full content availability * Media/images present #### 3. Structure Score Measures feed structure quality: * No parsing errors * Has items * Items have GUIDs * Has timestamps * Has links ### Publishing Frequency Detection Automatically analyzes item publication patterns to estimate update frequency: | Frequency | Pattern | | -------------- | ------------------------------ | | **Hourly** | New items every hour or less | | **Daily** | New items published daily | | **Weekly** | Weekly publication schedule | | **Monthly** | Monthly updates | | **Infrequent** | Longer intervals between posts | ```python # Algorithm outline def estimate_update_frequency(items): if not items or len(items) < 2: return "unknown" # Calculate time between publications intervals = calculate_intervals(items) avg_interval = median(intervals) # Classify based on average interval if avg_interval < 3600: # < 1 hour return "hourly" elif avg_interval < 86400: # < 1 day return "daily" # ... etc ``` ### Extension Support Full support for popular RSS extensions: **iTunes Podcast Metadata:** * Author, owner, categories * Explicit flag * Episode information * Artwork URLs **Dublin Core Metadata:** * Contributor, coverage * Creator, date * Format, identifier * Rights, source **Media RSS:** * Thumbnails with dimensions * Media content * Keywords and descriptions * Credit information **GeoRSS:** * Location coordinates * Geographic regions * Place names #### Usage Example ```python from ai_web_feeds.fetcher import AdvancedFeedFetcher from ai_web_feeds.storage import DatabaseManager # Initialize db = DatabaseManager("sqlite:///data/aiwebfeeds.db") fetcher = AdvancedFeedFetcher() # Fetch feed fetch_log, metadata, items = await fetcher.fetch_feed( "https://example.com/feed.xml" ) # Access quality scores print(f"Completeness: {metadata.completeness_score:.2f}") print(f"Richness: {metadata.richness_score:.2f}") print(f"Structure: {metadata.structure_score:.2f}") # Access metadata print(f"Update frequency: {metadata.estimated_update_frequency}") print(f"Total items: {metadata.total_items}") print(f"Found {len(items)} items") # Save to database session = db.get_session() session.add(fetch_log) session.commit() ``` #### Conditional Requests The fetcher supports conditional HTTP requests to reduce bandwidth: ```python # Use ETag and Last-Modified from previous fetch fetch_log, metadata, items = await fetcher.fetch_feed( url="https://example.com/feed.xml", etag="33a64df551425fcc55e4d42a148795d9f25f89d4", last_modified="Wed, 15 Nov 2023 12:00:00 GMT" ) # Returns 304 Not Modified if feed hasn't changed if fetch_log.status_code == 304: print("Feed unchanged") ``` #### Retry Logic Built-in exponential backoff for transient failures: ```python # Automatic retries (configured via tenacity) @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10) ) async def fetch_with_retry(url): # Will retry up to 3 times # Waits 2s, 4s, 8s between attempts pass ``` ### 2. Analytics Engine **Location:** `packages/ai_web_feeds/src/ai_web_feeds/analytics.py` (600 lines) Comprehensive analytics engine providing 8 different analytical views of feed data. Get high-level statistics across all feeds: ```python analytics = FeedAnalytics(session) stats = analytics.get_overview_stats() # Returns: { "totals": { "feeds": 150, "items": 12450, "topics": 45, "verified_feeds": 120 }, "status": { "verified": 120, "active": 135, "inactive": 15 }, "recent_activity": { "feeds_updated_24h": 78, "items_added_24h": 342, "fetch_attempts_24h": 150 } } ``` Analyze distribution across various dimensions: ```python # Source type distribution dist = analytics.get_source_type_distribution(limit=10) # Returns: [("blog", 45), ("paper", 30), ("podcast", 15), ...] # Topic distribution topics = analytics.get_topic_distribution(limit=20) # Returns: [("ml", 89), ("nlp", 67), ("cv", 45), ...] # Language distribution langs = analytics.get_language_distribution() # Returns: [("en", 120), ("zh", 15), ("ja", 10), ...] ``` Comprehensive quality assessment: ```python quality = analytics.get_quality_metrics() # Returns: { "average_scores": { "completeness": 0.78, "richness": 0.65, "structure": 0.92 }, "quality_distribution": { "excellent": 45, # score > 0.8 "good": 67, # score 0.6-0.8 "fair": 28, # score 0.4-0.6 "poor": 10 # score < 0.4 }, "high_quality_feeds": 45, "low_quality_feeds": 10 } ``` Monitor fetch performance and errors: ```python perf = analytics.get_fetch_performance_stats(days=7) # Returns: { "total_fetches": 1050, "successful_fetches": 987, "failed_fetches": 63, "success_rate": 0.94, "average_duration_ms": 1247, "error_distribution": { "timeout": 15, "http_404": 12, "http_500": 8, "parse_error": 28 }, "status_codes": { "200": 987, "404": 12, "500": 8 } } ``` Analyze content coverage and categories: ```python content = analytics.get_content_statistics() # Returns: { "total_items": 12450, "items_with_content": 11203, "items_with_authors": 9876, "items_with_enclosures": 2341, "content_coverage": 0.90, "author_coverage": 0.79, "enclosure_coverage": 0.19, "top_categories": [ ("research", 2341), ("tutorial", 1876), ("news", 1543) ] } ``` Identify publishing patterns: ```python trends = analytics.get_publishing_trends(days=30) # Returns: { "items_per_day": 415, "hourly_distribution": { "0": 12, "1": 8, ... "23": 15 }, "weekday_distribution": { "Monday": 2890, "Tuesday": 3120, ... }, "peak_hour": 14, # 2 PM "peak_weekday": "Tuesday" } ``` Per-feed health diagnostics: ```python health = analytics.get_feed_health_report("openai-blog") # Returns: { "feed_id": "openai-blog", "health_score": 0.87, "fetch_success_rate": 0.95, "average_quality": 0.82, "last_fetch_status": "success", "items_last_30d": 15, "estimated_frequency": "weekly", "issues": [], "recommendations": [ "Consider more frequent fetching" ] } ``` Track top contributors: ```python contributors = analytics.get_top_contributors(limit=10) # Returns: [ { "contributor": "user@example.com", "feed_count": 45, "verified_count": 42, "verification_rate": 0.93, "source_types": ["blog", "paper", "video"] }, ... ] ``` #### Generate Full Report ```python # Export everything to JSON report = analytics.generate_full_report() # Save to file import json with open("analytics.json", "w") as f: json.dump(report, f, indent=2) # Report includes all 8 analytics views ``` ### 3. CLI Commands ### Fetch Commands **Location:** `apps/cli/ai_web_feeds/cli/commands/fetch.py` (200 lines) #### Fetch Single Feed ```bash ai-web-feeds fetch one [--metadata] ``` Fetches a single feed with optional metadata display: ```bash # Basic fetch ai-web-feeds fetch one openai-blog # With detailed metadata ai-web-feeds fetch one openai-blog --metadata ``` **Features:** * Progress indicator * Error reporting * Quality scores display * Metadata summary table #### Fetch All Feeds ```bash ai-web-feeds fetch all [--limit N] [--verified-only] ``` Batch fetch with progress tracking: ```bash # Fetch all feeds ai-web-feeds fetch all # Fetch first 10 feeds ai-web-feeds fetch all --limit 10 # Fetch only verified feeds ai-web-feeds fetch all --verified-only ``` **Features:** * Rich progress bar * Real-time stats * Error summary table * Success/failure counts ### Analytics Commands **Location:** `apps/cli/ai_web_feeds/cli/commands/analytics.py` (400 lines) #### Overview Dashboard ```bash ai-web-feeds analytics overview ``` Displays comprehensive dashboard with: * Total counts (feeds, items, topics) * Status distribution * Recent activity (24h) #### Distributions ```bash ai-web-feeds analytics distributions [--limit N] ``` Shows distributions across: * Source types * Content mediums * Topics * Languages #### Quality Metrics ```bash ai-web-feeds analytics quality ``` Quality assessment with: * Average scores * Quality distribution * High/low quality counts #### Performance Tracking ```bash ai-web-feeds analytics performance [--days N] ``` Fetch performance metrics: * Success/failure rates * Average durations * Error distribution * HTTP status codes #### Content Statistics ```bash ai-web-feeds analytics content ``` Content analysis: * Total items * Coverage metrics * Top categories #### Publishing Trends ```bash ai-web-feeds analytics trends [--days N] ``` Publishing patterns: * Items per day * Hourly distribution * Weekday patterns * Peak times #### Feed Health ```bash ai-web-feeds analytics health ``` Per-feed health report with diagnostics and recommendations. #### Top Contributors ```bash ai-web-feeds analytics contributors [--limit N] ``` Contributor leaderboard with verification rates. #### Generate Report ```bash ai-web-feeds analytics report [--output FILE] ``` Export comprehensive JSON report. ## Database Schema The enhanced system uses the existing database schema with full utilization of flexible JSON columns: ### FeedFetchLog Enhancements ```python class FeedFetchLog(SQLModel, table=True): # ... existing fields ... # Enhanced usage of extra_data extra_data: Optional[Dict[str, Any]] = Field( default=None, sa_column=Column(JSON) ) # Now stores: # - Complete HTTP headers # - Detailed error information # - Item statistics # - Quality scores # - Extension metadata ``` ### FeedItem Enhancements ```python class FeedItem(SQLModel, table=True): # ... existing fields ... # Enhanced usage of extra_data extra_data: Optional[Dict[str, Any]] = Field( default=None, sa_column=Column(JSON) ) # Now stores: # - Extension metadata (iTunes, Media RSS, etc.) # - Multiple categories # - Enclosure metadata # - Author details ``` **No migration required** \- The system leverages existing flexible JSON columns for maximum compatibility. ## Dependencies ### New Dependencies Added ### Core Library Dependencies **File:** `packages/ai_web_feeds/pyproject.toml` ```toml dependencies = [ # ... existing ... "beautifulsoup4>=4.12.0", # NEW: HTML parsing ] ``` **Purpose:** * HTML parsing for feed discovery * Extracting feed URLs from web pages * Parsing HTML content in feed items ### CLI Tool Dependencies **File:** `apps/cli/pyproject.toml` ```toml dependencies = [ # ... existing ... "rich>=13.7.0", # NEW: Rich terminal output ] ``` **Purpose:** * Beautiful terminal tables * Progress bars and spinners * Colored output and styling * Markdown rendering in terminal ## Performance Considerations ### Conditional Requests Reduce bandwidth and processing for unchanged feeds: ```python # Store from previous fetch etag = fetch_log.etag last_modified = fetch_log.last_modified # Use in next fetch new_log, metadata, items = await fetcher.fetch_feed( url=feed_url, etag=etag, last_modified=last_modified ) # Server returns 304 Not Modified if unchanged if new_log.status_code == 304: # No processing needed return ``` ### Retry Logic Exponential backoff for reliability: ```python from tenacity import ( retry, stop_after_attempt, wait_exponential ) @retry( stop=stop_after_attempt(3), # Max 3 attempts wait=wait_exponential( multiplier=1, min=2, # Wait 2s after first failure max=10 # Wait max 10s ) ) async def fetch_with_retry(url): # Automatic retry on failure pass ``` ### Timeouts Prevent hanging on slow feeds: ```python # Configurable timeout (default 30s) fetcher = AdvancedFeedFetcher(timeout=30.0) # Per-request timeout fetch_log, metadata, items = await fetcher.fetch_feed( url=feed_url, timeout=60.0 # Override for slow feed ) ``` ## Best Practices ### Use Conditional Requests Always pass `etag` and `last_modified` from previous fetches to reduce bandwidth: ```python # Save from previous fetch session.add(fetch_log) # Use in next fetch new_log = await fetcher.fetch_feed( url=url, etag=fetch_log.etag, last_modified=fetch_log.last_modified ) ``` ### Respect TTL Values Honor feed TTL (Time To Live) for update frequency: ```python if metadata.ttl: # Wait TTL minutes before next fetch next_fetch = datetime.now() + timedelta(minutes=metadata.ttl) ``` ### Monitor Health Regularly Check feed health scores to identify issues: ```bash # Daily health check ai-web-feeds analytics health openai-blog # Weekly full report ai-web-feeds analytics report --output weekly-report.json ``` ### Track Trends Use analytics to identify patterns: ```bash # Monthly trend analysis ai-web-feeds analytics trends --days 30 # Quality monitoring ai-web-feeds analytics quality ``` ### Generate Periodic Reports Export analytics for monitoring: ```bash # Weekly reports ai-web-feeds analytics report --output reports/week-$(date +%U).json # Archive for historical analysis ``` ## Installation ### Quick Setup Script Use the automated setup script: ```bash # Make executable chmod +x setup-enhanced-features.sh # Run setup ./setup-enhanced-features.sh ``` The script will: 1. Install core library with dependencies 2. Install CLI tool with dependencies 3. Verify installation 4. Display next steps ### Manual Installation Install each component separately: ```bash # 1. Install core library cd packages/ai_web_feeds pip install -e . # 2. Install CLI tool cd ../../apps/cli pip install -e . # 3. Verify installation ai-web-feeds --version ai-web-feeds fetch --help ai-web-feeds analytics --help ``` ## Code Organization ``` packages/ai_web_feeds/src/ai_web_feeds/ ├── fetcher.py # AdvancedFeedFetcher class │ ├── FeedMetadata # Metadata container (100+ fields) │ ├── fetch_feed() # Main fetch method │ ├── _extract_*() # Extraction helpers │ └── _calculate_*() # Quality scoring │ ├── analytics.py # FeedAnalytics class │ ├── get_overview_stats() │ ├── get_*_distribution() │ ├── get_quality_metrics() │ ├── get_fetch_performance_stats() │ ├── get_content_statistics() │ ├── get_publishing_trends() │ ├── get_feed_health_report() │ ├── get_top_contributors() │ └── generate_full_report() │ apps/cli/ai_web_feeds/cli/commands/ ├── fetch.py # Fetch CLI commands │ ├── fetch_one() # Single feed fetch │ └── fetch_all() # Batch fetch │ └── analytics.py # Analytics CLI commands ├── show_overview() ├── show_distributions() ├── show_quality() ├── show_performance() ├── show_content() ├── show_trends() ├── show_health() ├── show_contributors() └── generate_report() ``` ## Future Enhancements Potential additions for future versions: * [ ] Web UI dashboard with real-time metrics * [ ] Machine learning for content classification * [ ] Real-time monitoring with webhooks * [ ] GraphQL API for analytics * [ ] Advanced deduplication algorithms * [ ] Content similarity analysis * [ ] Multi-language NLP support * [ ] Anomaly detection in publishing patterns * [ ] Automated quality recommendations ## Support For technical questions or issues: 1. Review this documentation 2. Check inline code documentation 3. Explore CLI help: `ai-web-feeds --help` 4. Open an issue on GitHub ## Related Documentation * [Feature Overview](/docs/features/overview) - High-level feature list * [Getting Started](/docs/guides/getting-started) - Setup and quickstart * [Analytics Guide](/docs/guides/analytics) - Analytics usage guide -------------------------------------------------------------------------------- END OF PAGE 18 -------------------------------------------------------------------------------- ================================================================================ PAGE 19 OF 57 ================================================================================ TITLE: Overview URL: https://ai-web-feeds.w4w.dev/docs/development MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development.mdx DESCRIPTION: AI Web Feeds development architecture and implementation PATH: /development -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Overview (/docs/development) # Development Overview AI Web Feeds is a comprehensive system for managing AI/ML feed sources with database persistence, enrichment, and OPML generation. ## What We Built A production-ready system with the following capabilities: ### 1. Database Layer (`aiwebfeeds.db`) **Technology:** SQLModel + SQLAlchemy + Alembic **Tables:** * `feed_sources` - Core feed metadata * `feed_items` - Individual feed entries * `feed_fetch_logs` - Fetch attempt tracking * `topics` - Topic taxonomy **Features:** * Full CRUD operations * Relationship management * Migration support via Alembic * JSON field support for flexible data ### 2. Feed Enrichment Pipeline (`feeds.enriched.yaml`) **Capabilities:** * Automatic feed URL discovery from site URLs * Feed format detection (RSS/Atom/JSONFeed) * Metadata validation and enrichment * Quality scoring and curation tracking **Input:** `data/feeds.yaml` (human-curated) **Output:** `data/feeds.enriched.yaml` (fully enriched with automation data) ### 3. Schema Management (`feeds.enriched.schema.json`) **Features:** * Auto-generated JSON Schema for enriched feeds * Comprehensive validation rules * Extends base `feeds.schema.json` * Supports all enrichment metadata ### 4. OPML Generation **Formats:** * **all.opml** - Flat list of all feeds * **categorized.opml** - Organized by source type * **Custom filtered** - By topic, type, tag, verification status **Use Case:** Import into feed readers (Feedly, Inoreader, NetNewsWire, etc.) ### 5. CLI Interface **Commands:** ```bash aiwebfeeds enrich all # Enrich feeds aiwebfeeds opml all # Generate all.opml aiwebfeeds opml categorized # Generate categorized.opml aiwebfeeds opml filtered # Generate custom filtered OPML aiwebfeeds stats show # Display statistics ``` ## Package Structure ``` ai-web-feeds (workspace root) ├── packages/ai_web_feeds/ # Core library │ └── src/ai_web_feeds/ │ ├── models.py # SQLModel tables + Pydantic models │ ├── storage.py # Database manager │ ├── utils.py # Enrichment, OPML, schema utils │ ├── config.py # Configuration │ └── logger.py # Logging setup │ └── apps/cli/ # CLI application └── ai_web_feeds/cli/ ├── __init__.py # Main CLI app └── commands/ ├── enrich.py # Enrichment commands ├── opml.py # OPML generation ├── stats.py # Statistics ├── export.py # Export (stub) └── validate.py # Validation (stub) ``` ## Data Flow ``` feeds.yaml (human-curated) ↓ ├─→ Feed Discovery (if discover: true) ├─→ Format Detection ├─→ Metadata Validation └─→ Enrichment ↓ ├─→ feeds.enriched.yaml (YAML export) ├─→ feeds.enriched.schema.json (JSON schema) └─→ aiwebfeeds.db (SQLite database) ↓ ├─→ all.opml (all feeds) ├─→ categorized.opml (by type) └─→ filtered.opml (custom filters) ``` ## Next Steps * [Database Setup](/docs/development/database) - Learn about the database layer * [CLI Usage](/docs/development/cli) - Using the command-line interface * [Python API](/docs/development/python-api) - Using the Python API * [Contributing](/docs/development/contributing) - How to contribute -------------------------------------------------------------------------------- END OF PAGE 19 -------------------------------------------------------------------------------- ================================================================================ PAGE 20 OF 57 ================================================================================ TITLE: Pre-commit Hook Fixes URL: https://ai-web-feeds.w4w.dev/docs/development/pre-commit-fixes MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/pre-commit-fixes.mdx DESCRIPTION: Comprehensive guide to pre-commit hook issues and their resolutions in the AI Web Feeds project PATH: /development/pre-commit-fixes -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Pre-commit Hook Fixes (/docs/development/pre-commit-fixes) # Pre-commit Hook Fixes This document tracks the systematic resolution of pre-commit hook failures encountered during development. ## Overview The project uses a comprehensive pre-commit framework with 15+ hooks for code quality, security, and consistency. This guide documents the fixes applied to address failures across YAML linting, code style, type checking, and dependency management. ## Fixed Issues ### 1. YAML Syntax Errors **Problem**: `data/topics.yaml` had 20+ instances of unquoted colons in array values: ```yaml # ❌ INVALID - Colon in array value must be quoted tags: [embed:title, summary, content] # ✅ VALID - Properly quoted tags: ["embed:title", summary, content] ``` **Solution**: Used bulk edit with `sed` to fix all occurrences: ```bash sed -i '' 's/tags: \[embed:title,/tags: ["embed:title",/g' data/topics.yaml ``` **Affected Hooks**: `check-yaml`, `yamllint` ### 2. Codespell False Positives **Problem**: Spell checker flagged legitimate technical terms and regex patterns from code. **Solution**: Extended codespell ignore list in `.pre-commit-config.yaml` to include technical terms that appear in regex patterns, mathematical notation, and library names: ```yaml - repo: https://github.com/codespell-project/codespell hooks: - id: codespell args: - --ignore-words-list=crate,nd,sav,ba,als,datas,socio,ser,oint,asent ``` **Affected Hooks**: `codespell` ### 3. Missing Dependencies **Problem**: `data/validate_data_assets.py` script failed with `ModuleNotFoundError: No module named 'yaml'` **Solution**: Added project dependencies to `data/pyproject.toml`: ```toml [project] name = "data-validation" version = "0.1.0" requires-python = ">=3.13" dependencies = [ "pyyaml>=6.0.3", "jsonschema>=4.23.0", ] ``` **Affected Hooks**: `validate-data-assets` ### 4. Ruff Complexity Warnings **Problem**: 126 ruff errors related to legitimate algorithmic complexity: * `PLR0911`: Too many return statements * `PLR0912`: Too many branches * `PLR0915`: Too many statements * `PLR2004`: Magic values in comparisons * `C901`: Function too complex **Solution**: Added targeted per-file-ignores in `packages/ai_web_feeds/pyproject.toml`: ```toml [tool.ruff.lint.per-file-ignores] # Utils: Complex URL generation logic for multiple platforms "src/ai_web_feeds/utils.py" = ["PLR0911", "PLR0912", "PLR0915", "PLR2004", "C901"] # Storage: Database query functions with many parameters "src/ai_web_feeds/storage.py" = ["PLR0913", "PLR0915"] # Models: Pydantic models with many fields "src/ai_web_feeds/models.py" = ["PLR0913"] # Search, recommendations, NLP: ML algorithms need complex logic "src/ai_web_feeds/search.py" = ["PLR0912", "PLR0913"] "src/ai_web_feeds/recommendations.py" = ["PLR0912", "PLR0913"] "src/ai_web_feeds/nlp.py" = ["PLR0912", "PLR0913"] ``` **Rationale**: These warnings represent legitimate complexity in: * RSS/RSSHub URL generation for 10+ platforms (Reddit, Twitter, Medium, etc.) * Machine learning model inference pipelines * Database query builders with multiple filter options * Feed validation with comprehensive rule sets **Affected Hooks**: `ruff` ## Pre-commit Configuration ### Enabled Hooks The project uses the following hook categories: 1. **File Format Checks**: * `check-yaml`: YAML syntax validation * `yamllint`: YAML style enforcement * `check-json`: JSON syntax validation * `check-toml`: TOML syntax validation 2. **Code Quality**: * `ruff`: Python linting and formatting * `mypy`: Python type checking * `codespell`: Spell checking 3. **Security**: * `detect-secrets`: Secret detection * `bandit`: Security vulnerability scanning 4. **Custom Validation**: * `validate-data-assets`: Schema validation for feed data ### Running Hooks ```bash # Run all hooks on all files pre-commit run --all-files # Run specific hook pre-commit run ruff --all-files # Run hooks on staged files only pre-commit run # Skip hooks temporarily (use sparingly!) git commit --no-verify ``` ## Best Practices ### When to Use `--no-verify` Only bypass pre-commit hooks when: 1. Making urgent hotfixes that will be cleaned up immediately 2. Committing work-in-progress on a feature branch for backup 3. The hook is known to have false positives being addressed **Always** run hooks before merging to main: ```bash # Before merging feature branch pre-commit run --all-files git push ``` ### Adding New Ignores When adding per-file-ignores to ruff configuration: 1. **Document the reason**: Add comments explaining why the ignore is legitimate 2. **Be specific**: Target exact files/patterns, not broad wildcards 3. **Consider alternatives**: Can the code be refactored instead? Example: ```toml # ✅ GOOD - Specific file with documented reason "src/ai_web_feeds/utils.py" = ["PLR0911"] # URL generation needs many return paths # ❌ BAD - Too broad, no justification "src/**/*.py" = ["PLR0911"] ``` ### YAML Quoting Rules Special characters in YAML flow sequences require quoting: ```yaml # Characters that need quoting: : { } [ ] , & * # ? | - < > = ! % @ \ # ✅ Correctly quoted tags: ["embed:title", "feat:search", content] # ❌ Missing quotes tags: [embed:title, feat:search, content] ``` ## Remaining Work ### Pending Fixes 1. **Mypy Type Errors** (150 errors across 21 files): * Missing type annotations in decorators * Untyped `__init__` methods * Missing imports (uuid, timedelta) * Attribute access on optional types 2. **Bandit Security Warnings** (9 warnings): * Some are false positives (XML parsing for OPML generation) * Others need review and potential `# nosec` comments ### Incremental Approach For large codebases, fix pre-commit issues incrementally: 1. **Critical blockers first**: YAML syntax, missing dependencies 2. **Quick wins**: Codespell false positives, formatting 3. **Complexity warnings**: Add ignores for legitimate cases 4. **Type checking**: Systematic file-by-file fixes 5. **Security**: Review and address or document each warning ## Related Documentation * [Testing Guide](/docs/development/testing): Test suite maintenance * [CLI Workflows](/docs/development/cli-workflows): Development commands * [Architecture](/docs/development/architecture): System design context ## Commit History Key commits addressing pre-commit hooks: ```bash # View recent linting fixes git log --oneline --grep="lint\|fix\|ruff\|pre-commit" -10 # See specific changes git show ``` ## References * [Pre-commit Framework](https://pre-commit.com/) * [Ruff Documentation](https://docs.astral.sh/ruff/) * [YAML Specification](https://yaml.org/spec/1.2/spec.html) * [Conventional Commits](https://www.conventionalcommits.org/) -------------------------------------------------------------------------------- END OF PAGE 20 -------------------------------------------------------------------------------- ================================================================================ PAGE 21 OF 57 ================================================================================ TITLE: Python API URL: https://ai-web-feeds.w4w.dev/docs/development/python-api MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/python-api.mdx DESCRIPTION: Using AI Web Feeds as a Python library PATH: /development/python-api -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Python API (/docs/development/python-api) # Python API AI Web Feeds can be used as a Python library for custom integrations and automation. ## Installation ```bash uv pip install -e packages/ai_web_feeds ``` ## Feed Enrichment ### Basic Enrichment ```python import asyncio from ai_web_feeds.utils import enrich_feed_source feed_data = { "id": "example-blog", "site": "https://example.com", "title": "Example Blog", "discover": True, # Enable feed discovery "topics": ["ml", "nlp"], } # Enrich the feed enriched = asyncio.run(enrich_feed_source(feed_data)) # enriched now contains: # - Discovered feed URL (if found) # - Detected feed format # - Validation timestamp # - etc. ``` ### Feed Discovery ```python from ai_web_feeds.utils import discover_feed_url # Discover feed URL from a website feed_url = asyncio.run(discover_feed_url("https://example.com")) if feed_url: print(f"Discovered feed: {feed_url}") ``` ### Format Detection ```python from ai_web_feeds.utils import detect_feed_format # Detect feed format format = asyncio.run(detect_feed_format("https://example.com/feed.xml")) print(f"Feed format: {format}") # rss, atom, jsonfeed, or unknown ``` ## OPML Generation ### Generate All Feeds OPML ```python from ai_web_feeds.storage import DatabaseManager from ai_web_feeds.utils import generate_opml, save_opml # Get feeds from database db = DatabaseManager("sqlite:///data/aiwebfeeds.db") feeds = db.get_all_feed_sources() # Generate OPML opml_xml = generate_opml(feeds, title="AI Web Feeds - All") save_opml(opml_xml, "data/all.opml") ``` ### Generate Categorized OPML ```python from ai_web_feeds.utils import generate_categorized_opml # Generate categorized OPML (by source type) opml_xml = generate_categorized_opml(feeds, title="AI Web Feeds - By Type") save_opml(opml_xml, "data/categorized.opml") ``` ### Generate Filtered OPML ```python from ai_web_feeds.utils import generate_filtered_opml # Define custom filter def nlp_filter(feed): return "nlp" in feed.topics and feed.verified # Generate filtered OPML opml_xml = generate_filtered_opml( feeds, title="AI Web Feeds - NLP (Verified)", filter_fn=nlp_filter, ) save_opml(opml_xml, "data/nlp-verified.opml") ``` ## Schema Generation ```python from ai_web_feeds.utils import generate_enriched_schema, save_json_schema # Generate the enriched schema schema = generate_enriched_schema() # Save to file save_json_schema(schema, "data/feeds.enriched.schema.json") ``` ## YAML Operations ### Load Feeds ```python from ai_web_feeds.utils import load_feeds_yaml # Load feeds from YAML feeds_data = load_feeds_yaml("data/feeds.yaml") sources = feeds_data.get("sources", []) ``` ### Save Enriched Feeds ```python from ai_web_feeds.utils import save_feeds_yaml enriched_data = { "schema_version": "feeds-enriched-1.0.0", "document_meta": { "enriched_at": datetime.utcnow().isoformat(), "total_sources": len(sources), }, "sources": enriched_sources, } save_feeds_yaml(enriched_data, "data/feeds.enriched.yaml") ``` ## Database Operations ### Initialize Database ```python from ai_web_feeds.storage import DatabaseManager db = DatabaseManager("sqlite:///data/aiwebfeeds.db") db.create_db_and_tables() ``` ### Add Feed Sources ```python from ai_web_feeds.models import FeedSource, SourceType feed = FeedSource( id="example-blog", feed="https://example.com/feed.xml", site="https://example.com", title="Example Blog", source_type=SourceType.BLOG, topics=["ml", "nlp"], topic_weights={"ml": 0.9, "nlp": 0.8}, verified=True, ) db.add_feed_source(feed) ``` ### Query Data ```python # Get all feed sources all_feeds = db.get_all_feed_sources() # Get specific feed feed = db.get_feed_source("example-blog") # Get all topics topics = db.get_all_topics() ``` ### Bulk Operations ```python # Bulk insert feed sources db.bulk_insert_feed_sources(feed_sources) # Bulk insert topics db.bulk_insert_topics(topics) ``` ## Complete Example ```python import asyncio from datetime import datetime from pathlib import Path from ai_web_feeds.storage import DatabaseManager from ai_web_feeds.utils import ( enrich_feed_source, generate_categorized_opml, generate_enriched_schema, generate_opml, load_feeds_yaml, save_feeds_yaml, save_json_schema, save_opml, ) async def main(): # 1. Load feeds feeds_data = load_feeds_yaml("data/feeds.yaml") sources = feeds_data.get("sources", []) # 2. Enrich each source enriched_sources = [] for source in sources: enriched = await enrich_feed_source(source) enriched_sources.append(enriched) # 3. Save enriched YAML enriched_data = { "schema_version": "feeds-enriched-1.0.0", "document_meta": { "enriched_at": datetime.utcnow().isoformat(), "total_sources": len(enriched_sources), }, "sources": enriched_sources, } save_feeds_yaml(enriched_data, "data/feeds.enriched.yaml") # 4. Generate and save schema schema = generate_enriched_schema() save_json_schema(schema, "data/feeds.enriched.schema.json") # 5. Save to database db = DatabaseManager("sqlite:///data/aiwebfeeds.db") db.create_db_and_tables() from ai_web_feeds.models import FeedSource for source_data in enriched_sources: feed = FeedSource( id=source_data["id"], feed=source_data.get("feed"), site=source_data.get("site"), title=source_data["title"], # ... other fields ) db.add_feed_source(feed) # 6. Generate OPML files feeds = db.get_all_feed_sources() # All feeds opml_all = generate_opml(feeds, "AI Web Feeds - All") save_opml(opml_all, "data/all.opml") # Categorized opml_cat = generate_categorized_opml(feeds, "AI Web Feeds - Categorized") save_opml(opml_cat, "data/categorized.opml") print("✓ Complete!") if __name__ == "__main__": asyncio.run(main()) ``` ## Error Handling ```python from loguru import logger from tenacity import retry, stop_after_attempt, wait_exponential @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10) ) async def safe_enrich(source): try: return await enrich_feed_source(source) except Exception as e: logger.error(f"Failed to enrich {source.get('id')}: {e}") return source # Return original on error ``` ## Configuration ```python from ai_web_feeds.config import Settings # Load settings from environment settings = Settings() # Access logging config log_level = settings.logging.level log_file = settings.logging.file_path # Custom settings custom_settings = Settings( logging__level="DEBUG", logging__file=True, ) ``` -------------------------------------------------------------------------------- END OF PAGE 21 -------------------------------------------------------------------------------- ================================================================================ PAGE 22 OF 57 ================================================================================ TITLE: Python API Documentation URL: https://ai-web-feeds.w4w.dev/docs/development/python-autodoc MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/python-autodoc.mdx DESCRIPTION: Automated API documentation generation from Python docstrings PATH: /development/python-autodoc -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Python API Documentation (/docs/development/python-autodoc) # Python API Documentation AIWebFeeds uses [fumadocs-python](https://fumadocs.dev/docs/ui/python) to automatically generate API documentation from Python docstrings. This integration extracts docstrings from the `ai_web_feeds` Python package and generates interactive MDX documentation pages. ## Overview The documentation workflow: 1. **Python docstrings** → Written in code with proper type hints 2. **JSON generation** → `fumapy-generate` extracts documentation 3. **MDX conversion** → Script converts JSON to MDX files 4. **Web display** → FumaDocs renders interactive API docs ## Prerequisites ### 1. Install Dependencies ```bash # Install Node.js dependencies cd apps/web pnpm install # Install Python dependencies (from workspace root) cd ../.. uv sync --dev ``` ### 2. Install fumadocs-python CLI ```bash pip install fumadocs-python ``` Or using uv: ```bash uv pip install fumadocs-python ``` ## Generating Documentation ### Step 1: Generate JSON From the workspace root: ```bash # Generate documentation JSON for ai_web_feeds package fumapy-generate ai_web_feeds # This creates ai_web_feeds.json in the current directory ``` Move the generated JSON to the web app: ```bash mv ai_web_feeds.json apps/web/ ``` ### Step 2: Convert to MDX From `apps/web`: ```bash pnpm generate:docs ``` This script: * Reads `ai_web_feeds.json` * Cleans previous output in `content/docs/api/` * Converts JSON to MDX format * Writes MDX files with proper frontmatter ### Step 3: View Documentation Start the dev server: ```bash pnpm dev ``` Visit: [http://localhost:3000/docs/api](http://localhost:3000/docs/api) ## Writing Good Docstrings fumadocs-python supports standard Python docstring formats. Use type hints and detailed descriptions: ````python from typing import List, Optional from pydantic import BaseModel class Feed(BaseModel): """ Represents an RSS/Atom feed. Attributes: url: The feed URL title: Feed title category: Optional category classification """ url: str title: str category: Optional[str] = None def fetch_feed(url: str, timeout: int = 30) -> Feed: """ Fetch and parse an RSS/Atom feed. Args: url: The feed URL to fetch timeout: Request timeout in seconds (default: 30) Returns: Parsed Feed object Raises: HTTPError: If the request fails ParseError: If the feed cannot be parsed Examples: ```python feed = fetch_feed("https://example.com/feed.xml") print(feed.title) ``` """ # Implementation here pass ```` ## MDX Syntax Compatibility Docstrings are converted to **MDX** , not Markdown. Ensure syntax compatibility: ### ✅ Valid MDX ```python """ This is a **bold** statement. - List item 1 - List item 2 Code example: \`\`\`python x = 1 \`\`\` """ ``` ### ❌ Invalid MDX ```python """ Don't use directly Use HTML entities: <angle brackets> """ ``` ## Project Structure ``` apps/web/ ├── scripts/ │ └── generate-python-docs.mjs # Conversion script ├── content/docs/api/ # Generated API docs (auto) │ ├── index.mdx │ └── [module]/ │ └── [class].mdx ├── ai_web_feeds.json # Generated JSON (temp) └── package.json # Contains generate:docs script ``` ## Configuration ### Custom Output Directory Edit `scripts/generate-python-docs.mjs`: ```js const OUTPUT_DIR = path.join(process.cwd(), "content/docs/your-path"); const BASE_URL = "/docs/your-path"; ``` ### Custom Package Name ```js const PACKAGE_NAME = "your_package_name"; ``` ## Automation ### Makefile Target Add to workspace `Makefile`: ```makefile .PHONY: docs-api docs-api: @echo "Generating Python API docs..." fumapy-generate ai_web_feeds mv ai_web_feeds.json apps/web/ cd apps/web && pnpm generate:docs @echo "✅ API docs generated!" ``` Usage: ```bash make docs-api ``` ### Pre-build Hook Add to `apps/web/package.json`: ```json { "scripts": { "prebuild": "pnpm generate:docs || true" } } ``` ## Components The integration adds these MDX components: * **Class documentation**: Renders class signatures and methods * **Function documentation**: Shows parameters, return types, examples * **Type annotations**: Interactive type information * **Code examples**: Syntax-highlighted examples from docstrings Import in MDX: ```mdx import { PythonClass, PythonFunction } from "fumadocs-python/components"; ; ``` ## Styling Styles are imported in `app/global.css`: ```css @import "fumadocs-python/preset.css"; ``` Customize styles in your Tailwind config or override CSS variables. ## Troubleshooting ### JSON file not found **Error**: `❌ JSON file not found: ai_web_feeds.json` **Solution**: ```bash fumapy-generate ai_web_feeds mv ai_web_feeds.json apps/web/ ``` ### Module not found **Error**: `Cannot find module 'fumadocs-python'` **Solution**: ```bash cd apps/web pnpm install ``` ### MDX syntax errors **Error**: Build fails with MDX parsing errors **Solution**: * Escape special characters in docstrings * Use HTML entities for `<>` brackets * Validate MDX syntax before generation ### Empty API docs **Issue**: No content in generated docs **Check**: 1. Are your Python files properly documented? 2. Is the package installed? (`pip install -e packages/ai_web_feeds`) 3. Are docstrings using standard format? ## Best Practices 1. **Type hints**: Always use type annotations 2. **Examples**: Include usage examples in docstrings 3. **Completeness**: Document all public APIs 4. **Consistency**: Use consistent docstring format 5. **Regenerate**: Run `pnpm generate:docs` after docstring changes 6. **Version control**: Don't commit `ai_web_feeds.json` or `content/docs/api/` (add to `.gitignore`) ## Related * [FumaDocs Python Integration](https://fumadocs.dev/docs/ui/python) * [Python Docstring Conventions (PEP 257)](https://peps.python.org/pep-0257/) * [Type Hints (PEP 484)](https://peps.python.org/pep-0484/) * [Contributing Guide](/docs/contributing) ## Next Steps -------------------------------------------------------------------------------- END OF PAGE 22 -------------------------------------------------------------------------------- ================================================================================ PAGE 23 OF 57 ================================================================================ TITLE: Database & Storage Refactoring Summary URL: https://ai-web-feeds.w4w.dev/docs/development/refactoring-summary MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/refactoring-summary.mdx DESCRIPTION: Complete refactoring of database/storage logic to include comprehensive data, metadata, and enrichments PATH: /development/refactoring-summary -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Database & Storage Refactoring Summary (/docs/development/refactoring-summary) ## Overview Successfully refactored the AIWebFeeds database and storage system to comprehensively store **all possible data, metadata, and enrichments** while maintaining the simplified 8-module architecture. ## Refactoring Goals ✅ COMPLETED 1. **Simplify Package Structure**: 8 core modules (load, validate, enrich, export, logger, models, storage, utils) 2. **Linear Pipeline Flow**: feeds.yaml → load → validate → enrich → validate → export + store + log 3. **Comprehensive Data Storage**: Store ALL enrichment data, validation results, and analytics 4. **Database Enhancement**: Add new models for complete data persistence ## Architecture Changes ### Core Modules Structure ``` packages/ai_web_feeds/src/ai_web_feeds/ ├── load.py # YAML I/O for feeds and topics ├── validate.py # Schema validation and data quality checks ├── enrich.py # Feed enrichment orchestration ├── export.py # Multi-format export (JSON, OPML) ├── logger.py # Logging configuration ├── models.py # SQLModel data models (7 tables) ├── storage.py # Database operations with comprehensive methods ├── utils.py # Shared utilities ├── enrichment.py # Advanced enrichment service (supporting module) └── __init__.py # Simplified exports ``` ### New Database Models Added 3 comprehensive new models to store ALL enrichment data: #### 1. FeedEnrichmentData (30+ fields) ```python class FeedEnrichmentData(SQLModel, table=True): # Basic metadata discovered_title: str | None discovered_description: str | None discovered_language: str | None discovered_author: str | None # Visual assets icon_url: str | None logo_url: str | None image_url: str | None favicon_url: str | None banner_url: str | None # Quality scores (5 different scores) health_score: float | None # 0-1 quality_score: float | None # 0-1 completeness_score: float | None # 0-1 reliability_score: float | None # 0-1 freshness_score: float | None # 0-1 # Content analysis entry_count: int | None has_full_content: bool avg_content_length: float | None content_types: list[str] content_samples: list[str] # Update patterns estimated_frequency: str | None last_updated: datetime | None update_regularity: float | None update_intervals: list[int] # Performance metrics response_time_ms: float | None availability_score: float | None uptime_percentage: float | None # Topic suggestions suggested_topics: list[str] topic_confidence: dict[str, float] auto_keywords: list[str] # Feed extensions has_itunes: bool has_media_rss: bool has_dublin_core: bool has_geo: bool extension_data: dict # SEO and social seo_title: str | None seo_description: str | None og_image: str | None twitter_card: str | None social_metadata: dict # Technical details encoding: str | None generator: str | None ttl: int | None cloud: dict # Link analysis internal_links: int | None external_links: int | None broken_links: int | None redirect_chains: list[str] # Security uses_https: bool has_valid_ssl: bool security_headers: dict # Flexible storage structured_data: dict # Schema.org, JSON-LD raw_metadata: dict # Original feed metadata extra_data: dict # Complete enrichment output ``` #### 2. FeedValidationResult ```python class FeedValidationResult(SQLModel, table=True): # Overall status is_valid: bool validation_level: str # strict, moderate, lenient # Schema validation schema_valid: bool schema_errors: list[str] # Accessibility is_accessible: bool http_status: int | None redirect_count: int | None # Content validation has_items: bool item_count: int | None missing_fields: list[str] # Link validation links_checked: int | None links_valid: int | None broken_link_urls: list[str] # Security checks https_enabled: bool ssl_valid: bool security_issues: list[str] # Full validation report validation_report: dict ``` #### 3. FeedAnalytics ```python class FeedAnalytics(SQLModel, table=True): # Time period period_start: datetime period_end: datetime period_type: str # daily, weekly, monthly, yearly # Volume metrics total_items: int new_items: int updated_items: int # Update frequency update_count: int avg_update_interval_hours: float | None # Content metrics avg_content_length: float | None has_images_count: int has_video_count: int # Quality metrics items_with_full_content: int items_with_summary_only: int # Performance avg_response_time_ms: float | None uptime_percentage: float | None # Distribution topic_distribution: dict[str, int] keyword_frequency: dict[str, int] ``` ### Enhanced Storage Operations Added comprehensive storage methods to `DatabaseManager`: ```python # Enrichment data persistence db.add_enrichment_data(enrichment) enrichment = db.get_enrichment_data(feed_id) all_enrichments = db.get_all_enrichment_data(feed_id) db.delete_old_enrichments(feed_id, keep_count=5) # Validation results db.add_validation_result(validation) result = db.get_validation_result(feed_id) failed = db.get_failed_validations() # Analytics db.add_analytics(analytics) analytics = db.get_analytics(feed_id, period_type="daily", limit=30) all_analytics = db.get_all_analytics(period_type="monthly") # Comprehensive queries complete_data = db.get_feed_complete_data(feed_id) health_summary = db.get_health_summary() ``` ## Pipeline Flow Enhancement ### Before (Limited Storage) ``` feeds.yaml → load → validate → enrich → export ↓ (enrichment data lost) ``` ### After (Comprehensive Storage) ``` feeds.yaml → load → validate → enrich → validate → export + store ↓ ↓ ↓ FeedValidation FeedEnrichment FeedSource Result Data FeedAnalytics (stored) (30+ fields (stored) stored) ``` ### CLI Integration The process command now automatically persists enrichment data: ```bash aiwebfeeds process \ --input data/feeds.yaml \ --output data/feeds.enriched.yaml \ --database sqlite:///data/aiwebfeeds.db # Now stores to database: # ✅ FeedSource records (from YAML) # ✅ FeedEnrichmentData (ALL enrichment metadata) # ✅ FeedValidationResult (validation checks) # ✅ FeedAnalytics (metrics and performance) ``` ## Data Completeness ### What's Now Stored **Previously**: Only basic `quality_score` in FeedSource table **Now**: Complete enrichment data including: * ✅ **5 Quality Scores**: health, quality, completeness, reliability, freshness * ✅ **Visual Assets**: icon, logo, image, favicon, banner URLs * ✅ **Content Analysis**: entry count, content types, samples, avg length * ✅ **Update Patterns**: frequency estimation, regularity, intervals * ✅ **Performance Metrics**: response times, availability, uptime * ✅ **Topic Intelligence**: suggested topics, confidence scores, keywords * ✅ **Feed Extensions**: iTunes, MediaRSS, Dublin Core, Geo detection * ✅ **SEO/Social**: Open Graph, Twitter Cards, structured data * ✅ **Security**: HTTPS usage, SSL validation, security headers * ✅ **Link Analysis**: internal/external/broken link counts * ✅ **Technical Details**: encoding, generator, TTL, cloud settings * ✅ **Flexible Storage**: raw metadata, structured data, extra fields ### Health Monitoring New comprehensive health summary: ```python summary = db.get_health_summary() # { # "total_feeds": 150, # "feeds_with_health_data": 145, # "avg_health_score": 0.82, # "avg_quality_score": 0.78, # "feeds_healthy": 120, # >= 0.7 # "feeds_warning": 20, # 0.4-0.7 # "feeds_critical": 5 # < 0.4 # } ``` ## Key Improvements ### 1. Zero Data Loss * **Before**: Enrichment data discarded after export * **After**: ALL enrichment metadata persisted with history ### 2. Comprehensive Analytics * **Before**: No analytics storage * **After**: Time-series analytics with metrics tracking ### 3. Validation Tracking * **Before**: Validation results not stored * **After**: Complete validation history with detailed reports ### 4. Performance Monitoring * **Before**: No performance tracking * **After**: Response times, uptime, availability metrics ### 5. Flexible Schema * **Before**: Fixed schema limitations * **After**: JSON fields for evolving data structures ## Migration Strategy ### Backwards Compatibility * ✅ Existing FeedSource table unchanged * ✅ New models additive (no breaking changes) * ✅ JSON columns for flexible data evolution * ✅ Version tracking for schema migrations ### Database Evolution ```python # Old enrichment (limited) source.quality_score = 0.85 # New enrichment (comprehensive) enrichment = FeedEnrichmentData( health_score=0.92, quality_score=0.85, completeness_score=0.78, suggested_topics=["tech", "ai"], response_time_ms=245.6, has_itunes=True, # ... 25+ more fields ) ``` ## Testing & Validation ### Import Tests ✅ ```bash ✓ All models imported successfully ✓ Storage operations working ✓ CLI integration functional ✓ Database persistence verified ``` ### Data Integrity ✅ * Foreign key constraints enforced * Score ranges validated (0-1) * JSON schema validation * Transaction safety guaranteed ## Next Steps 1. **Performance Optimization**: Add database indexes for common queries 2. **Analytics Dashboard**: Build visualization for health metrics 3. **Migration Scripts**: Create upgrade scripts for existing data 4. **Monitoring**: Set up alerts for feed health degradation 5. **API Integration**: Expose comprehensive data via REST API ## Summary ✅ **COMPLETED**: Complete database/storage refactoring * 3 new comprehensive models (30+ enrichment fields) * Enhanced storage operations (15+ new methods) * Zero data loss pipeline integration * Comprehensive health monitoring * Backwards compatible migration strategy The AIWebFeeds system now stores **every possible piece of data, metadata, and enrichment information** while maintaining the clean 8-module architecture and linear pipeline flow. -------------------------------------------------------------------------------- END OF PAGE 23 -------------------------------------------------------------------------------- ================================================================================ PAGE 24 OF 57 ================================================================================ TITLE: Test Infrastructure URL: https://ai-web-feeds.w4w.dev/docs/development/testing MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/testing.mdx DESCRIPTION: Comprehensive test suite with pytest, uv, and advanced testing features PATH: /development/testing -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Test Infrastructure (/docs/development/testing) import { Callout } from "fumadocs-ui/components/callout"; import { Tab, Tabs } from "fumadocs-ui/components/tabs"; ## Overview AI Web Feeds includes a **production-ready test suite** with 100+ tests covering unit, integration, and end-to-end scenarios. The infrastructure uses modern tools for fast, reliable testing. All tests use **uv** for execution (10-100x faster than pip) and **pytest** with 9+ advanced plugins. ## Test Execution Architecture All test execution logic is centralized using **uv scripts** defined in the workspace root `pyproject.toml`. The scripts delegate to the CLI for consistent test execution across all environments. ### Execution Flow ``` uv scripts (workspace pyproject.toml) ↓ CLI Test Commands ↓ pytest (test execution) ``` **Alternative entry point for backward compatibility:** ``` tests/run_tests.py → uv scripts → CLI → pytest ``` ### Multiple Entry Points You can run tests using any of these methods: ```bash # Run all tests uv run test # Run unit tests uv run test-unit # Run unit tests (skip slow) uv run test-unit-fast # Run with coverage and open in browser uv run test-coverage-open # Quick test run uv run test-quick # Debug mode uv run test-debug # Watch mode uv run test-watch # List available scripts uv run --help ``` ```bash # Run all tests uv run aiwebfeeds test all # Run unit tests with options uv run aiwebfeeds test unit --fast # Run with coverage uv run aiwebfeeds test coverage --open # E2E tests only uv run aiwebfeeds test e2e # Get help uv run aiwebfeeds test --help ``` ```bash cd tests # Run all tests ./run_tests.py all # Run unit tests ./run_tests.py unit # Run with coverage ./run_tests.py coverage # Quick run ./run_tests.py quick # Get help ./run_tests.py help ``` ## Quick Reference ### Common Commands ```bash # Quick test (TDD workflow) uv run test-quick # Watch mode (auto-rerun) uv run test-watch # Unit tests only uv run test-unit-fast # With coverage uv run test-coverage-open ``` ```bash # Full test suite with coverage uv run test-coverage # All tests uv run test-all # E2E tests only uv run test-e2e # Integration tests uv run test-integration ``` ```bash # Debug mode (with pdb) uv run test-debug # Or use CLI directly with specific test uv run aiwebfeeds test file test_models.py -k "twitter" # Show local variables uv run aiwebfeeds test all --verbose ``` ## Test Suite Statistics * **11 test files** created * **35+ test classes** * **100+ individual tests** * **15+ reusable fixtures** * **2,500+ lines of test code** ## Test Structure Tests mirror the source code structure: ``` packages/ai_web_feeds/src/ai_web_feeds/ ├── models.py → tests/.../test_models.py ├── storage.py → tests/.../test_storage.py ├── fetcher.py → tests/.../test_fetcher.py ├── config.py → tests/.../test_config.py ├── utils.py → tests/.../test_utils.py └── analytics.py → tests/.../test_analytics.py ``` ### Test Categories #### Unit Tests (`@pytest.mark.unit`) Fast, isolated tests with no external dependencies: * **test\_models.py** - Model validation with property-based testing * **test\_storage.py** - Database CRUD operations * **test\_fetcher.py** - Feed fetching with mocking * **test\_config.py** - Configuration management * **test\_utils.py** - Utility functions (platform detection, URL generation) * **test\_analytics.py** - Analytics calculations * **test\_commands.py** - CLI command tests #### Integration Tests (`@pytest.mark.integration`) Multi-component workflows: * **test\_integration.py** - Database + Fetcher integration * **test\_cli\_integration.py** - CLI integration #### E2E Tests (`@pytest.mark.e2e`) Complete user workflows: * **test\_workflows.py** - Full workflows (onboarding, bulk operations, export) ## Advanced Features ### Property-Based Testing Using **Hypothesis** for robust input validation: ```python from hypothesis import given, strategies as st @given(st.text()) def test_sanitize_text_property_based(text): """Property-based test for text sanitization.""" result = sanitize_text(text) assert isinstance(result, str) ``` ### Test Fixtures Comprehensive fixtures in `conftest.py`: **Database Fixtures:** * `temp_db_path` - Temporary SQLite database * `db_engine` - Test database engine * `db_session` - Test database session **Model Fixtures:** * `sample_feed_source` - Single feed source * `sample_feed_items` - Multiple feed items (5) * `sample_topic` - Topic instance **Mock Fixtures:** * `mock_httpx_response` - Mocked HTTP response * `mock_feedparser_result` - Mocked feedparser **File Fixtures:** * `temp_yaml_file` - Temporary YAML * `sample_rss_feed` - Sample RSS XML * `sample_atom_feed` - Sample Atom XML ### Test Markers Available markers for filtering: | Marker | Description | | ------------- | ------------------------------------------- | | `unit` | Unit tests (fast, no external dependencies) | | `integration` | Integration tests (multiple components) | | `e2e` | End-to-end tests (full workflows) | | `slow` | Slow running tests | | `network` | Tests requiring network access | | `database` | Tests requiring database | ```bash # List all markers aiwebfeeds test markers # Run specific markers uv run --directory tests pytest -m "unit and not slow" ``` ### Coverage Reporting Generate comprehensive coverage reports: ```bash # HTML + terminal report aiwebfeeds test coverage # Open in browser aiwebfeeds test coverage --open # Coverage reports saved to: tests/reports/coverage/ ``` **Coverage Configuration:** ```toml [tool.coverage.run] source = ["ai_web_feeds"] branch = true omit = ["*/tests/*", "*/test_*.py"] [tool.coverage.report] precision = 2 show_missing = true exclude_lines = [ "pragma: no cover", "def __repr__", "if __name__ == .__main__.:", "if TYPE_CHECKING:", ] ``` ## Test Configuration All configuration in `tests/pyproject.toml`: ### Pytest Settings ```toml [tool.pytest.ini_options] python_files = "test_*.py" python_classes = "Test*" python_functions = "test_*" testpaths = ["."] addopts = [ "-v", # Verbose "--strict-markers", # Enforce markers "--showlocals", # Show locals in errors "--cov=ai_web_feeds", # Coverage "--emoji", # Emoji output "--icdiff", # Better diffs "--instafail", # Instant failures "--timeout=300", # Test timeout ] ``` ### Pytest Plugins * **pytest-cov** - Coverage reporting * **pytest-emoji** - Emoji test output * **pytest-icdiff** - Better diff display * **pytest-instafail** - Instant failure reporting * **pytest-html** - HTML reports * **pytest-timeout** - Timeout protection * **pytest-mock** - Mocking support * **pytest-sugar** - Better output * **pytest-xdist** - Parallel execution * **hypothesis** - Property-based testing ## CLI Test Command ### UV Scripts Configuration The workspace `pyproject.toml` defines test scripts for convenience: ```toml [tool.uv.scripts] # Test execution commands (delegates to CLI) test = "aiwebfeeds test all" test-all = "aiwebfeeds test all" test-unit = "aiwebfeeds test unit" test-unit-fast = "aiwebfeeds test unit --fast" test-integration = "aiwebfeeds test integration" test-e2e = "aiwebfeeds test e2e" test-coverage = "aiwebfeeds test coverage" test-coverage-open = "aiwebfeeds test coverage --open" test-quick = "aiwebfeeds test quick" test-debug = "aiwebfeeds test debug" test-watch = "aiwebfeeds test watch" test-markers = "aiwebfeeds test markers" ``` ### UV Integration All commands use `uv run` internally: ```python def run_uv_command(args: list[str], cwd: Optional[Path] = None) -> int: """Run a uv command and return exit code.""" cmd = ["uv", "run"] + args result = subprocess.run(cmd, cwd=cwd) return result.returncode ``` ### Available Subcommands | Command | Description | Options | uv Script | | ------------------ | ----------------- | --------------------------------------- | ------------------------- | | `test all` | Run all tests | `--verbose`, `--coverage`, `--parallel` | `uv run test` | | `test unit` | Unit tests only | `--fast` (skip slow) | `uv run test-unit` | | `test integration` | Integration tests | `--verbose` | `uv run test-integration` | | `test e2e` | E2E tests | `--verbose` | `uv run test-e2e` | | `test coverage` | With coverage | `--open` (open browser) | `uv run test-coverage` | | `test quick` | Fast unit tests | None | `uv run test-quick` | | `test watch` | Watch mode | None | `uv run test-watch` | | `test file ` | Specific file | `-k ` | N/A (use CLI) | | `test debug` | Debug mode | None | `uv run test-debug` | | `test markers` | List markers | None | `uv run test-markers` | ### Examples ```bash # Recommended: Use uv scripts uv run test-quick # Quick development cycle uv run test-coverage-open # Full test with coverage uv run test-watch # Watch mode for TDD # Alternative: Use CLI directly uv run aiwebfeeds test all --verbose --coverage uv run aiwebfeeds test unit --fast uv run aiwebfeeds test debug packages/ai_web_feeds/unit/test_models.py # Legacy: Use run_tests.py wrapper cd tests ./run_tests.py quick ./run_tests.py coverage ``` ### Benefits of This Architecture **Single Source of Truth** : All test execution logic lives in the CLI commands, with uv scripts providing convenient shortcuts. This eliminates duplication and makes maintenance easier. Key advantages: 1. **Native uv Integration** - Uses uv's built-in script system 2. **Multiple Entry Points** - Choose the interface that works best for you 3. **Consistent Behavior** - All methods use the same underlying CLI 4. **Easy Discovery** - `uv run --help` lists all available scripts 5. **Backward Compatible** - Legacy `run_tests.py` still works ## CI/CD Integration ### GitHub Actions Example ```yaml name: Tests on: [push, pull_request] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Install uv run: curl -LsSf https://astral.sh/uv/install.sh | sh - name: Run tests with uv scripts run: uv run test-coverage - name: Upload coverage uses: codecov/codecov-action@v3 ``` ### Migration from Legacy Commands If you're updating CI/CD pipelines: **Before:** ```yaml - run: python tests/run_tests.py coverage ``` **After (Recommended):** ```yaml - run: uv run test-coverage ``` **Alternative:** ```yaml - run: uv run aiwebfeeds test coverage ``` ### Docker Testing ```dockerfile FROM python:3.13-slim WORKDIR /app COPY . . RUN pip install uv RUN cd tests && uv sync CMD ["uv", "run", "--directory", "tests", "pytest", "-v"] ``` ## Performance ### Test Execution Speed * **Quick tests**: \~2-5 seconds * **Unit tests**: \~10-15 seconds * **Integration tests**: \~20-30 seconds * **Full suite**: \~30-45 seconds * **With coverage**: \~45-60 seconds * **Parallel execution**: 50-70% faster ### Optimization Tips 1. **Use quick mode** for rapid feedback during development 2. **Run unit tests** before integration/E2E 3. **Enable parallel execution** with `--parallel` 4. **Skip slow tests** with `--fast` flag 5. **Use watch mode** for TDD workflow ## Best Practices ### Writing Tests 1. **Mirror structure** - Test files match source files 2. **Use fixtures** - Reusable test data 3. **Mark appropriately** - Use `@pytest.mark.unit`, etc. 4. **Property-based** - Use Hypothesis for edge cases 5. **Descriptive names** - Clear test method names 6. **AAA pattern** - Arrange, Act, Assert ### Running Tests 1. **Quick first** - Run quick tests during development 2. **Full before commit** - Run all tests before committing 3. **Coverage regularly** - Check coverage weekly 4. **E2E before release** - Run E2E tests before releases 5. **CI/CD always** - All tests in CI/CD pipeline ## Troubleshooting ### Tests Not Found ```bash # Sync dependencies cd tests uv sync # Verify discovery uv run pytest --collect-only ``` ### Import Errors ```bash # From workspace root uv sync # Verify package installed uv run --directory tests python -c "import ai_web_feeds" ``` ### Slow Tests ```bash # Skip slow tests aiwebfeeds test unit --fast # Show slowest tests uv run --directory tests pytest --durations=10 ``` ### Coverage Issues ```bash # Clear coverage data rm -rf tests/reports/.coverage tests/reports/coverage # Regenerate aiwebfeeds test coverage ``` ## Documentation All test infrastructure documentation is now integrated into this Fumadocs site: * **[Testing Guide](/docs/guides/testing)** - Quick start and overview * **[This Page](/docs/development/testing)** - Comprehensive test infrastructure * **[Twitter/arXiv Integration](/docs/features/twitter-arxiv-integration)** - Platform-specific testing * **tests/README.md** - Technical reference (in repository) ## Future Enhancements * [ ] Mutation testing with mutmut * [ ] Performance benchmarking with pytest-benchmark * [ ] Async testing with pytest-asyncio * [ ] Snapshot testing * [ ] Contract testing * [ ] Load testing -------------------------------------------------------------------------------- END OF PAGE 24 -------------------------------------------------------------------------------- ================================================================================ PAGE 25 OF 57 ================================================================================ TITLE: GitHub Actions Workflows URL: https://ai-web-feeds.w4w.dev/docs/development/workflows MARKDOWN: https://ai-web-feeds.w4w.dev/docs/development/workflows.mdx DESCRIPTION: Comprehensive guide to CI/CD workflows with CLI integration PATH: /development/workflows -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # GitHub Actions Workflows (/docs/development/workflows) # GitHub Actions Workflows AIWebFeeds uses an extensive suite of GitHub Actions workflows to ensure code quality, automate testing, and streamline development. All workflows leverage the **aiwebfeeds CLI** for consistent execution across environments. ## 🎯 Overview Our CI/CD pipeline enforces: * ✅ **Code Quality**: Linting, formatting, and type checking * 🧪 **Testing**: Unit, integration, and E2E tests with coverage * 🔒 **Security**: CodeQL analysis and dependency scanning * 📊 **Feed Validation**: RSS/Atom feed schema compliance * 🤖 **Automation**: Auto-fixing, labeling, and release management *** ## 📋 Workflow Categories ### Quality Enforcement #### `quality-enforcement.yml` - **Comprehensive Quality Gate** **Triggers**: Pull requests to `main` or `develop` **What it does**: 1. **Python Quality Checks** * Ruff linting (`uv run ruff check`) * Ruff formatting (`uv run ruff format --check`) * MyPy type checking (`uv run mypy`) * Import sorting validation 2. **Web Quality Checks** * ESLint (`pnpm lint`) * TypeScript type checking (`pnpm tsc --noEmit`) * Link validation (`pnpm lint:links`) * Build verification (`pnpm build`) 3. **CLI Integration** * Feed validation (`uv run aiwebfeeds validate --all`) * Analytics generation (`uv run aiwebfeeds analytics`) * Export verification (`uv run aiwebfeeds export`) 4. **Test Suite** * Unit tests (≥90% coverage required) * Integration tests * E2E tests * Coverage reporting to Codecov **Required Status**: ✅ Must pass for merge ```yaml # Example: Running quality checks locally uv run ruff check . uv run ruff format --check . uv run mypy . cd apps/web && pnpm lint ``` *** #### `python-quality.yml` - **Python-Specific Quality** **Triggers**: Push to any branch, PRs **What it does**: * Matrix testing across Python 3.11, 3.12, 3.13 * Parallel linting, formatting, type checking * CLI command validation * Package build verification **Strategy**: Fast feedback on Python changes *** ### Testing & Coverage #### `coverage.yml` - **Comprehensive Test Coverage** **Triggers**: Push to `main`/`develop`, PRs **What it does**: 1. Runs full test suite with `pytest-cov` 2. Generates HTML and XML coverage reports 3. Uploads to Codecov with threshold enforcement 4. Validates ≥90% coverage requirement 5. Posts coverage report as PR comment **CLI Integration**: ```bash # Run tests with CLI validation uv run pytest --cov=ai_web_feeds --cov-report=html --cov-report=xml # Validate feeds after tests uv run aiwebfeeds validate --all --strict ``` **Artifacts**: * `coverage-report` - HTML coverage report * `coverage-xml` - XML for Codecov *** ### Feed Validation #### `validate-all-feeds.yml` - **Complete Feed Validation** **Triggers**: * Push to `main` * Daily schedule (6 AM UTC) * Manual dispatch **What it does**: ```bash # 1. Schema validation uv run aiwebfeeds validate --schema --strict # 2. URL reachability checks uv run aiwebfeeds validate --check-urls --timeout 30 # 3. Feed parsing validation uv run aiwebfeeds validate --parse-feeds # 4. OPML export verification uv run aiwebfeeds opml export --validate # 5. Analytics generation uv run aiwebfeeds analytics --output data/analytics.json ``` **Notifications**: Posts summary to Slack/Discord on failures *** #### `validate-feed-submission.yml` - **PR Feed Validation** **Triggers**: Pull requests modifying `data/feeds.yaml` **What it does**: 1. Validates only changed feeds (incremental validation) 2. Checks schema compliance 3. Tests URL accessibility 4. Verifies feed parsing 5. Ensures no duplicates 6. Validates topic assignments **CLI Usage**: ```bash # Validate specific feeds uv run aiwebfeeds validate --feeds "https://example.com/feed.xml" # Validate with strict schema uv run aiwebfeeds validate --schema --strict --feeds-file data/feeds.yaml ``` **Auto-labels**: Adds `feeds:valid` or `feeds:invalid` label *** #### `add-approved-feed.yml` - **Automated Feed Addition** **Triggers**: Issue labeled `feed:approved` **What it does**: 1. Parses feed URL from issue body 2. Validates feed structure 3. Enriches metadata with `aiwebfeeds enrich` 4. Creates PR with new feed 5. Auto-assigns reviewers **CLI Integration**: ```bash # Extract feed from issue FEED_URL=$(gh issue view $ISSUE_NUMBER --json body -q .body | grep -oP 'https?://\S+') # Validate and enrich uv run aiwebfeeds validate --feeds "$FEED_URL" uv run aiwebfeeds enrich --url "$FEED_URL" --output data/feeds.yaml ``` *** ### Auto-Fixing #### `auto-fix.yml` - **Automated Code Fixes** **Triggers**: * Comment `/fix` on PR * Push to branches with `autofix/**` prefix **What it does**: 1. **Python Fixes**: ```bash uv run ruff check --fix . uv run ruff format . ``` 2. **Web Fixes**: ```bash cd apps/web pnpm lint --fix ``` 3. **Feed Fixes**: ```bash # Re-enrich feeds to fix metadata uv run aiwebfeeds enrich --all --fix-schema # Regenerate OPML with correct structure uv run aiwebfeeds opml export --fix-structure ``` 4. **Auto-commit**: Pushes fixes back to PR branch **Safety**: Only runs on PRs, never on `main` *** ### PR Validation #### `pr-validation.yml` - **Pull Request Quality Gate** **Triggers**: Pull request events (opened, synchronized, reopened) **What it does**: 1. **Title Validation**: Enforces conventional commits 2. **Label Validation**: Requires type labels 3. **Size Check**: Warns on large PRs (>500 lines) 4. **Linked Issues**: Verifies issue references 5. **CLI Validation**: Runs relevant CLI commands based on changes **Change Detection**: ```yaml # Runs different CLI commands based on changes if: contains(steps.changes.outputs.files, 'data/feeds.yaml') run: uv run aiwebfeeds validate --incremental if: contains(steps.changes.outputs.files, 'packages/ai_web_feeds/') run: uv run aiwebfeeds test --coverage if: contains(steps.changes.outputs.files, 'apps/web/') run: cd apps/web && pnpm lint && pnpm build ``` *** ### Security #### `codeql-analysis.yml` - **Security Scanning** **Triggers**: * Push to `main`/`develop` * Weekly schedule * PRs to `main` **What it does**: * CodeQL scanning for Python and TypeScript * Dependency vulnerability scanning * Secret scanning * SAST analysis **Languages**: Python, JavaScript, TypeScript *** #### `dependency-review.yml` - **Dependency Security** **Triggers**: Pull requests **What it does**: * Reviews new dependencies for vulnerabilities * Checks license compatibility * Validates dependency updates * Blocks PRs with high/critical vulnerabilities *** ### Automation #### `label-manager.yml` - **Automatic Labeling** **Triggers**: Pull requests, issues **What it does**: * Auto-labels based on file paths * `python` - Changes to `.py` files * `web` - Changes to `apps/web/` * `cli` - Changes to `apps/cli/` * `feeds` - Changes to `data/feeds.yaml` * `docs` - Changes to `.mdx` files * Adds size labels (`size/S`, `size/M`, `size/L`, `size/XL`) * Detects breaking changes from commit messages **CLI Integration**: ```bash # Generate labels from feed changes uv run aiwebfeeds analytics --changed-feeds --output labels.json ``` *** #### `release-drafter.yml` - **Automated Release Notes** **Triggers**: Push to `main`, merged PRs **What it does**: 1. Groups changes by type (features, fixes, docs, etc.) 2. Generates changelog from PR titles 3. Creates draft release 4. Suggests version bump (semver) **Template**: Uses `.github/release-drafter.yml` template *** #### `release.yml` - **Automated Releases** **Triggers**: * Tag push (`v*`) * Manual dispatch **What it does**: 1. **Build Artifacts**: ```bash # Python package uv build # CLI binary uv run pyinstaller apps/cli/ai_web_feeds/cli/__init__.py # Web static export cd apps/web && pnpm build && pnpm export ``` 2. **Publish**: * PyPI: `uv publish` * GitHub Release: Attach binaries * Docker: Build and push container 3. **Notifications**: Slack/Discord release announcement **CLI Validation**: ```bash # Verify CLI works before release uv run aiwebfeeds --version uv run aiwebfeeds validate --all uv run aiwebfeeds test --quick ``` *** ### Maintenance #### `dependency-updates.yml` - **Automated Dependency Updates** **Triggers**: Weekly schedule (Monday 9 AM UTC) **What it does**: 1. **Python**: `uv lock --upgrade` 2. **Web**: `pnpm update --interactive` 3. Creates PR with updates 4. Runs full test suite 5. Auto-merges if tests pass (patch versions only) *** #### `stale.yml` - **Stale Issue Management** **Triggers**: Daily schedule **What it does**: * Marks issues stale after 60 days * Closes after 14 more days * Exempts `pinned`, `security`, `bug` labels * Posts friendly reminder comments *** ## 🔧 CLI Command Reference All workflows use these CLI commands: ### Validation ```bash # Validate all feeds uv run aiwebfeeds validate --all # Validate specific feeds uv run aiwebfeeds validate --feeds "url1" "url2" # Schema validation only uv run aiwebfeeds validate --schema # Check URL accessibility uv run aiwebfeeds validate --check-urls # Strict mode (fail on warnings) uv run aiwebfeeds validate --strict ``` ### Analytics ```bash # Generate analytics uv run aiwebfeeds analytics # Output to file uv run aiwebfeeds analytics --output data/analytics.json # Specific metrics uv run aiwebfeeds analytics --metrics "count,categories,languages" ``` ### Export ```bash # Export to OPML uv run aiwebfeeds opml export --output feeds.opml # Export to JSON uv run aiwebfeeds export --format json --output feeds.json # Export with validation uv run aiwebfeeds export --validate ``` ### Enrichment ```bash # Enrich all feeds uv run aiwebfeeds enrich --all # Enrich specific feed uv run aiwebfeeds enrich --url "https://example.com/feed.xml" # Fix schema issues uv run aiwebfeeds enrich --fix-schema ``` ### Testing ```bash # Run test suite via CLI uv run aiwebfeeds test # Quick tests only uv run aiwebfeeds test --quick # With coverage uv run aiwebfeeds test --coverage ``` *** ## 🚀 Running Workflows Locally ### Install Act (GitHub Actions locally) ```bash brew install act ``` ### Run Specific Workflow ```bash # Quality enforcement act pull_request -W .github/workflows/quality-enforcement.yml # Coverage tests act push -W .github/workflows/coverage.yml # Feed validation act workflow_dispatch -W .github/workflows/validate-all-feeds.yml ``` ### Run with Secrets ```bash # Create .secrets file echo "CODECOV_TOKEN=your_token" > .secrets # Run with secrets act -s .secrets ``` *** ## 📊 Workflow Status Badges Add to README: ```markdown ![Quality](https://github.com/wyattowalsh/ai-web-feeds/workflows/quality-enforcement/badge.svg) ![Coverage](https://github.com/wyattowalsh/ai-web-feeds/workflows/coverage/badge.svg) ![Feeds](https://github.com/wyattowalsh/ai-web-feeds/workflows/validate-all-feeds/badge.svg) ``` *** ## 🔍 Troubleshooting ### Workflow Fails on CLI Command **Problem**: `aiwebfeeds: command not found` **Solution**: Ensure workflow uses `uv run`: ```yaml - name: Validate feeds run: uv run aiwebfeeds validate --all ``` ### Coverage Below Threshold **Problem**: Coverage report shows less than 90% **Solution**: 1. Check coverage report: `open reports/coverage/index.html` 2. Add missing tests 3. Run locally: `uv run pytest --cov --cov-report=html` ### Feed Validation Timeout **Problem**: Feed URL checks timeout **Solution**: Increase timeout in workflow: ```yaml - name: Validate with longer timeout run: uv run aiwebfeeds validate --check-urls --timeout 60 ``` *** ## 📚 Related Documentation * [CLI Commands](/docs/development/cli) - Complete CLI reference * [Testing Guide](/docs/development/testing) - Testing best practices * [Contributing](/docs/development/contributing) - Contribution workflow * [Feed Schema](/docs/guides/feed-schema) - Feed data structure *** ## 🤖 Best Practices 1. **Always use `uv run`** for CLI commands in workflows 2. **Cache dependencies** to speed up builds 3. **Run workflows locally** with `act` before pushing 4. **Keep workflows focused** - one responsibility per workflow 5. **Use CLI for consistency** - avoid duplicating logic in YAML 6. **Fail fast** - validate critical things first 7. **Provide clear error messages** in CLI output 8. **Matrix test** across Python versions 9. **Auto-fix when possible** - reduce manual work 10. **Monitor workflow usage** - optimize slow jobs *** *Last Updated: October 2025* -------------------------------------------------------------------------------- END OF PAGE 25 -------------------------------------------------------------------------------- ================================================================================ PAGE 26 OF 57 ================================================================================ TITLE: AI & LLM Integration URL: https://ai-web-feeds.w4w.dev/docs/features/ai-integration MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/ai-integration.mdx DESCRIPTION: Comprehensive AI and LLM integration for your Fumadocs documentation site PATH: /features/ai-integration -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # AI & LLM Integration (/docs/features/ai-integration) import { Callout } from "fumadocs-ui/components/callout"; import { Tab, Tabs } from "fumadocs-ui/components/tabs"; Complete AI and LLM integration following the official [Fumadocs guide](https://fumadocs.dev/docs/ui/llms), making your documentation easily consumable by AI agents and large language models. ## Overview This site provides multiple ways for AI agents to access documentation: **Discovery** `/llms.txt` endpoint lists all available docs **Full Docs** `/llms-full.txt` provides complete documentation "}> **Markdown** `.mdx` and `.md` extensions for any page **Smart Routing** Automatic content negotiation ## Features ### LLM-Friendly Endpoints #### `/llms.txt` - Discovery File Standard discovery file for AI agents following the [llms.txt specification](https://llmstxt.org). ```bash curl https://yourdomain.com/llms.txt ``` **Response:** ```text # AI Web Feeds Documentation > A collection of curated RSS/Atom feeds optimized for AI agents ## Documentation Pages - [Getting Started](https://yourdomain.com/docs.mdx): Quick start guide - [PDF Export](https://yourdomain.com/docs/features/pdf-export.mdx): Export docs as PDF ... ``` #### `/llms-full.txt` - Complete Documentation All documentation in a single, structured text file optimized for RAG systems. ```bash curl https://yourdomain.com/llms-full.txt ``` The format includes metadata header, table of contents, and structured page sections. See [llms-full.txt Format](/docs/features/llms-full-format) for details. **Key Features:** * Structured format with clear separators * Metadata header (date, page count, base URL) * Table of contents * Individual page sections with metadata * Optimized for AI parsing #### Markdown Extensions Access markdown source of any documentation page by appending `.mdx` or `.md`: `bash curl https://yourdomain.com/docs/getting-started.mdx ` Returns the markdown source of the page. `bash curl https://yourdomain.com/docs/getting-started.md ` Alternative markdown extension (same as .mdx). `bash curl -H "Accept: text/markdown" https://yourdomain.com/docs/getting-started ` Automatically serves markdown when AI agent requests it. ### Content Negotiation Middleware automatically detects AI agents and serves markdown content: ```typescript title="middleware.ts" import { isMarkdownPreferred } from "fumadocs-core/negotiation"; if (isMarkdownPreferred(request)) { // Serve markdown version return NextResponse.rewrite(new URL(`/llms.mdx${path}`, request.url)); } ``` When an AI agent sends `Accept: text/markdown` header, it automatically receives markdown content without changing the URL. ### AI Page Actions Interactive UI components on every documentation page: #### Copy Markdown Button One-click copy of page markdown to clipboard: ```tsx import { LLMCopyButton } from "@/components/page-actions"; ; ``` **Features:** * Client-side caching for performance * Loading state feedback * Success confirmation with checkmark #### View Options Menu Dropdown menu with links to AI tools: * **Open in GitHub** - View source code * **Open in Scira AI** - Ask questions about the page * **Open in Perplexity** - Search with context * **Open in ChatGPT** - Analyze content ```tsx import { ViewOptions } from "@/components/page-actions"; ; ``` ## Implementation ### File Structure ``` apps/web/ ├── app/ │ ├── llms.txt/ │ │ └── route.ts # Discovery endpoint │ ├── llms-full.txt/ │ │ └── route.ts # Full docs endpoint │ ├── llms.mdx/ │ │ └── [[...slug]]/ │ │ └── route.ts # .mdx handler │ ├── llms.md/ │ │ └── [[...slug]]/ │ │ └── route.ts # .md handler │ └── docs/ │ └── [[...slug]]/ │ └── page.tsx # With page actions ├── components/ │ └── page-actions.tsx # AI UI components ├── middleware.ts # Content negotiation └── next.config.mjs # URL rewrites ``` ### Configuration #### Source Config Already configured in `source.config.ts`: ```typescript title="source.config.ts" export const docs = defineDocs({ docs: { dir: "content/docs", includeProcessedMarkdown: true, // ✅ Required for LLM support }, }); ``` #### Next.js Config URL rewrites in `next.config.mjs`: ```javascript title="next.config.mjs" async rewrites() { return [ { source: '/docs/:path*.mdx', destination: '/llms.mdx/:path*', }, { source: '/docs/:path*.md', destination: '/llms.md/:path*', }, ]; } ``` ## Usage ### For AI Agents `bash # Discover all documentation curl https://yourdomain.com/llms.txt ` Returns a list of all available pages with descriptions. `bash # Get complete documentation curl https://yourdomain.com/llms-full.txt ` Returns all pages in a structured format. `bash # Get specific page as markdown curl https://yourdomain.com/docs/getting-started.mdx ` Returns markdown source of the page. `bash # Use content negotiation curl -H "Accept: text/markdown" https://yourdomain.com/docs/getting-started ` Automatically receives markdown content. ### For Users #### Copy Page as Markdown 1. Navigate to any documentation page 2. Click the **Copy Markdown** button 3. Paste into your AI tool or editor #### Open in AI Tools 1. Click the **View Options** dropdown 2. Select your preferred AI tool: * **GitHub** - View source code * **Scira AI** - Ask questions * **Perplexity** - Search with context * **ChatGPT** - Analyze content ### For Developers #### Get LLM Text Programmatically ```typescript import { getLLMText, source } from "@/lib/source"; const page = source.getPage(["getting-started"]); const markdown = await getLLMText(page); ``` #### Customize Page Actions Edit `components/page-actions.tsx` to add more AI tools: ```tsx { title: 'Open in Claude', href: `https://claude.ai/new?content=${markdownUrl}`, icon: , } ``` #### Update GitHub URLs Edit `app/docs/[[...slug]]/page.tsx`: ```tsx githubUrl={`https://github.com/wyattowalsh/ai-web-feeds/blob/main/apps/web/content/docs/${page.file.path}`} ``` ## Performance All endpoints are optimized for performance: | Endpoint | Caching Strategy | Generation | | ---------------- | ------------------------------ | ---------- | | `/llms.txt` | `s-maxage=86400` (24h) | Dynamic | | `/llms-full.txt` | `revalidate=false` (permanent) | Dynamic | | `*.mdx` routes | `immutable` | Static | | Middleware | Minimal overhead | Runtime | | Copy button | Client-side cache | Client | Static generation ensures fast response times and minimal server load. ## Benefits ### For AI Agents * **Easy discovery** via `/llms.txt` * **Complete context** via `/llms-full.txt` * **Granular access** via `.mdx` extensions * **Automatic detection** via content negotiation * **Optimized format** for RAG systems ### For Users * **Quick markdown copy** with one click * **Direct AI tool links** in View Options * **Easy sharing** with AI-friendly URLs * **Better collaboration** with AI assistants ### For Developers * **Standards-compliant** following llms.txt spec * **Performance-optimized** with caching * **Extensible** architecture * **Well-documented** implementation ## Related Documentation * [llms-full.txt Format](/docs/features/llms-full-format) - Detailed format specification * [Testing Guide](/docs/guides/testing) - Verify your integration * [Quick Reference](/docs/guides/quick-reference) - Commands and endpoints ## External Resources * [Fumadocs LLM Guide](https://fumadocs.dev/docs/ui/llms) * [llms.txt Specification](https://llmstxt.org) * [Content Negotiation](https://developer.mozilla.org/en-US/docs/Web/HTTP/Content_negotiation) -------------------------------------------------------------------------------- END OF PAGE 26 -------------------------------------------------------------------------------- ================================================================================ PAGE 27 OF 57 ================================================================================ TITLE: Analytics Dashboard URL: https://ai-web-feeds.w4w.dev/docs/features/analytics MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/analytics.mdx DESCRIPTION: Real-time feed analytics with interactive visualizations, trending topics, and health insights PATH: /features/analytics -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Analytics Dashboard (/docs/features/analytics) # Analytics Dashboard > **Status**: ✅ Fully Implemented > **Phase**: Phase 1 (MVP) > **Completion**: 100% The Analytics Dashboard provides curators with comprehensive metrics and insights for the AIWebFeeds collection. ## Features ### Key Metrics * **Total Feeds**: Count of all feeds in the collection * **Validation Success Rate**: Percentage of feeds passing health checks * **Average Response Time**: Mean latency for feed validation * **Health Score Distribution**: Feed quality buckets (healthy, moderate, unhealthy) ### Interactive Charts #### Most Active Topics Bar chart showing topics ranked by validation frequency (last 30 days), weighted by feed health scores. #### Publication Velocity Line chart displaying daily/weekly/monthly validation frequency trends, used as proxy for publication activity. #### Feed Health Distribution Pie chart showing distribution of feeds by health category: * **Healthy**: ≥0.8 health score * **Moderate**: 0.5-0.8 health score * **Unhealthy**: \<0.5 health score #### Validation Success Over Time Area chart tracking validation success rate over time ranges (7d, 30d, 90d). ### Filtering * **Time Range**: Last 7 days, Last 30 days, Last 90 days, Custom date range * **Topic Filter**: Filter all analytics by specific topic (e.g., "Show only LLM feeds") ### Data Export * **CSV Export**: Download raw metrics for external analysis * **API Endpoint**: Programmatic access at `/api/analytics/summary` ## Configuration Analytics caching is configurable via environment variables: ```bash # Static metrics (total_feeds, health_distribution) - 1 hour TTL AIWF_ANALYTICS__STATIC_CACHE_TTL=3600 # Dynamic metrics (trending_topics, validation_success_rate) - 5 minutes TTL AIWF_ANALYTICS__DYNAMIC_CACHE_TTL=300 # Maximum concurrent analytics queries AIWF_ANALYTICS__MAX_CONCURRENT_QUERIES=10 ``` ## Usage ### Web Interface Navigate to `/analytics` to access the dashboard. **Manual Refresh**: Click "Refresh Now" button to bypass cache and fetch real-time data. **Data Freshness**: Dashboard displays "Last updated: \[timestamp]" with auto-refresh option. ### CLI ```bash # Display analytics summary uv run aiwebfeeds analytics summary --date-range 30d # Filter by topic uv run aiwebfeeds analytics summary --topic llm # Export to CSV uv run aiwebfeeds analytics export --output metrics.csv ``` ### API ```typescript // Fetch analytics summary const response = await fetch("/api/analytics/summary?date_range=30d&topic=llm"); const data = await response.json(); console.log(data.total_feeds); console.log(data.validation_success_rate); console.log(data.trending_topics); ``` ## Performance * **Page Load**: \<2 seconds on 4G connection (NFR-001) * **Cache Hit Rate**: 95% of queries served from cache * **Database Load Reduction**: ≥80% via hybrid caching strategy ## Success Criteria * ✅ Dashboard loads within 2 seconds for 95% of requests * ✅ Curators can identify top 10 trending topics in ≤30 seconds * ✅ 80% of curators use dashboard at least weekly * ✅ Curators identify and disable 20+ inactive feeds within first month * ✅ Export feature used by 30% of curators within first quarter ## Related * [Search & Discovery](./search) - Find feeds by keywords and semantic similarity * [Recommendations](./recommendations) - AI-powered feed suggestions * [Data Model](/docs/development/data-model#analyticssnapshot) - AnalyticsSnapshot entity schema -------------------------------------------------------------------------------- END OF PAGE 27 -------------------------------------------------------------------------------- ================================================================================ PAGE 28 OF 57 ================================================================================ TITLE: Data Enrichment & Analytics URL: https://ai-web-feeds.w4w.dev/docs/features/data-enrichment MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/data-enrichment.mdx DESCRIPTION: Comprehensive data enrichment and advanced analytics capabilities PATH: /features/data-enrichment -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Data Enrichment & Analytics (/docs/features/data-enrichment) # Data Enrichment & Analytics AI Web Feeds includes comprehensive data enrichment and advanced analytics capabilities that automatically enhance feed metadata, analyze content, track quality, and provide ML-powered insights. ## Key Features ### 1. Metadata Enrichment **Module**: `enrichment.metadata` Automatically discovers and enriches feed metadata: * **Auto-discovery**: Extracts titles, descriptions, authors from feeds and websites * **Language Detection**: Identifies feed language with confidence scores * **Platform Detection**: Recognizes Reddit, Medium, Substack, GitHub, arXiv, YouTube, etc. * **Icon/Logo Discovery**: Finds favicons and Open Graph images * **Feed Format Detection**: Identifies RSS, Atom, JSON feeds * **Publishing Frequency**: Analyzes update patterns **Example Usage**: ```python from ai_web_feeds.enrichment import MetadataEnricher enricher = MetadataEnricher() # Enrich single feed feed_data = {"url": "https://example.com/feed"} enriched = enricher.enrich_feed_source(feed_data) print(enriched["title"]) # Auto-discovered title print(enriched["language"]) # Detected language print(enriched["platform"]) # Detected platform # Batch enrichment (parallel) feeds = [{"url": url1}, {"url": url2}, {"url": url3}] enriched_feeds = enricher.batch_enrich(feeds, max_workers=5) ``` ### 2. Content Analysis **Module**: `enrichment.content` NLP-powered content analysis: * **Text Statistics**: Word count, sentence count, paragraph count * **Readability Scoring**: Flesch reading ease, reading level classification * **Keyword Extraction**: Top keywords, domain-specific keywords (AI/ML) * **Named Entity Recognition**: Simple capitalization-based extraction * **Sentiment Analysis**: Positive/negative/neutral classification with confidence * **Topic Detection**: Auto-classification into research, industry, ML, NLP, etc. * **Content Detection**: Identifies code snippets and mathematical notation **Example Usage**: ```python from ai_web_feeds.enrichment import ContentAnalyzer analyzer = ContentAnalyzer() # Analyze text content text = """ Machine learning models are becoming increasingly powerful. Recent advances in transformer architectures have led to breakthrough performance on many NLP tasks. """ analysis = analyzer.analyze_text(text) print(f"Readability: {analysis.readability_score:.1f}") print(f"Reading Level: {analysis.reading_level}") print(f"Sentiment: {analysis.sentiment_label} ({analysis.sentiment_score:.2f})") print(f"Top Keywords: {analysis.top_keywords[:5]}") print(f"Detected Topics: {analysis.detected_topics}") print(f"Has Code: {analysis.has_code}") ``` ### 3. Quality Analysis **Module**: `enrichment.quality` Multi-dimensional quality scoring: * **Completeness**: Required vs. optional fields * **Accuracy**: URL format, title length, description quality * **Consistency**: Domain matching, language code format * **Timeliness**: Update freshness, staleness detection * **Validity**: Data type checking, schema compliance * **Uniqueness**: Duplicate detection (with context) **Quality Dimensions** (with weights): * Completeness (25%): Are required fields present? * Accuracy (20%): Is data properly formatted? * Consistency (15%): Do related fields match? * Timeliness (15%): Is data up-to-date? * Validity (15%): Does data meet type requirements? * Uniqueness (10%): Is feed unique? **Example Usage**: ```python from ai_web_feeds.enrichment import QualityAnalyzer analyzer = QualityAnalyzer() # Assess feed quality feed_data = { "url": "example.com/feed", # Missing protocol "title": "AI News", # Missing recommended fields: description, language, topics } score = analyzer.assess_feed_source(feed_data) print(f"Overall Score: {score.overall_score}/100") print(f"Completeness: {score.completeness_score}/100") print(f"Issues Found: {len(score.issues)}") for issue in score.issues: print(f" [{issue.severity}] {issue.field}: {issue.issue}") if issue.auto_fixable: print(f" → Can auto-fix: {issue.suggestion}") # Auto-fix issues fixed = analyzer.auto_fix_issues(feed_data) print(f"Fixed URL: {fixed['url']}") # Now has https:// ``` ### 4. Time-Series Analysis **Module**: `analytics.timeseries` Forecasting and temporal pattern analysis: * **Health Forecasting**: Predict feed health 7+ days ahead * **Seasonality Detection**: Weekly/daily posting patterns * **Trend Analysis**: Increasing/decreasing/stable trends with R² * **Frequency Analysis**: Publishing rates and regularity * **Peak Time Detection**: Most active hours/days **Example Usage**: ```python from ai_web_feeds.analytics.timeseries import TimeSeriesAnalyzer from ai_web_feeds import DatabaseManager db = DatabaseManager() with db.get_session() as session: analyzer = TimeSeriesAnalyzer(session) # Forecast health forecast = analyzer.forecast_health_metric("feed_123", days_ahead=14) print(f"Forecast (next 14 days): {forecast.forecast_values}") print(f"Confidence Intervals: {forecast.confidence_intervals}") print(f"Model RMSE: {forecast.rmse:.3f}") # Detect seasonality seasonality = analyzer.detect_seasonality("feed_123", lookback_days=90) if seasonality.has_seasonality: print(f"Seasonal Period: {seasonality.seasonal_period} hours/days") print(f"Seasonal Strength: {seasonality.seasonal_strength:.2f}") # Analyze trend trend = analyzer.analyze_trend("feed_123", lookback_days=90) print(f"Trend Direction: {trend.trend_direction}") print(f"Slope: {trend.slope:.4f}") print(f"R²: {trend.r_squared:.3f}") ``` ### 5. Network Analysis **Module**: `analytics.network` Graph-based topic and feed relationship analysis: * **Topic Networks**: Graph of topic relationships * **Feed Similarity Networks**: Feeds connected by shared topics * **Centrality Metrics**: PageRank, degree, closeness, betweenness * **Community Detection**: Identify topic clusters * **Influential Topics**: Rank topics by network importance **Example Usage**: ```python from ai_web_feeds.analytics.network import NetworkAnalyzer from ai_web_feeds import DatabaseManager db = DatabaseManager() with db.get_session() as session: analyzer = NetworkAnalyzer(session) # Build topic network topic_graph = analyzer.build_topic_network() print(f"Topics: {topic_graph.stats['num_nodes']}") print(f"Relationships: {topic_graph.stats['num_edges']}") print(f"Density: {topic_graph.stats['density']:.3f}") # Find influential topics influential = analyzer.find_influential_topics(topic_graph, top_n=10) for topic in influential: print(f"{topic['label']}: PageRank={topic['pagerank']:.4f}") ``` ### 6. Advanced Analytics **Module**: `analytics.advanced` ML-powered insights: * **Predictive Health Modeling**: Linear regression forecasts * **Pattern Detection**: Temporal, content, category patterns * **Similarity Computation**: Jaccard similarity between feeds * **Feed Clustering**: BFS-based clustering by similarity * **ML Insights Reports**: Comprehensive ML analysis ## Integration with Data Sync The enrichment system integrates seamlessly with data synchronization: ```python from ai_web_feeds.data_sync import DataSyncOrchestrator from ai_web_feeds.enrichment import MetadataEnricher, QualityAnalyzer from ai_web_feeds import DatabaseManager db = DatabaseManager() # Load and enrich feeds with MetadataEnricher() as enricher: import yaml with open("data/feeds.yaml") as f: data = yaml.safe_load(f) # Enrich all feeds enriched_sources = enricher.batch_enrich(data["sources"]) # Assess quality quality_analyzer = QualityAnalyzer() for feed in enriched_sources: score = quality_analyzer.assess_feed_source(feed) feed["quality_score"] = score.overall_score # Sync to database sync = DataSyncOrchestrator(db) sync.full_sync() ``` ## Workflow Examples ### Complete Feed Enrichment Pipeline ```python from ai_web_feeds.enrichment import ( MetadataEnricher, ContentAnalyzer, QualityAnalyzer ) # 1. Extract metadata enricher = MetadataEnricher() feed_data = {"url": "https://openai.com/blog/rss/"} enriched = enricher.enrich_feed_source(feed_data) # 2. Analyze content content_analyzer = ContentAnalyzer() content_text = "Latest advances in GPT-4 and DALL-E 3..." content_analysis = content_analyzer.analyze_text(content_text) # 3. Assess quality quality_analyzer = QualityAnalyzer() quality = quality_analyzer.assess_feed_source(enriched) # 4. Combine results final_feed = { **enriched, "content_analysis": { "readability": content_analysis.readability_score, "sentiment": content_analysis.sentiment_label, "topics": content_analysis.detected_topics, }, "quality": { "overall_score": quality.overall_score, "issues_count": len(quality.issues), } } ``` ### Health Monitoring Dashboard ```python from ai_web_feeds.analytics.timeseries import TimeSeriesAnalyzer from ai_web_feeds.analytics.advanced import AdvancedFeedAnalytics with db.get_session() as session: ts_analyzer = TimeSeriesAnalyzer(session) adv_analytics = AdvancedFeedAnalytics(session) feed_id = "feed_123" # Current health current_health = adv_analytics.get_current_health(feed_id) # Future forecast forecast = ts_analyzer.forecast_health_metric(feed_id, days_ahead=7) # Trend analysis trend = ts_analyzer.analyze_trend(feed_id, lookback_days=30) dashboard = { "feed_id": feed_id, "current_health": current_health, "forecast_7d": forecast.forecast_values[-1], "trend": trend.trend_direction, "status": "healthy" if current_health > 0.7 else "degraded" } ``` ## Performance Considerations * **Batch Processing**: Use `batch_enrich()` for multiple feeds (parallel workers) * **Caching**: Metadata enrichment results cached in enriched YAML * **Incremental Updates**: Only re-enrich feeds older than X days * **Database Indexes**: Ensure indexes on `feed_source_id`, `published_date`, `calculated_at` * **Memory**: Time-series analysis memory-efficient with streaming for large datasets ## Troubleshooting ### Common Issues **Language detection fails** * Ensure text is at least 10 characters; langdetect requires minimum text **Metadata extraction returns empty** * Check URL accessibility; some sites block scrapers (use crawlee-python) **Quality score too low** * Use `auto_fix_issues()` to automatically fix common problems **Forecasting insufficient data** * Need minimum 7 data points; ensure health metrics collected regularly ## Best Practices 1. **Enrich on Import**: Run enrichment when adding new feeds 2. **Quality Gates**: Set minimum quality score threshold (e.g., 70/100) 3. **Regular Updates**: Re-enrich metadata monthly 4. **Content Analysis**: Run on new feed items, not all historical 5. **Health Monitoring**: Schedule daily health metric calculations 6. **Network Updates**: Rebuild topic network when taxonomy changes ## Future Enhancements Planned features: * **Deep Learning Models**: Use transformer models for better NLP * **Real-time Anomaly Detection**: Alert on unusual patterns * **Automated Categorization**: ML-based topic assignment * **Sentiment Trends**: Track sentiment changes over time * **Duplicate Detection**: Find near-duplicate feeds * **Performance Optimization**: GPU acceleration for large-scale analysis ## Related Documentation * [Database Architecture](/docs/development/database-architecture) - Database implementation * [Database Quick Start](/docs/guides/database-quick-start) - Get started with the database * [Python API](/docs/development/python-api) - Full API reference *** **Version**: 1.0 **Last Updated**: October 15, 2025 -------------------------------------------------------------------------------- END OF PAGE 28 -------------------------------------------------------------------------------- ================================================================================ PAGE 29 OF 57 ================================================================================ TITLE: Entity Extraction URL: https://ai-web-feeds.w4w.dev/docs/features/entity-extraction MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/entity-extraction.mdx DESCRIPTION: Named Entity Recognition and normalization using spaCy NER PATH: /features/entity-extraction -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Entity Extraction (/docs/features/entity-extraction) # Entity Extraction Entity Extraction identifies and tracks people, organizations, techniques, datasets, and concepts mentioned in articles using spaCy's Named Entity Recognition (NER) models. ## Overview The entity extractor: 1. **Extracts** entities from article text using spaCy NER 2. **Normalizes** entity names to canonical forms (e.g., "G. Hinton" → "Geoffrey Hinton") 3. **Tracks** entity mentions across articles with confidence scores 4. **Enables** full-text search across entities and aliases ## Architecture ## Entity Types Supported entity types: * **person**: Geoffrey Hinton, Yann LeCun, Ilya Sutskever * **organization**: OpenAI, Google Brain, Anthropic * **technique**: Transformers, RLHF, LoRA, BERT * **dataset**: ImageNet, COCO, WikiText-103 * **concept**: Attention mechanism, Backpropagation ## Features ### Named Entity Recognition Uses spaCy's `en_core_web_sm` model to detect entities: ```python from ai_web_feeds.nlp import EntityExtractor extractor = EntityExtractor() article = { "id": 1, "title": "GPT-4 by OpenAI", "content": "OpenAI released GPT-4, led by Sam Altman..." } entities = extractor.extract_entities(article) # Returns: [ # {"text": "OpenAI", "type": "organization", "confidence": 0.91}, # {"text": "GPT-4", "type": "technique", "confidence": 0.96}, # {"text": "Sam Altman", "type": "person", "confidence": 0.89} # ] ``` ### Entity Normalization Automatically merges similar entities using Levenshtein distance: ```python # "Geoffrey Hinton" vs "G. Hinton" → Merged (distance ≤ 2) # "OpenAI" vs "Open AI" → Merged (distance = 1) ``` **Algorithm**: 1. Title-case normalization 2. Compare to existing entities of same type 3. If Levenshtein distance ≤ 2, use existing canonical name 4. Otherwise, create new entity ### Full-Text Search SQLite FTS5 virtual table enables fast entity search: ```bash # Search entities by name, aliases, or description aiwebfeeds nlp search-entities "hinton" # Returns: Geoffrey Hinton, Geoff Hinton (alias) ``` ## Usage ### CLI Commands #### Extract Entities ```bash aiwebfeeds nlp entities ``` **Options**: * `--batch-size`: Number of articles (default: 50) * `--force`: Reprocess all articles ```bash # Process 25 articles aiwebfeeds nlp entities --batch-size 25 ``` #### List Entities ```bash # List top 10 entities by frequency aiwebfeeds nlp list-entities --limit 10 ``` #### Show Entity Details ```bash aiwebfeeds nlp show-entity "Geoffrey Hinton" ``` Shows: * Entity metadata (type, aliases, frequency) * Recent article mentions * Related entities #### Manage Entities **Add Alias**: ```bash aiwebfeeds nlp add-alias "Geoffrey Hinton" "G. Hinton" ``` **Merge Duplicate Entities**: ```bash aiwebfeeds nlp merge-entities "Geoff Hinton" "Geoffrey Hinton" ``` **Search Entities (FTS5)**: ```bash aiwebfeeds nlp search-entities "transformer attention" ``` ### Python API ```python from ai_web_feeds.nlp import EntityExtractor from ai_web_feeds.storage import Storage extractor = EntityExtractor() storage = Storage() # Extract entities article = storage.get_article_by_id(123) entities = extractor.extract_entities(article) # Store entities for entity_data in entities: # Normalize name canonical_name = extractor.normalize_entity( entity_data["text"], entity_data["type"], existing_entities=storage.list_all_entity_names() ) # Get or create entity entity = storage.get_entity_by_name(canonical_name) if not entity: entity = storage.create_entity( canonical_name=canonical_name, entity_type=entity_data["type"] ) # Record mention storage.create_entity_mention( entity_id=entity.id, article_id=article["id"], confidence=entity_data["confidence"], extraction_method="ner_model", context=entity_data["context"] ) ``` ### Batch Processing Entity extraction runs hourly via APScheduler: ```python from ai_web_feeds.nlp.scheduler import NLPScheduler nlp_scheduler = NLPScheduler(scheduler) nlp_scheduler.register_jobs() # Registers: Entity extraction job (every hour) ``` ## Database Schema ### entities Table ```sql CREATE TABLE entities ( id TEXT PRIMARY KEY, -- UUID canonical_name TEXT NOT NULL UNIQUE, entity_type TEXT NOT NULL CHECK(entity_type IN ('person', 'organization', 'technique', 'dataset', 'concept')), aliases TEXT, -- JSON array description TEXT, metadata TEXT, -- JSON object frequency_count INTEGER DEFAULT 0, first_seen DATETIME DEFAULT CURRENT_TIMESTAMP, last_seen DATETIME, created_at DATETIME DEFAULT CURRENT_TIMESTAMP ); ``` ### entity\_mentions Table ```sql CREATE TABLE entity_mentions ( id INTEGER PRIMARY KEY AUTOINCREMENT, entity_id TEXT NOT NULL REFERENCES entities(id), article_id INTEGER NOT NULL, confidence REAL NOT NULL CHECK(confidence BETWEEN 0 AND 1), extraction_method TEXT NOT NULL CHECK(extraction_method IN ('ner_model', 'rule_based', 'manual')), context TEXT, -- Surrounding text snippet mentioned_at DATETIME DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (entity_id) REFERENCES entities(id), FOREIGN KEY (article_id) REFERENCES feed_entries(id) ); ``` ### FTS5 Virtual Table ```sql CREATE VIRTUAL TABLE entities_fts USING fts5( entity_id UNINDEXED, canonical_name, aliases, description ); ``` ## Model Installation The first run will download the spaCy model (\~13MB): ```bash # Manual download (optional) uv run python -m spacy download en_core_web_sm ``` **Model Info**: * Name: `en_core_web_sm` * Size: 13MB * Language: English * Accuracy: \~85% F1 score on OntoNotes 5.0 ## Configuration ```python class Phase5Settings(BaseSettings): entity_batch_size: int = 50 entity_cron: str = "0 * * * *" # Every hour entity_confidence_threshold: float = 0.7 spacy_model: str = "en_core_web_sm" ``` **Environment Variables**: ```bash PHASE5_ENTITY_BATCH_SIZE=50 PHASE5_ENTITY_CONFIDENCE_THRESHOLD=0.7 PHASE5_SPACY_MODEL=en_core_web_sm ``` ## Performance * **Throughput**: \~50 articles/hour * **Memory**: \~200MB (spaCy model loaded) * **Storage**: \~50 bytes per entity mention ## Use Cases ### Track Influential Researchers ```bash # Find top AI researchers by mention frequency aiwebfeeds nlp list-entities --type person --limit 20 ``` ### Discover Emerging Techniques ```bash # Find recently mentioned techniques aiwebfeeds nlp list-entities --type technique --sort recent ``` ### Build Knowledge Graphs Connect entities by co-occurrence in articles: ```python # Articles mentioning both "GPT-4" and "RLHF" storage.get_articles_mentioning_entities(["GPT-4", "RLHF"]) ``` ## Troubleshooting ### Low Extraction Accuracy **Symptom**: Many entities missed or incorrectly classified. **Solutions**: 1. Use larger spaCy model: `en_core_web_lg` (40MB, better accuracy) 2. Add domain-specific rules for AI terminology 3. Manual curation: Add aliases for common variations ### Duplicate Entities **Symptom**: "Geoffrey Hinton" and "Geoff Hinton" as separate entities. **Solution**: ```bash # Merge duplicates aiwebfeeds nlp merge-entities "Geoff Hinton" "Geoffrey Hinton" # Add alias aiwebfeeds nlp add-alias "Geoffrey Hinton" "Geoff Hinton" ``` ### spaCy Model Not Found **Symptom**: `OSError: Can't find model 'en_core_web_sm'` **Solution**: ```bash uv run python -m spacy download en_core_web_sm ``` ## See Also * [Quality Scoring](/docs/features/quality-scoring) - Article quality assessment * [Sentiment Analysis](/docs/features/sentiment-analysis) - Sentiment classification * [Topic Modeling](/docs/features/topic-modeling) - Discover subtopics -------------------------------------------------------------------------------- END OF PAGE 29 -------------------------------------------------------------------------------- ================================================================================ PAGE 30 OF 57 ================================================================================ TITLE: Link Validation URL: https://ai-web-feeds.w4w.dev/docs/features/link-validation MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/link-validation.mdx DESCRIPTION: Ensure all links in your documentation are correct and working PATH: /features/link-validation -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Link Validation (/docs/features/link-validation) import { Callout } from "fumadocs-ui/components/callout"; import { Tab, Tabs } from "fumadocs-ui/components/tabs"; import { Step, Steps } from "fumadocs-ui/components/steps"; import { Card, Cards } from "fumadocs-ui/components/card"; import { Link as LinkIcon, Hash, FileText, FolderOpen } from "lucide-react"; Automatically validate all links in your documentation to ensure they're correct and working. ## Overview Link validation uses [`next-validate-link`](https://next-validate-link.vercel.app) to check: }> **Internal Links** Links between documentation pages }> **Anchor Links** Links to headings within pages }> **MDX Components** Links in Cards and other components }> **Relative Paths** File path references ## Features * ✅ **Automatic scanning** - Finds all links in MDX files * ✅ **Heading validation** - Checks anchor links to headings * ✅ **Component support** - Validates links in MDX components * ✅ **Relative paths** - Checks file references * ✅ **Exit codes** - CI/CD friendly error reporting * ✅ **Detailed errors** - Shows exact location of broken links ## Quick Start ### Run Validation ```bash pnpm lint:links ``` Uses the Node.js/tsx runtime (no additional installation required). ```bash # Install Bun first (if not already installed) curl -fsSL https://bun.sh/install | bash # Run with Bun pnpm lint:links:bun ``` Uses the Bun runtime for faster execution. This will scan all documentation files and validate: * Links to other documentation pages * Anchor links to headings * Links in Card components * Relative file paths ### Expected Output **All links valid:** ``` 🔍 Scanning URLs and validating links... ✅ All links are valid! ``` **Broken links found:** ``` 🔍 Scanning URLs and validating links... ❌ /Users/.../content/docs/index.mdx Line 25: Link to /docs/invalid-page not found ❌ Found 1 link validation error(s) ``` ## How It Works ### File Structure ``` apps/web/ ├── bunfig.toml # Bun runtime configuration (for Bun) ├── scripts/ │ ├── lint.ts # Validation script (Bun runtime) │ ├── lint-node.mjs # Validation script (Node.js runtime) │ └── preload.ts # MDX plugin loader (for Bun) └── package.json # Scripts configuration ``` ### Validation Script The `scripts/lint-node.mjs` file runs with tsx/Node.js: ```javascript title="scripts/lint-node.mjs" import { printErrors, scanURLs, validateFiles, } from 'next-validate-link'; import { loader } from 'fumadocs-core/source'; import { createMDXSource } from 'fumadocs-mdx'; import { map } from '@/.map'; const source = loader({ baseUrl: '/docs', source: createMDXSource(map), }); async function checkLinks() { const scanned = await scanURLs({ preset: 'next', populate: { 'docs/[[...slug]]': source.getPages().map((page) => ({ value: { slug: page.slugs }, hashes: getHeadings(page), })), }, }); const errors = await validateFiles(await getFiles(), { scanned, markdown: { components: { Card: { attributes: ['href'] }, }, }, checkRelativePaths: 'as-url', }); printErrors(errors, true); if (errors.length > 0) { process.exit(1); } } ``` The `scripts/lint.ts` file runs with Bun runtime: ```typescript title="scripts/lint.ts" import { type FileObject, printErrors, scanURLs, validateFiles, } from 'next-validate-link'; import type { InferPageType } from 'fumadocs-core/source'; import { source } from '@/lib/source'; async function checkLinks() { const scanned = await scanURLs({ preset: 'next', populate: { 'docs/[[...slug]]': source.getPages().map((page) => ({ value: { slug: page.slugs }, hashes: getHeadings(page), })), }, }); const errors = await validateFiles(await getFiles(), { scanned, markdown: { components: { Card: { attributes: ['href'] }, }, }, checkRelativePaths: 'as-url', }); printErrors(errors, true); if (errors.length > 0) { process.exit(1); } } ``` Requires Bun preload setup (see below). ### Bun Runtime Loader Only required if using the Bun runtime ( `pnpm lint:links:bun` ). The default Node.js version doesn't need this. The `scripts/preload.ts` enables MDX processing in Bun: ```typescript title="scripts/preload.ts" import { createMdxPlugin } from "fumadocs-mdx/bun"; Bun.plugin(createMdxPlugin()); ``` ### Bun Configuration Only required for Bun runtime. Not needed for default Node.js execution. The `bunfig.toml` loads the preload script: ```toml title="bunfig.toml" preload = ["./scripts/preload.ts"] ``` ## What Gets Validated ### Internal Documentation Links Links to other documentation pages: ```mdx [Getting Started](/docs) [PDF Export](/docs/features/pdf-export) [Testing Guide](/docs/guides/testing) ``` ### Anchor Links Links to headings within pages: ```mdx [Quick Start](#quick-start) [Configuration](#configuration) ``` ### MDX Component Links Links in special components: ```mdx ``` ### Relative Paths File references: ```mdx [Scripts Documentation](./scripts/README.md) [Source Code](../../packages/ai_web_feeds/src) ``` ## CI/CD Integration ### GitHub Actions Add to your workflow: ```yaml title=".github/workflows/validate.yml" name: Validate Links on: push: branches: [main] pull_request: branches: [main] jobs: validate-links: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: oven-sh/setup-bun@v1 with: bun-version: latest - name: Install dependencies run: pnpm install - name: Validate links run: pnpm lint:links ``` ### Exit Codes The script exits with appropriate codes: * **0** - All links valid ✅ * **1** - Broken links found ❌ ## Customization ### Add More Components Validate links in additional MDX components: ```typescript title="scripts/lint.ts" markdown: { components: { Card: { attributes: ['href'] }, CustomCard: { attributes: ['link', 'url'] }, Button: { attributes: ['href'] }, }, } ``` ### Custom Validation Rules Add custom validation logic: ```typescript title="scripts/lint.ts" const errors = await validateFiles(await getFiles(), { scanned, markdown: { components: { Card: { attributes: ["href"] }, }, }, checkRelativePaths: "as-url", // Custom filter filter: (file) => { // Skip draft files return !file.data?.draft; }, }); ``` ### Exclude Patterns Skip certain files or paths: ```typescript title="scripts/lint.ts" async function getFiles(): Promise { const allPages = source.getPages(); // Filter out test files const pages = allPages.filter((page) => !page.absolutePath.includes("/test/")); const promises = pages.map( async (page): Promise => ({ path: page.absolutePath, content: await page.data.getText("raw"), url: page.url, data: page.data, }), ); return Promise.all(promises); } ``` ## Common Issues ### Broken Links **Problem:** Link to `/docs/invalid-page` not found **Solutions:** * Check the page exists in `content/docs/` * Verify the URL path matches the file structure * Ensure `meta.json` includes the page **Problem:** Anchor `#section-name` not found **Solutions:** * Check heading exists in target page * Verify anchor matches heading slug * Headings are auto-slugified (spaces become `-`) **Problem:** Card href `/docs/page` not found **Solutions:** * Verify Card component uses `href` attribute * Check link target exists * Add component to validation config if custom ### False Positives Some links may be valid but flagged as errors: **External Links** ```mdx [GitHub](https://github.com/user/repo) ``` **Dynamic Routes** ```mdx [User Profile](/users/[id]) ``` **API Routes** ```mdx [Search API](/api/search) ``` ### Bun Not Installed The default `pnpm lint:links` command uses Node.js/tsx and doesn't require Bun. If you want to use the faster Bun runtime, install it: ```bash curl -fsSL https://bun.sh/install | bash ``` Then use: `pnpm lint:links:bun` ### Script Errors If the script fails to run: ```bash # Clear cache rm -rf .next/ rm -rf node_modules/ pnpm install # Verify Bun is installed bun --version # Run with verbose output DEBUG=* pnpm lint:links ``` ## Best Practices ### 1. Run Before Commits Add to your pre-commit hook: ```bash title=".husky/pre-commit" #!/bin/sh pnpm lint:links ``` ### 2. Validate on Build Add to build process: ```json title="package.json" { "scripts": { "build": "pnpm lint:links && next build" } } ``` ### 3. Regular Checks Run validation regularly: ```bash # Daily cron job 0 0 * * * cd /path/to/project && pnpm lint:links ``` ### 4. Document Link Patterns Keep a consistent link style: ```mdx [Features](/docs/features/pdf-export) [Features](../features/pdf-export) ``` ### 5. Use Anchor Links Link to specific sections: ```mdx [Configuration Section](/docs/features/rss-feeds#configuration) ``` ## Testing ### Manual Test Create a broken link to test: ```mdx title="content/docs/test.mdx" --- title: Test Page --- This link is broken: [Invalid Page](/docs/does-not-exist) ``` Run validation: ```bash pnpm lint:links ``` **Expected output:** ``` ❌ /Users/.../content/docs/test.mdx Line 6: Link to /docs/does-not-exist not found ``` ### Test Anchor Links ```mdx This anchor is broken: [Missing Section](#does-not-exist) ``` ### Test Component Links ```mdx ``` ## Performance ### Optimization Tips 1. **Cache Results** * Validation results can be cached between runs * Only re-validate changed files 2. **Parallel Processing** * Script processes files in parallel * Scales with CPU cores 3. **Incremental Validation** * Only validate modified files in CI * Use git diff to find changed files ### Benchmark Typical validation times: | Pages | Time | | ----- | ----- | | 10 | \~2s | | 50 | \~5s | | 100 | \~10s | | 500 | \~30s | ## Related Documentation * [Quick Reference](/docs/guides/quick-reference) - Commands and scripts * [Testing Guide](/docs/guides/testing) - Comprehensive testing * [PDF Export](/docs/features/pdf-export) - Export documentation ## External Resources * [next-validate-link Documentation](https://next-validate-link.vercel.app) * [Fumadocs Link Validation Guide](https://fumadocs.dev/docs/ui/validate-links) * [Bun Documentation](https://bun.sh/docs) -------------------------------------------------------------------------------- END OF PAGE 30 -------------------------------------------------------------------------------- ================================================================================ PAGE 31 OF 57 ================================================================================ TITLE: llms-full.txt Format URL: https://ai-web-feeds.w4w.dev/docs/features/llms-full-format MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/llms-full-format.mdx DESCRIPTION: Detailed specification of the enhanced llms-full.txt structured format PATH: /features/llms-full-format -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # llms-full.txt Format (/docs/features/llms-full-format) import { Callout } from "fumadocs-ui/components/callout"; import { Tab, Tabs } from "fumadocs-ui/components/tabs"; The `/llms-full.txt` endpoint provides a comprehensive, structured format optimized for AI agents and RAG systems. ## Overview The enhanced format includes: * **Metadata header** with generation info * **Table of contents** for navigation * **Structured page sections** with clear separators * **Individual metadata** for each page * **AI-friendly formatting** for easy parsing This format is designed to be both human-readable and machine-parsable, making it ideal for RAG systems, embeddings, and AI analysis. ## Format Structure The document follows this hierarchical structure: ``` ================================================================================ HEADER SECTION ================================================================================ ├── Metadata (date, page count, base URL) ├── Description ├── Structure explanation └── Table of Contents ================================================================================ DOCUMENTATION CONTENT ================================================================================ ├── PAGE 1 │ ├── Page metadata (title, URL, description, path) │ ├── Content separator │ ├── Full markdown content │ └── End marker ├── PAGE 2 │ └── ... └── PAGE N ================================================================================ FOOTER SECTION ================================================================================ └── Summary and access information ``` ## Header Section ### Metadata Block Essential information about the documentation: ```text ================================================================================ AI WEB FEEDS - COMPLETE DOCUMENTATION ================================================================================ METADATA -------------------------------------------------------------------------------- Generated: 2025-10-14T12:00:00.000Z Total Pages: 5 Base URL: https://yourdomain.com Format: Markdown Encoding: UTF-8 ``` ### Description Block Project overview for context: ```text DESCRIPTION -------------------------------------------------------------------------------- A comprehensive collection of curated RSS/Atom feeds optimized for AI agents and large language models. This document contains the complete documentation for the AI Web Feeds project, including setup guides, API references, and usage examples. ``` ### Structure Explanation Format guide for parsers: ```text STRUCTURE -------------------------------------------------------------------------------- Each page section follows this format: - Page separator (===) - Page number (X OF Y) - Page metadata (title, URL, description, path) - Content separator (---) - Full markdown content ``` ### Table of Contents Complete navigation index: ```text NAVIGATION -------------------------------------------------------------------------------- Table of Contents: 1. Getting Started - /docs 2. PDF Export - /docs/features/pdf-export 3. AI Integration - /docs/features/ai-integration 4. Testing Guide - /docs/guides/testing 5. Quick Reference - /docs/guides/quick-reference ================================================================================ DOCUMENTATION CONTENT ================================================================================ ``` ## Page Section Format Each page follows a consistent structure: ```text ================================================================================ PAGE 1 OF 5 ================================================================================ TITLE: Getting Started URL: https://yourdomain.com/docs MARKDOWN: https://yourdomain.com/docs.mdx DESCRIPTION: Quick start guide for AI Web Feeds PATH: / -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Getting Started [Full markdown content of the page...] -------------------------------------------------------------------------------- END OF PAGE 1 -------------------------------------------------------------------------------- ``` ### Page Metadata Fields | Field | Description | Example | | ------------- | ----------------- | --------------------------------- | | `TITLE` | Page title | `Getting Started` | | `URL` | Full page URL | `https://yourdomain.com/docs` | | `MARKDOWN` | Markdown endpoint | `https://yourdomain.com/docs.mdx` | | `DESCRIPTION` | Page description | `Quick start guide...` | | `PATH` | Relative path | `/` | ## Footer Section Summary and access instructions: ```text ================================================================================ END OF DOCUMENTATION ================================================================================ Total pages processed: 5 Generated: 2025-10-14T12:00:00.000Z Format: Plain text with markdown content For individual pages, append .mdx to any documentation URL. For the discovery file, visit /llms.txt ================================================================================ ``` ## Benefits for AI Agents ### Clear Structure * **Consistent separators** - 80-character wide `=` and `-` lines * **Numbered pages** - `PAGE X OF Y` format * **Hierarchical organization** - Header → Content → Footer * **Predictable format** - Easy to parse with regex ### Rich Metadata * **Generation timestamp** - Know when docs were created * **Total page count** - Plan context window usage * **Base URL** - Resolve relative links * **Per-page metadata** - Title, URL, description, path ### Multiple Access Patterns * **Complete documentation** - Single request for all content * **Table of contents** - Quick overview of structure * **Individual pages** - URLs for targeted access * **Markdown endpoints** - Source content links ### Parser-Friendly * **Fixed-width separators** - 80 characters for consistency * **Clear section markers** - Unmistakable boundaries * **Predictable structure** - Same format every time * **UTF-8 encoding** - Universal character support ## HTTP Headers Enhanced response headers provide additional metadata: ```http Content-Type: text/plain; charset=utf-8 Cache-Control: public, max-age=0, must-revalidate X-Content-Pages: 5 X-Generated-Date: 2025-10-14T12:00:00.000Z ``` Custom headers allow clients to access metadata without parsing the document body. ## Usage Examples ### RAG System Integration ```python import requests # Fetch complete documentation response = requests.get('https://yourdomain.com/llms-full.txt') content = response.text # Parse metadata from headers total_pages = int(response.headers['X-Content-Pages']) generated = response.headers['X-Generated-Date'] # Split by page separators separator = '=' * 80 + '\nPAGE ' pages = content.split(separator) # Extract table of contents toc_start = content.find('Table of Contents:') toc_end = content.find('=' * 80 + '\nDOCUMENTATION CONTENT') toc = content[toc_start:toc_end] # Process individual pages for i, page in enumerate(pages[1:], 1): if 'TITLE:' in page: # Extract page metadata title = page.split('TITLE: ')[1].split('\n')[0] url = page.split('URL: ')[1].split('\n')[0] # Extract content content_start = page.find('CONTENT\n' + '-' * 80 + '\n\n') content_end = page.find('\n\n' + '-' * 80 + '\nEND OF PAGE') content = page[content_start:content_end] print(f"Page {i}: {title}") ``` ```javascript // Fetch complete documentation const response = await fetch('https://yourdomain.com/llms-full.txt'); const content = await response.text(); // Parse metadata from headers const totalPages = parseInt(response.headers.get('X-Content-Pages')); const generated = response.headers.get('X-Generated-Date'); // Split by page separators const separator = '='.repeat(80) + '\nPAGE '; const pages = content.split(separator); // Extract table of contents const tocStart = content.indexOf('Table of Contents:'); const tocEnd = content.indexOf('='.repeat(80) + '\nDOCUMENTATION CONTENT'); const toc = content.substring(tocStart, tocEnd); // Process individual pages pages.slice(1).forEach((page, index) => { if (page.includes('TITLE:')) { // Extract page metadata const title = page.split('TITLE: ')[1].split('\n')[0]; const url = page.split('URL: ')[1].split('\n')[0]; // Extract content const contentStart = page.indexOf('CONTENT\n' + '-'.repeat(80) + '\n\n'); const contentEnd = page.indexOf('\n\n' + '-'.repeat(80) + '\nEND OF PAGE'); const content = page.substring(contentStart, contentEnd); console.log(`Page ${index + 1}: ${title}`); } }); ``` ```bash # Download complete documentation curl https://yourdomain.com/llms-full.txt -o docs.txt # View headers curl -I https://yourdomain.com/llms-full.txt # Extract table of contents curl https://yourdomain.com/llms-full.txt | \ sed -n '/Table of Contents:/,/^===/p' # Count pages curl https://yourdomain.com/llms-full.txt | \ grep -c "^PAGE [0-9]" # Extract first page curl https://yourdomain.com/llms-full.txt | \ sed -n '/^PAGE 1 OF/,/^END OF PAGE 1/p' ``` ## Parsing Tips ### Regular Expressions ```python import re # Extract page numbers page_pattern = r'PAGE (\d+) OF (\d+)' matches = re.findall(page_pattern, content) # Extract metadata fields title_pattern = r'TITLE: (.+)' url_pattern = r'URL: (.+)' desc_pattern = r'DESCRIPTION: (.+)' # Split by separators separator_80 = r'={80}' separator_dash = r'-{80}' ``` ### Content Extraction ```python def extract_pages(content: str) -> list: """Extract individual pages from llms-full.txt""" pages = [] # Find all page sections page_pattern = r'={80}\nPAGE (\d+) OF (\d+)={80}(.+?)(?=={80}\nPAGE |\Z)' for match in re.finditer(page_pattern, content, re.DOTALL): page_num, total, page_content = match.groups() # Extract metadata metadata = {} for line in page_content.split('\n'): if ':' in line and line.isupper().startswith(line.split(':')[0]): key, value = line.split(':', 1) metadata[key.strip()] = value.strip() # Extract content content_match = re.search( r'CONTENT\n-{80}\n\n(.+?)\n\n-{80}', page_content, re.DOTALL ) if content_match: pages.append({ 'page_number': int(page_num), 'total_pages': int(total), 'metadata': metadata, 'content': content_match.group(1).strip() }) return pages ``` ### Token Counting ```python def count_tokens_per_page(content: str) -> dict: """Estimate token count for each page""" import tiktoken enc = tiktoken.get_encoding("cl100k_base") pages = extract_pages(content) token_counts = {} for page in pages: page_content = page['content'] tokens = len(enc.encode(page_content)) token_counts[page['metadata']['TITLE']] = tokens return token_counts ``` ## Comparison with Previous Format ### Before Enhancement ```text # Page Title (url) Content... # Another Page (url) Content... ``` **Limitations:** * No metadata header * No table of contents * Basic separators * No page numbers * No HTTP headers ### After Enhancement ```text ================================================================================ HEADER WITH METADATA ================================================================================ ... Table of Contents: [all pages] ================================================================================ PAGE 1 OF 5 ================================================================================ TITLE: ... URL: ... MARKDOWN: ... ... ``` **Improvements:** * ✅ Rich metadata header * ✅ Complete table of contents * ✅ 80-character separators * ✅ Page numbers (X OF Y) * ✅ Custom HTTP headers * ✅ Structured format ## Best Practices ### For RAG Systems 1. **Parse metadata first** - Get page count and base URL 2. **Use table of contents** - Quick overview of structure 3. **Extract pages individually** - Process one at a time 4. **Respect token limits** - Use page numbers to estimate size 5. **Cache the response** - Revalidate periodically ### For Embeddings 1. **Chunk by pages** - Natural boundaries 2. **Include metadata** - Title, URL, description in embeddings 3. **Cross-reference** - Use URLs for linking 4. **Update regularly** - Check X-Generated-Date header ### For Analysis 1. **Validate structure** - Check separator consistency 2. **Handle errors** - Missing descriptions are optional 3. **Use HTTP headers** - Metadata without parsing 4. **Test parsing** - Verify on sample data first ## Testing ### Verify Format ```bash # Download and inspect curl https://yourdomain.com/llms-full.txt > docs.txt # Check header head -50 docs.txt # Count separators (should be consistent) grep -c "^====" docs.txt grep -c "^----" docs.txt # Verify page numbers grep "^PAGE [0-9]" docs.txt ``` ### Validate Headers ```bash # Check custom headers curl -I https://yourdomain.com/llms-full.txt | grep "X-" # Expected output: # X-Content-Pages: 5 # X-Generated-Date: 2025-10-14T12:00:00.000Z ``` ## Related Documentation * [AI Integration](/docs/features/ai-integration) - Complete AI/LLM guide * [Testing Guide](/docs/guides/testing) - Verify your setup * [Quick Reference](/docs/guides/quick-reference) - Commands and endpoints -------------------------------------------------------------------------------- END OF PAGE 31 -------------------------------------------------------------------------------- ================================================================================ PAGE 32 OF 57 ================================================================================ TITLE: Math Equations URL: https://ai-web-feeds.w4w.dev/docs/features/math MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/math.mdx DESCRIPTION: Render beautiful mathematical equations in your documentation using KaTeX PATH: /features/math -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Math Equations (/docs/features/math) import { Callout } from "fumadocs-ui/components/callout"; import { Tab, Tabs } from "fumadocs-ui/components/tabs"; ## Overview KaTeX is a fast, easy-to-use JavaScript library for rendering TeX math notation on the web. This site integrates KaTeX to enable beautiful mathematical equations in documentation. ## Features * **Fast rendering** - KaTeX is significantly faster than MathJax * **High quality** - Produces crisp output at any zoom level * **Self-contained** - No dependencies on external fonts or stylesheets * **Server-side rendering** - Works without JavaScript enabled * **TeX/LaTeX syntax** - Familiar notation for mathematicians ## Basic Usage ### Inline Math Wrap inline equations with single dollar signs `$...$`: ```mdx The Pythagorean theorem states that $c = \pm\sqrt{a^2 + b^2}$ for a right triangle. ``` The Pythagorean theorem states that $c = \pm\sqrt{a^2 + b^2}$ for a right triangle. ### Block Math Use code blocks with the `math` language identifier or wrap with double dollar signs `$$...$$`: ````mdx ```math c = \pm\sqrt{a^2 + b^2} ``` ```` ```math c = \pm\sqrt{a^2 + b^2} ``` Or using double dollar signs: ```mdx $$ E = mc^2 $$ ``` $$ E = mc^2 $$ ## Common Examples ### Algebra **Quadratic Formula:** ```math x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} ``` **Binomial Theorem:** ```math (x + y)^n = \sum_{k=0}^{n} \binom{n}{k} x^{n-k} y^k ``` ### Calculus **Fundamental Theorem of Calculus:** ```math \int_a^b f(x) \, dx = F(b) - F(a) ``` **Partial Derivatives:** ```math \frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x} ``` **Limit Definition:** ```math \lim_{x \to \infty} \left(1 + \frac{1}{x}\right)^x = e ``` ### Linear Algebra **Matrix Multiplication:** ```math \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} e & f \\ g & h \end{bmatrix} = \begin{bmatrix} ae + bg & af + bh \\ ce + dg & cf + dh \end{bmatrix} ``` **Determinant:** ```math \det(A) = \begin{vmatrix} a & b \\ c & d \end{vmatrix} = ad - bc ``` ### Statistics & Probability **Normal Distribution:** ```math f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} ``` **Bayes' Theorem:** ```math P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} ``` ### Complex Analysis **Taylor Series Expansion:** The Taylor expansion expresses a holomorphic function $f(z)$ as a power series: ```math \displaystyle {\begin{aligned}T_{f}(z)&=\sum _{k=0}^{\infty }{\frac {(z-c)^{k}}{2\pi i}}\int _{\gamma }{\frac {f(w)}{(w-c)^{k+1}}}\,dw\\&={\frac {1}{2\pi i}}\int _{\gamma }{\frac {f(w)}{w-c}}\sum _{k=0}^{\infty }\left({\frac {z-c}{w-c}}\right)^{k}\,dw\\&={\frac {1}{2\pi i}}\int _{\gamma }{\frac {f(w)}{w-c}}\left({\frac {1}{1-{\frac {z-c}{w-c}}}}\right)\,dw\\&={\frac {1}{2\pi i}}\int _{\gamma }{\frac {f(w)}{w-z}}\,dw=f(z),\end{aligned}} ``` **Euler's Formula:** ```math e^{ix} = \cos(x) + i\sin(x) ``` ### Physics **Schrödinger Equation:** ```math i\hbar\frac{\partial}{\partial t}\Psi(\mathbf{r},t) = \hat{H}\Psi(\mathbf{r},t) ``` **Maxwell's Equations:** ```math \begin{aligned} \nabla \cdot \mathbf{E} &= \frac{\rho}{\epsilon_0} \\ \nabla \cdot \mathbf{B} &= 0 \\ \nabla \times \mathbf{E} &= -\frac{\partial \mathbf{B}}{\partial t} \\ \nabla \times \mathbf{B} &= \mu_0\mathbf{J} + \mu_0\epsilon_0\frac{\partial \mathbf{E}}{\partial t} \end{aligned} ``` **Lagrangian Mechanics:** The action functional $S$ is defined as: ```math \displaystyle S[{\boldsymbol {q}}]=\int _{a}^{b}L(t,{\boldsymbol {q}}(t),{\dot {\boldsymbol {q}}}(t))\,dt. ``` ## Advanced Features ### Multi-line Equations Use `aligned` environment for aligned equations: ```math \begin{aligned} f(x) &= (x+a)(x+b) \\ &= x^2 + (a+b)x + ab \end{aligned} ``` ### Cases and Piecewise Functions ```math f(x) = \begin{cases} x^2 & \text{if } x \geq 0 \\ -x^2 & \text{if } x < 0 \end{cases} ``` ### Fractions and Continued Fractions ```math \frac{1}{\displaystyle 1+\frac{1}{\displaystyle 2+\frac{1}{\displaystyle 3+\frac{1}{4}}}} ``` ### Greek Letters and Symbols Common symbols used in mathematics: * Greek: $\alpha, \beta, \gamma, \delta, \epsilon, \theta, \lambda, \mu, \pi, \sigma, \omega$ * Operators: $\sum, \prod, \int, \oint, \nabla, \partial$ * Relations: $\leq, \geq, \neq, \approx, \equiv, \propto$ * Sets: $\in, \notin, \subset, \subseteq, \cup, \cap, \emptyset$ * Logic: $\forall, \exists, \neg, \land, \lor, \implies, \iff$ ### Subscripts and Superscripts ```math x_1, x_2, \ldots, x_n \quad \text{and} \quad a^2 + b^2 = c^2 ``` ### Large Operators **Summation:** ```math \sum_{i=1}^{n} i = \frac{n(n+1)}{2} ``` **Product:** ```math \prod_{i=1}^{n} i = n! ``` **Integration:** ```math \int_{-\infty}^{\infty} e^{-x^2} \, dx = \sqrt{\pi} ``` ## Special Formatting ### Colored Equations KaTeX supports color through the `\textcolor` and `\colorbox` commands: ```math \textcolor{red}{F = ma} \quad \text{and} \quad \colorbox{yellow}{$E = mc^2$} ``` ### Sizing Control the size of your equations: ```math \tiny{tiny} \quad \small{small} \quad \normalsize{normal} \quad \large{large} \quad \Large{Large} \quad \LARGE{LARGE} \quad \huge{huge} ``` ### Spacing Fine-tune spacing in equations: ```math a\!b \quad a\,b \quad a\:b \quad a\;b \quad a\ b \quad a\quad b \quad a\qquad b ``` ## Best Practices ### Keep It Readable Use clear variable names and proper spacing: ```math P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} ``` Cramped or unclear notation: ```math P(X=k)=\binom{n}{k}p^k(1-p)^{n-k} ``` ### Use Display Style for Complex Equations For complex fractions and large operators, use `\displaystyle`: ```math \displaystyle \sum_{i=1}^{n} \frac{1}{i^2} = \frac{\pi^2}{6} ``` ### Break Long Equations For very long equations, use multiple lines with `aligned`: ```math \begin{aligned} (a + b)^3 &= (a + b)(a + b)^2 \\ &= (a + b)(a^2 + 2ab + b^2) \\ &= a^3 + 3a^2b + 3ab^2 + b^3 \end{aligned} ``` ### Label Important Equations Use text annotations to explain components: ```math \underbrace{e^{i\pi}}_{\text{Euler's identity}} + 1 = 0 ``` ## Common Syntax Reference ### Basic Operations | Syntax | Result | Description | | ------------- | ------------- | -------------- | | `x + y` | $x + y$ | Addition | | `x - y` | $x - y$ | Subtraction | | `x \times y` | $x \times y$ | Multiplication | | `x \div y` | $x \div y$ | Division | | `\frac{x}{y}` | $\frac{x}{y}$ | Fraction | | `x^y` | $x^y$ | Superscript | | `x_y` | $x_y$ | Subscript | | `\sqrt{x}` | $\sqrt{x}$ | Square root | | `\sqrt[n]{x}` | $\sqrt[n]{x}$ | nth root | ### Delimiters | Syntax | Result | Description | | ------------------- | ------------------- | -------------- | | `(x)` | $(x)$ | Parentheses | | `[x]` | $[x]$ | Brackets | | `\{x\}` | $\{x\}$ | Braces | | `\langle x \rangle` | $\langle x \rangle$ | Angle brackets | | `\lvert x \rvert` | $\lvert x \rvert$ | Absolute value | | `\lVert x \rVert` | $\lVert x \rVert$ | Norm | ## Troubleshooting ### Equation Not Rendering * Check that `katex/dist/katex.css` is imported in your layout * Verify the TeX syntax is valid * Ensure `remark-math` and `rehype-katex` are configured correctly * Use the [KaTeX Live Demo](https://katex.org/#demo) to test syntax ### Missing Symbols * Not all LaTeX commands are supported by KaTeX * Check the [KaTeX Support Table](https://katex.org/docs/support_table.html) * Consider using alternative notation ### Escaping Special Characters Use backslash to escape special characters: ```mdx Use \$ for a dollar sign, not $\$$ in math mode. ``` You can copy equations from Wikipedia - they're already in LaTeX format and work directly with KaTeX! Try it: Visit any Wikipedia math article, right-click an equation, and select "Copy LaTeX code". ## Resources * [KaTeX Official Documentation](https://katex.org/) * [KaTeX Support Table](https://katex.org/docs/support_table.html) - Complete list of supported functions * [KaTeX Live Demo](https://katex.org/#demo) - Test equations in real-time * [LaTeX Math Symbols](https://www.latex-project.org/help/documentation/) - Comprehensive symbol reference * [Detexify](http://detexify.kirelabs.org/classify.html) - Draw a symbol to find its LaTeX command * [Fumadocs Math Guide](https://fumadocs.dev/docs/ui/markdown/math) ## Next Steps * Experiment with different equation types * Check out the [KaTeX support table](https://katex.org/docs/support_table.html) for all available commands * Review our [Mermaid Diagrams](/docs/features/mermaid) feature for visual diagrams * Explore [Documentation Guide](/docs/guides/documentation) for general writing tips -------------------------------------------------------------------------------- END OF PAGE 32 -------------------------------------------------------------------------------- ================================================================================ PAGE 33 OF 57 ================================================================================ TITLE: Mermaid Diagrams URL: https://ai-web-feeds.w4w.dev/docs/features/mermaid MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/mermaid.mdx DESCRIPTION: Render beautiful diagrams in your documentation using Mermaid syntax PATH: /features/mermaid -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Mermaid Diagrams (/docs/features/mermaid) import { Mermaid } from "@/components/mdx/mermaid"; import { Tab, Tabs } from "fumadocs-ui/components/tabs"; ## Overview Mermaid is a JavaScript-based diagramming and charting tool that uses Markdown-inspired syntax to create and modify diagrams dynamically. This site integrates Mermaid to enable rich, interactive diagrams in documentation. ## Features * **Theme-aware**: Diagrams automatically adapt to light/dark mode * **Interactive**: Clickable elements and tooltips * **Multiple diagram types**: Flowcharts, sequence diagrams, class diagrams, ER diagrams, and more * **Simple syntax**: Write diagrams using a Markdown-like syntax ## Basic Usage ### Method 1: Mermaid Code Blocks The simplest way to add a Mermaid diagram is using a fenced code block with the `mermaid` language identifier: ````md ```mermaid graph TD; A[Start] --> B{Decision}; B -->|Yes| C[Action 1]; B -->|No| D[Action 2]; C --> E[End]; D --> E; ``` ```` ### Method 2: Component Syntax You can also use the `` component directly for more control: ```mdx ``` ## Diagram Types ### Flowcharts Create process flows and decision trees: ### Sequence Diagrams Visualize interaction between components: ### Class Diagrams Document object-oriented structures: ### Entity Relationship Diagrams Model database schemas: ### State Diagrams Show state transitions: ### Gantt Charts Project timelines and scheduling: ### User Journey Map user experiences: ### Git Graph Visualize Git workflows: ## Advanced Features ### Subgraphs Organize complex diagrams with subgraphs: `mermaid graph TB subgraph Frontend A[React App] B[Vue App] end subgraph Backend C[API Server] D[Auth Service] end subgraph Database E[(PostgreSQL)] F[(Redis)] end A --> C B --> C C --> D C --> E D --> F ` `md ```mermaid graph TB subgraph Frontend A[React App] B[Vue App] end subgraph Backend C[API Server] D[Auth Service] end subgraph Database E[(PostgreSQL)] F[(Redis)] end A --> C B --> C C --> D C --> E D --> F ``` ` ### Styling Customize diagram appearance with inline styles: ## Best Practices ### Keep It Simple * Start with simple diagrams and add complexity gradually * Use subgraphs to organize large diagrams * Keep labels concise and clear ### Use Consistent Naming * Use descriptive node IDs * Follow a naming convention across diagrams * Use consistent shapes for similar elements ### Example: Good vs. Not Ideal ## Troubleshooting ### Diagram Not Rendering * Ensure `mermaid` and `next-themes` are installed * Check console for syntax errors * Verify the diagram type is supported ### Theme Issues * The component automatically detects light/dark mode * If themes don't switch, check that `RootProvider` is properly configured ### Syntax Errors * Use the [Mermaid Live Editor](https://mermaid.live/) to validate syntax * Check the [official Mermaid documentation](https://mermaid.js.org/) for syntax reference ## Resources * [Mermaid Official Documentation](https://mermaid.js.org/) * [Mermaid Live Editor](https://mermaid.live/) * [Mermaid Cheat Sheet](https://jojozhuang.github.io/tutorial/mermaid-cheat-sheet/) * [Fumadocs Mermaid Guide](https://fumadocs.dev/docs/ui/markdown/mermaid) ## Next Steps * Explore different diagram types in the examples above * Check out the [Mermaid syntax documentation](https://mermaid.js.org/intro/syntax-reference.html) * Review our [Documentation Guide](/docs/guides/documentation) for general writing tips -------------------------------------------------------------------------------- END OF PAGE 33 -------------------------------------------------------------------------------- ================================================================================ PAGE 34 OF 57 ================================================================================ TITLE: Features Overview URL: https://ai-web-feeds.w4w.dev/docs/features/overview MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/overview.mdx DESCRIPTION: Complete overview of AI Web Feeds capabilities - feed management, fetching, analytics, and integrations PATH: /features/overview -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Features Overview (/docs/features/overview) import { Card, Cards } from "fumadocs-ui/components/card"; AI Web Feeds is a comprehensive system for managing, fetching, and analyzing AI/ML content feeds. ## Core Capabilities ## Feed Management ### Centralized Feed Registry * **YAML-based configuration** (`data/feeds.yaml`) * **JSON schema validation** for correctness * **Multiple feed formats** (RSS, Atom, JSON Feed) * **Platform-specific discovery** (auto-detect and generate feed URLs) ### Feed Metadata * **Source types**: blog, newsletter, podcast, journal, preprint, organization, aggregator, video, docs, forum, dataset, code-repo * **Content mediums**: text, audio, video, code, data * **Topic classification** with relevance weights * **Language and localization** support * **Quality scoring** and curation status * **Contributor attribution** ## Advanced Fetching ### Comprehensive Metadata Extraction Extracts **100+ fields** from feeds: * **Basic info**: title, subtitle, description, link, language, copyright, generator * **Author/publisher**: name, email, managing editor, webmaster * **Visual assets**: images, logos, icons * **Technical**: TTL, skip hours/days, cloud config, PubSubHubbub * **Extensions**: iTunes podcast metadata, Dublin Core, Media RSS, GeoRSS ### Quality Assessment Three-dimensional scoring system (0-1): * **Completeness Score**: Measures metadata completeness * **Richness Score**: Evaluates content depth and quality * **Structure Score**: Assesses feed validity and structure ### Content Analysis * Item statistics (total, with content, with authors, with media) * Average content lengths * Publishing frequency detection * Update pattern analysis ### Reliability Features * **Conditional requests** using ETag and Last-Modified headers * **Automatic retry** with exponential backoff * **Configurable timeouts** * **Comprehensive error logging** * **Success rate tracking** ## Analytics & Reporting ### Overview Statistics * Total feeds, items, and topics * Feed status distribution (verified, active, inactive, archived) * Recent activity tracking (24h, 7d, 30d) ### Distribution Analysis * Source type distribution * Content medium distribution * Topic distribution across feeds * Language distribution * Geographic distribution (via GeoRSS) ### Performance Metrics * Fetch success/failure rates * Average fetch duration * Error type distribution * HTTP status code analysis * Bandwidth usage ### Content Intelligence * Content coverage analysis * Author attribution tracking * Category and tag analysis * Publishing trends by time/day * Content freshness metrics ### Feed Health Monitoring * Per-feed health scores (0-1) * Health status (Excellent, Good, Fair, Poor, Critical) * Success rate tracking * Content quality metrics * Publishing frequency analysis * Historical trend analysis ### Contributor Analytics * Top contributors by feed count * Verification rates * Quality benchmarking * Contribution timeline ### Reporting * **JSON reports**: Full analytics export * **OPML export**: For feed readers * **CSV export**: Via Python API * **Custom queries**: Database access ## Platform-Specific Integration ### Supported Platforms **Social/Community:** * **Reddit**: Subreddits and user feeds with sorting (hot, top, new) * **Hacker News**: Multiple feed types (frontpage, newest, best, ask, show, jobs) * **Dev.to**: User and organization feeds **Publishing:** * **Medium**: Publications, users, and tags * **Substack**: Newsletter feeds * **GitHub**: Releases, commits, tags, activity **Media:** * **YouTube**: Channels and playlists * **Podcasts**: iTunes podcast metadata support ### Auto-Discovery * Automatic feed URL generation for known platforms * HTML-based feed discovery for generic sites * Common feed URL pattern detection * Platform-specific configuration support ## Data Storage ### Database Schema * **SQLModel-based ORM** for type safety * Support for **SQLite and PostgreSQL** * Efficient relationship management * **JSON columns** for flexible metadata storage ### Models * `FeedSource`: Main feed registry with metadata * `FeedItem`: Individual feed entries * `FeedFetchLog`: Detailed fetch history and metrics * `Topic`: Topic taxonomy and relationships ## Export & Interoperability ### OPML Export * Standard OPML format * Categorized OPML by source type * Filtered OPML generation * Compatible with all major feed readers ### Data Formats * **YAML**: Human-editable feed configuration * **JSON**: API consumption and export * **JSON Schema**: Validation and documentation * **SQL**: Direct database queries ## CLI Tools ### Feed Management ```bash ai-web-feeds enrich all # Enrich feeds with metadata ai-web-feeds validate # Validate feed configuration ai-web-feeds export # Export to various formats ``` ### Data Fetching ```bash ai-web-feeds fetch one # Fetch single feed ai-web-feeds fetch all # Fetch all feeds ``` ### Analytics ```bash ai-web-feeds analytics overview # Dashboard view ai-web-feeds analytics distributions # Distribution analysis ai-web-feeds analytics quality # Quality metrics ai-web-feeds analytics performance # Fetch performance ai-web-feeds analytics content # Content statistics ai-web-feeds analytics trends # Publishing trends ai-web-feeds analytics health # Feed health report ai-web-feeds analytics report # Full JSON report ``` ### OPML Management ```bash ai-web-feeds opml generate # Generate OPML files ai-web-feeds opml categorize # Generate categorized OPML ``` ## Quality & Curation ### Curation Workflow * Verification status tracking * Quality score calculation (automated) * Curation notes and metadata * Contributor attribution * Curation history ### Quality Dimensions 1. **Completeness** (0-1): Metadata completeness 2. **Richness** (0-1): Content depth and quality 3. **Structure** (0-1): Feed validity and structure ### Health Status * **Excellent** (0.8-1.0): Optimal performance * **Good** (0.6-0.8): Healthy with minor issues * **Fair** (0.4-0.6): Some problems present * **Poor** (0.2-0.4): Needs attention * **Critical** (0.0-0.2): Failing/broken ## Extensibility ### Plugin Architecture * Custom platform generators * Configurable discovery rules * Extension metadata support * Flexible JSON storage for unknown fields ### API Design * Clean Python API for programmatic use * Rich CLI for interactive use * Database session management * Async/await support for concurrent operations ## Use Cases 1. **Content Aggregation**: Build comprehensive AI/ML content aggregators 2. **Research**: Track and analyze AI/ML publication patterns 3. **Monitoring**: Monitor feed health and reliability 4. **Discovery**: Find new AI/ML content sources 5. **Analysis**: Analyze publishing trends and patterns 6. **Curation**: Build high-quality curated feed lists 7. **Integration**: Feed data into other systems via exports 8. **Alerting**: Get notified when feeds break or content is published ## Architecture ``` ai-web-feeds/ ├── packages/ai_web_feeds/ # Core library │ ├── models.py # Data models │ ├── storage.py # Database management │ ├── utils.py # Feed discovery & enrichment │ ├── fetcher.py # Advanced feed fetching │ └── analytics.py # Analytics engine ├── apps/cli/ # CLI application │ └── commands/ # CLI commands │ ├── fetch.py # Fetch commands │ ├── analytics.py # Analytics commands │ ├── enrich.py # Enrichment commands │ ├── export.py # Export commands │ ├── opml.py # OPML commands │ └── validate.py # Validation commands └── data/ # Data files ├── feeds.yaml # Feed registry ├── topics.yaml # Topic taxonomy └── aiwebfeeds.db # SQLite database ``` ## Technology Stack * **Python 3.13+**: Modern Python with latest features * **SQLModel**: SQL database ORM with Pydantic integration * **feedparser**: Robust feed parsing * **httpx**: Modern async HTTP client * **BeautifulSoup**: HTML parsing for discovery * **Typer**: CLI framework * **Rich**: Beautiful terminal output * **Pydantic**: Data validation * **YAML/JSON**: Configuration and export formats ## Performance * **Conditional requests**: Reduce bandwidth with ETag/Last-Modified * **Async operations**: Concurrent feed fetching * **Retry logic**: Exponential backoff for transient failures * **Connection pooling**: Efficient HTTP connections * **Database indexing**: Fast queries * **Caching**: Feed metadata caching ## Security See the [Security Guide](/docs/security) for: * Input validation * Rate limiting * Error handling * Secure defaults * Vulnerability reporting ## Getting Started Ready to dive in? Check out our guides: * [Getting Started](/docs/guides/getting-started) - Installation and setup * [Analytics Guide](/docs/guides/analytics) - Advanced analytics * [CLI Reference](/docs/development/cli) - Command-line interface * [Python API](/docs/development/python-api) - Programmatic usage ## Future Roadmap Planned enhancements: * [ ] Real-time analytics dashboard (web UI) * [ ] Machine learning for content classification * [ ] Anomaly detection in publishing patterns * [ ] Advanced deduplication algorithms * [ ] Content similarity analysis * [ ] Multi-language NLP support * [ ] GraphQL API * [ ] Webhook notifications * [ ] Feed reader web interface * [ ] Export to more formats (Parquet, Arrow) -------------------------------------------------------------------------------- END OF PAGE 34 -------------------------------------------------------------------------------- ================================================================================ PAGE 35 OF 57 ================================================================================ TITLE: PDF Export URL: https://ai-web-feeds.w4w.dev/docs/features/pdf-export MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/pdf-export.mdx DESCRIPTION: Export your Fumadocs documentation pages as high-quality PDF files PATH: /features/pdf-export -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # PDF Export (/docs/features/pdf-export) import { Callout } from "fumadocs-ui/components/callout"; import { Tab, Tabs } from "fumadocs-ui/components/tabs"; import { Step, Steps } from "fumadocs-ui/components/steps"; Export your Fumadocs documentation pages as high-quality PDF files with automatic discovery and batch processing. ## Features **Automatic Discovery** Exports all documentation pages automatically **Clean Output** Navigation and UI elements hidden in print mode **Interactive Content** Accordions and tabs expanded to show all content **Batch Processing** Concurrent exports with rate limiting ## Quick Start ### Start Development Server ```bash pnpm dev ``` Wait for the server to be ready at `http://localhost:3000` ### Export PDFs `bash pnpm export-pdf ` Exports all documentation pages to the `pdfs/` directory. `bash pnpm export-pdf:specific /docs /docs/getting-started ` Export only the specified pages. `bash pnpm export-pdf:build ` Automated build and export (recommended for final PDFs). ### Find Your PDFs PDFs are saved to the `pdfs/` directory: ``` pdfs/ ├── index.pdf ├── docs-getting-started.pdf └── docs-features-pdf-export.pdf ``` ## How It Works ### Print Styles Special CSS in `app/global.css` hides navigation elements and optimizes for printing: ```css title="app/global.css" @media print { #nd-docs-layout { --fd-sidebar-width: 0px !important; } #nd-sidebar { display: none; } pre, img { page-break-inside: avoid; } } ``` ### Component Overrides When `NEXT_PUBLIC_PDF_EXPORT=true`, interactive components render expanded: ```tsx title="mdx-components.tsx" const isPrinting = process.env.NEXT_PUBLIC_PDF_EXPORT === "true"; return { Accordion: isPrinting ? PrintingAccordion : Accordion, Tab: isPrinting ? PrintingTab : Tab, }; ``` **PrintingAccordion** and **PrintingTab** components expand all content so nothing is hidden in PDFs. ### Export Script The `scripts/export-pdf.ts` script uses Puppeteer to: 1. Discover all documentation pages from `source.getPages()` 2. Navigate to each page with headless Chrome 3. Wait for content to load 4. Generate PDF with custom settings ```typescript title="scripts/export-pdf.ts" await page.pdf({ path: outputPath, width: "950px", printBackground: true, margin: { top: "20px", right: "20px", bottom: "20px", left: "20px", }, }); ``` ## Configuration ### PDF Settings Edit `scripts/export-pdf.ts` to customize PDF output: ```typescript title="scripts/export-pdf.ts" await page.pdf({ path: outputPath, width: "950px", // Page width printBackground: true, // Include backgrounds margin: { // Page margins top: "20px", right: "20px", bottom: "20px", left: "20px", }, }); ``` ### Concurrency Control Adjust parallel exports to match your server capacity: ```typescript title="scripts/export-pdf.ts" const CONCURRENCY = 3; // Export 3 pages at a time ``` Higher concurrency = faster exports but more server load. Start with 3 and adjust based on your system. ### Environment Variables Set `NEXT_PUBLIC_PDF_EXPORT=true` to enable PDF-friendly rendering: ```bash NEXT_PUBLIC_PDF_EXPORT=true pnpm build ``` ## Advanced Usage ### Custom Page Selection Modify `getAllDocUrls()` to filter pages: ```typescript title="scripts/export-pdf.ts" async function getAllDocUrls(): Promise { const pages = source.getPages(); return pages .filter((page) => page.url.startsWith("/docs/api")) // Only API docs .map((page) => page.url); } ``` ### Custom Viewport Change rendering viewport for different display sizes: ```typescript await page.setViewport({ width: 1920, // Wider viewport height: 1080, }); ``` ### Add Headers/Footers Puppeteer supports custom PDF headers and footers: ```typescript await page.pdf({ // ... other options displayHeaderFooter: true, headerTemplate: '
My Docs
', footerTemplate: '
Page
', }); ``` ## Troubleshooting ### PDFs are blank ### Increase Timeout ```typescript timeout: 60000; // 60 seconds ``` ### Check Server ```bash curl http://localhost:3000/docs ``` ### View Browser Set `headless: false` in launch options to see what's happening. ### Missing Content Ensure `NEXT_PUBLIC_PDF_EXPORT=true` is set during build: `bash NEXT_PUBLIC_PDF_EXPORT=true pnpm build ` ### Navigation Still Visible 1. Clear `.next` cache: `rm -rf .next` 2. Rebuild with PDF export mode enabled 3. Verify print styles in browser dev tools ### Timeout Errors * Reduce concurrency: `CONCURRENCY = 1` * Increase timeout values * Check server resources ## Best Practices 1. **Always use production build** for final exports 2. **Test with single pages** first before exporting all 3. **Monitor server resources** during large exports 4. **Review PDFs** before distribution ## Scripts Reference | Script | Description | | ------------------------------------ | ------------------------------------------ | | `pnpm export-pdf` | Export all pages (requires server running) | | `pnpm export-pdf:specific ` | Export specific pages | | `pnpm export-pdf:build` | Build and export (automated) | ## Tips * Export during off-peak hours for large sites * Use `--no-sandbox` flag if running in containers * Consider PDF file size when distributing * Test exports on different content types * Keep Puppeteer updated for best compatibility ## More Information * [Fumadocs PDF Export Guide](https://fumadocs.dev/docs/ui/export-pdf) * [Puppeteer PDF API](https://pptr.dev/api/puppeteer.pdfoptions) * [Scripts Documentation](/docs/guides/scripts) -------------------------------------------------------------------------------- END OF PAGE 35 -------------------------------------------------------------------------------- ================================================================================ PAGE 36 OF 57 ================================================================================ TITLE: Platform Integrations URL: https://ai-web-feeds.w4w.dev/docs/features/platform-integrations MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/platform-integrations.mdx DESCRIPTION: Native support for Reddit, Medium, YouTube, GitHub, and more PATH: /features/platform-integrations -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Platform Integrations (/docs/features/platform-integrations) # Platform Integrations AI Web Feeds provides native support for popular content platforms, automatically converting URLs to their RSS/Atom feed equivalents. ## Supported Platforms ### Reddit Convert subreddit and user URLs to RSS feeds. **URL Formats:** * Subreddit: `https://reddit.com/r/{subreddit}` * User: `https://reddit.com/u/{username}` **Configuration:** ```yaml - id: "machinelearning-subreddit" site: "https://www.reddit.com/r/MachineLearning" title: "r/MachineLearning" source_type: "reddit" topics: ["ml", "community"] platform_config: platform: "reddit" reddit: subreddit: "MachineLearning" sort: "hot" # hot, new, top, rising time: "day" # hour, day, week, month, year, all (for top) ``` **Auto-generated feed:** * `hot`: `https://www.reddit.com/r/MachineLearning/hot/.rss` * `top`: `https://www.reddit.com/r/MachineLearning/top/.rss?t=day` * `new`: `https://www.reddit.com/r/MachineLearning/new/.rss` ### Medium Convert Medium publications and user profiles to RSS feeds. **URL Formats:** * Publication: `https://medium.com/{publication}` * User: `https://medium.com/@{username}` * Tag: `https://medium.com/tag/{tag}` **Configuration:** ```yaml - id: "towards-data-science" site: "https://towardsdatascience.com" title: "Towards Data Science" source_type: "medium" topics: ["ml", "data-science"] platform_config: platform: "medium" medium: publication: "towards-data-science" ``` **Auto-generated feed:** * Publication: `https://medium.com/feed/towards-data-science` * User: `https://medium.com/feed/@username` * Tag: `https://medium.com/feed/tag/ai` ### YouTube Convert YouTube channels and playlists to RSS feeds. **URL Formats:** * Channel: `https://youtube.com/channel/{channel_id}` * User: `https://youtube.com/@{username}` * Playlist: `https://youtube.com/playlist?list={playlist_id}` **Configuration:** ```yaml - id: "two-minute-papers" site: "https://www.youtube.com/@TwoMinutePapers" title: "Two Minute Papers" source_type: "youtube" topics: ["research", "video"] platform_config: platform: "youtube" youtube: channel_id: "UCbfYPyITQ-7l4upoX8nvctg" ``` **Auto-generated feed:** * Channel: `https://www.youtube.com/feeds/videos.xml?channel_id=UCbfYPyITQ-7l4upoX8nvctg` * Playlist: `https://www.youtube.com/feeds/videos.xml?playlist_id=PLxxxxxx` ### GitHub Convert GitHub repositories to Atom feeds for releases, commits, and tags. **URL Format:** * Repository: `https://github.com/{owner}/{repo}` **Configuration:** ```yaml - id: "pytorch-releases" site: "https://github.com/pytorch/pytorch" title: "PyTorch Releases" source_type: "github" topics: ["frameworks", "ml"] platform_config: platform: "github" github: owner: "pytorch" repo: "pytorch" feed_type: "releases" # releases, commits, tags, activity branch: "main" # optional, for commits feed ``` **Auto-generated feeds:** * Releases: `https://github.com/pytorch/pytorch/releases.atom` * Commits: `https://github.com/pytorch/pytorch/commits.atom` * Tags: `https://github.com/pytorch/pytorch/tags.atom` * Activity: `https://github.com/pytorch/pytorch/activity.atom` ### Substack Convert Substack publications to RSS feeds. **URL Format:** * Publication: `https://{publication}.substack.com` **Configuration:** ```yaml - id: "import-ai" site: "https://importai.substack.com" title: "Import AI" source_type: "substack" topics: ["newsletters", "industry"] platform_config: platform: "substack" substack: publication: "importai" ``` **Auto-generated feed:** * `https://importai.substack.com/feed` ### Dev.to Convert Dev.to users, organizations, and tags to RSS feeds. **URL Formats:** * User: `https://dev.to/{username}` * Organization: `https://dev.to/{org}` * Tag: `https://dev.to/t/{tag}` **Configuration:** ```yaml - id: "devto-ml-tag" site: "https://dev.to/t/machinelearning" title: "Dev.to - ML Tag" source_type: "devto" topics: ["blogs", "tutorials"] platform_config: platform: "devto" devto: tag: "machinelearning" ``` **Auto-generated feeds:** * User: `https://dev.to/feed/username` * Tag: `https://dev.to/feed/tag/machinelearning` ### Hacker News Access Hacker News RSS feeds. **Configuration:** ```yaml - id: "hackernews-frontpage" site: "https://news.ycombinator.com" title: "Hacker News - Front Page" source_type: "hackernews" topics: ["tech", "news"] platform_config: platform: "hackernews" hackernews: feed_type: "frontpage" # frontpage, newest, best, ask, show, jobs ``` **Auto-generated feeds:** * Frontpage: `https://news.ycombinator.com/rss` * Newest: `https://news.ycombinator.com/newest.rss` * Best: `https://news.ycombinator.com/best.rss` * Ask HN: `https://news.ycombinator.com/ask.rss` * Show HN: `https://news.ycombinator.com/show.rss` ## How It Works ### Automatic Detection When you provide a `site` URL, the system: 1. **Detects the platform** from the URL domain 2. **Extracts identifiers** (subreddit, username, channel ID, etc.) 3. **Generates the feed URL** using platform-specific patterns 4. **Validates the feed** before saving ### Manual Configuration For more control, use `platform_config`: ```yaml - id: "custom-reddit" site: "https://www.reddit.com/r/MachineLearning" platform_config: platform: "reddit" reddit: subreddit: "MachineLearning" sort: "top" time: "week" ``` ### Enrichment Metadata Auto-generated feeds include metadata: ```yaml meta: platform: "reddit" # Platform name platform_generated: true # Feed URL was auto-generated format: "rss" # Detected feed format last_validated: "2025-10-15T12:00:00" ``` ## Complete Example Here's a complete feeds.yaml with platform integrations: ```yaml schema_version: "feeds-1.0.0" sources: # Reddit subreddit - id: "ml-subreddit" site: "https://www.reddit.com/r/MachineLearning" title: "r/MachineLearning" source_type: "reddit" topics: ["ml", "community"] platform_config: platform: "reddit" reddit: subreddit: "MachineLearning" sort: "hot" # Medium publication - id: "tds-medium" site: "https://towardsdatascience.com" title: "Towards Data Science" source_type: "medium" topics: ["ml", "data-science"] platform_config: platform: "medium" medium: publication: "towards-data-science" # YouTube channel - id: "yt-2min-papers" site: "https://www.youtube.com/@TwoMinutePapers" title: "Two Minute Papers" source_type: "youtube" topics: ["research", "video"] platform_config: platform: "youtube" youtube: channel_id: "UCbfYPyITQ-7l4upoX8nvctg" # GitHub releases - id: "pytorch-gh" site: "https://github.com/pytorch/pytorch" title: "PyTorch Releases" source_type: "github" topics: ["frameworks", "ml"] platform_config: platform: "github" github: owner: "pytorch" repo: "pytorch" feed_type: "releases" # Substack newsletter - id: "importai-newsletter" site: "https://importai.substack.com" title: "Import AI" source_type: "substack" topics: ["newsletters"] platform_config: platform: "substack" substack: publication: "importai" ``` ## CLI Usage Generate feeds with platform auto-detection: ```bash # Enrich feeds (auto-generates platform feed URLs) uv run aiwebfeeds enrich all # View the enriched YAML with generated feed URLs cat data/feeds.enriched.yaml # Generate OPML with platform feeds uv run aiwebfeeds opml all ``` ## Python API Use platform integrations programmatically: ```python from ai_web_feeds.utils import ( detect_platform, generate_platform_feed_url, enrich_feed_source, ) # Detect platform platform = detect_platform("https://www.reddit.com/r/MachineLearning") # Returns: "reddit" # Generate feed URL feed_url = generate_platform_feed_url( "https://www.reddit.com/r/MachineLearning", "reddit", {"reddit": {"subreddit": "MachineLearning", "sort": "hot"}} ) # Returns: "https://www.reddit.com/r/MachineLearning/hot/.rss" # Enrich with platform detection feed_data = { "id": "ml-reddit", "site": "https://www.reddit.com/r/MachineLearning", "platform_config": { "platform": "reddit", "reddit": {"subreddit": "MachineLearning"} } } enriched = await enrich_feed_source(feed_data) # enriched["feed"] will contain the auto-generated RSS URL ``` ## Benefits * **No manual feed URL lookup** - Just provide the platform URL * **Consistent formatting** - All feeds follow platform standards * **Validation** - Auto-generated URLs are validated before saving * **Metadata tracking** - Know which feeds were auto-generated * **Easy maintenance** - Update platform configs, not URLs ## Limitations * **Platform changes** - If platforms change their feed URL patterns, updates needed * **Rate limiting** - Some platforms may rate-limit feed access * **Authentication** - Private/authenticated feeds not supported * **Custom domains** - Some platforms use custom domains that may not auto-detect ## Next Steps * [Feed Enrichment](/docs/development/cli#enrich---enrich-feed-data) - Learn about the enrichment process * [OPML Generation](/docs/development/cli#opml---generate-opml-files) - Generate feed reader imports * [Python API](/docs/development/python-api) - Programmatic platform integration -------------------------------------------------------------------------------- END OF PAGE 36 -------------------------------------------------------------------------------- ================================================================================ PAGE 37 OF 57 ================================================================================ TITLE: Quality Scoring URL: https://ai-web-feeds.w4w.dev/docs/features/quality-scoring MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/quality-scoring.mdx DESCRIPTION: Heuristic-based article quality assessment for AI Web Feeds PATH: /features/quality-scoring -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Quality Scoring (/docs/features/quality-scoring) # Quality Scoring Quality Scoring analyzes articles using heuristic metrics to compute quality scores ranging from 0-100. This helps surface high-quality content and filter low-quality articles. ## Overview The quality scorer evaluates articles across multiple dimensions: * **Depth**: Word count, paragraph structure, technical content (code blocks, diagrams) * **References**: External links, academic citations, reputable domains * **Author Authority**: Author credentials and expertise (planned) * **Domain Reputation**: Feed source quality and reliability * **Engagement**: Read time estimates and user signals (planned) ## Architecture ## Scoring Components ### Depth Score (0-100) Evaluates content depth based on: * **Word Count**: Higher scores for longer articles (500+ words) * **Structure**: Rewards well-organized content with multiple paragraphs * **Technical Content**: Bonus points for code blocks (\`\`\`) and images * **Headings**: Recognition of structured content with markdown headings **Example**: ```python # Article with 1500 words, 5 paragraphs, code blocks → Depth Score: 85 ``` ### Reference Score (0-100) Assesses external citations: * **External Links**: Minimum 3 links recommended * **Academic Citations**: DOI, arXiv references weighted highly * **Reputable Domains**: .edu, .org domains receive bonus points **Example**: ```python # Article with 5 links, 2 from arxiv.org → Reference Score: 75 ``` ### Domain Score (0-100) Based on feed reputation: * **High-Quality Feeds**: arXiv, Nature, Science, ACM journals → 90 * **Standard Feeds**: General tech blogs → 60 * **Unknown Feeds**: Default score → 50 ### Overall Score Weighted combination of component scores: ```python overall_score = ( depth_score * 0.25 + reference_score * 0.20 + author_score * 0.15 + domain_score * 0.25 + engagement_score * 0.15 ) ``` ## Usage ### CLI Commands #### Process Quality Scoring Run quality scoring manually on unprocessed articles: ```bash aiwebfeeds nlp quality ``` **Options**: * `--batch-size`: Number of articles to process (default: 100) * `--force`: Reprocess all articles, ignoring existing scores ```bash # Process 50 articles aiwebfeeds nlp quality --batch-size 50 # Reprocess all articles aiwebfeeds nlp quality --force ``` #### View Statistics ```bash aiwebfeeds nlp stats ``` Shows processing status for all NLP operations including quality scoring. ### Python API ```python from ai_web_feeds.nlp import QualityScorer from ai_web_feeds.config import Settings scorer = QualityScorer(Settings()) article = { "id": 1, "title": "Attention Is All You Need", "content": "The Transformer architecture...", # Long article "feed_id": "arxiv-nlp" } scores = scorer.score_article(article) # Returns: { # "overall_score": 85, # "depth_score": 90, # "reference_score": 75, # "author_score": 50, # "domain_score": 90, # "engagement_score": 60 # } ``` ### Batch Processing Quality scoring runs automatically every 30 minutes via APScheduler: ```python from ai_web_feeds.nlp.scheduler import NLPScheduler from apscheduler.schedulers.asyncio import AsyncIOScheduler scheduler = AsyncIOScheduler() nlp_scheduler = NLPScheduler(scheduler) nlp_scheduler.register_jobs() scheduler.start() ``` ## Database Schema ### article\_quality\_scores Table ```sql CREATE TABLE article_quality_scores ( article_id INTEGER PRIMARY KEY, overall_score INTEGER NOT NULL CHECK(overall_score BETWEEN 0 AND 100), depth_score INTEGER CHECK(depth_score BETWEEN 0 AND 100), reference_score INTEGER CHECK(reference_score BETWEEN 0 AND 100), author_score INTEGER CHECK(author_score BETWEEN 0 AND 100), domain_score INTEGER CHECK(domain_score BETWEEN 0 AND 100), engagement_score INTEGER CHECK(engagement_score BETWEEN 0 AND 100), computed_at DATETIME DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (article_id) REFERENCES feed_entries(id) ON DELETE CASCADE ); ``` ### Processed Flags Feed entries track processing status: ```sql ALTER TABLE feed_entries ADD COLUMN quality_processed BOOLEAN DEFAULT FALSE; ALTER TABLE feed_entries ADD COLUMN quality_processed_at DATETIME; ``` ## Configuration Configure quality scoring in `config.py` or via environment variables: ```python class Phase5Settings(BaseSettings): quality_batch_size: int = 100 # Articles per batch quality_cron: str = "*/30 * * * *" # Every 30 minutes quality_min_words: int = 100 # Minimum words to score ``` **Environment Variables**: ```bash PHASE5_QUALITY_BATCH_SIZE=100 PHASE5_QUALITY_MIN_WORDS=100 ``` ## Performance * **Throughput**: \~100 articles/minute * **Memory**: \<50MB for batch of 100 articles * **Storage**: \~100 bytes per article score ## Future Enhancements Planned improvements for quality scoring: 1. **Author Authority**: H-index, publication history, expert verification 2. **Engagement Metrics**: Read time tracking, shares, comments 3. **Machine Learning**: Train models on user feedback to refine scoring 4. **Domain Reputation**: Crowdsourced feed quality ratings ## Troubleshooting ### No Articles Being Scored **Symptom**: `aiwebfeeds nlp stats` shows 0 quality processed. **Solution**: ```bash # Check if articles exist aiwebfeeds feeds list # Manually trigger scoring aiwebfeeds nlp quality --batch-size 10 ``` ### Low Scores for Good Articles **Symptom**: High-quality articles receiving low scores. **Cause**: Missing metadata (author, feed reputation not configured). **Solution**: Update domain scoring logic in `quality_scorer.py` to recognize your feeds. ## See Also * [Entity Extraction](/docs/features/entity-extraction) - Extract named entities from articles * [Sentiment Analysis](/docs/features/sentiment-analysis) - Classify article sentiment * [Topic Modeling](/docs/features/topic-modeling) - Discover subtopics automatically -------------------------------------------------------------------------------- END OF PAGE 37 -------------------------------------------------------------------------------- ================================================================================ PAGE 38 OF 57 ================================================================================ TITLE: Real-Time Feed Monitoring URL: https://ai-web-feeds.w4w.dev/docs/features/real-time-monitoring MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/real-time-monitoring.mdx DESCRIPTION: Get instant notifications for new articles, trending topics, and email digests with WebSocket-powered real-time updates PATH: /features/real-time-monitoring -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Real-Time Feed Monitoring (/docs/features/real-time-monitoring) # Real-Time Feed Monitoring & Alerts **Phase 3B Implementation** - Get instant notifications for new articles, trending topics, and customizable email digests. ## Overview The real-time monitoring system provides: * **Live Notifications**: WebSocket-powered instant alerts for new articles * **Trending Detection**: Z-score analysis for identifying hot topics * **Email Digests**: Customizable daily/weekly digest subscriptions * **Feed Follows**: Subscribe to specific feeds for targeted notifications * **Smart Bundling**: Automatic notification grouping to prevent spam ## Architecture ### Components 1. **Feed Poller** (`polling.py`): * Periodic feed fetching with retry logic * Article deduplication via GUID * Response time tracking 2. **Notification Manager** (`notifications.py`): * Notification creation and bundling * WebSocket broadcasting * User preference filtering 3. **Trending Detector** (`trending.py`): * Z-score statistical analysis * Baseline calculation (mean/std dev) * Representative article selection 4. **Digest Manager** (`digests.py`): * HTML email generation * Cron-based scheduling * SMTP delivery 5. **WebSocket Server** (`websocket_server.py`): * Socket.IO real-time server * User authentication and rooms * Event broadcasting 6. **Scheduler** (`scheduler.py`): * APScheduler background jobs * 4 periodic tasks (polling, trending, digests, cleanup) ## Getting Started ### 1. Start Monitoring Server ```bash # Start backend monitoring (WebSocket + scheduler) uv run aiwebfeeds monitor start # Output: # ✓ Background scheduler started # ✓ WebSocket server started on port 8000 # # Scheduled Jobs: # poll_feeds | Every 15 min | Poll all active feeds # detect_trending | Every 1 hour | Z-score trend detection # send_digests | Every minute | Check for due email digests # cleanup_notifications | Daily 3:00 AM | Delete old notifications ``` ### 2. Follow Feeds ```bash # Get your user ID from browser localStorage # (automatically generated on first visit) # Follow a feed to receive notifications uv run aiwebfeeds monitor follow # Example: uv run aiwebfeeds monitor follow a1b2c3d4-... ai-news # List your follows uv run aiwebfeeds monitor list-follows # Unfollow uv run aiwebfeeds monitor unfollow ``` ### 3. Frontend Integration ```tsx import { useState } from "react"; import { NotificationBell, NotificationCenter, FollowButton, TrendingTopics } from "@/components/notifications"; export default function Page() { const [showNotifications, setShowNotifications] = useState(false); return (
{/* Header with notification bell */}

AI Web Feeds

setShowNotifications(true)} />
{/* Notification panel */} setShowNotifications(false)} /> {/* Feed page */}

AI News Feed

{/* Follow button */} {/* Articles... */}
{/* Sidebar */}
); } ``` ## Features ### Real-Time Notifications Instant WebSocket alerts for: * **New Articles**: Individual notifications for each new article (below bundle threshold) * **Bundled Updates**: Single notification for multiple articles (>3 in 5 minutes) * **Trending Topics**: Alerts when topics exceed Z-score threshold (>2.0) * **System Alerts**: Important system messages **Notification Types**: ```typescript type NotificationType = | "new_article" // Single new article | "trending_topic" // Hot topic alert | "feed_updated" // Multiple articles (bundled) | "system_alert"; // System message ``` ### Notification Bundling Prevents notification spam with smart bundling: ``` IF articles_count >= threshold (default: 3) AND within_window (default: 5 minutes) THEN send_bundled_notification() ELSE send_individual_notifications() ``` **Configuration** (`.env`): ```bash AIWF_NOTIFICATION_BUNDLE_THRESHOLD=3 AIWF_NOTIFICATION_BUNDLE_WINDOW_SECONDS=300 ``` ### Trending Detection Z-score statistical analysis: **Algorithm**: 1. **Baseline Calculation**: Mean & StdDev of article counts over N days (default: 3) 2. **Current Period**: Article counts in last 1 hour 3. **Z-Score**: `(current - baseline_mean) / baseline_std` 4. **Threshold**: Alert if Z-score > 2.0 AND articles > 5 **Formula**: $$ Z = \frac{X - \mu}{\sigma} $$ Where: * $X$ = Current article count * $\mu$ = Baseline mean * $\sigma$ = Baseline standard deviation **Configuration**: ```bash AIWF_TRENDING_BASELINE_DAYS=3 AIWF_TRENDING_Z_SCORE_THRESHOLD=2.0 AIWF_TRENDING_MIN_ARTICLES=5 AIWF_TRENDING_UPDATE_INTERVAL_HOURS=1 ``` ### Email Digests Customizable email summaries: **Schedule Types**: * **Daily**: Every day at 9:00 AM * **Weekly**: Every Monday at 9:00 AM * **Custom**: Cron expression (e.g., `0 9 * * *`) **Configuration**: ```bash # SMTP Settings AIWF_SMTP_HOST=localhost AIWF_SMTP_PORT=25 AIWF_SMTP_USER= AIWF_SMTP_PASSWORD= AIWF_SMTP_FROM=noreply@aiwebfeeds.com # Digest Settings AIWF_DIGEST_MAX_ARTICLES=20 ``` **API Usage**: ```typescript // Subscribe to daily digest await fetch("/api/digests", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ user_id: "user-uuid", email: "user@example.com", schedule_type: "daily", schedule_cron: "0 9 * * *", timezone: "America/New_York", }), }); ``` ### Feed Follows User-feed relationships for notification targeting: ```typescript // Follow a feed await fetch("/api/follows", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ user_id: "user-uuid", feed_id: "ai-news", }), }); // Get followed feeds const response = await fetch(`/api/follows?user_id=${userId}`); const { follows } = await response.json(); ``` ## API Reference ### REST Endpoints #### Notifications **GET /api/notifications** List user notifications. ```typescript // Query params ?user_id=&unread_only=true&limit=50 // Response { "user_id": "...", "notifications": [...], "count": 10 } ``` **PATCH /api/notifications/:id** Mark notification as read or dismissed. ```typescript // Body { "action": "mark_read" | "dismiss" } ``` #### Follows **GET /api/follows** List followed feeds. ```typescript ?user_id= ``` **POST /api/follows** Follow a feed. ```typescript { "user_id": "...", "feed_id": "..." } ``` **DELETE /api/follows** Unfollow a feed. ```typescript ?user_id=&feed_id= ``` #### Trending **GET /api/trending** Get current trending topics. ```typescript ?limit=10 // Response { "trending": [ { "topic_id": "artificial-intelligence", "z_score": 2.5, "article_count": 50, "rank": 1 } ] } ``` #### Preferences **GET /api/preferences** Get notification preferences. **POST /api/preferences** Set notification preferences. ```typescript { "user_id": "...", "feed_id": "..." | null, // null for global "delivery_method": "websocket" | "email" | "in_app", "frequency": "instant" | "hourly" | "daily" | "weekly" | "off", "quiet_hours_start": "22:00", "quiet_hours_end": "08:00" } ``` #### Digests **GET /api/digests** Get digest subscriptions. **POST /api/digests** Create digest subscription. **DELETE /api/digests** Unsubscribe from digests. ### WebSocket Protocol **Connection**: ```typescript import { io } from "socket.io-client"; const socket = io("http://localhost:8000"); // Authenticate socket.emit("authenticate", { user_id: "user-uuid" }); ``` **Events**: ```typescript // Incoming socket.on("notification", (data: Notification) => { console.log("New notification:", data); }); socket.on("trending_alert", (data: TrendingAlert) => { console.log("Trending:", data.topic_id); }); socket.on("notifications_history", (data: { notifications: Notification[] }) => { console.log("History:", data.notifications); }); // Outgoing socket.emit("mark_read", { notification_id: 123 }); socket.emit("dismiss", { notification_id: 123 }); ``` ## Configuration ### Environment Variables ```bash # WebSocket Server AIWF_WEBSOCKET_PORT=8000 AIWF_WEBSOCKET_CORS_ORIGINS=http://localhost:3000,https://aiwebfeeds.com NEXT_PUBLIC_WEBSOCKET_URL=http://localhost:8000 # Feed Polling AIWF_FEED_POLL_INTERVAL_MIN=15 AIWF_FEED_POLL_TIMEOUT=30 AIWF_FEED_POLL_MAX_CONCURRENT=10 # Notifications AIWF_NOTIFICATION_RETENTION_DAYS=7 AIWF_NOTIFICATION_BUNDLE_THRESHOLD=3 AIWF_NOTIFICATION_BUNDLE_WINDOW_SECONDS=300 # Trending Detection AIWF_TRENDING_BASELINE_DAYS=3 AIWF_TRENDING_Z_SCORE_THRESHOLD=2.0 AIWF_TRENDING_MIN_ARTICLES=5 AIWF_TRENDING_UPDATE_INTERVAL_HOURS=1 # Email Digests AIWF_SMTP_HOST=localhost AIWF_SMTP_PORT=25 AIWF_SMTP_USER= AIWF_SMTP_PASSWORD= AIWF_SMTP_FROM=noreply@aiwebfeeds.com AIWF_DIGEST_MAX_ARTICLES=20 ``` ### Database Schema **7 New Tables**: 1. `feed_entries` - Article metadata from polling 2. `feed_poll_jobs` - Polling job tracking 3. `notifications` - User notifications 4. `user_feed_follows` - Feed follow relationships 5. `trending_topics` - Trending topic calculations 6. `notification_preferences` - User preferences 7. `email_digests` - Digest subscriptions See [Database Architecture](/docs/development/database) for full schema. ## Components ### NotificationBell Bell icon with unread count badge. ```tsx setShowCenter(true)} className="..." /> ``` **Props**: * `onOpenCenter`: Callback when bell is clicked * `className`: Additional CSS classes ### NotificationCenter Slide-in notification panel. ```tsx setIsOpen(false)} className="..." /> ``` **Features**: * All/Unread filter tabs * Mark read, dismiss, view actions * Time-ago relative timestamps * Type-specific icons and colors ### FollowButton Toggle feed follow status. ```tsx console.log(following)} /> ``` **Variants**: * `default`: Full button with icons and text * `compact`: Small button for inline use ### TrendingTopics Display top trending topics. ```tsx ``` ## CLI Commands ### Monitor **Start server**: ```bash uv run aiwebfeeds monitor start [--port 8000] ``` **Check status**: ```bash uv run aiwebfeeds monitor status ``` **Stop server**: ```bash # Use Ctrl+C to stop ``` ### Follows **Follow a feed**: ```bash uv run aiwebfeeds monitor follow ``` **Unfollow a feed**: ```bash uv run aiwebfeeds monitor unfollow ``` **List follows**: ```bash uv run aiwebfeeds monitor list-follows ``` ## Testing Run the test suite: ```bash cd tests uv run pytest tests/packages/test_polling.py -v uv run pytest tests/packages/test_notifications.py -v uv run pytest tests/packages/test_scheduler.py -v ``` **Coverage**: ```bash uv run pytest --cov=ai_web_feeds --cov-report=html ``` ## Troubleshooting ### WebSocket Connection Issues **Problem**: Frontend can't connect to WebSocket server. **Solution**: 1. Check server is running: `uv run aiwebfeeds monitor status` 2. Verify CORS origins in `.env`: `AIWF_WEBSOCKET_CORS_ORIGINS` 3. Check browser console for connection errors 4. Ensure `NEXT_PUBLIC_WEBSOCKET_URL` matches server URL ### No Notifications Received **Problem**: Following feeds but not getting notifications. **Solution**: 1. Verify feed is being polled: Check scheduler status 2. Confirm follow relationship: `aiwebfeeds monitor list-follows ` 3. Check notification creation: Query database for recent notifications 4. Verify WebSocket authentication: Browser should emit `authenticate` event ### Trending Topics Not Updating **Problem**: Trending topics list is stale. **Solution**: 1. Check scheduler: `aiwebfeeds monitor status` 2. Verify trending job next run time 3. Ensure sufficient baseline data (>= 3 days of articles) 4. Check Z-score threshold configuration ### Email Digests Not Sending **Problem**: Digest subscriptions created but emails not delivered. **Solution**: 1. Verify SMTP configuration in `.env` 2. Check digest schedule (cron expression) 3. Test SMTP connection manually 4. Check scheduler logs for digest job errors ## Performance ### Optimization Tips 1. **Feed Polling**: * Adjust interval based on feed update frequency * Use `AIWF_FEED_POLL_MAX_CONCURRENT` to limit concurrent requests * Monitor response times in `feed_poll_jobs` table 2. **WebSocket**: * Enable connection pooling for high traffic * Use Redis adapter for multi-server deployments * Monitor active connections 3. **Database**: * Enable WAL mode for SQLite (default) * Add indexes on frequently queried columns * Cleanup old notifications regularly 4. **Trending Detection**: * Cache baseline calculations * Reduce baseline period for faster computation * Limit to top N topics ## Security ### User Identity * Anonymous identification via localStorage UUID * No authentication required (Phase 3B MVP) * Migration path to user accounts planned (Phase 3A) ### WebSocket * CORS validation on connection * User ID authentication required * Room-based message targeting ### API * Rate limiting (planned) * Input validation on all endpoints * SQL injection prevention via SQLModel ## Roadmap ### Phase 3A: User Accounts * Email/password authentication * User profile management * Account migration for existing localStorage users ### Phase 3C: Community Curation * Feed ratings and reviews * User-submitted feeds * Collaborative filtering ### Phase 3D: Advanced AI * Content summarization * Sentiment analysis * Smart article recommendations ## Resources * [Specification](/docs/development/real-time-monitoring-spec) * [Database Schema](/docs/development/database) * [API Reference](/docs/reference/api) * [CLI Reference](/docs/reference/cli) *** *Implemented: October 2025 · Version: Phase 3B* -------------------------------------------------------------------------------- END OF PAGE 38 -------------------------------------------------------------------------------- ================================================================================ PAGE 39 OF 57 ================================================================================ TITLE: AI-Powered Recommendations URL: https://ai-web-feeds.w4w.dev/docs/features/recommendations MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/recommendations.mdx DESCRIPTION: Personalized feed suggestions with content-based filtering, cold start onboarding, and user feedback PATH: /features/recommendations -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # AI-Powered Recommendations (/docs/features/recommendations) # AI-Powered Recommendations > **Status**: ✅ Fully Implemented > **Phase**: Phase 1 (MVP) > **Completion**: 100% AI-Powered Recommendations provide personalized feed suggestions based on user interests, content similarity, and popularity. ## Features ### Personalized Suggestions Navigate to `/recommendations` to see 10-20 personalized feed suggestions with infinite scroll. ### Cold Start Onboarding New users take a 3-5 topic selection quiz: * **Question**: "What AI/ML areas interest you?" * **Options**: Select from topic taxonomy (LLM, Computer Vision, Reinforcement Learning, etc.) * **Result**: Immediate recommendations based on selected topics ### Recommendation Algorithm (Phase 1) **70-20-10 Split**: * **70% Content-Based**: Topic overlap + embedding similarity with user interests * **20% Popularity-Based**: Most followed/verified feeds * **10% Serendipity**: Random high-quality feeds for discovery **Configurable** via environment variables: ```bash AIWF_RECOMMENDATION__CONTENT_WEIGHT=0.7 AIWF_RECOMMENDATION__POPULARITY_WEIGHT=0.2 AIWF_RECOMMENDATION__SERENDIPITY_WEIGHT=0.1 ``` ### Recommendation Explanations Each recommendation includes context: * "Because you follow **X**" (clickable link) * "Popular in **Y**" (topic badge) * "Similar to **Z**" (feed comparison) ### User Feedback **Interactions**: * **Like** (👍): Boost topic weight +0.2 * **Dismiss** (✖): Reduce feed weight -0.5 * **Block Topic** (🚫): Exclude topic entirely **Effect**: Recommendations update based on feedback to improve relevance. ### Diversity Constraints (Flexible) * **Max 3 feeds per topic** (best effort) * **Min 2 topics represented** (unless user interests are highly focused) * **Suggestion**: "Explore similar topics" if recommendations too narrow ### Periodic Refresh * **Weekly**: Embedding refresh for new feeds * **Nightly**: Topic popularity recalculation * **Phase 2**: Collaborative matrix update when user accounts exist ### Trending Feeds Boost Feeds with sudden validation frequency spike (3× avg validations in 7 days) get +0.1 relevance boost. ## Configuration ```bash # Recommendation algorithm weights AIWF_RECOMMENDATION__CONTENT_WEIGHT=0.7 # Topic + embedding similarity AIWF_RECOMMENDATION__POPULARITY_WEIGHT=0.2 # Verified + follower count AIWF_RECOMMENDATION__SERENDIPITY_WEIGHT=0.1 # Random high-quality # Embedding settings (for content-based filtering) AIWF_EMBEDDING__PROVIDER=local AIWF_EMBEDDING__HF_API_TOKEN= AIWF_EMBEDDING__LOCAL_MODEL=sentence-transformers/all-MiniLM-L6-v2 ``` ## Usage ### Web Interface Navigate to `/recommendations`: 1. **First Visit**: Complete topic quiz (3-5 selections) 2. **Browse**: Scroll through personalized suggestions 3. **Interact**: Like, dismiss, or block topics 4. **Refresh**: Page auto-updates based on feedback **Privacy**: Opt-out of personalization → Recommendations fall back to popular feeds only. ### CLI ```bash # Generate recommendations for a user profile uv run aiwebfeeds recommendations generate --user-id abc123 --count 20 # Update user profile based on interactions uv run aiwebfeeds recommendations feedback --user-id abc123 --feed-id xyz789 --action like # Cold start recommendations from topics uv run aiwebfeeds recommendations coldstart --topics llm,agents,training --count 10 ``` ### API ```typescript // Get personalized recommendations const response = await fetch("/api/recommendations?user_id=abc123&count=10"); const recommendations = await response.json(); // Submit feedback await fetch("/api/recommendations/feedback", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ user_id: "abc123", feed_id: "xyz789", interaction_type: "like", }), }); // Cold start quiz const coldStartResponse = await fetch("/api/recommendations/quiz", { method: "POST", body: JSON.stringify({ topics: ["llm", "agents", "training"] }), }); ``` ## Performance * **Generation Time**: \<1 second (precomputed matrices, NFR-005) * **Loading States**: Spinner + skeleton UI during generation (NFR-020) * **Scalability**: Supports 10,000+ users with O(log n) lookup (NFR-009) ## Phase 2 Enhancements **Collaborative Filtering** (deferred until user accounts exist): * User-user similarity matrix * Item-item co-occurrence matrix * Hybrid content + collaborative model * Real-time personalization **Current Phase 1**: Content-based only (topic similarity + popularity). ## Success Criteria * ✅ Recommendation generation completes within 1 second for 95% of requests * ✅ Cold start users receive recommendations with ≥50% topic match rate * ✅ Recommendation click-through rate ≥15% * ✅ Users who interact follow 2× more feeds than non-users * ✅ Precision\@10 ≥60% (6+ relevant feeds in top 10) * ✅ 40% of new follows come from recommendations within 3 months ## Related * [Analytics Dashboard](./analytics) - View recommendation performance metrics * [Search & Discovery](./search) - Find specific feeds by query * [Data Model](/docs/development/data-model#recommendationinteraction) - RecommendationInteraction and UserProfile schemas -------------------------------------------------------------------------------- END OF PAGE 39 -------------------------------------------------------------------------------- ================================================================================ PAGE 40 OF 57 ================================================================================ TITLE: RSS Feeds URL: https://ai-web-feeds.w4w.dev/docs/features/rss-feeds MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/rss-feeds.mdx DESCRIPTION: Subscribe to documentation updates via RSS, Atom, or JSON feeds PATH: /features/rss-feeds -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # RSS Feeds (/docs/features/rss-feeds) import { Callout } from "fumadocs-ui/components/callout"; import { Tab, Tabs } from "fumadocs-ui/components/tabs"; Subscribe to AI Web Feeds documentation updates using RSS, Atom, or JSON feeds. ## Available Feeds }> **Sitewide Feed** All content from the entire site }> **Documentation Feed** Only documentation pages }> **Multiple Formats** RSS 2.0, Atom 1.0, and JSON Feed }> **Auto-Updated** Refreshed hourly with latest content ## Feed URLs ### Sitewide Feeds Subscribe to all content: `https://yourdomain.com/rss.xml` Standard RSS 2.0 format, compatible with most feed readers. `https://yourdomain.com/atom.xml` Atom 1.0 format with extended metadata support. `https://yourdomain.com/feed.json` Modern JSON-based feed format. ### Documentation Feeds Subscribe to documentation updates only: `https://yourdomain.com/docs/rss.xml` `https://yourdomain.com/docs/atom.xml` `https://yourdomain.com/docs/feed.json` Feeds are automatically discoverable via `` tags in the HTML head for compatible feed readers. ## Feed Readers ### Popular RSS Readers Choose your preferred feed reader: * **[Feedly](https://feedly.com)** - Web-based, mobile apps * **[Inoreader](https://www.inoreader.com)** - Advanced features, filtering * **[NetNewsWire](https://netnewswire.com)** - Native Mac/iOS app * **[Reeder](https://reederapp.com)** - Beautiful Mac/iOS app * **[The Old Reader](https://theoldreader.com)** - Classic Google Reader style ### Command Line Use `curl` to fetch feeds: ```bash # RSS 2.0 curl https://yourdomain.com/rss.xml # Atom 1.0 curl https://yourdomain.com/atom.xml # JSON Feed curl https://yourdomain.com/feed.json | jq ``` ## Feed Content ### What's Included Each feed item contains: | Field | Description | | --------------- | ----------------------------------------- | | **Title** | Page title | | **Description** | Page description or excerpt | | **Link** | Full URL to the page | | **Date** | Last modified date | | **Category** | Content category (Features, Guides, etc.) | | **Author** | AI Web Feeds Team | ### Categories Content is categorized automatically: * **Features** - Feature documentation * **Guides** - How-to guides and tutorials * **Documentation** - General documentation pages ## How It Works ### Feed Generation Feeds are generated using the [feed](https://www.npmjs.com/package/feed) package: ```typescript title="lib/rss.ts" import { Feed } from "feed"; import { source } from "@/lib/source"; export function getDocsRSS() { const feed = new Feed({ title: "AI Web Feeds - Documentation", id: `${baseUrl}/docs`, link: `${baseUrl}/docs`, language: "en", description: "Documentation updates...", }); for (const page of source.getPages()) { feed.addItem({ id: `${baseUrl}${page.url}`, title: page.data.title, description: page.data.description, link: `${baseUrl}${page.url}`, date: new Date(page.data.lastModified), }); } return feed; } ``` ### Route Handlers Next.js route handlers serve the feeds: ```typescript title="app/docs/rss.xml/route.ts" import { getDocsRSS } from "@/lib/rss"; export const revalidate = 3600; // Revalidate every hour export function GET() { const feed = getDocsRSS(); return new Response(feed.rss2(), { headers: { "Content-Type": "application/rss+xml; charset=utf-8", "Cache-Control": "public, max-age=3600, s-maxage=86400", }, }); } ``` ### Metadata Discovery Feeds are discoverable via metadata: ```typescript title="app/layout.tsx" export const metadata: Metadata = { alternates: { types: { "application/rss+xml": [ { title: "AI Web Feeds - Documentation", url: "/docs/rss.xml", }, ], }, }, }; ``` ## Caching Strategy Feeds are cached for performance: | Cache Layer | Duration | Purpose | | ---------------- | -------- | ------------------- | | **Browser** | 1 hour | Client-side caching | | **CDN** | 24 hours | Edge caching | | **Revalidation** | 1 hour | Server regeneration | Feeds are revalidated every hour to ensure fresh content while maintaining performance. ## Testing ### Test Feed URLs Visit the feed URLs directly in your browser: * [http://localhost:3000/rss.xml](http://localhost:3000/rss.xml) * [http://localhost:3000/docs/rss.xml](http://localhost:3000/docs/rss.xml) * [http://localhost:3000/feed.json](http://localhost:3000/feed.json) ```bash # Test RSS feed curl http://localhost:3000/rss.xml | head -50 # Test Atom feed curl http://localhost:3000/atom.xml | head -50 # Test JSON feed curl http://localhost:3000/feed.json | jq # Check headers curl -I http://localhost:3000/rss.xml ``` Use the [W3C Feed Validator](https://validator.w3.org/feed/): 1. Visit [https://validator.w3.org/feed/](https://validator.w3.org/feed/) 2. Enter your feed URL 3. Click "Check" 4. Review validation results ### Verify Feed Discovery Check that feeds are discoverable: ```bash # View HTML head curl http://localhost:3000 | grep -i "alternate" # Expected output: # ``` ### Test Feed Reader 1. Open your feed reader 2. Click "Add Feed" or "Subscribe" 3. Enter feed URL: `http://localhost:3000/rss.xml` 4. Verify items appear correctly ## Customization ### Update Base URL Set your production URL: ```bash title=".env.local" NEXT_PUBLIC_BASE_URL=https://yourdomain.com ``` ### Modify Feed Metadata Edit `lib/rss.ts`: ```typescript const feed = new Feed({ title: "Your Custom Title", description: "Your custom description", copyright: "All rights reserved 2025, Your Name", // Add more fields... }); ``` ### Add Custom Fields Extend feed items with custom data: ```typescript feed.addItem({ id: `${baseUrl}${page.url}`, title: page.data.title, description: page.data.description, link: `${baseUrl}${page.url}`, date: new Date(page.data.lastModified), // Custom fields image: page.data.image, content: await getPageContent(page), // More custom fields... }); ``` ### Filter Content Control which pages appear in feeds: ```typescript const pages = source .getPages() .filter((page) => !page.data.draft) // Exclude drafts .filter((page) => page.url.startsWith("/docs")); // Only docs ``` ## Best Practices ### 1. Set Last Modified Dates Add `lastModified` to frontmatter: ```yaml --- title: My Page description: Description lastModified: 2025-10-14 --- ``` ### 2. Write Good Descriptions Provide clear, concise descriptions: ```yaml --- title: RSS Feeds description: Subscribe to documentation updates via RSS, Atom, or JSON feeds --- ``` ### 3. Use Proper Categories Organize content with meaningful categories: ```typescript category: page.url.includes("/api/") ? [{ name: "API Reference" }] : [{ name: "Guides" }]; ``` ### 4. Cache Appropriately Balance freshness with performance: ```typescript export const revalidate = 3600; // 1 hour ``` ## Troubleshooting ### Feed Not Updating Clear the Next.js cache: `bash rm -rf .next/ pnpm dev ` ### Invalid XML * Ensure special characters are escaped * Validate with W3C Feed Validator * Check for proper UTF-8 encoding ### Missing Items * Verify `source.getPages()` returns all pages * Check filter conditions * Ensure frontmatter is complete ### Slow Generation * Reduce number of items * Implement pagination * Increase revalidation time ## Future Enhancements Potential additions: * **Blog feed** - Separate feed for blog posts * **Category feeds** - Individual feeds per category * **Per-author feeds** - Filter by author * **Full content** - Include complete page content * **Media enclosures** - Attach images/files * **Podcasting support** - iTunes RSS extensions ## Related Documentation * [AI Integration](/docs/features/ai-integration) - AI/LLM endpoints * [Quick Reference](/docs/guides/quick-reference) - Commands and endpoints * [Testing Guide](/docs/guides/testing) - Verify your setup ## External Resources * [RSS 2.0 Specification](https://www.rssboard.org/rss-specification) * [Atom 1.0 Specification](https://tools.ietf.org/html/rfc4287) * [JSON Feed Specification](https://jsonfeed.org/version/1.1) * [Feed Package Documentation](https://www.npmjs.com/package/feed) * [Fumadocs RSS Guide](https://fumadocs.dev/docs/ui/rss) -------------------------------------------------------------------------------- END OF PAGE 40 -------------------------------------------------------------------------------- ================================================================================ PAGE 41 OF 57 ================================================================================ TITLE: Search & Discovery URL: https://ai-web-feeds.w4w.dev/docs/features/search MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/search.mdx DESCRIPTION: Intelligent feed search with autocomplete, faceted filtering, and semantic similarity PATH: /features/search -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Search & Discovery (/docs/features/search) # Search & Discovery > **Status**: ✅ Fully Implemented > **Phase**: Phase 1 (MVP) > **Completion**: 100% The Search & Discovery feature enables users to find feeds through full-text search, autocomplete suggestions, faceted filtering, and semantic similarity. ## Features ### Unified Search Interface Single search bar at `/search` with real-time autocomplete (\<200ms response time). ### Full-Text Search Powered by SQLite FTS5 with Porter stemming: * **Search Across**: Feed titles, descriptions, recent article titles (if cached) * **Ranking**: TF-IDF scoring with boost factors: * Verified feeds: +20% * Active feeds: +10% * Popular feeds: +5% * **Highlighting**: Search terms bolded in result snippets ### Autocomplete Suggestions Within 200ms, get: * **Top 5 matching feeds** * **Top 3 matching topics** * **Top 3 recent searches** (user-specific, localStorage) Powered by pre-built Trie index (in-memory, \<10ms response). ### Faceted Filtering Filter results by multiple criteria (AND logic): * **Source Type**: blog, podcast, newsletter, video, social, other * **Topics**: Multi-select from topic taxonomy * **Verified Status**: Toggle verified-only filter * **Activity Status**: Active/inactive toggle **Result Count Badges**: "Blogs (45)", "Verified (23)" displayed next to each filter option. ### Semantic Search Toggle "Include similar results" to enable vector similarity search: * **Embeddings**: Sentence-BERT (384-dim all-MiniLM-L6-v2 model) * **Similarity Threshold**: ≥0.7 cosine similarity * **Configurable Modes**: * **Local** (default): Sentence-Transformers, zero setup * **Hugging Face API** (optional): Requires `AIWF_EMBEDDING__HF_API_TOKEN` ### Saved Searches * **Save**: Store query + filters with custom name * **Replay**: One-click load from sidebar * **Persistence**: Browser localStorage with Export/Import JSON for cross-device transfer ### Search History Last 10 searches stored per user (localStorage or database if logged in). ## Configuration ```bash # Autocomplete suggestions limit (5 feeds + 3 topics) AIWF_SEARCH__AUTOCOMPLETE_LIMIT=8 # Full-text search results per page AIWF_SEARCH__FULL_TEXT_LIMIT=20 # Semantic similarity threshold (0.0-1.0) AIWF_SEARCH__SEMANTIC_SIMILARITY_THRESHOLD=0.7 # Embedding provider: "local" or "huggingface" AIWF_EMBEDDING__PROVIDER=local # Hugging Face API token (optional, for HF provider) AIWF_EMBEDDING__HF_API_TOKEN= # Hugging Face model name AIWF_EMBEDDING__HF_MODEL=sentence-transformers/all-MiniLM-L6-v2 # Local model name AIWF_EMBEDDING__LOCAL_MODEL=sentence-transformers/all-MiniLM-L6-v2 # Embedding cache size (LRU) AIWF_EMBEDDING__EMBEDDING_CACHE_SIZE=1000 ``` ## Usage ### Web Interface Navigate to `/search`: 1. Type query in search bar 2. Select autocomplete suggestion or press Enter 3. Apply faceted filters (left sidebar) 4. Toggle "Include similar results" for semantic search 5. Click "Save Search" to store query for later **Keyboard Shortcuts**: * `Cmd/Ctrl+K`: Focus search bar * `Arrow keys`: Navigate autocomplete suggestions * `Enter`: Execute search ### CLI ```bash # Full-text search uv run aiwebfeeds search "transformer attention" --limit 20 # Semantic search uv run aiwebfeeds search "machine learning" --semantic --threshold 0.7 # Filter by source type and topic uv run aiwebfeeds search "pytorch" --source-type blog --topic deeplearning # Save search uv run aiwebfeeds search save --name "ML Research" --query "deep learning" --topics "llm,training" ``` ### API ```typescript // Full-text search const response = await fetch("/api/search?q=transformer&limit=20"); const results = await response.json(); // Semantic search const semanticResults = await fetch("/api/search?q=neural networks&semantic=true&threshold=0.7"); // Autocomplete const suggestions = await fetch("/api/search/autocomplete?prefix=mach"); // Save search await fetch("/api/search/saved", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ name: "AI Research", query: "artificial intelligence", filters: { source_type: ["blog"], topics: ["llm", "agents"] }, }), }); ``` ## Performance * **Autocomplete**: \<200ms response time (95th percentile, NFR-002) * **Full-Text Search**: \<500ms for 10,000+ feeds (NFR-003) * **Semantic Search**: \<3s total latency (2s vector search + 1s rendering, NFR-004) * **FTS5 Scalability**: Supports 50,000+ feeds with sub-second queries ## Zero Results Handling When no results found, display: * Spelling suggestions * "Browse by topic" link * "Suggest a feed" link → GitHub issue template ## Success Criteria * ✅ Search results appear within 500ms for 95% of queries * ✅ 70% of searches yield >0 results (zero-result rate \<30%) * ✅ Average click-through rate on search results ≥40% * ✅ 50% of users who search use faceted filters * ✅ Saved searches used by 20% of active users within first month * ✅ Semantic search increases relevance by 25% (A/B test CTR) ## Related * [Analytics Dashboard](./analytics) - View search analytics and popular queries * [Recommendations](./recommendations) - AI-powered feed suggestions * [Data Model](/docs/development/data-model#searchquery) - SearchQuery and SavedSearch schemas -------------------------------------------------------------------------------- END OF PAGE 41 -------------------------------------------------------------------------------- ================================================================================ PAGE 42 OF 57 ================================================================================ TITLE: Sentiment Analysis URL: https://ai-web-feeds.w4w.dev/docs/features/sentiment-analysis MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/sentiment-analysis.mdx DESCRIPTION: Transformer-based sentiment classification and trend tracking PATH: /features/sentiment-analysis -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Sentiment Analysis (/docs/features/sentiment-analysis) # Sentiment Analysis Sentiment Analysis classifies article sentiment using transformer models (DistilBERT) and tracks sentiment trends over time by topic. ## Overview The sentiment analyzer: 1. **Classifies** article sentiment: positive, neutral, or negative 2. **Computes** sentiment scores (-1.0 to +1.0) 3. **Aggregates** daily sentiment by topic 4. **Detects** sentiment shifts using moving averages ## Architecture ## Sentiment Classification ### Model Uses Hugging Face's `distilbert-base-uncased-finetuned-sst-2-english`: * **Model Size**: 67MB * **Accuracy**: \~92% on SST-2 benchmark * **Inference Time**: \~50ms per article (CPU) * **Context Window**: 512 tokens (truncates longer articles) ### Sentiment Score Mapping ```python # Model output → Sentiment score "POSITIVE" (confidence 0.85) → +0.85 "NEGATIVE" (confidence 0.92) → -0.92 "NEUTRAL" → 0.0 ``` ### Classification Thresholds ```python if sentiment_score > 0.3: classification = "positive" elif sentiment_score < -0.3: classification = "negative" else: classification = "neutral" ``` ## Usage ### CLI Commands #### Analyze Sentiment ```bash aiwebfeeds nlp sentiment ``` **Options**: * `--batch-size`: Number of articles (default: 100) * `--force`: Reprocess all articles ```bash # Process 50 articles aiwebfeeds nlp sentiment --batch-size 50 ``` #### View Sentiment Trends ```bash # 30-day sentiment trend for "AI Safety" aiwebfeeds nlp sentiment-trend "AI Safety" --days 30 ``` **Output**: ``` AI Safety - Sentiment Trend (30 days) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Date Avg Sentiment Articles Positive Neutral Negative 2023-10-01 +0.45 24 18 4 2 2023-10-02 +0.32 19 12 5 2 2023-10-03 -0.15 28 8 12 8 ⚠️ Shift ``` #### Detect Sentiment Shifts ```bash # Show topics with sentiment shifts (>0.3 change in 7-day MA) aiwebfeeds nlp sentiment-shifts ``` **Output**: ``` Recent Sentiment Shifts ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Topic Previous Current Change Status AI Safety +0.25 -0.18 -0.43 🔴 Major shift AI Regulation -0.10 +0.35 +0.45 🟢 Improving ``` #### Compare Topics ```bash aiwebfeeds nlp sentiment-compare "AI Safety" "AI Capabilities" ``` Shows side-by-side sentiment trends for two topics. ### Python API ```python from ai_web_feeds.nlp import SentimentAnalyzer from ai_web_feeds.config import Settings analyzer = SentimentAnalyzer(Settings()) article = { "id": 1, "title": "RLHF Concerns", "content": "Critics have raised serious concerns about RLHF..." } sentiment = analyzer.analyze_sentiment(article) # Returns: { # "sentiment_score": -0.65, # "classification": "negative", # "confidence": 0.89, # "model_name": "distilbert-base-uncased-finetuned-sst-2-english" # } ``` ### Batch Processing Sentiment analysis runs hourly: ```python from ai_web_feeds.nlp.scheduler import NLPScheduler nlp_scheduler = NLPScheduler(scheduler) nlp_scheduler.register_jobs() # Registers: # - Sentiment analysis (every hour) # - Sentiment aggregation (15 min after analysis) ``` ## Database Schema ### article\_sentiment Table ```sql CREATE TABLE article_sentiment ( article_id INTEGER PRIMARY KEY, sentiment_score REAL NOT NULL CHECK(sentiment_score BETWEEN -1.0 AND 1.0), classification TEXT NOT NULL CHECK(classification IN ('positive', 'neutral', 'negative')), model_name TEXT NOT NULL, confidence REAL NOT NULL CHECK(confidence BETWEEN 0 AND 1), computed_at DATETIME DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (article_id) REFERENCES feed_entries(id) ); ``` ### topic\_sentiment\_daily Table Aggregated daily sentiment by topic: ```sql CREATE TABLE topic_sentiment_daily ( id INTEGER PRIMARY KEY AUTOINCREMENT, topic TEXT NOT NULL, date DATE NOT NULL, avg_sentiment REAL NOT NULL, article_count INTEGER NOT NULL, positive_count INTEGER DEFAULT 0, neutral_count INTEGER DEFAULT 0, negative_count INTEGER DEFAULT 0, UNIQUE(topic, date) ); ``` ## Sentiment Aggregation ### Daily Aggregation Runs 15 minutes after sentiment analysis: ```python # Group sentiment scores by (topic, date) aggregates = {} for article in recent_articles: for topic in article.topics: key = (topic, article.date) aggregates[key]["scores"].append(article.sentiment_score) aggregates[key][article.classification] += 1 # Compute average for (topic, date), data in aggregates.items: avg_sentiment = sum(data["scores"]) / len(data["scores"]) storage.upsert_topic_sentiment_daily( topic=topic, date=date, avg_sentiment=avg_sentiment, article_count=len(data["scores"]), positive_count=data["positive"], neutral_count=data["neutral"], negative_count=data["negative"] ) ``` ### Shift Detection 7-day moving average: ```python def detect_shift(topic: str, threshold: float = 0.3) -> bool: """Detect sentiment shift using 7-day moving average""" trend = storage.get_topic_sentiment_trend(topic, days=14) # Compute 7-day MA for last 2 weeks ma_recent = mean([day.avg_sentiment for day in trend[:7]]) ma_previous = mean([day.avg_sentiment for day in trend[7:14]]) shift = abs(ma_recent - ma_previous) return shift > threshold ``` ## Configuration ```python class Phase5Settings(BaseSettings): sentiment_batch_size: int = 100 sentiment_cron: str = "0 * * * *" # Every hour sentiment_model: str = "distilbert-base-uncased-finetuned-sst-2-english" sentiment_shift_threshold: float = 0.3 ``` **Environment Variables**: ```bash PHASE5_SENTIMENT_BATCH_SIZE=100 PHASE5_SENTIMENT_SHIFT_THRESHOLD=0.3 PHASE5_SENTIMENT_MODEL=distilbert-base-uncased-finetuned-sst-2-english ``` ## Performance * **Throughput**: \~100 articles/hour (CPU) * **Memory**: \~500MB (model loaded) * **Storage**: \~50 bytes per sentiment record ## Use Cases ### Monitor Topic Sentiment Track sentiment for specific topics: ```bash # Daily check for "AI Safety" sentiment aiwebfeeds nlp sentiment-trend "AI Safety" --days 7 ``` ### Detect Controversies Identify topics with negative sentiment spikes: ```bash # Topics with sentiment < -0.5 in last 7 days aiwebfeeds nlp sentiment-shifts --threshold -0.5 ``` ### Compare Competing Approaches ```bash # Compare sentiment for competing techniques aiwebfeeds nlp sentiment-compare "RLHF" "Constitutional AI" ``` ## Model Details ### DistilBERT Architecture * **Base Model**: BERT distilled to 66M parameters (40% smaller) * **Training**: Fine-tuned on SST-2 (Stanford Sentiment Treebank) * **Input**: Max 512 tokens (articles truncated to \~2000 chars) * **Output**: Binary classification (positive/negative) with confidence ### Limitations 1. **Context Window**: Only first 512 tokens considered 2. **Binary Classification**: Model trained for binary sentiment (positive/negative), neutral inferred 3. **Domain Shift**: SST-2 is movie reviews; AI articles may differ 4. **No Fine-tuning**: Pre-trained model used as-is (no domain adaptation) ## Troubleshooting ### Low Confidence Scores **Symptom**: All sentiment predictions have low confidence (\<0.6). **Cause**: Articles too long, model only sees truncated beginning. **Solution**: Increase truncation window or use extractive summarization before analysis. ### Model Download Fails **Symptom**: `OSError: Can't find model` **Solution**: ```bash # Models auto-download to ~/.cache/huggingface/hub # Ensure internet connection and disk space (~67MB) # Manual download: python -c "from transformers import pipeline; pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')" ``` ### Sentiment Shifts Not Detected **Symptom**: No shifts reported despite obvious sentiment changes. **Cause**: Threshold too high. **Solution**: ```bash # Lower threshold to 0.2 export PHASE5_SENTIMENT_SHIFT_THRESHOLD=0.2 ``` ## Future Enhancements 1. **Domain-Specific Fine-tuning**: Train on AI article sentiment labels 2. **Aspect-Based Sentiment**: Sentiment for specific entities/topics within articles 3. **Multilingual Support**: Add models for non-English content 4. **Real-Time Alerts**: Webhook notifications for sentiment shifts ## See Also * [Quality Scoring](/docs/features/quality-scoring) - Article quality assessment * [Entity Extraction](/docs/features/entity-extraction) - Named entity recognition * [Topic Modeling](/docs/features/topic-modeling) - Discover subtopics -------------------------------------------------------------------------------- END OF PAGE 42 -------------------------------------------------------------------------------- ================================================================================ PAGE 43 OF 57 ================================================================================ TITLE: SEO & Metadata URL: https://ai-web-feeds.w4w.dev/docs/features/seo-metadata MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/seo-metadata.mdx DESCRIPTION: Rich metadata and Open Graph images for improved discoverability PATH: /features/seo-metadata -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # SEO & Metadata (/docs/features/seo-metadata) import { Callout } from "fumadocs-ui/components/callout"; import { Tab, Tabs } from "fumadocs-ui/components/tabs"; import { Step, Steps } from "fumadocs-ui/components/steps"; import { Card, Cards } from "fumadocs-ui/components/card"; import { Image, FileText, Share2, Bot } from "lucide-react"; Comprehensive SEO optimization with rich metadata, Open Graph images, and search engine discoverability. ## Overview AI Web Feeds implements Next.js Metadata API for: }> **Dynamic OG Images** Custom Open Graph images for every page }> **Rich Metadata** Complete SEO tags for all content }> **Social Sharing** Optimized for Twitter, LinkedIn, Slack }> **AI-Friendly** Special rules for AI crawlers ## Features ### ✨ What's Included * **Dynamic OG Images** - Unique images generated for each documentation page * **Rich Metadata** - Complete title, description, keywords, and author information * **Twitter Cards** - Summary cards with large images * **Canonical URLs** - Proper canonical link tags * **Structured Data** - JSON-LD for better search results * **Sitemap** - Auto-generated XML sitemap * **Robots.txt** - Search engine crawling rules with AI bot support * **PWA Manifest** - Progressive Web App configuration * **RSS Discovery** - Feed links in HTML head ## Open Graph Images ### Dynamic Page Images Every documentation page gets a unique OG image: ```tsx title="lib/source.ts" export function getPageImage(page: InferPageType) { const segments = [...page.slugs, "image.png"]; return { segments, url: `/og/docs/${segments.join("/")}`, }; } ``` ### Image Design Custom-designed OG images with: * **Dark theme** - Modern dark background with gradient accents * **Brand identity** - Logo and site name * **Page title** - Large, readable typography * **Description** - Supporting text for context * **Category badge** - Visual categorization * **Site URL** - Domain attribution ### Example URLs ``` /og/docs/image.png ``` Main documentation landing page OG image ``` /og/docs/features/pdf-export/image.png ``` PDF Export feature page OG image ``` /og/docs/guides/quick-reference/image.png ``` Quick Reference guide OG image ### Image Specifications | Property | Value | | ---------- | -------------------------------- | | Width | 1200px | | Height | 630px | | Format | PNG | | Generation | Static at build time | | Caching | Permanent (`revalidate = false`) | ## Metadata Structure ### Root Layout Site-wide metadata in `app/layout.tsx`: ```tsx export const metadata: Metadata = { metadataBase: new URL(baseUrl), title: { default: 'AI Web Feeds - RSS/Atom Feeds for AI Agents', template: '%s | AI Web Feeds', }, description: 'Curated RSS/Atom feeds optimized for AI agents...', keywords: ['AI', 'RSS feeds', 'Atom feeds', ...], authors: [{ name: 'Wyatt Walsh', url: '...' }], openGraph: { type: 'website', locale: 'en_US', url: baseUrl, siteName: 'AI Web Feeds', images: [{ url: '/og-image.png', width: 1200, height: 630 }], }, twitter: { card: 'summary_large_image', creator: '@wyattowalsh', }, robots: { index: true, follow: true, googleBot: { 'max-video-preview': -1, 'max-image-preview': 'large', 'max-snippet': -1, }, }, }; ``` ### Documentation Pages Dynamic metadata for each page in `app/docs/[[...slug]]/page.tsx`: ```tsx export async function generateMetadata( props: PageProps ): Promise { const page = source.getPage(params.slug); return { title: page.data.title, description: page.data.description, keywords: ['documentation', 'AI', ...], openGraph: { type: 'article', title: page.data.title, url: pageUrl, images: [{ url: imageUrl, width: 1200, height: 630 }], }, twitter: { card: 'summary_large_image', title: page.data.title, images: [imageUrl], }, alternates: { canonical: pageUrl, types: { 'application/rss+xml': '/docs/rss.xml', 'application/atom+xml': '/docs/atom.xml', }, }, }; } ``` ## Sitemap Auto-generated XML sitemap at `/sitemap.xml`: ```tsx title="app/sitemap.ts" export default function sitemap(): MetadataRoute.Sitemap { const pages = source.getPages(); return pages.map((page) => ({ url: `${baseUrl}${page.url}`, lastModified: new Date(), changeFrequency: "weekly", priority: 0.8, })); } ``` ### Sitemap Features * ✅ All documentation pages included * ✅ Proper priority levels * ✅ Change frequency hints * ✅ Last modified dates * ✅ Auto-updates on build ### Access Sitemap ```bash curl https://ai-web-feeds.vercel.app/sitemap.xml ``` ## Robots.txt Custom robots.txt with AI crawler support: ```txt title="Generated robots.txt" User-agent: * Allow: / Disallow: /api/ Disallow: /_next/ Disallow: /static/ User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: Google-Extended Allow: / User-agent: anthropic-ai Allow: / User-agent: ClaudeBot Allow: / Sitemap: https://ai-web-feeds.vercel.app/sitemap.xml ``` ### AI Crawler Support Explicitly allows common AI crawlers: * **GPTBot** - OpenAI's web crawler * **ChatGPT-User** - ChatGPT browsing * **Google-Extended** - Google's AI training crawler * **anthropic-ai** - Anthropic's crawler * **ClaudeBot** - Claude's web crawler ## PWA Manifest Progressive Web App configuration: ```json title="Generated manifest.json" { "name": "AI Web Feeds - RSS/Atom Feeds for AI Agents", "short_name": "AI Web Feeds", "description": "Curated RSS/Atom feeds optimized for AI agents", "start_url": "/", "display": "standalone", "background_color": "#0a0a0a", "theme_color": "#667eea", "icons": [ { "src": "/icon-192.png", "sizes": "192x192", "type": "image/png" }, { "src": "/icon-512.png", "sizes": "512x512", "type": "image/png" } ] } ``` ## Social Media Preview ### How It Looks When shared on social media, links display: **Twitter/X** * Large image (1200x630) * Page title * Description * Site name * Creator handle **LinkedIn** * Large image * Page title * Description * Site URL **Slack/Discord** * Rich embed with image * Title and description * Site information ### Testing Social Cards Use [Twitter Card Validator](https://cards-dev.twitter.com/validator): 1. Enter page URL 2. Click "Preview card" 3. Verify image and text Use [LinkedIn Post Inspector](https://www.linkedin.com/post-inspector/): 1. Enter page URL 2. Click "Inspect" 3. Review preview Use [Facebook Sharing Debugger](https://developers.facebook.com/tools/debug/): 1. Enter page URL 2. Click "Debug" 3. Scrape again if needed ## Search Engine Optimization ### Google Search Features Optimized for: * **Rich snippets** - Enhanced search results * **Knowledge graph** - Structured data integration * **Image preview** - Large image thumbnails * **Site links** - Auto-generated navigation * **Breadcrumbs** - Clear page hierarchy ### Verification Add verification codes in `app/layout.tsx`: ```tsx verification: { google: 'your-google-verification-code', yandex: 'your-yandex-verification-code', bing: 'your-bing-verification-code', } ``` ## Implementation ### File Structure ``` app/ ├── layout.tsx # Root metadata ├── manifest.ts # PWA manifest ├── robots.ts # Robots.txt ├── sitemap.ts # XML sitemap ├── og-image.png/ │ └── route.tsx # Homepage OG image ├── (home)/ │ └── page.tsx # Homepage metadata ├── docs/ │ └── [[...slug]]/ │ └── page.tsx # Dynamic page metadata └── og/ └── docs/ └── [...slug]/ └── route.tsx # Dynamic OG images lib/ └── source.ts # getPageImage helper ``` ### Key Functions **Get Page Image URL** ```tsx const image = getPageImage(page); // { segments: ['features', 'pdf-export', 'image.png'], // url: '/og/docs/features/pdf-export/image.png' } ``` **Generate Metadata** ```tsx export async function generateMetadata(props): Promise { const page = source.getPage(params.slug); return { title: page.data.title, openGraph: { images: getPageImage(page).url }, }; } ``` ## Best Practices ### 1. Title Templates Use templates for consistent branding: ```tsx title: { default: 'AI Web Feeds', template: '%s | AI Web Feeds', } ``` Results in: * Homepage: "AI Web Feeds" * Docs page: "Getting Started | AI Web Feeds" ### 2. Description Length Keep descriptions under 160 characters: ```tsx description: "Clear, concise description under 160 characters"; ``` ### 3. Image Optimization * Use 1200x630 for OG images (1.91:1 ratio) * Keep file sizes under 1MB * Use high-contrast text * Test on multiple platforms ### 4. Canonical URLs Always set canonical URLs: ```tsx alternates: { canonical: pageUrl, } ``` ### 5. Keywords Include relevant keywords: ```tsx keywords: ["specific", "relevant", "keywords"]; ``` ## Troubleshooting ### OG Images Not Showing OG images are generated at build time. Rebuild after changes: `bash pnpm build ` ### Social Media Cache If old images persist: 1. Clear platform cache using their debug tools 2. Add query parameter: `?v=2` to force refresh 3. Wait 24-48 hours for automatic cache expiry ### Missing Metadata Check browser dev tools: ```bash # View page source curl https://ai-web-feeds.vercel.app/docs | grep -i "og:" curl https://ai-web-feeds.vercel.app/docs | grep -i "twitter:" ``` Expected tags: ```html ``` ## Testing ### Verify Metadata ### Check HTML Head View page source and verify tags: ```bash curl https://ai-web-feeds.vercel.app/docs | head -100 ``` ### Test OG Images Visit image URLs directly: ``` /og-image.png /og/docs/image.png /og/docs/features/pdf-export/image.png ``` ### Validate Sitemap ```bash curl https://ai-web-feeds.vercel.app/sitemap.xml ``` ### Check Robots.txt ```bash curl https://ai-web-feeds.vercel.app/robots.txt ``` ### SEO Audit Tools * [Google Search Console](https://search.google.com/search-console) * [Bing Webmaster Tools](https://www.bing.com/webmasters) * [Lighthouse](https://developer.chrome.com/docs/lighthouse) (Chrome DevTools) * [PageSpeed Insights](https://pagespeed.web.dev/) ## Performance ### Build-Time Generation All OG images generated during build: * **Development**: Images generated on-demand * **Production**: All images pre-rendered * **Caching**: Permanent (`revalidate = false`) ### Size Optimization | Asset | Size | | -------------- | ---------- | | OG Image (PNG) | \~50-100KB | | Sitemap XML | \~5-10KB | | Manifest JSON | \~1KB | | Robots.txt | \~500B | ## Related Documentation * [RSS Feeds](/docs/features/rss-feeds) - Feed discovery and metadata * [AI Integration](/docs/features/ai-integration) - AI crawler support * [Quick Reference](/docs/guides/quick-reference) - Metadata endpoints ## External Resources * [Next.js Metadata API](https://nextjs.org/docs/app/building-your-application/optimizing/metadata) * [Open Graph Protocol](https://ogp.me/) * [Twitter Cards](https://developer.twitter.com/en/docs/twitter-for-websites/cards/overview/abouts-cards) * [Schema.org](https://schema.org/) -------------------------------------------------------------------------------- END OF PAGE 43 -------------------------------------------------------------------------------- ================================================================================ PAGE 44 OF 57 ================================================================================ TITLE: Topic Modeling URL: https://ai-web-feeds.w4w.dev/docs/features/topic-modeling MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/topic-modeling.mdx DESCRIPTION: LDA-based topic discovery and evolution tracking PATH: /features/topic-modeling -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Topic Modeling (/docs/features/topic-modeling) # Topic Modeling Topic Modeling automatically discovers subtopics within parent topics using Latent Dirichlet Allocation (LDA) and tracks topic evolution over time. ## Overview The topic modeler: 1. **Discovers** subtopics using LDA clustering 2. **Tracks** topic evolution (splits, merges, emergence, decline) 3. **Enables** manual curation of discovered subtopics 4. **Computes** topic coherence scores for quality assessment ## Architecture ## LDA Topic Modeling ### Algorithm Latent Dirichlet Allocation (LDA) discovers latent topics in document collections: 1. **Preprocessing**: Tokenize, remove stopwords, apply TF-IDF 2. **Model Training**: Learn topic distributions using Gensim LDA 3. **Topic Extraction**: Extract keywords and descriptions 4. **Coherence Scoring**: Validate topic quality using C\_v coherence ### Model Parameters ```python lda_config = { "num_topics": 10, # Number of subtopics per parent "passes": 10, # Training iterations "iterations": 400, # Inference iterations "alpha": "auto", # Document-topic density "eta": "auto", # Topic-word density "minimum_probability": 0.01, # Minimum topic probability } ``` ## Usage ### CLI Commands #### Run Topic Modeling ```bash aiwebfeeds nlp topics ``` **Options**: * `--parent-topic`: Parent topic to model (default: all) * `--num-topics`: Number of subtopics to discover (default: 10) * `--min-articles`: Minimum articles required (default: 100) ```bash # Discover 5 subtopics in "NLP" with minimum 50 articles aiwebfeeds nlp topics --parent-topic "NLP" --num-topics 5 --min-articles 50 ``` #### Review Unapproved Subtopics ```bash aiwebfeeds nlp review-subtopics ``` **Interactive Workflow**: ``` Unapproved Subtopics (3) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [1] NLP > Transformer Architectures Keywords: transformer, attention, bert, gpt, architecture Articles: 45 Coherence: 0.68 Actions: [a]pprove, [r]ename, [d]elete, [s]kip > a ✓ Approved: Transformer Architectures ``` #### Approve Subtopic ```bash aiwebfeeds nlp approve-subtopic ``` #### Rename Subtopic ```bash aiwebfeeds nlp rename-subtopic "New Name" ``` #### List Subtopics ```bash # List all approved subtopics for "AI Safety" aiwebfeeds nlp list-subtopics "AI Safety" ``` #### View Topic Evolution ```bash # Show topic evolution events (splits, merges, etc.) aiwebfeeds nlp topic-evolution --days 30 ``` **Output**: ``` Topic Evolution Events (Last 30 Days) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Date Event Source Topic Target Topics 2023-10-15 split Transformers [BERT-variants, GPT-variants] 2023-10-22 emergence - [Constitutional AI] 2023-10-28 merge [RLHF, HHH] Alignment Techniques ``` ### Python API ```python from ai_web_feeds.nlp import TopicModeler from ai_web_feeds.storage import Storage modeler = TopicModeler() storage = Storage() # Get articles for parent topic articles = storage.get_articles_by_topic("NLP", limit=1000) # Train LDA model subtopics = modeler.extract_subtopics( parent_topic="NLP", articles=articles, num_topics=10 ) # subtopics = [ # { # "name": "Transformer Architectures", # "keywords": ["transformer", "attention", "bert", "gpt"], # "description": "Articles about transformer models...", # "article_count": 45, # "coherence": 0.68 # }, # ... # ] # Store subtopics for subtopic_data in subtopics: storage.create_subtopic( parent_topic="NLP", name=subtopic_data["name"], keywords=subtopic_data["keywords"], description=subtopic_data["description"], article_count=subtopic_data["article_count"] ) ``` ### Batch Processing Topic modeling runs monthly (1st of month, 3 AM): ```python from ai_web_feeds.nlp.scheduler import NLPScheduler nlp_scheduler = NLPScheduler(scheduler) nlp_scheduler.register_jobs() # Registers: Topic modeling job (monthly) ``` ## Database Schema ### subtopics Table ```sql CREATE TABLE subtopics ( id TEXT PRIMARY KEY, -- UUID parent_topic TEXT NOT NULL, name TEXT NOT NULL, keywords TEXT NOT NULL, -- JSON array description TEXT, article_count INTEGER DEFAULT 0, detected_at DATETIME DEFAULT CURRENT_TIMESTAMP, approved BOOLEAN DEFAULT FALSE, created_by TEXT DEFAULT 'system', UNIQUE(parent_topic, name) ); ``` ### topic\_evolution\_events Table ```sql CREATE TABLE topic_evolution_events ( id INTEGER PRIMARY KEY AUTOINCREMENT, event_type TEXT NOT NULL CHECK(event_type IN ('split', 'merge', 'emergence', 'decline')), source_topic TEXT, target_topics TEXT, -- JSON array article_count INTEGER NOT NULL, growth_rate REAL, detected_at DATETIME DEFAULT CURRENT_TIMESTAMP ); ``` ## Topic Evolution Detection ### Evolution Types **Split**: One topic divides into multiple subtopics ``` Transformers → [BERT-variants, GPT-variants, ViT] ``` **Merge**: Multiple subtopics combine into one ``` [Supervised Learning, Unsupervised Learning] → Machine Learning Fundamentals ``` **Emergence**: New topic appears (growth rate > 100%) ``` - → Constitutional AI (50 articles in 1 month) ``` **Decline**: Topic activity decreases (growth rate \< -50%) ``` GANs → (declining mention frequency) ``` ### Detection Algorithm ```python def detect_evolution( current_topics: List[Subtopic], previous_topics: List[Subtopic] ) -> List[EvolutionEvent]: """Compare current vs previous month's topics""" events = [] # Detect splits for prev_topic in previous_topics: similar_topics = find_similar_topics(prev_topic, current_topics) if len(similar_topics) >= 2: events.append({ "type": "split", "source": prev_topic.name, "targets": [t.name for t in similar_topics] }) # Detect emergence for curr_topic in current_topics: if not any(is_similar(curr_topic, pt) for pt in previous_topics): growth_rate = compute_growth_rate(curr_topic) if growth_rate > 1.0: # >100% growth events.append({ "type": "emergence", "target": curr_topic.name, "growth_rate": growth_rate }) return events ``` ## Topic Coherence ### Coherence Metric Topic coherence (C\_v) measures topic quality: * **Range**: 0.0 (poor) to 1.0 (excellent) * **Threshold**: Reject topics with coherence \< 0.5 * **Interpretation**: * 0.7+: Excellent, semantically coherent * 0.5-0.7: Good, acceptable * \<0.5: Poor, review manually ### Computation ```python from gensim.models.coherencemodel import CoherenceModel coherence_model = CoherenceModel( model=lda_model, texts=tokenized_docs, dictionary=dictionary, coherence='c_v' ) coherence_score = coherence_model.get_coherence() ``` ## Configuration ```python class Phase5Settings(BaseSettings): topic_modeling_cron: str = "0 3 1 * *" # 3 AM on 1st of month topic_model: str = "lda" # Algorithm: lda, nmf, or bertopic topic_coherence_min: float = 0.5 nlp_workers: int = 4 # Parallel processing ``` **Environment Variables**: ```bash PHASE5_TOPIC_MODEL=lda PHASE5_TOPIC_COHERENCE_MIN=0.5 PHASE5_NLP_WORKERS=4 ``` ## Performance * **Training Time**: \~5-10 minutes for 1000 articles * **Memory**: \~1GB peak during training * **Storage**: \~200 bytes per subtopic ## Manual Curation Workflow ### 1. Run Topic Modeling ```bash aiwebfeeds nlp topics --parent-topic "AI Safety" ``` ### 2. Review Unapproved Subtopics ```bash aiwebfeeds nlp review-subtopics ``` ### 3. Approve/Rename/Delete **Approve**: ```bash aiwebfeeds nlp approve-subtopic ``` **Rename**: ```bash aiwebfeeds nlp rename-subtopic "Better Name" ``` **Delete** (low coherence): ```bash aiwebfeeds nlp delete-subtopic ``` ### 4. Verify Approved Subtopics ```bash aiwebfeeds nlp list-subtopics "AI Safety" --approved-only ``` ## Use Cases ### Discover Emerging Subtopics Monitor new research areas: ```bash # Monthly check for new subtopics in "AI" aiwebfeeds nlp topics --parent-topic "AI" aiwebfeeds nlp topic-evolution --event-type emergence ``` ### Track Topic Fragmentation Identify when broad topics split: ```bash # Check if "Deep Learning" has fragmented aiwebfeeds nlp topic-evolution --event-type split --source "Deep Learning" ``` ### Content Organization Use subtopics for navigation and filtering: ```bash # Show articles in specific subtopic aiwebfeeds articles list --subtopic "Transformer Architectures" ``` ## Troubleshooting ### Low Coherence Scores **Symptom**: All subtopics have coherence \< 0.5. **Causes**: 1. Too few articles (\< 100) 2. Too many subtopics requested 3. Poor text preprocessing **Solutions**: ```bash # Reduce number of topics aiwebfeeds nlp topics --num-topics 5 # Increase minimum articles aiwebfeeds nlp topics --min-articles 200 ``` ### Topics Too Broad **Symptom**: Subtopics are generic and overlap. **Solution**: Increase `num_topics` parameter to get more specific clusters: ```bash aiwebfeeds nlp topics --num-topics 15 ``` ### Model Training Fails **Symptom**: `MemoryError` or training hangs. **Solution**: * Reduce batch size * Limit article count: `--max-articles 500` * Increase system memory or use cloud instance ## Advanced Features ### BERTopic (Future) Alternative to LDA using transformer embeddings: ```python # Planned: BERTopic support modeler = TopicModeler(algorithm="bertopic") subtopics = modeler.extract_subtopics(parent_topic="NLP", articles=articles) ``` **Advantages**: * Better semantic understanding * No need to specify number of topics * Higher coherence scores **Trade-offs**: * Slower training (GPU recommended) * Higher memory usage (\~2GB) ## See Also * [Quality Scoring](/docs/features/quality-scoring) - Article quality assessment * [Entity Extraction](/docs/features/entity-extraction) - Named entity recognition * [Sentiment Analysis](/docs/features/sentiment-analysis) - Sentiment classification -------------------------------------------------------------------------------- END OF PAGE 44 -------------------------------------------------------------------------------- ================================================================================ PAGE 45 OF 57 ================================================================================ TITLE: Twitter/X and arXiv Integration URL: https://ai-web-feeds.w4w.dev/docs/features/twitter-arxiv-integration MARKDOWN: https://ai-web-feeds.w4w.dev/docs/features/twitter-arxiv-integration.mdx DESCRIPTION: Generate RSS feeds from Twitter/X and arXiv for AI research tracking PATH: /features/twitter-arxiv-integration -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Twitter/X and arXiv Integration (/docs/features/twitter-arxiv-integration) import { Callout } from "fumadocs-ui/components/callout"; import { Tab, Tabs } from "fumadocs-ui/components/tabs"; ## Overview AI Web Feeds provides native integrations for Twitter/X and arXiv, enabling you to track AI researchers, discussions, and papers through RSS feeds. Twitter/X integration uses **Nitter** instances (privacy-focused alternative Twitter frontend) to generate RSS feeds. ## Twitter/X Integration ### Supported Feed Types Get tweets from a specific user. ```yaml - id: "karpathy-twitter" site: "https://twitter.com/karpathy" title: "Andrej Karpathy on Twitter" topics: ["ai", "ml", "research"] source_type: "twitter" mediums: ["text"] platform_config: platform: "twitter" twitter: username: "karpathy" nitter_instance: "nitter.net" # Optional, defaults to nitter.net ``` **Generated Feed URL**: `https://nitter.net/karpathy/rss` Get tweets from a Twitter list. ```yaml - id: "ai-researchers-list" site: "https://twitter.com/i/lists/1234567890" title: "AI Researchers List" topics: ["ai", "research"] source_type: "twitter" platform_config: platform: "twitter" twitter: list_id: "1234567890" ``` **Generated Feed URL**: `https://nitter.net/i/lists/1234567890/rss` Get tweets matching a search query. ```yaml - id: "twitter-llm-search" site: "https://twitter.com/search" title: "Twitter Search - LLM discussions" topics: ["llm", "community"] source_type: "twitter" platform_config: platform: "twitter" twitter: search_query: "LLM OR large language model" ``` **Generated Feed URL**: `https://nitter.net/search/rss?q=LLM+OR+large+language+model` ### Configuration Schema The `platform_config.twitter` object supports: | Field | Type | Description | | ----------------- | ------ | ------------------------------------------- | | `username` | string | Twitter username (without @) | | `list_id` | string | Twitter list ID | | `search_query` | string | Twitter search query | | `nitter_instance` | string | Nitter instance URL (default: `nitter.net`) | ### Alternative Nitter Instances For reliability, you can use different Nitter instances: * `nitter.net` (default) * `nitter.privacy.com.de` * `nitter.1d4.us` * `nitter.kavin.rocks` Nitter instances may have rate limits or availability issues. Consider using multiple instances for redundancy. ## arXiv Integration ### Supported Feed Types RSS feeds for specific arXiv categories. ```yaml - id: "arxiv-cs-lg" site: "https://arxiv.org/list/cs.LG/recent" title: "arXiv - Computer Science - Machine Learning" topics: ["research", "papers", "ml"] source_type: "arxiv" mediums: ["text"] platform_config: platform: "arxiv" arxiv: category: "cs.LG" ``` **Generated Feed URL**: `http://export.arxiv.org/rss/cs.LG` Papers by specific authors. ```yaml - id: "arxiv-bengio" site: "https://arxiv.org" title: "arXiv - Yoshua Bengio papers" topics: ["research", "papers", "ml"] source_type: "arxiv" platform_config: platform: "arxiv" arxiv: author: "Yoshua Bengio" max_results: 50 ``` **Generated Feed URL**: `http://export.arxiv.org/api/query?search_query=au:Yoshua+Bengio&max_results=50&sortBy=submittedDate&sortOrder=descending` Advanced search capabilities. ```yaml - id: "arxiv-transformer-search" site: "https://arxiv.org" title: "arXiv - Transformer papers" topics: ["research", "nlp"] source_type: "arxiv" platform_config: platform: "arxiv" arxiv: search_query: "all:transformer AND all:attention" max_results: 100 ``` **Generated Feed URL**: `http://export.arxiv.org/api/query?search_query=all:transformer+AND+all:attention&max_results=100&sortBy=submittedDate&sortOrder=descending` ### Configuration Schema The `platform_config.arxiv` object supports: | Field | Type | Description | | -------------- | ------- | ----------------------------------------- | | `category` | string | arXiv category (e.g., `cs.LG`, `stat.ML`) | | `author` | string | Author name for author-specific feeds | | `search_query` | string | Advanced search query | | `max_results` | integer | Maximum number of results (default: 50) | ### Popular arXiv Categories for AI/ML * **`cs.LG`** - Machine Learning * **`cs.AI`** - Artificial Intelligence * **`cs.CL`** - Computation and Language (NLP) * **`cs.CV`** - Computer Vision and Pattern Recognition * **`cs.NE`** - Neural and Evolutionary Computing * **`stat.ML`** - Machine Learning (Statistics) * **`cs.RO`** - Robotics * **`cs.IR`** - Information Retrieval ### arXiv Search Syntax When using `search_query`, you can use arXiv's advanced search: * `au:author_name` - Author search * `ti:title_words` - Title search * `abs:abstract_words` - Abstract search * `all:keywords` - Search all fields * Use `AND`, `OR`, `ANDNOT` for boolean queries **Example**: `all:transformer AND cat:cs.LG` ## Implementation Details ### Platform Detection The system automatically detects Twitter/X and arXiv URLs: **Twitter/X domains:** * `twitter.com`, `www.twitter.com` * `x.com`, `www.x.com` **arXiv domains:** * `arxiv.org`, `www.arxiv.org` * `export.arxiv.org` ### Feed URL Generation Platform-specific generators: 1. `generate_twitter_feed_url(url, platform_config)` - Generates Nitter RSS URLs 2. `generate_arxiv_feed_url(url, platform_config)` - Generates arXiv RSS/API URLs These are automatically called during feed discovery. ## Testing Run the integration tests: ```bash # All Twitter/arXiv tests aiwebfeeds test file test_utils.py -k "twitter or arxiv" # Specific test class aiwebfeeds test file test_utils.py -k "TestTwitterIntegration" aiwebfeeds test file test_utils.py -k "TestArxivIntegration" ``` ## Usage Examples ### Adding a Twitter Feed Add to `data/feeds.yaml`: ```yaml - id: "your-twitter-feed" site: "https://twitter.com/username" title: "Feed Title" topics: ["ai"] source_type: "twitter" platform_config: platform: "twitter" ``` ### Adding an arXiv Feed Add to `data/feeds.yaml`: ```yaml - id: "your-arxiv-feed" site: "https://arxiv.org/list/cs.LG/recent" title: "Feed Title" topics: ["research", "ml"] source_type: "arxiv" platform_config: platform: "arxiv" ``` ## Limitations ### Twitter/X * Relies on Nitter instances which may have rate limits or availability issues * Nitter instances may be blocked or shut down * Consider using multiple Nitter instances for redundancy ### arXiv * RSS feeds update once per day (overnight) * API queries limited to 100 results maximum * API has rate limiting (3 seconds between requests recommended) * Author searches may return false positives for common names ## Best Practices 1. **Twitter/X**: Monitor your chosen Nitter instance for availability 2. **arXiv**: Use specific categories rather than broad searches for better signal 3. **Both**: Set appropriate `max_results` to avoid overwhelming feeds 4. **Both**: Use `topic_weights` to indicate relevance when a feed covers multiple topics ## Future Enhancements Potential improvements: * [ ] Automatic Nitter instance failover * [ ] arXiv paper metadata enrichment * [ ] Twitter thread reconstruction * [ ] arXiv citation tracking * [ ] Integration with arXiv vanity for better author disambiguation -------------------------------------------------------------------------------- END OF PAGE 45 -------------------------------------------------------------------------------- ================================================================================ PAGE 46 OF 57 ================================================================================ TITLE: Analytics & Monitoring URL: https://ai-web-feeds.w4w.dev/docs/guides/analytics MARKDOWN: https://ai-web-feeds.w4w.dev/docs/guides/analytics.mdx DESCRIPTION: Comprehensive guide to feed analytics, monitoring, and reporting capabilities PATH: /guides/analytics -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Analytics & Monitoring (/docs/guides/analytics) import { Callout } from "fumadocs-ui/components/callout"; import { Tabs, Tab } from "fumadocs-ui/components/tabs"; AI Web Feeds provides robust analytics and monitoring capabilities to track feed health, performance, and content trends. ## Overview The analytics system provides: * **8 Different Analytics Views** - Overview, distributions, quality, performance, content, trends, health, contributors * **Real-time Health Monitoring** - Track each feed's status and performance * **Performance Metrics** - Success rates, durations, error analysis * **Publishing Trends** - Analyze content patterns over time * **Quality Scoring** - 3-dimensional quality assessment * **JSON Export** - Generate comprehensive reports ## Analytics Commands ### Overview Dashboard Get a high-level view of all your feeds: ```bash ai-web-feeds analytics overview ``` **Provides:** * Total feeds, items, and topics * Feed status distribution (verified, active, inactive) * Recent activity (last 24 hours) ### Distributions Analyze how feeds are distributed across different dimensions: ```bash ai-web-feeds analytics distributions [--limit N] ``` **Shows:** * Source type distribution (blog, newsletter, podcast, etc.) * Topic distribution across feeds * Language distribution * Content medium distribution ### Quality Metrics View quality scores and distributions: ```bash ai-web-feeds analytics quality ``` **Displays:** * Average and median quality scores * Quality distribution (excellent/good/fair/poor) * High/low quality feed counts Each feed receives three scores (0-1): * **Completeness**: How complete is the feed metadata? * **Richness**: How rich and detailed is the content? * **Structure**: How well-structured is the feed? ### Performance Tracking Monitor fetch performance over time: ```bash ai-web-feeds analytics performance [--days N] ``` **Metrics:** * Total fetches and success rate * Average fetch duration * Error type distribution * HTTP status code analysis ### Content Statistics Analyze content across all feeds: ```bash ai-web-feeds analytics content ``` **Provides:** * Total items and content coverage * Author attribution rates * Enclosure/media usage * Top categories ### Publishing Trends Understand publishing patterns: ```bash ai-web-feeds analytics trends [--days N] ``` **Shows:** * Items per day * Publishing patterns by hour * Publishing patterns by weekday * Peak publishing times ### Feed Health Reports Get detailed health metrics for a specific feed: ```bash ai-web-feeds analytics health ``` **Includes:** * Overall health score and status * Fetch statistics and success rate * Content quality metrics * Publishing frequency **Health Status Levels:** * **Excellent** (0.8-1.0) - Feed is performing optimally * **Good** (0.6-0.8) - Feed is healthy with minor issues * **Fair** (0.4-0.6) - Feed has some problems * **Poor** (0.2-0.4) - Feed needs attention * **Critical** (0.0-0.2) - Feed is failing ### Contributor Analytics View top contributors: ```bash ai-web-feeds analytics contributors [--limit N] ``` **Shows:** * Contributors ranked by feed count * Verification rates per contributor * Quality benchmarks ### Full Report Generation Generate comprehensive JSON reports: ```bash ai-web-feeds analytics report [--output FILE] ``` Exports all analytics data in JSON format for: * Custom analysis * Integration with other tools * Long-term tracking * Data visualization ## Quality Scoring System ### Completeness Score (0-1) Measures how complete the feed metadata is: ✓ Has title ✓ Has description ✓ Has link ✓ Has language ✓ Has timestamps ✓ Has author/publisher ✓ Has categories ✓ Has image/logo ### Richness Score (0-1) Evaluates content depth and quality: ✓ Items have content ✓ Content coverage percentage ✓ Author attribution ✓ Average content length ✓ Full content availability ✓ Media/images present ### Structure Score (0-1) Assesses feed validity and structure: ✓ No parsing errors ✓ Has items ✓ Items have GUIDs ✓ Has timestamps ✓ Has links ## Monitoring Workflows ### Daily Health Check Set up a daily monitoring routine: ```bash #!/bin/bash # daily-health-check.sh # Fetch all verified feeds ai-web-feeds fetch all --verified-only # Generate health report ai-web-feeds analytics overview > daily-overview.txt ai-web-feeds analytics performance --days 1 > daily-performance.txt # Check for critical feeds ai-web-feeds analytics quality | grep -i "poor\|critical" ``` ### Weekly Analytics Review Generate weekly analytics: ```bash #!/bin/bash # weekly-analytics.sh DATE=$(date +%Y-%m-%d) # Generate comprehensive report ai-web-feeds analytics report --output "reports/analytics-${DATE}.json" # View trends ai-web-feeds analytics trends --days 7 ai-web-feeds analytics distributions # Top contributors ai-web-feeds analytics contributors --limit 10 ``` ### Alert on Feed Failures Monitor for failing feeds: ```bash #!/bin/bash # check-failures.sh # Get performance stats STATS=$(ai-web-feeds analytics performance --days 1) # Extract success rate SUCCESS_RATE=$(echo "$STATS" | grep "Success Rate" | awk '{print $3}' | tr -d '%') if (( $(echo "$SUCCESS_RATE < 90" | bc -l) )); then echo "WARNING: Success rate below 90%: ${SUCCESS_RATE}%" # Send alert (email, Slack, etc.) fi ``` ## Advanced Analytics ### Custom Python Analysis Use the Python API for custom analytics: ```python from ai_web_feeds.analytics import FeedAnalytics from ai_web_feeds.storage import DatabaseManager # Initialize db = DatabaseManager("sqlite:///data/aiwebfeeds.db") analytics = FeedAnalytics(db.get_session()) # Custom query: Find all feeds with quality < 0.5 feeds = db.get_all_feed_sources() low_quality = [ f for f in feeds if f.quality_score and f.quality_score < 0.5 ] print(f"Found {len(low_quality)} low quality feeds:") for feed in low_quality: print(f" - {feed.title}: {feed.quality_score:.2f}") # Generate custom report report = analytics.generate_full_report() # Analyze specific dimension quality_by_type = {} for feed in feeds: if feed.source_type and feed.quality_score: type_name = feed.source_type.value if type_name not in quality_by_type: quality_by_type[type_name] = [] quality_by_type[type_name].append(feed.quality_score) # Calculate averages for source_type, scores in quality_by_type.items(): avg = sum(scores) / len(scores) print(f"{source_type}: {avg:.3f}") ``` ### Database Queries Direct SQL queries for advanced analysis: ```python from sqlalchemy import select, func from ai_web_feeds.models import FeedSource, FeedItem # Get feeds with most items stmt = ( select(FeedSource.id, FeedSource.title, func.count(FeedItem.id)) .join(FeedItem) .group_by(FeedSource.id) .order_by(func.count(FeedItem.id).desc()) .limit(10) ) results = session.exec(stmt).all() for feed_id, title, count in results: print(f"{title}: {count} items") ``` ## Export Formats ### JSON Reports Comprehensive analytics in JSON: ```json { "generated_at": "2025-10-15T12:00:00Z", "overview": { "totals": { "feeds": 150, "items": 5000, "topics": 25 }, "feed_status": { "verified": 120, "active": 100, "inactive": 5 } }, "quality": { "average_quality": 0.85, "median_quality": 0.87 } } ``` ### CSV Export (via Python) ```python import csv from ai_web_feeds.storage import DatabaseManager db = DatabaseManager() feeds = db.get_all_feed_sources() with open('feeds-export.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['ID', 'Title', 'Type', 'Quality', 'Verified']) for feed in feeds: writer.writerow([ feed.id, feed.title, feed.source_type.value if feed.source_type else '', feed.quality_score or '', feed.verified ]) ``` ## Integration Examples ### Grafana Dashboard Export metrics for Grafana: ```python import json from datetime import datetime def export_metrics(): analytics = FeedAnalytics(session) stats = analytics.get_overview_stats() metrics = { "timestamp": datetime.utcnow().isoformat(), "feeds_total": stats["totals"]["feeds"], "feeds_active": stats["feed_status"]["active"], "items_24h": stats["recent_activity_24h"]["new_items"] } with open('/var/lib/grafana/metrics/ai-web-feeds.json', 'w') as f: json.dump(metrics, f) ``` ### Prometheus Exporter ```python from prometheus_client import Gauge, generate_latest feeds_total = Gauge('ai_web_feeds_total', 'Total number of feeds') feeds_active = Gauge('ai_web_feeds_active', 'Number of active feeds') def update_metrics(): stats = analytics.get_overview_stats() feeds_total.set(stats["totals"]["feeds"]) feeds_active.set(stats["feed_status"]["active"]) ``` ## Best Practices 1. **Regular Monitoring** - Run analytics daily to track changes 2. **Health Checks** - Monitor feed health scores regularly 3. **Performance Tracking** - Watch for degrading fetch success rates 4. **Quality Improvement** - Address low-quality feeds 5. **Trend Analysis** - Understand publishing patterns 6. **Report Generation** - Keep historical analytics for comparison 7. **Alert on Anomalies** - Set up alerts for critical issues ## Related Documentation * [CLI Reference](/docs/development/cli) - All CLI commands * [Python API](/docs/development/python-api) - Programmatic usage * [Database Schema](/docs/development/database) - Data model * [Getting Started](/docs/guides/getting-started) - Installation guide -------------------------------------------------------------------------------- END OF PAGE 46 -------------------------------------------------------------------------------- ================================================================================ PAGE 47 OF 57 ================================================================================ TITLE: Data Explorer URL: https://ai-web-feeds.w4w.dev/docs/guides/data-explorer MARKDOWN: https://ai-web-feeds.w4w.dev/docs/guides/data-explorer.mdx DESCRIPTION: Interactive tool for browsing and filtering AI Web Feeds topics and feeds PATH: /guides/data-explorer -------------------------------------------------------------------------------- CONTENT -------------------------------------------------------------------------------- # Data Explorer (/docs/guides/data-explorer) # Data Explorer The Data Explorer provides an interactive web application for browsing, searching, and filtering the AI Web Feeds catalog of topics and feeds. ## Features ### Tabbed Interface * ✅ **Topics View**: Browse and search all available topics * ✅ **Feeds View**: Explore the complete catalog of RSS/Atom feeds * Switch seamlessly between views ### Advanced Search * **Full-text search** across titles, URLs, descriptions, and IDs * **Real-time filtering** as you type * Search works across both Topics and Feeds tabs * Instant results with optimized performance ### Topics Browser * View all available topics with their IDs, names, and descriptions * Sort by name or ID in ascending/descending order * Quick search to find specific topics * Hierarchical topic display ### Feeds Browser * Browse the complete catalog of RSS/Atom feeds * Filter by tags with one-click tag selection * Sort by title or URL * Direct links to feed URLs * Visual tag badges for easy identification ### Tag Filtering * **Visual tag cloud** showing all available tags * **Multi-select filtering** - click multiple tags to narrow results * **Active tag highlighting** shows selected filters * Clear all filters with one click * Smart tag counting and sorting ### Sorting Options * Sort by multiple fields (name, ID, title, URL) * Toggle between ascending and descending order * Maintains sort preferences while filtering * Persistent sort state ### Performance Features * ✅ **Real-Time Updates**: Instant filtering and sorting with React hooks * ✅ **Performance Optimized**: Uses `useMemo` for efficient re-rendering * ✅ **Error Handling**: Graceful error states and loading indicators * ✅ **Responsive Design**: Mobile-friendly UI with Tailwind CSS ## Usage ### Accessing the Explorer 1. **Via Browser**: Navigate to [/explorer](/explorer) 2. **Via Navigation**: Click "Explorer" in the site header ### Searching **Example: Search for feeds** 1. Select the **Feeds** tab 2. Enter "AI" in search box 3. Click tags like "machine-learning" or "nlp" 4. Results update instantly **Example: Search for topics** 1. Select the **Topics** tab 2. Enter topic name or ID 3. Sort by name or ID 4. Browse filtered results ### Filtering by Tags (Feeds Only) 1. Switch to the **Feeds** tab 2. Click on one or more tags in the tag filter section 3. Only feeds with selected tags will be displayed 4. Click "Clear tag filters" to reset ### Sorting Results 1. Select your preferred sort field from the dropdown 2. Click the sort order button (↑ Asc / ↓ Desc) to toggle 3. Results update immediately 4. Sort preferences maintained while filtering ## API Endpoints The explorer uses the following API endpoints: ### Topics API * **Endpoint**: `GET /api/topics` * **Source**: `topics.yaml` * **Returns**: JSON array of all topics ### Feeds API * **Endpoint**: `GET /api/feeds` * **Source**: `feeds.enriched.yaml` (fallback to `feeds.yaml`) * **Returns**: JSON array of all feeds Both endpoints include: * Static generation for performance * Cache headers (3600s max-age, 86400s stale-while-revalidate) * Error handling with proper status codes * CORS support for external access ### API Usage Examples ```bash # Fetch topics curl http://localhost:3000/api/topics # Fetch feeds curl http://localhost:3000/api/feeds ``` ## Implementation Details ### Technology Stack * **React 19** with hooks for state management * **Next.js 15** App Router * **Client-side rendering** for instant interactivity * **Responsive design** with Tailwind CSS * **Optimized performance** with useMemo for filtering/sorting ### Components * `ExplorerPage`: Main page component with search and filter controls * `TopicsTable`: Displays topics in a sortable table * `FeedsTable`: Displays feeds with clickable URLs and tag badges * `useExplorerData`: Custom hook for fetching data from API routes ### UI Layout ``` ┌─────────────────────────────────────────┐ │ Data Explorer │ │ Browse and filter AI Web Feeds... │ ├─────────────────────────────────────────┤ │ [Topics (50)] [Feeds (200)] │ ├─────────────────────────────────────────┤ │ [Search...] [Sort by ▼] [↑ Asc] │ │ Tags: [ai] [ml] [nlp] [research]... │ ├─────────────────────────────────────────┤ │ ┌────────────────────────────────────┐ │ │ │ Title URL Tags │ │ │ ├────────────────────────────────────┤ │ │ │ Feed 1 example.com ai,ml │ │ │ │ Feed 2 test.com nlp │ │ │ └────────────────────────────────────┘ │ └─────────────────────────────────────────┘ ``` ### Performance Considerations 1. **Static Generation**: API routes use `force-static` for build-time generation 2. **Memoization**: Filter/sort operations use `useMemo` to prevent unnecessary recalculations 3. **Cache Headers**: Aggressive caching (1 hour fresh, 24 hour stale-while-revalidate) 4. **Client-Side Filtering**: All filtering happens client-side for instant responsiveness ### Type Safety All components use TypeScript with proper type annotations: * Event handlers have explicit types * Data structures are typed * API responses are validated ### Accessibility * Semantic HTML elements (``, `