Fishwrap Config Schema

A fishwrap config is a Python file (typically config.py) that defines a set of top-level variables the engine reads. This document describes every variable the engine recognizes: which are required, which are optional, what types they accept, and what they do.

The validator (fishwrap.validate_config, run via fishwrap-validate-config <path>) is the source of truth for the schema. If this document and the validator ever disagree, the validator wins.

A typical workflow: edit your config.py, then before running the pipeline, validate it:

docker run --rm -v $(pwd):/cfg ghcr.io/maxspevack/fishwrap:2.0 \
    fishwrap-validate-config /cfg/config.py

If validation passes, the config is structurally sound. If it fails, the validator prints one error per line to stderr; fix and re-run.


Minimal Valid Config

The smallest config that passes validation:

FEEDS = ["https://example.com/feed.xml"]
SECTIONS = [{"id": "news", "title": "News"}]

This will run, but with KEYWORDS = {} (the default) every article gets fallback classification. To produce a meaningful edition, you almost certainly want the optional keys below.


Required Keys

These two keys must be present. The engine’s defaults for them are not useful: FEEDS = [] means nothing to fetch, SECTIONS = [] means no buckets to put articles in.

FEEDS

Type: non-empty list of strings.

The URLs to fetch. Each is an HTTP(S) URL serving an RSS feed, JSON feed, or one of the supported variants (Reddit .json endpoints, hnrss.org URLs, etc.).

FEEDS = [
    "https://hnrss.org/frontpage",
    "https://feeds.npr.org/1001/rss.xml",
    "https://old.reddit.com/r/news.json",
]

SECTIONS

Type: non-empty list of dicts. Each entry must have id (string) and title (string). Other fields (e.g., description) are accepted but not required.

The output sections of the published edition. Each becomes a heading in the rendered HTML. Articles route into sections based on KEYWORDS (below) or SOURCE_SECTIONS.

SECTIONS = [
    {"id": "news",    "title": "News",    "description": "What happened today."},
    {"id": "tech",    "title": "Tech",    "description": "Computers and the people who run them."},
    {"id": "culture", "title": "Culture"},
]

Optional Keys — Structural

If absent, the engine uses the defaults defined in fishwrap/_config.py. If present, the type and shape are validated.

KEYWORDS

Type: dict mapping section ID (string) to a list of keyword strings.

Used by the editor to classify articles. An article whose title or content contains a keyword matched to a section gets routed there. Default: {} (every article goes to fallback classification).

KEYWORDS = {
    "news":    ["politics", "election", "government"],
    "tech":    ["software", "hardware", "ai"],
    "culture": ["film", "music", "book"],
}

EDITORIAL_POLICIES

Type: list of dicts. Each entry must have type (string). Most types accept boosts (number) and other type-specific fields.

The editorial knobs: keyword boosts, source boosts, domain penalties, content boosts. Each policy is checked against every article and applied if it matches.

EDITORIAL_POLICIES = [
    {"type": "keyword_boost",  "phrases": ["breaking"],                            "boosts": 5},
    {"type": "domain_penalty", "domains": ["lowqualitysource.example.com"],        "boosts": -10},
    {"type": "source_boost",   "match":   "krebsonsecurity.com",                   "boosts": 3},
]

Supported policy types (see fishwrap/scoring.py for the matching logic):

type value Matches on Required fields
keyword_boost / keyword_penalty Title + content text phrases (list of strings), boosts (number)
source_boost Source URL substring match (string), boosts
domain_penalty Source URL substring or domain list match (string) and/or domains (list of strings), boosts
content_boost Content text only phrases (list of strings), boosts

SCORING_PROFILES

Type: dict with at least dynamic and static keys, each mapping to a dict of weights.

Profiles control how raw stats (upvotes, comments) translate into score points. The fetcher decides which profile applies per article based on whether the source provides stats.

SCORING_PROFILES = {
    "dynamic": {"base_boosts": 0,  "score_weight": 1.0, "comment_weight": 1.0},
    "static":  {"base_boosts": 10, "score_weight": 0,   "comment_weight": 0},
}

The default ships with sensible values. Override only if you have specific tuning needs.

EDITION_SIZE

Type: dict mapping section ID (string) to integer.

Caps how many articles each section publishes per edition. Articles ranked below the cap get cut. Default: {} (no caps; EDITION_SIZE.get(section, 5) is the engine’s fallback).

EDITION_SIZE = {"news": 15, "tech": 10, "culture": 5}

MIN_SECTION_SCORES

Type: dict mapping section ID (string) to number.

Articles whose score is below this floor get filtered out of the section, even if they would otherwise fit under EDITION_SIZE. Default: {} (no minimum).

MIN_SECTION_SCORES = {"news": 100, "tech": 50}

SOURCE_SECTIONS

Type: dict mapping source-URL substring (string) to section ID (string).

Lets you override classification by source: any article from a feed whose URL contains the matching substring goes to the named section regardless of keyword match. Default: {}.

SOURCE_SECTIONS = {
    "krebsonsecurity.com": "tech",
    "deadline.com":        "culture",
}

VISUAL_THRESHOLDS

Type: dict.

Per-section thresholds that the printer uses to decide visual emphasis (“lead” vs “feature” vs “compact” article presentation). Implementation detail — consult fishwrap/printer.py for the current shape if you need to override.


Optional Keys — Paths and Strings

Key Type Default Notes
DATABASE_URL string 'sqlite:///newsroom.db' SQLAlchemy URL. SQLite is the only tested backend. Relative SQLite paths resolve against process CWD
LATEST_HTML_FILE string 'index.html' Path where the rendered HTML edition is written
LATEST_PDF_FILE string 'edition.pdf' Path for PDF output. Currently inert in v2.0.x — the image does not include weasyprint, so PDF generation is silently skipped (logged as PDF generation skipped: weasyprint not installed). Setting this is harmless
RUN_SHEET_FILE string 'run_sheet.json' Where the editor writes the per-run record consumed by the auditor
ENHANCED_ISSUE_FILE string 'enhanced_issue.json' Intermediate artifact between the enhancer and printer
STATS_FILE string 'publication_stats.json' Aggregate publication statistics, appended on each run
SECRETS_FILE string 'secrets.json' JSON file mapping source-specific keys to auth material (cookies, API tokens). Absent = sources requiring auth fetch in degraded mode
ARTICLES_DB_FILE string 'articles_db.json' Deprecated; kept for backward compatibility. Has no effect when DATABASE_URL is a SQLite URL (default since v1.2)
USER_AGENT string engine default HTTP User-Agent for the fetcher and enhancer. Be a polite citizen
FOUNDING_DATE string '2024-01-01' ISO date the publication started; used to compute issue numbers (days since founding + 1)
TIMEZONE string 'UTC' Display timezone for timestamps in the rendered edition. Accepts IANA names ('America/Los_Angeles'); legacy aliases ('US/Pacific') work in v2.0.x because the image ships tzdata-legacy, but a future release may drop the legacy package and require IANA-canonical names
THEME string 'basic' Path to the theme directory (templates + static). Relative paths resolve against process CWD

Optional Keys — Numeric

Key Type Default Notes
BOOST_UNIT_VALUE number 100 The point value of one “boost unit.” Editorial policies declare boosts as integer counts; each count multiplies by this value to produce the score delta
FUZZY_BOOST_MULTIPLIER number 1.5 When deduplication merges two articles, the surviving article’s score is incremented by this factor of BOOST_UNIT_VALUE to reflect the duplicate’s signal
EXPIRATION_HOURS number 24 Articles older than this are not considered for the published edition
MAX_ARTICLE_LENGTH number 10000 Truncation cap for content fields

What the Validator Does NOT Check

The validator catches schema-shaped errors. It does not validate semantics. The following are runtime concerns; the engine surfaces them as errors when you actually run the pipeline:

  • Reachability of feed URLs. The validator does not probe FEEDS URLs. A FEEDS entry that 404s passes validation; the fetcher will report the failure at runtime.
  • Timezone string validity. TIMEZONE = "Europe/Atlantis" passes validation; it crashes at runtime when ZoneInfo looks it up.
  • File path existence. Theme directories, secret files, output destinations — none are checked at validation time.
  • Cross-key consistency. If EDITION_SIZE = {"news": 10} but SECTIONS = [{"id": "tech", ...}], no section ID matches between the two; validation passes but no articles ever reach the printer.

If you find yourself wanting one of these checks, run the pipeline against the config in a development environment first and observe.


Adding Your Own Keys

The validator is permissive on unknown keys: variables it does not recognize are ignored, not flagged. You can add your own top-level config variables and reference them from custom theme templates or downstream scripts. They simply won’t be looked at by the engine itself.

MY_PUBLICATION_NAME = "The Daily Clamour"
MY_FOUNDING_AUTHOR  = "Max Spevack"

These pass validation and travel through to wherever your downstream code reads them.


Fishwrap is open source software licensed under the Apache License 2.0.