How to Build a Self-Optimizing AI Trading System in Python

Stay up-to-date with macroeconomic news & investments without spending two hours reading Bloomberg every morning. Nothing here is financial advice.

by Alex G. | Mar 6, 2026 | Trading, Tutorial

// Summary

What started as a simple script to summarize financial headlines evolved into a multi-agent pipeline with five sequential Claude API calls, a self-calibrating feedback loop, live brokerage integration, and a unified dashboard.

This system reads macro news, derives a structured trade idea, tracks whether that idea was right, and uses that track record to calibrate its own future recommendations.

The feedback loop is the secret sauce that separates this from a prompt wrapper with a Discord webhook.

News Analysis Architecture

The news ingestion layer pulls the top macro stories of the day and structures them as JSON. The indicators layer fetches live market snapshots from yfinance. The Claude analysis pipeline runs three sequential API calls, each with a specific reasoning task. The storage layer is MySQL, with five core tables. The notification layer is Discord webhooks. And the feedback loop is what ties it all together: trade ideas get tracked automatically, outcomes get resolved against live prices, and accuracy statistics get fed back into the next Claude prompt.

Table of Contents

// News Ingestion: Two Sources, One Format

The ingestion layer has two modes that produce identical outputs:

Comet Browser

The primary source is Comet, a browser-based Perplexity AI agent I can run manually. It scans financial news and writes a structured JSON file with the top five macro stories. Each story includes a headline, an impact score from one to ten, a macro theme like "Fed hawkish pivot" or "energy supply shock," a directional bias, and a list of affected sectors and key instruments.

Perplexity API

The fallback is the Perplexity API, using their Sonar model at roughly five cents per scan. When a cron job runs and the JSON file is older than sixteen hours, the system automatically triggers the Perplexity fallback. The Perplexity prompt is written to behave like a macro analyst: it is instructed to identify the top market-moving stories across Fed and ECB policy, inflation, PMI, geopolitics, energy, and credit, and to cross-check the headline narrative against actual recent price action before scoring impact.

The reason for two sources is cost and reliability. Comet is free, but manual. Perplexity API costs money, but runs unattended. The sixteen-hour staleness window means I can run Comet in the morning when I am at my desk and let Perplexity cover the automated runs.

The high-impact stories, anything above a configurable impact threshold, also get queued to perform a more detailed research pass on each of those stories, looking at historical precedent, second-order effects, key upcoming dates, and potential entry and exit points. Those research summaries get stored and injected into the next Claude analysis call. It is the closest thing to a research analyst in the pipeline.

// The Five-Call Claude Reasoning Pipeline

This is the most architecturally interesting part of the system.

The reason for separating into three calls is because when you mix different reasoning tasks in a single prompt, the model anchors to information it should not have yet.

For example, if Claude knows that XLE is trading at $89.50 while it is trying to form a macro directional conviction, it anchors to that price. The reasoning gets contaminated before it even starts. Separating the calls forces clean cognitive boundaries.

Call One: Pure Macro Reasoning

The first call receives the top macro stories, the deep-dive research summaries for high-impact items, the current market indicators, and Claude's own historical accuracy statistics from the feedback loop.

It produces a market regime assessment, a sector prediction, one or two ticker recommendations, a directional thesis, a confidence score from zero to five, and a setup grade from A-plus down to C.

Critically, there are no prices in this call. The question being asked is purely: given what is happening in macro, what sector should move and in which direction, and how confident are we? Letting the model answer that without price anchoring produces cleaner directional conviction.

The grading system is worth explaining separately because it drives most of the downstream logic. Confidence and quality are related but not identical. A confidence score of four means Claude thinks the trade idea is well-supported. A setup grade of A-plus means the conditions are exceptional: the catalyst is unambiguous, the macro regime aligns with the thesis, the sector has not yet priced in the move, and the risk-to-reward profile is clean. An A-plus setup is rare by design. Most scans produce B or C grades, and that is intentional.

Call Two: Price Level Setting

The second call receives the output from call one plus live prices fetched via yfinance. Its only job is to calculate specific entry, target, and stop-loss levels for each recommended ticker.

The prompt constrains it meaningfully. Targets for sector ETFs should represent a five to fifteen percent move. Stop-losses should be placed at logical technical support levels, not at arbitrary percentage offsets. The minimum risk-to-reward ratio required is two-to-one. If that ratio cannot be achieved given current prices, the system notes it.

This is a genuinely different cognitive task from call one. The first call is asking what will happen. The second call is asking, given what we believe will happen and given where prices are right now, where exactly do we put our levels. Mixing these produces worse output on both dimensions. The first call loses directional conviction because it is thinking about price levels. The second call loses precision because it is re-litigating the macro thesis.

Price Correction Prompt
PYTHON
def _build_price_correction_prompt(result, prices):
    return (
        "You are setting entry, target, and stop-loss levels "
        "for a swing trade.\n\n"
        f"Direction: {direction.upper()}\n"
        f"Current prices (LIVE from market data):\n"
        f"{price_lines}\n"  # e.g. "XLE: $89.50 | OXY: $52.10"
        "RULES:\n"
        "- Target: 5-15% move for sector ETFs, "
        "8-20% for individual stocks.\n"
        "- Stop: logical support/resistance level, "
        "typically 3-7% from entry.\n"
        "- Risk/reward ratio should be at least 2:1.\n"
    )

Call Three: Vehicle Selection

The third call receives the outputs from both previous calls plus, when available, a live options chain pulled from the Questrade API. Its job is to recommend whether the trade should be expressed as a stock position, an options play, or a Polymarket prediction market bet.

This matters because the same directional thesis has very different risk-to-reward profiles depending on how it is expressed. A high-conviction two-week thesis on energy sector upside looks completely different as a position in XLE versus a near-money call option expiring in three weeks. The options chain data lets Claude reason about implied volatility, strike selection, and premium cost relative to the expected move.

The Polymarket module is an interesting edge case. Some macro stories map cleanly onto open prediction market questions with binary outcomes. "Will the Fed cut rates at the March meeting?" is the kind of question where a strong macro thesis and a mispriced binary market can create a completely different risk profile than any equity trade. The third call flags these when it finds a match.

Call Three & Four: Cross-Database Selection

These are the most architecturally interesting calls because they bridge two separate MySQL databases.
Call 3 extracts 3-5 search keywords from the macro narrative. Those keywords query the separate `polymarket_monitor` database for active prediction markets with odds between 5-95%
Extract keywords for search
PYTHON
# Returns: ["Fed rate cut", "March FOMC", "interest rates"]
kw_msg = client.messages.create(
    model=CLAUDE_MODEL, max_tokens=200, ...
)
Cross-database query against the Polymarket monitor's tables
MYSQL
query = """
    SELECT m.question, ms.yes_price, m.end_date
    FROM markets m
    INNER JOIN market_snapshots ms
        ON m.market_id = ms.market_id
    WHERE m.active = TRUE
      AND ms.yes_price BETWEEN 0.05 AND 0.95
      AND (m.question LIKE %s OR m.question LIKE %s ...)
"""

Call 4 evaluates whether the top match is mispriced given the macro thesis. 

Is this market mispriced?
PYTHON
# Returns: {relevant, direction, edge, grade, confidence}
eval_msg = client.messages.create(
    model=CLAUDE_MODEL, max_tokens=300, ...
)

Call Five: Best Play

This call only runs if confidence >= 2
Input: all previous outputs + options chain data from Questrade (if available)
Output: stock vs options vs Polymarket recommendation with reasoning
Why this matters: the same directional thesis has very different risk/reward profiles depending on vehicle
The grading system:
  • A+ to C, not just confidence 1-5, because confidence and quality are related but not the same.
  • A+ = rare, unambiguous catalyst, regime aligns, sector not yet priced in, clear R:R
  • The grade feeds into the feedback loop (was A-grade accuracy actually better than B-grade?)
Best Play Prompt
PYTHON
def _build_best_play_prompt(result, prices, options_data):
    return (
        "You are recommending the optimal trade vehicle.\n\n"
        f"Confidence: {confidence}/5, Grade: {grade}\n"
        "Choose 'options' when: clear directional catalyst, "
        "defined timeline...\n"
        "Choose 'stock' when: thesis is strong but "
        "timing uncertain...\n"
        "Choose 'polymarket' when: catalyst maps to a "
        "binary event with mispriced odds.\n"
    )

// The Self-Calibrating Feedback Loop

This is the part that separates the system from a one-shot prompt wrapper.

The basic mechanic is straightforward: every trade alert with a confidence score of two or higher automatically creates a row in the trade_outcomes table. The resolver runs on each pipeline cycle, fetches live prices for all open positions, and checks whether the target or stop has been hit. If neither has been hit within twenty-eight days, the trade is closed at the current price and classified as a win, loss, or breakeven based on a configurable percentage threshold.

But the interesting part is what gets stored alongside the outcome, and what gets fed back into the next Claude prompt.

At the time a trade outcome is created, the system snapshots the market regime: VIX level, DXY six-hour percentage change, and SPY six-hour percentage change. It also stores the macro theme, the specific catalyst headline, the sector, and a thesis summary from the original analysis. The risk-to-reward ratio is calculated from the entry, target, and stop prices and stored at creation time.

At resolution, the regime indicators are snapshotted again. This means you can eventually ask questions like: did the same macro thesis produce different outcomes in high-VIX versus low-VIX environments? The answer to that question is worth more than a flat win rate.

The outcome model has six resolution states, not just win and loss.

The resolver checks direction first. For a long trade, if the current price hits the target the trade resolves as a win. If it hits the stop it resolves as a loss. If neither happens within twenty-eight days, the system uses peak tracking to make a more nuanced call. A trade that moved three percent in your favor at its best point before pulling back is not the same as one that went straight against you. The outcome direction_correct captures the first case. The outcome direction_wrong captures the second. Breakeven sits in the middle.

This distinction matters for the feedback loop. A trade that was right about direction but wrong about exit timing should teach the system something different than a trade that was simply wrong. Binary win and loss collapses that information.

The peak tracking runs continuously on every resolver check, not just at resolution. This means even if a trade gets stopped out, you have a record of whether it was ever profitable and by how much. That data feeds directly into the stop placement calibration question over time.

The accuracy stats that get injected into the next Claude prompt look like this once enough data accumulates:

  • Grade A+: direction correct 78% (7/9), target hit 5/9, avg peak +8.3%
  • VIX regime: high=3/4, medium=2/3, low=1/2
  • Themes: rates 3/4, geopolitics 1/2 Thesis confirmed: 71% (n=7)

The prompt tells Claude to use this to calibrate its confidence and grading. Whether it meaningfully adjusts its behavior based on this is still unproven. The feedback loop has been running for less than two weeks at time of writing. That is the honest answer.

One other piece worth mentioning is thesis validation. Some trades will hit their target for the wrong reasons. A long on energy that wins because the Fed unexpectedly cut rates is not a validation of the energy supply thesis that generated the recommendation. Counting it as a win corrupts the accuracy signal. To address this, the schema includes a thesis_played_out boolean that defaults to null. A CLI command called --review-outcomes surfaces all recently resolved trades where this field has not been set, showing the ticker, outcome, percentage move, catalyst, and thesis summary side by side. Once a week I go through these and mark them. It takes about five minutes and is the most valuable manual input in the system.

Resolved outcomes are never deleted. The cleanup policy only removes unresolved expired rows. The entire point of building this is to accumulate a training dataset, and deleting it would defeat the purpose.

// Questrade and Polymarket Integrations

These are optional modules that degrade gracefully if disabled. The core pipeline runs without either of them.

Questrade

Questrade is a Canadian discount brokerage with a reasonably well-documented API. The --trade command fetches a live bid/ask quote, displays an order preview showing entry, target, stop, and total position value, enforces a configurable safety cap that rejects anything above five thousand dollars, and places a limit order. The order ID logs back to the database and a Discord notification confirms execution.

The TSX equivalents layer exists because most of Claude's sector recommendations are US-listed tickers. A curated mapping dictionary of thirty-five entries plus a yfinance probe for anything not in the list handles the translation automatically. XLE becomes XEG.TO. XLF becomes ZEB.TO. Gold via GLD becomes CGL.TO for the CAD-hedged version. This is a small thing that makes the system actually usable for a Canadian brokerage account.

Token refresh is handled automatically. Every Questrade API call is wrapped in a retry function that catches 401 errors, clears the authentication singleton, reinitializes, and retries once before surfacing the failure. In practice this means the integration works across long-running cron sessions without any manual intervention.

Polymarket

The Polymarket integration is architecturally interesting because it bridges two separate MySQL databases. Calls three and four in the pipeline handle this together.

Call three extracts three to five search keywords from the macro narrative. Those keywords query the separate polymarket_monitor database for active prediction markets with odds between five and ninety-five percent. Markets near one hundred percent are excluded because there is no meaningful edge in near-certain outcomes. The results are ranked by keyword match count before being passed to Call four.

Call four evaluates whether the top matching market is mispriced given the macro thesis. If the implied odds look off relative to what the macro analysis suggests, it flags a potential bet with an edge estimate and a grade.

The interesting thing about this conversion is that it takes a continuous directional thesis and forces it into a binary probability. A macro view that the Fed is likely to cut rates in March maps onto a Polymarket question trading at forty-five cents as a potential edge, with a completely different risk profile than any equity position. Sometimes that is actually the cleaner trade.

// Discord Notifications

Alerts are color-coded by conviction level, from grey at confidence two up through blue, yellow, orange, and red at confidence five. Each embed includes the setup grade and confidence in the footer, the directional thesis as the title, and structured fields for ticker, entry, target, stop, timeline, and TSX alternatives.

If a Polymarket match was found it appears as an additional field showing the question, current odds, and the edge estimate. Active position alerts show up in the same embed when today's catalyst affects something already held, with a suggested action of hold, tighten stop, take profit, or close.

An in-memory deduplication cache with a five-minute window prevents the same alert from firing twice when concurrent pipeline runs overlap.

// Honest Limitations and What I Am Watching

The system has been live for less than two weeks at the time of writing. Everything said above about the feedback loop is architectural, not empirical. I need a minimum of one hundred resolved outcomes before the accuracy statistics mean anything statistically. Right now the system generates signals and tracks them, but whether the feedback loop actually changes Claude's output in useful ways is a hypothesis, not a finding.

The news lag is real. Comet runs manually at scheduled times. A major event at two in the afternoon may not enter the pipeline until the next time I run a scan. This is not a real-time system and should not be treated as one for anything time-sensitive.

Claude's confidence scores are not calibrated probabilities. A confidence of four out of five does not mean an eighty percent win rate. The feedback loop is supposed to discover the actual relationship between those numbers over time, but that relationship is still unknown.

The execution gap between signal and outcome is real and not accounted for anywhere in the system. The pipeline recommends a trade at a specific entry price. Whether you actually execute it, at what size, and with what slippage, is entirely separate from whether the signal was right.

What I am watching over the next ninety days: does A-plus outperform A in practice? Does the system show better direction accuracy in high-VIX environments than low-VIX? Do thesis confirmation rates correlate with actual win rates? Does direction_correct turn out to be a more useful signal than binary win and loss? These questions have real answers. I just do not have the data yet.

 

// What I Would Build Differently

Structured outputs instead of JSON parsing. The current five-call pipeline relies on parsing JSON from Claude's free-form text. It is fragile. A malformed response breaks the pipeline. Anthropic's structured outputs feature would eliminate this and is the first refactor I would make if starting over today.

API ingestion instead of Comet. Comet is free and excellent for prototyping, but it is manual. A production version would ingest from a proper financial news API. The structured JSON format is already defined, so swapping the ingestion layer is a clean refactor without touching anything else.

Vector similarity for story deduplication. The same story can enter the pipeline across multiple scans if it stays in the news cycle for several days. A vector similarity check against recent headlines would prevent redundant analysis and produce cleaner outcome attribution.

A backtesting harness. Before going live I would want to replay several months of historical macro news through the pipeline with frozen indicator snapshots, to get a rough estimate of how quickly the feedback loop converges and whether the grading system shows any predictive signal at all. Running it blind in production is intellectually interesting but not rigorous.

Separate signal from execution. Analysis and order placement currently live in the same codebase. For anything at real scale you want a clean API boundary between the signal service and the execution layer so they can be deployed, debugged, and scaled independently.

Smarter stop placement using peak tracking data. The peak tracking already records whether a trade moved favorably before getting stopped out. Feeding that data back into the price-level prompt could let the system auto-calibrate stop distance over time, rather than relying on Claude's judgment about support levels from a static prompt.

// What Comes Next

The immediate priority is letting the feedback loop accumulate data. I am trading small size on A-plus and A setups while the system builds its track record. After one hundred resolved outcomes I will publish a follow-up with the actual accuracy data, broken down by grade, regime, and macro theme.

The third tab on the Trading Hub dashboard is in progress, pulling on-chain data from Dune Analytics and direct protocol APIs. The longer-term goal is to correlate macro signals with cross-chain money flows, since capital moves across blockchains in patterns that sometimes precede equity-visible moves in related sectors.

The code is private for now. Whether it goes public depends entirely on what the feedback loop shows after six months of live data.

If you have built something similar or have thoughts on the architecture, I would like to hear about it.

Alex Grant

Alex Grant

Blockchain Infrastructure & Security Analyst

Hi, I'm Alex, founder of Web3Fuel. My goal is to simplify complex blockchain concepts and provide fuel for the growth of Web3.

Currently seeking: Technical Writer, Content Strategist, and Developer Relations roles at blockchain protocols and infrastructure companies.