> For the complete documentation index, see [llms.txt](https://ai-os-and-trend-finder.gitbook.io/ai-os-and-trend-finder-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ai-os-and-trend-finder.gitbook.io/ai-os-and-trend-finder-docs/.spec_system/archive/sessions/phase28-session14-direct-first-party-source-adapters/spec.md).

# Session Specification

**Session ID**: `phase28-session14-direct-first-party-source-adapters` **Phase**: 28 - Trend Finder Trends-Finderz Adoption **Status**: Not Started **Created**: 2026-06-14

***

## 1. Session Overview

This session moves four Trend Finder collection roles from paid or limited adapter paths toward direct first-party endpoints where the source terms allow it. The target roles are arXiv research metadata, GitHub public repository search, RSS/news feed metadata, and Hacker News keyword search through the HN Algolia endpoint. The work preserves existing source IDs and source-health labels so downstream scoring, spend accounting, collection health, source setup, and operator docs keep a stable contract.

The first deliverable is compliance re-review. Current source documents mark direct arXiv, GitHub, and RSS adapters as deferred, and the HN document covers top-stories collection rather than keyword search. No direct network path should be enabled until the matching source document records rate limits, terms, retention, attribution, field boundaries, and fallback behavior for that path.

After the compliance gate, the implementation adds direct adapters with bounded keyword input, timeout and retry handling, explicit readiness states, zero-cost spend labels, and Apify declarations retained as reviewed fallbacks. If a direct adapter is blocked by compliance or runtime readiness, the existing Apify declaration remains available instead of silently dropping that source role from collection.

***

## 2. Objectives

1. Re-review and update the arXiv, GitHub, RSS/news, and Hacker News source compliance documents before enabling direct API collection.
2. Add direct arXiv, GitHub, and RSS adapters plus HN Algolia keyword search with bounded keyword windows, normalization, timeouts, retries, and rate-limit handling.
3. Preserve Apify declarations as reviewed fallbacks while suppressing duplicate Apify collection when a matching direct source succeeds.
4. Publish connector readiness, source health, and zero-cost spend labels for direct paths so operators can distinguish ready, degraded, disabled, and fallback states before collection starts.

***

## 3. Prerequisites

### Required Sessions

* [x] `phase28-session13-keyword-packs-rotation-and-coverage` - provides the reviewed keyword window, source cap summaries, and coverage QA that direct keyword adapters will consume.
* [x] `phase28-session02-signal-quality-score-and-collection-health` - provides collection health and source coverage counters that direct source rows must feed.
* [x] `phase24-session04-source-setup-target-configuration` - provides source setup declarations, diagnostics, and reviewed source configuration boundaries.

### Required Tools/Knowledge

* Bun 1.3.14, TypeScript, Vitest, Zod, and the existing Trend Finder source adapter contracts.
* Current adapter patterns in `scripts/extensions/trend-finder/sources/hn-adapter.ts`, `scripts/extensions/trend-finder/sources/apify-adapter.ts`, and `scripts/extensions/trend-finder/collector.ts`.
* Current source declarations in `scripts/extensions/trend-finder/sources/apify-source-config.ts`.
* Trends-Finderz reference adapters under `EXAMPLES/trends-finderz/lib/sources/` and readiness patterns under `EXAMPLES/trends-finderz/lib/scan/connector-readiness.ts`.

### Environment Requirements

* No live Apify credential is required for direct adapter tests.
* Optional `GITHUB_TOKEN` stays script-side only and must not enter browser state, generated data, logs, or trace payloads.
* Direct RSS collection must use only reviewed public feeds; private, local, authenticated, or unreviewed feeds are disabled before network calls.
* Direct adapters must fail closed when compliance status or readiness is not reviewed, ready, or explicitly enabled for collection.

***

## 4. Scope

### In Scope (MVP)

* Trend Finder maintainers can review source compliance updates for arXiv, GitHub, RSS/news, and HN Algolia before direct adapter code is activated.
* Trend Finder collection can fetch arXiv metadata directly from `export.arxiv.org/api/query` with single-connection pacing, timeout/retry handling, metadata-only normalization, and no PDFs or author contact data.
* Trend Finder collection can fetch public GitHub repository metadata from the REST search endpoint with optional script-side token support, rate-limit header handling, bounded query terms, public canonical URLs, and no private repository, email, profile, issue body, or code-content collection.
* Trend Finder collection can fetch reviewed RSS/Atom feed items directly with per-feed approval, conditional request metadata where available, item caps, timeout/retry handling, and metadata-only snippets.
* Hacker News collection supports keyword queries through the Algolia `search_by_date` endpoint while preserving the no-author, no-comment-content data boundary.
* Connector readiness reports disabled, degraded, and ready states before direct network work begins, including missing optional token, compliance blocked, rate limited, timeout, empty response, and fallback decisions.
* Direct source rows keep existing source IDs where applicable: `arxiv-ai-papers`, `github-ai-repositories`, `rss-ai-news`, and `hackernews`.
* Spend accounting keeps these direct sources visible as zero-cost public API sources instead of dropping them from the spend table.
* Apify declarations remain as reviewed fallbacks; they are skipped only when a matching direct source succeeds or is explicitly selected as the active path.
* Tests cover normalization, retries, timeouts, rate-limit handling, readiness, direct-vs-Apify fallback behavior, zero-cost spend labels, and no activation after collection starts.

### Out of Scope (Deferred)

* YouTube direct adapter - *Reason: official Data API quota and privacy work is env-gated and explicitly out of scope for this session.*
* Removing Apify source declarations - *Reason: they remain reviewed fallbacks and preserve operator choice.*
* Adding new source roles beyond arXiv, GitHub, RSS/news, and HN - *Reason: this session is constrained to existing compliance docs and source IDs.*
* Hosted persistence or database-backed source cache - *Reason: durable-store direction is owned by the separate observation-store plan.*
* Browser controls that enable unreviewed direct feeds or arbitrary direct API query input - *Reason: reviewed script-side source configuration remains the collection boundary.*

***

## 5. Technical Approach

### Architecture

Add a small direct-source layer under `scripts/extensions/trend-finder/sources/` that shares timeout, retry, backoff, safe URL, XML/JSON parsing, readiness, and zero-cost spend helpers. Direct adapters should return the same `SourceAdapterResult` shape as the existing HN adapter, including browser evidence, analyst evidence, source summary, spend summary, and warnings.

The collector should resolve direct-source readiness before collection starts, run direct adapters first, and pass active direct source IDs into the Apify path so Apify is not run for sources already collected directly. If a direct adapter is disabled, blocked, offline, rate limited, or degraded without usable evidence, the Apify declaration remains eligible as the reviewed fallback.

Normalization should stay in script code and emit only existing browser-safe evidence fields. The browser receives source rows, source setup diagnostics, collection health, and spend summaries; it does not receive raw API responses, tokens, private paths, rate-limit headers, feed bodies, XML payloads, Actor IDs, Dataset IDs, or stack traces.

### Design Patterns

* Compliance before collection: update source-specific compliance docs before enabling each matching adapter.
* Script-only runtime boundary: all direct API calls, optional tokens, and raw responses stay under `scripts/`.
* Disabled-first collector pattern: unreviewed, misconfigured, or restricted sources return disabled or blocked readiness before network calls.
* Pure helper first: normalize, parse, retry, and readiness logic in isolated functions before collector integration.
* Fallback without duplication: Apify runs only when a matching direct source did not produce an active reviewed result.
* Project before rendering: source setup and UI read bounded diagnostics, not raw source response details.

### Technology Stack

* TypeScript 6, Bun 1.3.14, Vitest.
* Existing Trend Finder script collector and `SourceAdapter` contract.
* Existing Zod schema defaults for browser payload compatibility.
* Existing React 19 and dense Trend Finder source setup surfaces for readiness display.

***

## 6. Deliverables

### Files to Create

| File                                                                                | Purpose                                                                                                                                    | Est. Lines |
| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------- |
| `scripts/extensions/trend-finder/sources/direct-source-utils.ts`                    | Shared direct-source fetch, timeout, retry, safe URL, XML/JSON parsing, and sanitized warning helpers.                                     | \~280      |
| `scripts/extensions/trend-finder/sources/direct-source-readiness.ts`                | Direct adapter readiness declarations, disabled/degraded/ready labels, fallback decisions, and diagnostics.                                | \~220      |
| `scripts/extensions/trend-finder/sources/arxiv-adapter.ts`                          | Direct arXiv Atom API adapter with reviewed query building, pacing, metadata-only normalization, and source health.                        | \~260      |
| `scripts/extensions/trend-finder/sources/github-adapter.ts`                         | Direct GitHub repository search adapter with optional script token, rate-limit handling, public metadata normalization, and source health. | \~260      |
| `scripts/extensions/trend-finder/sources/rss-adapter.ts`                            | Direct RSS/Atom feed adapter with reviewed feed allowlist, conditional request metadata, item caps, and source health.                     | \~260      |
| `scripts/extensions/trend-finder/sources/__tests__/direct-source-utils.test.ts`     | Unit tests for timeout, retry, safe URL, sanitized warning, JSON/XML, and abort handling.                                                  | \~180      |
| `scripts/extensions/trend-finder/sources/__tests__/arxiv-adapter.test.ts`           | Unit tests for arXiv query construction, pacing, parsing, metadata exclusions, retries, and zero-cost spend labels.                        | \~220      |
| `scripts/extensions/trend-finder/sources/__tests__/github-adapter.test.ts`          | Unit tests for GitHub query bounds, optional token boundary, rate-limit handling, normalization, and PII exclusions.                       | \~220      |
| `scripts/extensions/trend-finder/sources/__tests__/rss-adapter.test.ts`             | Unit tests for reviewed feed allowlist, RSS/Atom parsing, conditional metadata, relevance filtering, and disabled feed states.             | \~220      |
| `scripts/extensions/trend-finder/sources/__tests__/direct-source-readiness.test.ts` | Unit tests for readiness status derivation, fallback selection, diagnostics, and no activation after collection starts.                    | \~180      |

### Files to Modify

| File                                                                           | Changes                                                                                                                     | Est. Lines |
| ------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- | ---------- |
| `docs/sources/source-compliance-arxiv.md`                                      | Re-review direct arXiv API terms, rate limits, retention, approved fields, and adapter checklist.                           | \~90       |
| `docs/sources/source-compliance-github.md`                                     | Re-review direct GitHub search API terms, optional token boundary, rate limits, PII exclusions, and adapter checklist.      | \~100      |
| `docs/sources/source-compliance-rss-news.md`                                   | Re-review direct RSS/feed use, per-feed approval, cache/cadence handling, and adapter checklist.                            | \~100      |
| `docs/sources/source-compliance-hackernews.md`                                 | Re-review HN Algolia keyword search boundary, rate expectations, field exclusions, and adapter checklist.                   | \~80       |
| `docs/sources/apify-source-onboarding.md`                                      | Document direct first-party path gating, fallback-to-Apify behavior, and spend-label preservation.                          | \~80       |
| `scripts/extensions/trend-finder/sources/types.ts`                             | Add direct readiness, direct source declaration, fallback policy, and diagnostics contracts.                                | \~120      |
| `scripts/extensions/trend-finder/sources/hn-types.ts`                          | Add HN Algolia endpoint constants and typed search response fields.                                                         | \~80       |
| `scripts/extensions/trend-finder/sources/hn-adapter.ts`                        | Add keyword-window Algolia collection path with fallback-safe top-stories behavior, retries, and source health.             | \~180      |
| `scripts/extensions/trend-finder/normalize.ts`                                 | Add script-side direct arXiv, GitHub, RSS, and HN Algolia normalization helpers with browser-safe fields.                   | \~220      |
| `scripts/extensions/trend-finder/sources/apify-source-config.ts`               | Link direct adapters to existing Apify source declarations and record fallback metadata without removing declarations.      | \~100      |
| `scripts/extensions/trend-finder/sources/apify-adapter.ts`                     | Accept direct-active skip decisions and return explicit skipped fallback summaries without duplicate paid runs.             | \~130      |
| `scripts/extensions/trend-finder/sources/source-setup.ts`                      | Include direct readiness diagnostics and direct-vs-Apify status in source setup summaries.                                  | \~140      |
| `scripts/extensions/trend-finder/collector.ts`                                 | Register direct adapters, resolve readiness before collection, suppress duplicate Apify runs, and trace fallback decisions. | \~240      |
| `scripts/extensions/trend-finder/spend-accounting.ts`                          | Ensure public API zero-cost source labels remain visible and aggregate correctly with Apify fallbacks.                      | \~60       |
| `src/extensions/trend-finder/schema.ts`                                        | Add additive defaults for direct readiness diagnostics in source setup and source summaries.                                | \~120      |
| `src/extensions/trend-finder/view-model.ts`                                    | Project direct readiness and fallback labels into Source Setup and source health view models.                               | \~120      |
| `src/extensions/trend-finder/components/source-setup-panel.tsx`                | Render direct readiness and fallback labels without adding direct source mutation controls.                                 | \~90       |
| `scripts/extensions/trend-finder/sources/__tests__/hn-adapter.test.ts`         | Cover HN Algolia keyword search, disabled/fallback behavior, retries, timeouts, and field exclusions.                       | \~180      |
| `scripts/extensions/trend-finder/sources/__tests__/apify-adapter.test.ts`      | Cover direct-active source skipping and Apify fallback eligibility when direct paths fail.                                  | \~160      |
| `scripts/extensions/trend-finder/__tests__/collector.test.ts`                  | Cover direct-first collection order, fallback routing, trace diagnostics, and no activation after collection starts.        | \~180      |
| `scripts/extensions/trend-finder/__tests__/spend-accounting.test.ts`           | Cover zero-cost public API source labels and mixed direct-plus-Apify spend aggregation.                                     | \~80       |
| `src/extensions/trend-finder/components/__tests__/source-setup-panel.test.tsx` | Cover direct readiness labels, disabled states, fallback labels, and re-entry rendering.                                    | \~120      |
| `docs/extensions/trend-finder-sources.md`                                      | Document shipped direct adapters, zero-cost labels, readiness states, fallbacks, and deferred direct candidates.            | \~100      |

***

## 7. Success Criteria

### Functional Requirements

* [ ] Direct-source compliance review is recorded before each matching direct network path is activated.
* [ ] arXiv, GitHub, and RSS/news can collect through direct first-party paths when reviewed and ready; HN supports keyword queries through Algolia.
* [ ] Direct source rows preserve existing source IDs, source roles, quality tiers, half-life behavior, and source health semantics.
* [ ] Apify declarations remain available as reviewed fallbacks and do not run as duplicates when a matching direct source succeeds.
* [ ] Connector readiness reports disabled, degraded, ready, timeout, rate limited, empty, blocked, and fallback states before network collection.
* [ ] Spend accounting labels direct sources as zero-cost public API sources and keeps them visible in per-run spend summaries.
* [ ] Optional GitHub token support is script-only and never appears in browser state, source summaries, traces, warnings, tests snapshots, or logs.
* [ ] Any adapter blocked by compliance review is explicitly deferred in docs and readiness output, not shipped as an active network path.

### Testing Requirements

* [ ] Unit tests cover direct utility timeout, retry, abort, safe URL, XML/JSON, sanitized warning, and backoff behavior.
* [ ] Adapter tests cover arXiv, GitHub, RSS, and HN normal paths, empty responses, parse failures, rate limits, timeouts, retries, field exclusions, and zero-cost spend labels.
* [ ] Collector tests cover direct-first ordering, direct-active Apify skip, direct-failed Apify fallback, readiness freeze before collection, and trace diagnostics.
* [ ] Source setup and view-model tests cover direct readiness labels, disabled states, fallback labels, long text bounds, and re-entry behavior.
* [ ] Focused script tests, relevant app tests, TypeScript checks, and ASCII validation pass or document unrelated pre-existing failures.

### Non-Functional Requirements

* [ ] No new dependency, hosted storage path, database schema, admin write surface, browser direct-source mutation path, or unreviewed transfer path is introduced.
* [ ] Browser-visible data excludes raw API responses, XML bodies, feed bodies, author contact fields, token-shaped strings, private paths, Actor IDs, Dataset IDs, raw rate-limit headers, and stack traces.
* [ ] Direct collection respects source-specific item caps, keyword caps, timeout budgets, retry limits, rate-limit guidance, and attribution requirements.
* [ ] Generated payload size remains bounded and existing collection health, static export, and Reference mode trust boundaries are not widened.

### Quality Gates

* [ ] All files ASCII-encoded
* [ ] Unix LF line endings
* [ ] Code follows project conventions
* [ ] Focused tests pass
* [ ] Script and app type checks pass or unrelated pre-existing failures are documented

***

## 8. Implementation Notes

### Key Considerations

* The existing source IDs are part of downstream scoring, source setup, spend, fixtures, and docs. Prefer preserving them over introducing new direct source IDs.
* Direct adapters should build analyst evidence from browser evidence or share normalization helpers so the analyst path and browser path stay consistent.
* GitHub optional auth must use `GITHUB_TOKEN` only from script env. Do not add a `VITE_` token variable or browser setup control for this session.
* RSS direct collection needs a reviewed feed allowlist. Arbitrary operator feed URLs are not approved unless the compliance docs record the per-feed review boundary.
* Apify fallback should remain visible in readiness and diagnostics even when direct paths are active, because operators may need the reviewed fallback if a first-party API changes or rate limits.

### Potential Challenges

* Compliance blocker: if a source review does not clear direct collection, keep that source disabled, preserve the Apify fallback, and record the deferral.
* Duplicate source rows: direct success and Apify fallback must not both count as evidence for the same source ID in one run.
* Rate-limit semantics: GitHub and arXiv require different behavior, so avoid a one-size-fits-all retry policy that ignores reset or pacing guidance.
* RSS feed variance: parse RSS and Atom defensively, cap snippets, and treat malformed feeds as degraded without leaking feed bodies.
* HN behavior change: Algolia keyword search changes the evidence universe, so tests and docs need to call out the current-only keyword path.

### Relevant Considerations

* \[P02] **New source adapters need compliance review**: This session sequences source docs before network adapter activation.
* \[P06] **Apify actor outputs remain operator-dependent**: Apify declarations remain fallbacks, with explicit blocked and degraded states.
* \[P02] **Extension payloads and labels stay bounded**: Direct readiness and source diagnostics must be bounded labels, not raw responses.
* \[P05] **Do not let source activation happen after collection starts**: Direct readiness is resolved before collection and frozen for the run.
* \[P27] **Reference manuals track shipped behavior only**: Direct source docs should describe only active, reviewed behavior and explicit deferrals.
* \[P02] **Disabled-first collector pattern**: Direct adapters start disabled or blocked unless reviewed and configured.
* \[P05] **Script-only runtime boundary**: Direct API calls, optional tokens, and raw responses stay under `scripts/`.

### Behavioral Quality Focus

Checklist active: Yes

Top behavioral risks for this session:

* A direct adapter widens collection before its source compliance doc is reviewed.
* A direct success and Apify fallback both run for the same source ID, inflating evidence volume and spend labels.
* Optional token, raw API response, feed body, private path, Actor ID, Dataset ID, or stack trace leaks into browser-visible data or logs.
* Readiness changes during a collection run and activates a source after the compliance gate was evaluated.

***

## 9. Testing Strategy

### Unit Tests

* Test direct utility helpers for timeout, abort, retry/backoff, safe URL, redaction, XML tag extraction, JSON validation, and sanitized warnings.
* Test arXiv parsing, query construction, pacing hooks, metadata-only output, PDF exclusion, empty feed handling, and offline/degraded states.
* Test GitHub query bounds, optional token header construction without leakage, rate-limit stop behavior, public repository normalization, and PII exclusions.
* Test RSS/Atom parsing, reviewed feed allowlist, malformed feed handling, conditional metadata capture, snippet limits, and private/local URL rejection.
* Test HN Algolia keyword search, fallback behavior, field exclusions, timeout and retry behavior, and current-only historical support.

### Integration Tests

* Test collector direct-first ordering, direct-active Apify skip, direct-failed Apify fallback, source health rows, collection health counts, and trace diagnostics.
* Test source setup projection of direct readiness and fallback labels through schema defaults and view models.
* Test spend aggregation for public API zero-cost sources mixed with estimated or exact Apify fallbacks.

### Manual Testing

* Run the focused Vitest suites for direct adapters, source setup, collector, and spend accounting.
* Run script and app type checks.
* Run a local aggregate without Apify credentials to verify direct readiness and missing-token fallback labels do not crash collection.
* Inspect generated Trend Finder source rows for safe source IDs, zero-cost labels, bounded warnings, and no token-shaped or private strings.

### Edge Cases

* Compliance doc blocks one direct adapter while others proceed.
* GitHub returns `403`, `429`, secondary-limit headers, or malformed JSON.
* arXiv returns Atom entries with PDF links, missing dates, XML entities, or no matching keyword terms.
* RSS feed returns Atom instead of RSS, malformed XML, duplicate links, missing dates, local URLs, or full-body descriptions.
* HN Algolia returns hits without title, URL, points, comments, or created date.
* Collection is aborted while a direct adapter is sleeping for pacing or retry.

***

## 10. Dependencies

### External Libraries

* No new external library is planned.
* Use platform `fetch`, `AbortController`, `URL`, and existing TypeScript/Zod utilities.

### Other Sessions

* **Depends on**: `phase28-session13-keyword-packs-rotation-and-coverage`.
* **Depended by**: `phase28-session15-documentation-validation-and-release`.

***

## Next Steps

Run the implement workflow step to begin AI-led implementation.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://ai-os-and-trend-finder.gitbook.io/ai-os-and-trend-finder-docs/.spec_system/archive/sessions/phase28-session14-direct-first-party-source-adapters/spec.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
