> For the complete documentation index, see [llms.txt](https://ai-os-and-trend-finder.gitbook.io/ai-os-and-trend-finder-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ai-os-and-trend-finder.gitbook.io/ai-os-and-trend-finder-docs/.spec_system/archive/sessions/phase28-session01-cross-source-signal-identity-and-dedup/spec.md).

# Session Specification

**Session ID**: `phase28-session01-cross-source-signal-identity-and-dedup` **Phase**: 28 - Trend Finder Trends-Finderz Adoption **Status**: Not Started **Created**: 2026-06-14

***

## 1. Session Overview

This session adds cross-source signal identity to Trend Finder's collection and scoring path. It introduces deterministic URL normalization, content hashes, and per-source fingerprints, then uses those fingerprints to drop exact repeats inside the same source while keeping legitimate per-source evidence rows.

The work is the first Phase 28 scoring-integrity prerequisite because later sessions consume its duplicate and syndication counters. Without this layer, a single syndicated story can inflate evidence volume, momentum, and source diversity as if it were independent evidence from several places.

The implementation stays local and dependency-free. Fingerprints remain script and trace metadata; browser payloads receive only bounded counts so raw URLs, normalized URLs, and private paths do not leak through the UI contract.

***

## 2. Objectives

1. Create pure signal identity helpers for URL normalization, content hashing, and per-source fingerprinting.
2. Deduplicate exact within-source repeats before analyst input and record duplicate counts and warnings.
3. Mark cross-source syndicated sibling rows while preserving legitimate per-source evidence coverage.
4. Adjust scoring inputs so one syndicated story counts once for evidence volume, momentum, and diversity-sensitive scoring.

***

## 3. Prerequisites

### Required Sessions

* [x] `phase27-session12-documentation-validation-and-release` - confirms the Phase 27 Trend Finder baseline, including current collector, scoring, schema, engine trace, and release docs.

### Required Tools/Knowledge

* Bun 1.3.14 and Vitest for focused TypeScript tests.
* Zod additive-default patterns in `src/extensions/trend-finder/schema.ts`.
* Existing source-quality and scoring helpers under `scripts/extensions/trend-finder/sources/` and `scripts/lib/ai-runtime/`.
* Trends-Finderz reference helpers in `EXAMPLES/trends-finderz/lib/signals/`.

### Environment Requirements

* Work from the repo root.
* Do not add runtime dependencies.
* Keep generated private Trend Finder data out of git.
* Preserve the browser-safe payload boundary and the existing 512 KB collector payload limit.

***

## 4. Scope

### In Scope (MVP)

* Trend Finder collector can derive normalized URL, content hash, and per-source fingerprint metadata from already-collected evidence - implemented as pure, deterministic helpers.
* Trend Finder collector drops exact same-source fingerprint repeats and emits counted warnings with explicit high-duplicate-rate handling.
* Trend Finder collector annotates cross-source syndicated groups and exposes only bounded `syndicationCount` style metadata to browser-safe records.
* Trend Finder scoring counts syndicated story groups once for evidence volume, momentum, and diversity-sensitive inputs without dropping per-source rows from evidence links, source breakdowns, or role coverage.
* Engine trace carries sanitized duplicate and syndication counters for replay and Session 02 consumption.
* Focused tests cover normalization, fingerprints, collector dedup, scoring projection, schema defaults, and trace counters.

### Out of Scope (Deferred)

* Per-signal quality score and run-level duplicate-rate and source-coverage rollups - *Reason: Session 02 consumes this session's counters.*
* Canonical topic identity changes - *Reason: `topic-identity.ts` remains the topic continuity boundary for this session.*
* Durable storage or database indexes for fingerprints - *Reason: the SQLite observation-store plan owns persistent storage shape.*
* New source adapters or keyword-pack changes - *Reason: Phase 28 Sessions 13 and 14 own compliance-gated collection expansion.*

***

## 5. Technical Approach

### Architecture

Add a script-only identity layer under `scripts/extensions/trend-finder/sources/signal-identity.ts`. The helper accepts the existing evidence shapes, normalizes URLs by stripping tracking parameters, lowercasing hosts, removing `www.`, sorting retained query parameters, and cleaning path slashes. It then builds a stable content hash from normalized title, host, and the first path segments, plus a source-scoped fingerprint for same-source dedup.

Wire the helper in the collector after source-local enrichment and before `prepareAnalystEvidence()`. The collector should deduplicate exact same-source fingerprints in browser and analyst evidence together, record counts by source, warn when the duplicate rate is high, and annotate cross-source content-hash groups without exposing raw normalized URLs in browser output.

Extend scoring through typed metadata on `TrendEvidenceForAnalysis`. Topic evidence contexts should keep all linked evidence rows for evidence links, source breakdowns, lifecycle, and role shares, but should build a scoring projection where syndicated content groups contribute once to evidence volume, velocity/momentum count inputs, and source-diversity inflation checks.

### Design Patterns

* Pure helper first: mirrors existing `source-quality.ts` and enables focused deterministic tests before collector wiring.
* Additive schema defaults: optional new payload fields parse legacy fixtures without migrations.
* Script-only sensitive metadata: raw normalized URLs and hashes stay out of the UI unless explicitly bounded and sanitized.
* Trace event mapping: collector emits structured counters; engine trace maps only safe aggregate values.

### Technology Stack

* TypeScript on Bun.
* Node `crypto` for SHA-256 hashing.
* Zod for browser payload parsing and defaults.
* Vitest for script, schema, and scoring tests.

***

## 6. Deliverables

### Files to Create

| File                                                                        | Purpose                                                                             | Est. Lines |
| --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- | ---------- |
| `scripts/extensions/trend-finder/sources/signal-identity.ts`                | URL normalization, content hash, source fingerprint, dedup, and syndication helpers | \~220      |
| `scripts/extensions/trend-finder/sources/__tests__/signal-identity.test.ts` | Focused helper tests for normalization, hashing, dedup, and syndication grouping    | \~180      |

### Files to Modify

| File                                                             | Changes                                                                                       | Est. Lines |
| ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ---------- |
| `scripts/extensions/trend-finder/collector.ts`                   | Apply identity metadata, within-source dedup, warnings, trace events, and browser-safe counts | \~120      |
| `scripts/lib/ai-runtime/trend-analyst.ts`                        | Add optional script-only identity and syndication metadata to evidence input types            | \~25       |
| `scripts/lib/ai-runtime/scoring.ts`                              | Use syndicated story groups for volume, momentum, and diversity-sensitive scoring inputs      | \~100      |
| `src/extensions/trend-finder/schema.ts`                          | Add bounded optional syndication count defaults for evidence and topics as needed             | \~45       |
| `src/extensions/trend-finder/engine-trace.ts`                    | Extend engine-trace evidence counts with duplicate and syndicated group counters              | \~45       |
| `scripts/extensions/trend-finder/engine-trace.ts`                | Map sanitized collector trace counters into engine replay payload                             | \~55       |
| `scripts/extensions/trend-finder/__tests__/collector.test.ts`    | Cover within-source dedup, warnings, and browser-safe metadata                                | \~100      |
| `scripts/lib/ai-runtime/__tests__/scoring.test.ts`               | Cover syndicated scoring projection and no inflated diversity/volume/momentum                 | \~120      |
| `src/extensions/trend-finder/__tests__/view-model.test.ts`       | Cover legacy parsing or visible bounded syndication counts if surfaced                        | \~45       |
| `scripts/extensions/trend-finder/__tests__/engine-trace.test.ts` | Cover sanitized duplicate and syndication counters                                            | \~70       |

***

## 7. Success Criteria

### Functional Requirements

* [ ] URL normalization strips tracking parameters, lowercases hosts, removes `www.`, sorts retained query parameters, and handles malformed URLs deterministically.
* [ ] Same-source fingerprint repeats are dropped before analyst input and counted by source.
* [ ] Cross-source content-hash siblings are retained as evidence rows but grouped for scoring.
* [ ] A story syndicated across three sources counts once for evidence volume, momentum, and diversity-sensitive scoring while preserving source breakdown coverage.
* [ ] Browser-visible payload fields expose bounded counts only and never expose raw normalized URLs, private paths, or source input payloads.

### Testing Requirements

* [ ] Unit tests written and passing for identity helpers.
* [ ] Collector tests cover duplicate warnings, high-rate warnings, and browser-safe metadata.
* [ ] Scoring tests prove syndicated evidence does not inflate volume, momentum, or diversity-sensitive scoring.
* [ ] Schema and engine-trace tests prove legacy payloads parse and counters are sanitized.
* [ ] Manual testing completed through a generated Trend Finder run or fixture replay.

### Non-Functional Requirements

* [ ] No new runtime dependencies.
* [ ] Hashing and grouping remain deterministic across runs.
* [ ] Existing source compliance posture is unchanged because no new source calls are introduced.
* [ ] Collector payload remains under the existing 512 KB limit.

### Quality Gates

* [ ] All files ASCII-encoded.
* [ ] Unix LF line endings.
* [ ] Code follows project conventions.
* [ ] Focused Vitest suites pass.

***

## 8. Implementation Notes

### Key Considerations

* The identity helper should operate on existing evidence records, not raw adapter payloads, so every source path receives the same normalization.
* Same-source dedup must keep browser evidence and analyst evidence in sync.
* Cross-source syndication should not erase legitimate per-source role coverage, source breakdowns, evidence links, or trace/source-health rows.
* High duplicate-rate warnings feed Session 02, so the collector should emit a structured summary even if the UI does not render it yet.
* Do not expose normalized URLs in browser state; use bounded counts and sanitized aggregate trace fields.

### Potential Challenges

* Hash collisions or empty URLs: Fall back to normalized title plus source fields and include deterministic empty-value behavior in tests.
* Scoring projection vs evidence display: Keep display evidence unchanged while deriving separate scoring group counts for volume, momentum, and diversity.
* Legacy fixture parsing: Add `.default()` or `.catch()` where needed so older generated payloads continue to load.
* Trace sanitization: Reuse existing redaction and count clamps before adding new trace fields.

### Relevant Considerations

* \[P02] **Extension payloads and labels stay bounded**: This session adds only bounded counts to browser payloads and keeps raw URLs/hashes script-side.
* \[P24] **Browser-safe export and triage boundaries**: Private paths, raw evidence, and local artifact details must not leak into browser-visible output.
* \[P27] **Trend Finder payload growth needs release checks**: New fields require schema defaults, payload-size awareness, and release validation in closeout.
* \[P01] **Extract pure functions, then test, then wire**: Start with `signal-identity.ts` helper tests before collector and scoring changes.
* \[P27] **Deterministic fallback before AI enrichment**: Identity and dedup are deterministic and must work in disabled or fallback AI modes.

### Behavioral Quality Focus

Checklist active: Yes Top behavioral risks for this session:

* Duplicate rows could be dropped from analyst input but still appear in browser evidence, creating inconsistent source counts.
* Syndication grouping could over-collapse legitimate independent source coverage if it uses URL alone and ignores title/content fallback.
* Trace or schema fields could accidentally expose normalized URLs, hashes, or private paths outside the script-only boundary.

***

## 9. Testing Strategy

### Unit Tests

* Test URL normalization for tracking params, host case, `www.`, retained query sorting, trailing slash cleanup, malformed URLs, and blank input.
* Test content hash and fingerprint stability across reordered query params and title punctuation variants.
* Test within-source dedup counts exact fingerprint repeats and preserves the first deterministic row.

### Integration Tests

* Collector test with two same-source duplicates and one cross-source syndicated sibling verifies analyst input, browser evidence, warnings, and trace events.
* Scoring test with one story syndicated across multiple sources verifies volume, momentum, and source-diversity scoring do not exceed the independent single-story baseline.
* Engine trace test verifies duplicate and syndication counters are sanitized and bounded.

### Manual Testing

* Run a Trend Finder fixture or generated collector pass and inspect that evidence rows remain visible while scoring no longer overweights syndicated stories.
* Inspect Engine Replay evidence/source sections for counts and absence of raw normalized URLs or private file paths.

### Edge Cases

* Evidence with no URL but strong title/snippet.
* Evidence with malformed URL, uppercase host, tracking-only query parameters, or fragment identifiers.
* Same content hash across sources with different source roles.
* Same source duplicate where one record is browser evidence only or analyst evidence only.
* Legacy payloads without `syndicationCount` fields.

***

## 10. Dependencies

### External Libraries

* None. Use Node `crypto` only.

### Other Sessions

* **Depends on**: `phase27-session12-documentation-validation-and-release`
* **Depended by**: `phase28-session02-signal-quality-score-and-collection-health`, `phase28-session04-topic-noise-gate-and-visibility-bands`, and all later Phase 28 scoring or UI sessions that rely on duplicate and syndication counters.

***

## Next Steps

Run the implement workflow step to begin AI-led implementation.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://ai-os-and-trend-finder.gitbook.io/ai-os-and-trend-finder-docs/.spec_system/archive/sessions/phase28-session01-cross-source-signal-identity-and-dedup/spec.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
