> For the complete documentation index, see [llms.txt](https://ai-os-and-trend-finder.gitbook.io/ai-os-and-trend-finder-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ai-os-and-trend-finder.gitbook.io/ai-os-and-trend-finder-docs/docs/sources/source-compliance-arxiv.md).

# Source Compliance: arXiv

> Implementation-time review completed 2026-05-17. Direct API re-review completed 2026-06-14 for a metadata-only arXiv adapter. The Apify source declaration remains a reviewed fallback.

***

## Source Overview

| Field               | Value                                                                  |
| ------------------- | ---------------------------------------------------------------------- |
| Source Name         | arXiv                                                                  |
| API Base URL        | `https://export.arxiv.org/api/query`                                   |
| API Documentation   | <https://info.arxiv.org/help/api/user-manual.html>                     |
| API Terms           | <https://info.arxiv.org/help/api/tou.html>                             |
| Authentication      | None for public metadata                                               |
| Data Format         | Atom XML from direct API; Apify Dataset JSON after Actor normalization |
| Trend Finder Status | Direct metadata adapter approved; Apify declaration remains fallback   |
| Default Source ID   | `arxiv-ai-papers`                                                      |

***

## Phase 06 Candidate Declaration

| Field                   | Value                                                                           |
| ----------------------- | ------------------------------------------------------------------------------- |
| Source ID               | `arxiv-ai-papers`                                                               |
| Primary Apify candidate | `easyapi/arxiv-search-scraper`                                                  |
| Fallback candidate      | `gentle_cloud/arxiv-paper-search`                                               |
| Validation status       | Fixture-backed and live tiny-validated on 2026-05-17; rerun is credential-gated |
| Direct adapter status   | Approved for metadata-only direct API use with single-connection pacing         |

***

## Terms and Use Boundary

arXiv permits use of descriptive metadata for discovery use cases, while e-print content such as PDFs and source files remains subject to applicable copyright and license restrictions. The official API terms also require clients to respect rate limits, use one connection at a time for legacy APIs, and link users back to arXiv rather than serving copied paper content.

Trend Finder may use arXiv only for public metadata trend evidence:

* Paper title.
* Abstract page URL.
* Short abstract or summary snippet.
* Published or updated timestamp.
* Public aggregate or classification metadata when available.

Trend Finder must not store or redistribute PDFs, source files, full paper content, author email addresses, account data, or any non-public data.

***

## Rate Limits

| Parameter          | Value                                                         |
| ------------------ | ------------------------------------------------------------- |
| Legacy API pace    | No more than one request every 3 seconds                      |
| Connection limit   | Single connection at a time                                   |
| Higher volume path | Contact arXiv support                                         |
| Required behavior  | Respect Actor timeout, cap item count, and avoid retry storms |

The configured Apify source uses a per-source timeout and item cap. Any future direct arXiv adapter must enforce the official pace and single-connection rules itself.

## Direct Adapter Approval

The 2026-06-14 re-review approves a direct adapter only under these conditions:

* Use `https://export.arxiv.org/api/query` for descriptive metadata only.
* Keep the stable source ID `arxiv-ai-papers`, source role `research`, and quality tier `primary`.
* Send at most one request every 3 seconds from the process and never fan out concurrent arXiv API requests.
* Query only reviewed keyword-window terms and arXiv AI/ML categories.
* Normalize only title, canonical abstract URL, bounded abstract snippet, submitted/updated date, arXiv ID, and categories.
* Exclude PDF links, source-file links, full paper content, comments, author contact data, and raw XML from browser payloads, traces, and logs.
* Return disabled or degraded readiness before collection if compliance status is not reviewed or if rate-limit, timeout, parse, or empty-response failures occur.
* Preserve the Apify source declaration as the fallback if the direct adapter is blocked or produces no usable reviewed evidence.
* Emit a zero-cost public API spend label for direct rows.

***

## Data Collection Boundary

| Data Element           | Stored As           | PII Risk                  | Status                    |
| ---------------------- | ------------------- | ------------------------- | ------------------------- |
| Paper title            | Evidence title      | Low                       | Approved                  |
| Abstract page URL      | Evidence URL        | None                      | Approved                  |
| Abstract snippet       | Evidence snippet    | Low; author-provided text | Approved with length cap  |
| Published/updated date | `publishedAt`       | None                      | Approved                  |
| arXiv identifier       | URL derivation only | None                      | Approved                  |
| Categories             | Topic hint          | None                      | Approved if emitted later |

**Not approved**: PDFs, source files, full text, author emails, author profiles, submitter account data, comments, private correspondence, or local file copies of paper content.

***

## Retention and Attribution

| Policy               | Value                                                                                         |
| -------------------- | --------------------------------------------------------------------------------------------- |
| Storage location     | `src/data/live-data.json` and private cache snapshots                                         |
| Retention period     | `live-data.json` overwritten on each collection run; snapshots retained locally until deleted |
| Historical retention | Local snapshots only; no database-backed retention                                            |
| Deletion path        | Delete generated Trend Finder data and snapshots                                              |
| Attribution          | Link to the canonical arXiv abstract page                                                     |

Dashboard evidence must use a public arXiv URL, preferably `https://arxiv.org/abs/<id>`, source identifier `arxiv-ai-papers`, and source name "arXiv AI papers".

***

## Phase 14 Historical Window Stance

| Field                  | Value                                                                                                |
| ---------------------- | ---------------------------------------------------------------------------------------------------- |
| Historical support     | Current-only                                                                                         |
| Source ID              | `arxiv-ai-papers`                                                                                    |
| Safe override fields   | None                                                                                                 |
| Unsupported reason     | The reviewed Actor input sorts current search results by submitted date but has no date-bound input. |
| Compliance declaration | `historicalWindowSupport.supported = false`                                                          |

Do not encode a requested historical window into the free-form query field without a separate reviewed adapter contract. This stance keeps arXiv collection metadata-only and does not add persistent backtest storage.

***

## Implementation Notes

* Phase 06 Session 01 declares `arxiv-ai-papers` with `easyapi/arxiv-search-scraper` as primary and `gentle_cloud/arxiv-paper-search` as fallback.
* Phase 06 Session 02 adds fixture-backed normalizer coverage for canonical abstract URLs, public titles, abstract snippets, submitted/updated dates, and category fields. The 2026-05-17 live tiny validation returned 5 items and 5 public URLs; raw shape field names included `comment`, which remains excluded from emitted evidence.
* Session 08 normalizers derive canonical arXiv abstract URLs from arXiv IDs.
* Apify Actor, run, and Dataset IDs remain internal provenance and are not used as browser evidence URLs.
* The configured source remains disabled unless the operator supplies an Apify Actor through the source JSON file or inline override, avoids known unresolved placeholders, and has `APIFY_TOKEN` in the script environment.

***

## Compliance Checklist

* [x] Terms and API documentation reviewed on 2026-05-17
* [x] Metadata-only data boundary documented
* [x] Rate limit requirements documented
* [x] Retention and deletion path documented
* [x] Attribution requirements documented
* [x] Normalizer excludes PDFs and source-file content from evidence URLs
* [x] Phase 06 primary and fallback Apify candidates recorded
* [x] Phase 06 fixture-backed normalizer validation completed
* [x] Live tiny capped Phase 06 Actor/Dataset validation completed on 2026-05-17
* [x] Phase 14 historical-window stance recorded as current-only
* [x] Direct API adapter approval re-reviewed on 2026-06-14
* [x] Direct API adapter requirements record arXiv pacing and field exclusions


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://ai-os-and-trend-finder.gitbook.io/ai-os-and-trend-finder-docs/docs/sources/source-compliance-arxiv.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
