njsparser/js/README.md
2026-02-15 02:10:29 +01:00

425 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# NJSParser (JS)
A 99.9% AI-generated (including unit tests & README) JS port of [`novitae/njsparser`](https://github.com/novitae/njsparser) for extracting and parsing Next.js hydration data from HTML content.
- Parses flight data (from the **`self.__next_f.push`** scripts)
- Parses next data from **`__NEXT_DATA__`** script
- Parses **build manifests**
- Searches for **build id**
- Many other things...
This is the JavaScript/TypeScript port of the Python [njsparser](https://github.com/novitae/njsparser) library, designed to run on Bun.
## Installation
```bash
bun add njsparser
```
Or install from source:
```bash
git clone <repository-url>
cd njsparser/js
bun install
```
## DOMParser dependency
This library doesnt bundle an HTML parser. You must pass a `DOMParser` implementation when creating the parser:
- In Bun tests we use `deno-dom`s WASM build.
- In a browser you can pass the native `DOMParser`.
## Usage
### Basic Setup (Bun)
```javascript
import { DOMParser } from 'deno-dom/deno-dom-wasm.ts';
import { NJSParser } from 'njsparser';
const parser = NJSParser({ DOMParser });
```
### Basic Setup (Browser)
```javascript
import { NJSParser } from 'njsparser';
const parser = NJSParser({ DOMParser: window.DOMParser });
```
### Parsing Flight Data
Flight data is contained in `self.__next_f.push()` scripts within NextJS pages.
```javascript
import { DOMParser } from 'deno-dom/deno-dom-wasm.ts';
import { NJSParser } from 'njsparser';
const parser = NJSParser({ DOMParser });
const response = await fetch('https://nextjs.org');
const html = await response.text();
const flightData = parser.parser.getFlightData(html);
// Use BeautifulFD for easier searching
const fd = new parser.BeautifulFD(html);
// Find specific element types
for (const data of fd.find_iter([parser.types.Data])) {
if (data.content && typeof data.content === 'object' && 'user' in data.content) {
console.log(data.content.user);
break;
}
}
```
### Parsing `__NEXT_DATA__`
```javascript
const parser = NJSParser({ DOMParser });
const html = await fetch('https://example.com').then(r => r.text());
const nextData = parser.parser.getNextData(html);
if (nextData) {
console.log('Build ID:', nextData.buildId);
console.log('Page Props:', nextData.props);
}
```
### Finding Build ID
```javascript
const parser = NJSParser({ DOMParser });
const html = await fetch('https://example.com').then(r => r.text());
const buildId = parser.findBuildId(html);
console.log('Build ID:', buildId);
```
### Parsing Build Manifests
```javascript
const parser = NJSParser({ DOMParser });
const manifestScript = await fetch('https://example.com/_next/static/BUILD_ID/_buildManifest.js')
.then(r => r.text());
const manifest = parser.parser.parseBuildManifest(manifestScript);
console.log('Manifest:', manifest);
```
## API Reference
### Factory Function
#### `NJSParser({ DOMParser })`
Creates a new parser instance.
**Parameters:**
- `DOMParser` (required): a DOMParser class/constructor to parse HTML
**Returns:** Parser instance with all methods and classes
### Parser Methods
#### `parser.hasFlightData(html)`
Check if HTML contains flight data.
**Parameters:**
- `html` (string): HTML content
**Returns:** `boolean`
#### `parser.getFlightData(html)`
Extract and parse flight data from HTML.
**Parameters:**
- `html` (string): HTML content
**Returns:** `Object | null` - Parsed flight data or null
#### `parser.hasNextData(html)`
Check if HTML contains `__NEXT_DATA__` script.
**Parameters:**
- `html` (string): HTML content
**Returns:** `boolean`
#### `parser.getNextData(html)`
Extract and parse `__NEXT_DATA__` script.
**Parameters:**
- `html` (string): HTML content
**Returns:** `Object | null` - Parsed JSON data or null
#### `parser.parseBuildManifest(script)`
Parse build manifest script.
**Parameters:**
- `script` (string): Build manifest script content
**Returns:** `Object` - Parsed manifest
#### `parser.getBuildManifestPath(buildId, basePath)`
Generate build manifest path.
**Parameters:**
- `buildId` (string): Build ID
- `basePath` (string, optional): Base path
**Returns:** `string` - Build manifest path
#### `parser.getNextStaticUrls(html)`
Find all NextJS static URLs in HTML.
**Parameters:**
- `html` (string): HTML content
**Returns:** `Array<string> | null` - Array of URLs or null
#### `parser.getBasePath(htmlOrUrls, removeDomain)`
Extract base path from URLs or HTML.
**Parameters:**
- `htmlOrUrls` (string | Array<string>): HTML content or array of URLs
- `removeDomain` (boolean, optional): Remove domain from absolute URLs
**Returns:** `string | null` - Base path or null
### High-Level Tools
#### `hasNextJS(html)`
Check if page has any NextJS data.
**Parameters:**
- `html` (string): HTML content
**Returns:** `boolean`
#### `findBuildId(html)`
Find build ID from page (searches static URLs, next data, and flight data).
**Parameters:**
- `html` (string): HTML content
**Returns:** `string | null` - Build ID or null
#### `findInFlightData(flightData, classFilters, callback, recursive)`
Find first matching element in flight data.
**Parameters:**
- `flightData` (Object): Flight data dictionary
- `classFilters` (Array, optional): Array of Element classes to filter by
- `callback` (Function, optional): Callback function for filtering
- `recursive` (boolean, optional): Search recursively (default: true)
**Returns:** `Element | null`
#### `findallInFlightData(flightData, classFilters, callback, recursive)`
Find all matching elements in flight data.
**Parameters:**
- `flightData` (Object): Flight data dictionary
- `classFilters` (Array, optional): Array of Element classes to filter by
- `callback` (Function, optional): Callback function for filtering
- `recursive` (boolean, optional): Search recursively (default: true)
**Returns:** `Array<Element>`
#### `finditerInFlightData(flightData, classFilters, callback, recursive)`
Iterator for finding elements in flight data.
**Parameters:**
- `flightData` (Object): Flight data dictionary
- `classFilters` (Array, optional): Array of Element classes to filter by
- `callback` (Function, optional): Callback function for filtering
- `recursive` (boolean, optional): Search recursively (default: true)
**Returns:** `Generator<Element>`
### BeautifulFD Class
Simplified interface for working with flight data.
#### `new BeautifulFD(htmlOrFlightData)`
Create BeautifulFD instance.
**Parameters:**
- `htmlOrFlightData` (string | Object): HTML content or parsed flight data
#### `find(classFilters, callback, recursive)`
Find first matching element.
**Parameters:**
- `classFilters` (Array, optional): Array of Element classes
- `callback` (Function, optional): Callback function for filtering
- `recursive` (boolean, optional): Search recursively (default: true)
**Returns:** `Element | null`
#### `find_all(classFilters, callback, recursive)`
Find all matching elements.
**Parameters:**
- `classFilters` (Array, optional): Array of Element classes
- `callback` (Function, optional): Callback function for filtering
- `recursive` (boolean, optional): Search recursively (default: true)
**Returns:** `Array<Element>`
#### `find_iter(classFilters, callback, recursive)`
Iterator for finding elements.
**Parameters:**
- `classFilters` (Array, optional): Array of Element classes
- `callback` (Function, optional): Callback function for filtering
- `recursive` (boolean, optional): Search recursively (default: true)
**Returns:** `Generator<Element>`
#### `as_list()`
Get flight data as array.
**Returns:** `Array<Element>`
#### `static from_list(list, viaEnumerate)`
Create BeautifulFD from array.
**Parameters:**
- `list` (Array): Array of Elements
- `viaEnumerate` (boolean, optional): Use array indices as element indices
**Returns:** `BeautifulFD`
### Element Types
All flight data elements extend the base `Element` class:
- `Element` - Base class
- `HintPreload` - Preload hints (class "HL")
- `Module` - Module imports (class "I")
- `Text` - Text content (class "T")
- `Data` - Structured data
- `EmptyData` - Null/empty data
- `SpecialData` - Special markers (strings starting with "$")
- `HTMLElement` - HTML elements
- `DataContainer` - Container of multiple elements
- `DataParent` - Parent element with children
- `URLQuery` - URL query parameters
- `RSCPayload` - React Server Components payload
- `Error` - Error elements (class "E")
You can access types via `parser.types`:
```javascript
const parser = NJSParser({ DOMParser });
const fd = new parser.BeautifulFD(html);
const textElements = fd.find_all([parser.types.Text]);
const modules = fd.find_all([parser.types.Module]);
```
### API Utilities
#### `api.getApiPath(buildId, basePath, path)`
Generate API path for a page.
**Parameters:**
- `buildId` (string): Build ID
- `basePath` (string, optional): Base path
- `path` (string, optional): Page path
**Returns:** `string | null` - API path or null for excluded paths
#### `api.getIndexApiPath(buildId, basePath)`
Generate index API path.
**Parameters:**
- `buildId` (string): Build ID
- `basePath` (string, optional): Base path
**Returns:** `string`
#### `api.isApiExposedFromResponse(statusCode, contentType, text)`
Check if API is exposed from response.
**Parameters:**
- `statusCode` (number): HTTP status code
- `contentType` (string): Content-Type header
- `text` (string): Response text
**Returns:** `boolean`
#### `api.listApiPaths(sortedPages, buildId, basePath, isApiExposed)`
List API paths from build manifest.
**Parameters:**
- `sortedPages` (Array<string>): Sorted pages from build manifest
- `buildId` (string): Build ID
- `basePath` (string): Base path
- `isApiExposed` (boolean, optional): Is API exposed
**Returns:** `Array<string>`
## Differences from Python Version
1. **DOMParser Parameter**: The JavaScript version requires a DOMParser instance to be provided, making it compatible with different environments (Bun, browser, Node.js with jsdom, etc.)
2. **No CLI**: This port focuses on the library functionality only. No CLI is included.
3. **Factory Pattern**: Uses a factory function to inject dependencies rather than global imports.
4. **Async/Await**: JavaScript version is designed to work with async/await for fetching HTML.
5. **Native JSON**: Uses native `JSON.parse` and `JSON.stringify` instead of orjson.
6. **Native eval**: Uses JavaScript `eval` for build manifest parsing instead of pythonmonkey.
## Testing
Run tests with Bun:
```bash
bun test
```
## License
MIT
## Credits
This is a JavaScript port of the Python [njsparser](https://github.com/novitae/njsparser) library by novitae.