425 lines
10 KiB
Markdown
425 lines
10 KiB
Markdown
# NJSParser (JS)
|
||
|
||
A 99.9% AI-generated (including unit tests & README) JS port of [`novitae/njsparser`](https://github.com/novitae/njsparser) for extracting and parsing Next.js hydration data from HTML content.
|
||
|
||
- Parses flight data (from the **`self.__next_f.push`** scripts)
|
||
- Parses next data from **`__NEXT_DATA__`** script
|
||
- Parses **build manifests**
|
||
- Searches for **build id**
|
||
- Many other things...
|
||
|
||
This is the JavaScript/TypeScript port of the Python [njsparser](https://github.com/novitae/njsparser) library, designed to run on Bun.
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
bun add njsparser
|
||
```
|
||
|
||
Or install from source:
|
||
|
||
```bash
|
||
git clone <repository-url>
|
||
cd njsparser/js
|
||
bun install
|
||
```
|
||
|
||
## DOMParser dependency
|
||
|
||
This library doesn’t bundle an HTML parser. You must pass a `DOMParser` implementation when creating the parser:
|
||
|
||
- In Bun tests we use `deno-dom`’s WASM build.
|
||
- In a browser you can pass the native `DOMParser`.
|
||
|
||
## Usage
|
||
|
||
### Basic Setup (Bun)
|
||
|
||
```javascript
|
||
import { DOMParser } from 'deno-dom/deno-dom-wasm.ts';
|
||
import { NJSParser } from 'njsparser';
|
||
|
||
const parser = NJSParser({ DOMParser });
|
||
```
|
||
|
||
### Basic Setup (Browser)
|
||
|
||
```javascript
|
||
import { NJSParser } from 'njsparser';
|
||
|
||
const parser = NJSParser({ DOMParser: window.DOMParser });
|
||
```
|
||
|
||
### Parsing Flight Data
|
||
|
||
Flight data is contained in `self.__next_f.push()` scripts within NextJS pages.
|
||
|
||
```javascript
|
||
import { DOMParser } from 'deno-dom/deno-dom-wasm.ts';
|
||
import { NJSParser } from 'njsparser';
|
||
|
||
const parser = NJSParser({ DOMParser });
|
||
|
||
const response = await fetch('https://nextjs.org');
|
||
const html = await response.text();
|
||
|
||
const flightData = parser.parser.getFlightData(html);
|
||
|
||
// Use BeautifulFD for easier searching
|
||
const fd = new parser.BeautifulFD(html);
|
||
|
||
// Find specific element types
|
||
for (const data of fd.find_iter([parser.types.Data])) {
|
||
if (data.content && typeof data.content === 'object' && 'user' in data.content) {
|
||
console.log(data.content.user);
|
||
break;
|
||
}
|
||
}
|
||
```
|
||
|
||
### Parsing `__NEXT_DATA__`
|
||
|
||
```javascript
|
||
const parser = NJSParser({ DOMParser });
|
||
|
||
const html = await fetch('https://example.com').then(r => r.text());
|
||
const nextData = parser.parser.getNextData(html);
|
||
|
||
if (nextData) {
|
||
console.log('Build ID:', nextData.buildId);
|
||
console.log('Page Props:', nextData.props);
|
||
}
|
||
```
|
||
|
||
### Finding Build ID
|
||
|
||
```javascript
|
||
const parser = NJSParser({ DOMParser });
|
||
|
||
const html = await fetch('https://example.com').then(r => r.text());
|
||
const buildId = parser.findBuildId(html);
|
||
|
||
console.log('Build ID:', buildId);
|
||
```
|
||
|
||
### Parsing Build Manifests
|
||
|
||
```javascript
|
||
const parser = NJSParser({ DOMParser });
|
||
|
||
const manifestScript = await fetch('https://example.com/_next/static/BUILD_ID/_buildManifest.js')
|
||
.then(r => r.text());
|
||
|
||
const manifest = parser.parser.parseBuildManifest(manifestScript);
|
||
console.log('Manifest:', manifest);
|
||
```
|
||
|
||
## API Reference
|
||
|
||
### Factory Function
|
||
|
||
#### `NJSParser({ DOMParser })`
|
||
|
||
Creates a new parser instance.
|
||
|
||
**Parameters:**
|
||
- `DOMParser` (required): a DOMParser class/constructor to parse HTML
|
||
|
||
**Returns:** Parser instance with all methods and classes
|
||
|
||
### Parser Methods
|
||
|
||
#### `parser.hasFlightData(html)`
|
||
|
||
Check if HTML contains flight data.
|
||
|
||
**Parameters:**
|
||
- `html` (string): HTML content
|
||
|
||
**Returns:** `boolean`
|
||
|
||
#### `parser.getFlightData(html)`
|
||
|
||
Extract and parse flight data from HTML.
|
||
|
||
**Parameters:**
|
||
- `html` (string): HTML content
|
||
|
||
**Returns:** `Object | null` - Parsed flight data or null
|
||
|
||
#### `parser.hasNextData(html)`
|
||
|
||
Check if HTML contains `__NEXT_DATA__` script.
|
||
|
||
**Parameters:**
|
||
- `html` (string): HTML content
|
||
|
||
**Returns:** `boolean`
|
||
|
||
#### `parser.getNextData(html)`
|
||
|
||
Extract and parse `__NEXT_DATA__` script.
|
||
|
||
**Parameters:**
|
||
- `html` (string): HTML content
|
||
|
||
**Returns:** `Object | null` - Parsed JSON data or null
|
||
|
||
#### `parser.parseBuildManifest(script)`
|
||
|
||
Parse build manifest script.
|
||
|
||
**Parameters:**
|
||
- `script` (string): Build manifest script content
|
||
|
||
**Returns:** `Object` - Parsed manifest
|
||
|
||
#### `parser.getBuildManifestPath(buildId, basePath)`
|
||
|
||
Generate build manifest path.
|
||
|
||
**Parameters:**
|
||
- `buildId` (string): Build ID
|
||
- `basePath` (string, optional): Base path
|
||
|
||
**Returns:** `string` - Build manifest path
|
||
|
||
#### `parser.getNextStaticUrls(html)`
|
||
|
||
Find all NextJS static URLs in HTML.
|
||
|
||
**Parameters:**
|
||
- `html` (string): HTML content
|
||
|
||
**Returns:** `Array<string> | null` - Array of URLs or null
|
||
|
||
#### `parser.getBasePath(htmlOrUrls, removeDomain)`
|
||
|
||
Extract base path from URLs or HTML.
|
||
|
||
**Parameters:**
|
||
- `htmlOrUrls` (string | Array<string>): HTML content or array of URLs
|
||
- `removeDomain` (boolean, optional): Remove domain from absolute URLs
|
||
|
||
**Returns:** `string | null` - Base path or null
|
||
|
||
### High-Level Tools
|
||
|
||
#### `hasNextJS(html)`
|
||
|
||
Check if page has any NextJS data.
|
||
|
||
**Parameters:**
|
||
- `html` (string): HTML content
|
||
|
||
**Returns:** `boolean`
|
||
|
||
#### `findBuildId(html)`
|
||
|
||
Find build ID from page (searches static URLs, next data, and flight data).
|
||
|
||
**Parameters:**
|
||
- `html` (string): HTML content
|
||
|
||
**Returns:** `string | null` - Build ID or null
|
||
|
||
#### `findInFlightData(flightData, classFilters, callback, recursive)`
|
||
|
||
Find first matching element in flight data.
|
||
|
||
**Parameters:**
|
||
- `flightData` (Object): Flight data dictionary
|
||
- `classFilters` (Array, optional): Array of Element classes to filter by
|
||
- `callback` (Function, optional): Callback function for filtering
|
||
- `recursive` (boolean, optional): Search recursively (default: true)
|
||
|
||
**Returns:** `Element | null`
|
||
|
||
#### `findallInFlightData(flightData, classFilters, callback, recursive)`
|
||
|
||
Find all matching elements in flight data.
|
||
|
||
**Parameters:**
|
||
- `flightData` (Object): Flight data dictionary
|
||
- `classFilters` (Array, optional): Array of Element classes to filter by
|
||
- `callback` (Function, optional): Callback function for filtering
|
||
- `recursive` (boolean, optional): Search recursively (default: true)
|
||
|
||
**Returns:** `Array<Element>`
|
||
|
||
#### `finditerInFlightData(flightData, classFilters, callback, recursive)`
|
||
|
||
Iterator for finding elements in flight data.
|
||
|
||
**Parameters:**
|
||
- `flightData` (Object): Flight data dictionary
|
||
- `classFilters` (Array, optional): Array of Element classes to filter by
|
||
- `callback` (Function, optional): Callback function for filtering
|
||
- `recursive` (boolean, optional): Search recursively (default: true)
|
||
|
||
**Returns:** `Generator<Element>`
|
||
|
||
### BeautifulFD Class
|
||
|
||
Simplified interface for working with flight data.
|
||
|
||
#### `new BeautifulFD(htmlOrFlightData)`
|
||
|
||
Create BeautifulFD instance.
|
||
|
||
**Parameters:**
|
||
- `htmlOrFlightData` (string | Object): HTML content or parsed flight data
|
||
|
||
|
||
#### `find(classFilters, callback, recursive)`
|
||
|
||
Find first matching element.
|
||
|
||
**Parameters:**
|
||
- `classFilters` (Array, optional): Array of Element classes
|
||
- `callback` (Function, optional): Callback function for filtering
|
||
- `recursive` (boolean, optional): Search recursively (default: true)
|
||
|
||
**Returns:** `Element | null`
|
||
|
||
#### `find_all(classFilters, callback, recursive)`
|
||
|
||
Find all matching elements.
|
||
|
||
**Parameters:**
|
||
- `classFilters` (Array, optional): Array of Element classes
|
||
- `callback` (Function, optional): Callback function for filtering
|
||
- `recursive` (boolean, optional): Search recursively (default: true)
|
||
|
||
**Returns:** `Array<Element>`
|
||
|
||
#### `find_iter(classFilters, callback, recursive)`
|
||
|
||
Iterator for finding elements.
|
||
|
||
**Parameters:**
|
||
- `classFilters` (Array, optional): Array of Element classes
|
||
- `callback` (Function, optional): Callback function for filtering
|
||
- `recursive` (boolean, optional): Search recursively (default: true)
|
||
|
||
**Returns:** `Generator<Element>`
|
||
|
||
#### `as_list()`
|
||
|
||
Get flight data as array.
|
||
|
||
**Returns:** `Array<Element>`
|
||
|
||
#### `static from_list(list, viaEnumerate)`
|
||
|
||
Create BeautifulFD from array.
|
||
|
||
**Parameters:**
|
||
- `list` (Array): Array of Elements
|
||
- `viaEnumerate` (boolean, optional): Use array indices as element indices
|
||
|
||
**Returns:** `BeautifulFD`
|
||
|
||
### Element Types
|
||
|
||
All flight data elements extend the base `Element` class:
|
||
|
||
- `Element` - Base class
|
||
- `HintPreload` - Preload hints (class "HL")
|
||
- `Module` - Module imports (class "I")
|
||
- `Text` - Text content (class "T")
|
||
- `Data` - Structured data
|
||
- `EmptyData` - Null/empty data
|
||
- `SpecialData` - Special markers (strings starting with "$")
|
||
- `HTMLElement` - HTML elements
|
||
- `DataContainer` - Container of multiple elements
|
||
- `DataParent` - Parent element with children
|
||
- `URLQuery` - URL query parameters
|
||
- `RSCPayload` - React Server Components payload
|
||
- `Error` - Error elements (class "E")
|
||
|
||
You can access types via `parser.types`:
|
||
|
||
```javascript
|
||
const parser = NJSParser({ DOMParser });
|
||
const fd = new parser.BeautifulFD(html);
|
||
|
||
const textElements = fd.find_all([parser.types.Text]);
|
||
const modules = fd.find_all([parser.types.Module]);
|
||
```
|
||
|
||
### API Utilities
|
||
|
||
#### `api.getApiPath(buildId, basePath, path)`
|
||
|
||
Generate API path for a page.
|
||
|
||
**Parameters:**
|
||
- `buildId` (string): Build ID
|
||
- `basePath` (string, optional): Base path
|
||
- `path` (string, optional): Page path
|
||
|
||
**Returns:** `string | null` - API path or null for excluded paths
|
||
|
||
#### `api.getIndexApiPath(buildId, basePath)`
|
||
|
||
Generate index API path.
|
||
|
||
**Parameters:**
|
||
- `buildId` (string): Build ID
|
||
- `basePath` (string, optional): Base path
|
||
|
||
**Returns:** `string`
|
||
|
||
#### `api.isApiExposedFromResponse(statusCode, contentType, text)`
|
||
|
||
Check if API is exposed from response.
|
||
|
||
**Parameters:**
|
||
- `statusCode` (number): HTTP status code
|
||
- `contentType` (string): Content-Type header
|
||
- `text` (string): Response text
|
||
|
||
**Returns:** `boolean`
|
||
|
||
#### `api.listApiPaths(sortedPages, buildId, basePath, isApiExposed)`
|
||
|
||
List API paths from build manifest.
|
||
|
||
**Parameters:**
|
||
- `sortedPages` (Array<string>): Sorted pages from build manifest
|
||
- `buildId` (string): Build ID
|
||
- `basePath` (string): Base path
|
||
- `isApiExposed` (boolean, optional): Is API exposed
|
||
|
||
**Returns:** `Array<string>`
|
||
|
||
## Differences from Python Version
|
||
|
||
1. **DOMParser Parameter**: The JavaScript version requires a DOMParser instance to be provided, making it compatible with different environments (Bun, browser, Node.js with jsdom, etc.)
|
||
|
||
2. **No CLI**: This port focuses on the library functionality only. No CLI is included.
|
||
|
||
3. **Factory Pattern**: Uses a factory function to inject dependencies rather than global imports.
|
||
|
||
4. **Async/Await**: JavaScript version is designed to work with async/await for fetching HTML.
|
||
|
||
5. **Native JSON**: Uses native `JSON.parse` and `JSON.stringify` instead of orjson.
|
||
|
||
6. **Native eval**: Uses JavaScript `eval` for build manifest parsing instead of pythonmonkey.
|
||
|
||
## Testing
|
||
|
||
Run tests with Bun:
|
||
|
||
```bash
|
||
bun test
|
||
```
|
||
|
||
## License
|
||
|
||
MIT
|
||
|
||
## Credits
|
||
|
||
This is a JavaScript port of the Python [njsparser](https://github.com/novitae/njsparser) library by novitae.
|