Data Parser Engine — Schema Studio

Reference

What is a schema?

A small object that tells DPE what kind of file you're handing it and how to extract records from it. The engine reads the file, follows the schema, and returns mapped JSON records plus diagnostics.

The schema you build in this Studio is exactly what gets passed to DataParser.parse(file, schema) in production. The DPE schema output tab shows that object — copy it from there when you're done.

The output pipeline

A parse runs through four stages, each visible as its own output tab:

RAW input
The file's original text as received, byte-for-byte. ("Binary format" for spreadsheet / document files — RAW text isn't applicable.)
Post-filter
Text after dropRegex lines have been stripped. Identical to RAW if no dropRegex is set.
Pre-mapping
Records as the format parser produced them, before mapping reshapes the fields. This is what shows up at result.raw.
Data
The final mapped records (result.data). Identical to Pre-mapping if no mapping is defined.

A fifth tab, DPE schema, shows the schema object that was just submitted to the engine — the artifact this session is producing. It refreshes on Parse, not as you edit the form.

Saving & loading your work

No accounts, nothing stored. DPE parses your files entirely in your browser — nothing is uploaded and nothing is kept on a server. You save your work as files on your own disk and reload them here. Every example you load exercises these controls.

Download schema
On the DPE schema tab, Download saves the schema as dpe-schema.json — the same object Copy produces. It's the portable engine schema: drop it into any DPE consumer, or re-upload it here later.
Upload schema
In 1. Input, "upload a saved schema" reads a dpe-schema.json and repopulates the whole form (including the mapping). Then load a data file and Parse.
Download results
On the Data tab, after a parse: Download JSON saves the records losslessly (handles nested objects); Download CSV saves them as CSV.
CSV dialect
Header row of field names, every value double-quoted, comma-delimited, UTF-8. For other dialects — or nested data — use JSON export.

Round-trip: tune a schema → download it → next time, upload the schema, load the data file, Parse, download the results. Naming and file management are yours.

Form options

format
Required. The file type. Extension is ignored. Choices: csv, prn, txt, fixed, xml, json, passthrough, xls, xlsx, ods, docx, odt.
layout
Only for prn and txt. delimited = fields separated by a character (comma, pipe, tab). fixed = fields at known column positions.
encoding
Character encoding for text formats. Default utf-8. Use windows-1252, iso-8859-1, cp437, etc. for legacy dumps.
dropRegex
One regex per line. Each is compiled with the m flag. Any line matching any pattern is removed before the parser sees the file. Use it to strip page headers, dates, banner separators, decorative === lines. Applies to all text formats; ignored (with a warning) for spreadsheets / documents.
mapping
Output shaping. Syntax: output = input, one per line or comma-separated. Source may be a field name (sku), a dot path on nested objects (Material.Category), an XML attribute (@_id), a text node (Pricing.Buy.#text), or a 1-indexed positional integer (3). Missing values become null.
mapping mode
replace (default) — output contains only the mapped target keys. extend — output contains all source fields with mapped target keys overlaid on top.
strict mode
If on, any error rejects the parse instead of being collected into result.errors.

Format-specific knobs

Delimited — csv, prn+delimited, txt+delimited

delimiter
Single character. Use the literal (,, |, ;), \t for tab, or auto to let PapaParse autodetect.
quote char
Encloses field values that contain the delimiter. Default ".
first row is headers
If yes, the first row's values become field names (records are objects). If no, records are arrays — pair with positional mapping.

Fixed-width — fixed, prn+fixed, txt+fixed

fieldDefinitions
JSON array, required. Each entry: { name, start, end, trim? }. start inclusive, end exclusive, both 0-based. trim defaults true.

XML / JSON

rootPath
Dot path to the array (or single record) to extract from the parsed tree, e.g. Envelope.Body.Records.Record. Omit to use the root.

Spreadsheet — xls, xlsx, ods

sheetName
Defaults to the workbook's first sheet.

Passthrough

No knobs. Output is [{ line, text }], one record per line, 1-indexed. Useful when you want to inspect (or positionally map) a text file DPE doesn't structurally parse — raw EDIFACT / X12 dumps, log files, anything line-oriented.

Recipes

Messy delimited file with header/footer cruft

{
    format: 'csv',
    delimiter: '|',
    dropRegex: ['^=', '^Generated:', '^-{3,}', '^END OF']
}

The dropRegex patterns strip banner separators, date stamps, dash lines, and trailing summaries. The parser only sees actual data rows.

Passthrough for EDIFACT / X12

{
    format: 'passthrough',
    mapping: { segment: 1, payload: 2 }
}

Each line of the EDI dump becomes a record { line, text }. The positional mapping renames those two fields by index (1 = line, 2 = text).

Positional mapping for headerless CSV

{
    format: 'csv',
    hasHeaders: false,
    mapping: { sku: 1, name: 2, buy: 3, sell: 4 }
}

With hasHeaders: false, PapaParse returns each row as an array. Positional sources (1-indexed) map array slots to target names.

Nested XML with attribute and text-node mapping

{
    format: 'xml',
    rootPath: 'Envelope.Body.Records.Record',
    mapping: {
        id: '@_id',
        category: 'Material.Category',
        buy: 'Pricing.Buy.#text'
    }
}

rootPath drills into the parsed tree. The @_ prefix accesses XML attributes (fast-xml-parser convention). #text accesses the text node of an element that also has attributes.

1. Input

Select the buyer's export file to tune a schema against.

Loads a sample file from samples/ and pre-fills the schema.

When off, loading an example leaves the Mapping field empty so you can see the raw parser output before any mapping is applied.

Reads a dpe-schema.json (downloaded from the schema tab) and repopulates the whole form. Then load a data file above and Parse.

No accounts, nothing stored. DPE parses your files entirely in your browser — we never receive or keep your data. That's why you save your work by downloading the schema (and results) and reload it by uploading, rather than storing it on our servers. Your files, your disk, your naming.

2. Schema

format & layout

File type. Extension is ignored.

common

Default utf-8. Use windows-1252, iso-8859-1, cp437 for legacy dumps.

Lines matching any pattern are stripped before parsing. m-flag, line-anchored. Text formats only.

Syntax: output = input, one per line or comma-separated. Source may be a field name (with dot-path for nested), or a positive integer for positional access (1-indexed). Comments (#, //) supported.

Only applies when a mapping is defined above.

If on, any error rejects the parse instead of being collected in result.errors.

When on, the schema sent to the engine includes every option relevant to the chosen format, even when at its default value. Useful for self-documenting schemas.

3. Output

Stage 1 — original file content as received, before any processing.

Parse a file to see output.

Stage 2 — text after dropRegex stripping.

Stage 3 — records as the format parser produced them, before mapping reshapes the fields.

Stage 4 — final mapped records (result.data). Download as JSON (lossless, nested-safe) or CSV (header row, every value double-quoted, comma-delimited, UTF-8).

    
            

    The schema object the engine just received — the artifact this Studio session is producing. Copy, or download as dpe-schema.json to reuse in any DPE consumer (or re-upload here later). Refreshes on Parse.

    Parse a file to see the schema.