Skip to content

Commit 4e3fba6

Browse files
committed
feat: update package version and dependencies; refactor database item UID generation
- Bump version from 1.1.10 to 2.0.0 in package.json - Update dependencies: glob (10.3.10 to 13.0.0), js-yaml (4.1.0 to 4.1.1), remark-directive (3.0.0 to 4.0.0), remark-gfm (4.0.0 to 4.0.1), unified (11.0.4 to 11.0.5) - Remove unused item UID generation in structure_db.js - Clean up pnpm-lock.yaml to reflect updated dependencies and remove deprecated packages
1 parent a0dd2fb commit 4e3fba6

File tree

7 files changed

+196
-642
lines changed

7 files changed

+196
-642
lines changed

README.md

Lines changed: 49 additions & 198 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,11 @@
11
# Content Structure
2-
content-structure collects all your markdown files meta data and parses the Abstract Syntax Tree of each file
2+
Parsed markdown is stored in SQLite tables that can be used for rendering and database content management.
3+
4+
![design](design.drawio.svg)
35

46
## Deepwiki
57
https://deepwiki.com/MicroWebStacks/content-structure
68

7-
## Concept
8-
![design](design.drawio.svg)
9-
109
# install
1110
prerequisites
1211
- choco
@@ -23,74 +22,55 @@ collect all data by running this once
2322
```javascript
2423
import {collect} from 'content-structure'
2524

26-
await collect()
25+
await collect({
26+
rootdir:rootdir,
27+
contentdir:join(rootdir,"content"),
28+
file_link_ext:["svg","webp","png","jpeg","jpg","xlsx","glb"],
29+
outdir:join(rootdir,".structure")
30+
})
31+
```
32+
see demo with
33+
```cmd
34+
>pnpm run demo
35+
> node parse.js
36+
37+
content_dir : C:\dev\MicroWebStacks\content-structure\example\content
38+
searching for files with extensions : *.md
39+
Structure DB tables and row counts:
40+
- asset_info: 19
41+
- assets: 19
42+
- blob_store: 14
43+
- documents: 30
44+
- items: 82
2745
```
28-
then use as follows
29-
```javascript
30-
import {getDocuments, getEntry} from 'content-structure'
3146

32-
const documents = await getDocuments()
33-
console.log(`obtained ${documents.length} documents`)
3447

35-
const image_entry = await getEntry({slug:"image"})
36-
const images_urls = image_entry.data.images.map(image=>image.url)
37-
console.log(`'image' content entry has following images '${images_urls}'`)
38-
console.log(`image meta data payload '${image_entry.meta_data ?? 'none'}'`)
48+
# Documentation
49+
Content Structure produces a relational snapshot of every markdown run using the schema declared in [`catalog.yaml`](./catalog.yaml).
50+
The catalog defines a single `structure` dataset whose tables are optimized for rendering, search indexing, and asset management. Each run populates these tables under `.structure/structure.db`.
3951

40-
```
41-
will output
42-
```shell
43-
obtained 14 documents
44-
'image' content entry has following images './tree.svg,./long-diagram.svg'
45-
image meta data payload '{"hero":"Dendrogram"}'
46-
```
52+
### Table overview
53+
| Table | Purpose | Relationships |
54+
| --- | --- | --- |
55+
| `documents` | Canonical row per markdown entry. Stores stable ids, routing metadata, and leftover front matter via the `meta_data` JSON column. | `items`, `assets`, and `asset_info` reference `documents.sid`. |
56+
| `items` | Flattened AST stream in reading order. Each row keeps `body_text` for simple rendering plus an optional serialized AST subtree for nested constructs (stored in `ast`). | References `documents` via `doc_sid`; `assets` rows connect items to blobs when an AST node produces a file. |
57+
| `assets` | Run-specific join table so consumers can tell which document referenced which asset at a given `version_id`. | Bridge between `documents` and `asset_info`; also carries the `blob_uid` for quick payload lookups. |
58+
| `asset_info` | Deduplicated description of every asset (code blocks, tables, linked files, etc.) regardless of run. | Points to the owning document (`parent_doc_uid`) and the physical payload via `blob_uid`. |
59+
| `blob_store` | Source of truth for payloads. Large blobs are stored under `blobs/YYYY/MM/ff/hash` and referenced by path, while small blobs inline their bytes (compressed when eligible). | `asset_info`/`assets` link to blobs through `blob_uid`. |
4760

48-
# Roadmap
49-
- [x] provide an API for querying documents content-by-x
50-
- [x] extracting svg text and span content with jsdom
51-
- [x] replace refs with a reference node
52-
- [x] test hierarchical content
53-
- [ ] files with same name as folder count as folder type
54-
- [ ] test combined content e.g. code inside table, image inside table
55-
- [ ] provide an API for querying image-by-x, table-by-x,...
56-
- [ ] helper for search engine injection
57-
- [ ] check compatibility with content-collections
58-
- [ ] add optional typecheck
59-
60-
## ideas
61-
* parse other images types for text extraction
61+
The catalog is intentionally compact: fields are named to match DOM concerns (`slug`, `url_type`, `level`), content analysis (`headings`, `links`, `code`), and asset lifecycle (`first_seen`, `last_seen`). Instead of memorizing every column, browse [`catalog.yaml`](./catalog.yaml) whenever you need the exact types or to extend the dataset. Downstream tools can rely on the catalog as the authoritative contract when generating queries, migrations, or analytics dashboards.
6262

63-
# Documentation
64-
## Documents fields description
65-
### Metadata
66-
Documents expose a `meta_data` column that stores the JSON representation of any metadata fields not mapped directly to schema columns.
67-
In multi-document mode (default) markdown front matter is split into known schema fields (e.g., `title`, `slug`, `tags`, etc.) and leftover fields. The leftovers are serialized to JSON and stored in `meta_data`.
68-
When `folder_single_doc` is enabled, every folder is treated as a single document:
69-
70-
1. All markdown files inside the folder are concatenated alphabetically and parsed as one document. Front matter is ignored.
71-
2. The first YAML/YML file inside the same folder is parsed, its known fields override document columns, and any extra keys are serialized into `meta_data`.
72-
73-
Metadata is therefore always collocated with the document row itself—no additional assets are created just to store free-form fields.
74-
75-
### Ordering
76-
Documents also expose an `order` column. When you omit it, Content Structure assigns numbers per directory-and-level group using the alphabetical listing of siblings, ensuring menus can render in a predictable order. If you declare `order` in front matter or the folder YAML, those positions are reserved and any remaining siblings automatically fill the lowest available gaps.
77-
78-
### URL type
79-
Content structure allows both file and folder URL types to be used at the same time without the need of user configuration.
80-
If a markdown file is named `readme.md` or matches the parent directory name, it is treated as a folder document (`url_type: "dir"`); any other filename is considered a file document.
81-
82-
The field `url_type` will also be exposed for the user as in the example entry below
83-
```json
84-
{
85-
"sid": "a518c9b7",
86-
"uid": "authors.agatha-christie",
87-
"path": "authors/agatha-christie/entry.yml",
88-
"url_type": "dir",
89-
"slug": "agatha-christie",
90-
"meta_data": "{\"featured\":true}"
91-
...
92-
}
93-
```
63+
### Document behavior highlights
64+
- **Metadata folding** – Any front matter not mapped to a declared column is serialized into `documents.meta_data`, keeping schemas manageable without losing context.
65+
- **Automatic ordering** – Documents inherit incremental `order` values scoped to their directory level unless you pin them explicitly. This keeps navigation menus stable even when markdown files are added later.
66+
- **Mixed routing** – Folder-style (`readme.md` or matching filenames) and file-style URLs coexist. `url_type` reveals which variant was used to generate the url.
67+
68+
### Item and asset lifecycle
69+
- Paragraphs, headings, tables, code blocks, and images are all represented in `items`. Simple rows expose fully extracted text; nested structures store their sanitized AST so you can re-render bold or embedded assets without reparsing the original markdown.
70+
- Every asset mentioned by an item produces two entries: a durable definition in `asset_info` and a run-scoped membership row in `assets`. The membership row ties the asset to both the document and its blob so you can know exactly when something was added, removed, or reused.
71+
- Blob payloads avoid bloat with configurable thresholds: large files stream to disk under `blobs/`, while smaller text blobs can be compressed inline and served straight from SQLite.
72+
73+
Refer back to the catalog for exhaustive field notes, and treat the tables above as the primary contract between your markdown source and any rendering or analytics layers.
9474

9575
## Config parameters
9676
the config parameter is optional and do have default values
@@ -102,135 +82,6 @@ the config parameter is optional and do have default values
10282
* `file_compress_ext` : defaults to `["txt","md","json","csv","tsv","yaml","yml"]`. Inline blobs are compressed only if their source extension (when known) appears in this list.
10383

10484
## Generated output
105-
* `gen/document_list.json`
106-
* documents : a list of documents properties
107-
* slug : auto generated if not provided
108-
* uid : autogenerated and unique across all documents
109-
* sid : a short uid with first 8 letters of the md5 hash, for simplified referencing e.g. in data directories or links
110-
* meta_data : JSON string of any remaining frontmatter/YAML fields that do not match schema columns
111-
* images : a list of images properties. These images were parsed from the markdown text content and not from the filesystem
112-
* heading : the heading id of the section the image belongs to
113-
* title : from the image link meta data
114-
* document : the document the image was referenced in
115-
* each markdown file gets a `./gen/documents/<sid>` directors with
116-
* `tree.json` the raw output of the remark AST parser
117-
* content.json with the parameters and parsed content parameters
118-
* `.structure/structure.db` : a SQLite database (powered by better-sqlite3) that mirrors the JSON output.
119-
The database exposes the tables `documents`, `items`, `assets`, `asset_info`, and `blob_store`. Each `documents` row now includes the `version_id` of the run that produced it, plus an optional `meta_data` JSON string whenever leftover metadata fields are detected.
120-
Repeating values are normalised into dedicated tables, while any retained list uses a `*_list` column that stores a JSON string of the related ids.
121-
Items flatten the AST of every markdown document using a stable `version_id` per run and now embed inline asset references directly as `asset://type/asset_uid` Markdown tokens. Complex items with nested formatting retain their AST subtree as a JSON string in the optional `ast` column so recursive rendering data is not lost.
122-
`assets` rows keep per-run joins between assets and documents without placeholder ids, `asset_info` rows store the asset catalog metadata, and `blob_store` rows capture the blob hash, byte size, storage directory (when persisted to disk), inline payloads for small blobs, and a compression flag (`true`/`false` or `null` when the payload lives on disk).
123-
124-
## Example generated output
125-
126-
this files structure
127-
```shell
128-
└───content
129-
├───title-complex
130-
│ readme.md
131-
├───text-simple
132-
│ readme.md
133-
...
134-
```
135-
generates this output
136-
```shell
137-
└─gen
138-
│ document_list.json
139-
└───documents
140-
├───35298154
141-
│ content.json
142-
│ tree.json
143-
├───12b0e722
144-
│ content.json
145-
│ tree.json
146-
...
147-
```
148-
* `document_list.json` is the documents index
149-
```json
150-
[
151-
{
152-
"sid": "35298154",
153-
"uid": "title-complex",
154-
"path": "title-complex/readme.md",
155-
"url_type": "dir",
156-
"slug": "title-complex",
157-
"title": "title Complex",
158-
"meta_data": "{\"hero\":\"title\"}"
159-
},
160-
{
161-
"sid": "12b0e722",
162-
"uid": "text-simple",
163-
"path": "text-simple/readme.md",
164-
"url_type": "dir",
165-
"slug": "text-simple",
166-
"title": "Text Simple",
167-
"meta_data": null
168-
},
169-
...
170-
```
171-
* file content example
172-
```markdown
173-
---
174-
title: Image
175-
---
176-
![Tree](./tree.svg)
177-
178-
```
179-
example of generated files for `image/readme.md` which has an sid of `78805a22`
180-
```json
181-
{
182-
"sid": "78805a22",
183-
"uid": "image",
184-
"path": "image/readme.md",
185-
"url_type": "dir",
186-
"slug": "image",
187-
"title": "Image",
188-
"meta_data": "{\"hero\":\"Image\"}",
189-
"headings": [],
190-
"tables": [],
191-
"images": [
192-
{
193-
"id": "tree",
194-
"heading": null,
195-
"title": null,
196-
"url": "./tree.svg",
197-
"alt": "Tree",
198-
"label": ""
199-
}
200-
],
201-
"code": [],
202-
"paragraphs": [
203-
{
204-
"heading": null,
205-
"label": []
206-
},
207-
{
208-
"heading": null,
209-
"label": []
210-
}
211-
]
212-
}
213-
```
214-
215-
and the beginning of `tree.json`
216-
217-
```json
218-
{
219-
"type": "root",
220-
"children": [
221-
{
222-
"type": "paragraph",
223-
"children": [
224-
{
225-
"type": "image",
226-
"title": null,
227-
"url": "./tree.svg",
228-
"alt": "Tree",
229-
"position": {
230-
"start": {
231-
"line": 1,
232-
"column": 1,
233-
"offset": 0
234-
},
235-
...
236-
```
85+
* `.structure/structure.db` : a SQLite database (powered by better-sqlite3).
86+
The database exposes the tables `documents`, `items`, `assets`, `asset_info`, and `blob_store`.
87+
* `blobs/year/month/prefix/hash` path for all files larger than `config.external_storage_kb`

catalog.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@ datasets:
2323
- name: items
2424
description: Flattened AST items representing headings, paragraphs, and asset-backed nodes in reading order.
2525
columns:
26-
- { name: uid, type: string, primary: true, description: Unique item identifier scoped by document and version }
2726
- { name: version_id, type: string, description: Encoded seconds-since-2000 identifier for the collection run }
2827
- { name: doc_sid, type: string, description: SID of the parent document }
2928
- { name: type, type: string, description: Item type such as heading, paragraph, table, code, or image }

0 commit comments

Comments
 (0)