You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: update package version and dependencies; refactor database item UID generation
- Bump version from 1.1.10 to 2.0.0 in package.json
- Update dependencies: glob (10.3.10 to 13.0.0), js-yaml (4.1.0 to 4.1.1), remark-directive (3.0.0 to 4.0.0), remark-gfm (4.0.0 to 4.0.1), unified (11.0.4 to 11.0.5)
- Remove unused item UID generation in structure_db.js
- Clean up pnpm-lock.yaml to reflect updated dependencies and remove deprecated packages
console.log(`'image' content entry has following images '${images_urls}'`)
38
-
console.log(`image meta data payload '${image_entry.meta_data??'none'}'`)
48
+
# Documentation
49
+
Content Structure produces a relational snapshot of every markdown run using the schema declared in [`catalog.yaml`](./catalog.yaml).
50
+
The catalog defines a single `structure` dataset whose tables are optimized for rendering, search indexing, and asset management. Each run populates these tables under `.structure/structure.db`.
39
51
40
-
```
41
-
will output
42
-
```shell
43
-
obtained 14 documents
44
-
'image' content entry has following images './tree.svg,./long-diagram.svg'
45
-
image meta data payload '{"hero":"Dendrogram"}'
46
-
```
52
+
### Table overview
53
+
| Table | Purpose | Relationships |
54
+
| --- | --- | --- |
55
+
|`documents`| Canonical row per markdown entry. Stores stable ids, routing metadata, and leftover front matter via the `meta_data` JSON column. |`items`, `assets`, and `asset_info` reference `documents.sid`. |
56
+
|`items`| Flattened AST stream in reading order. Each row keeps `body_text` for simple rendering plus an optional serialized AST subtree for nested constructs (stored in `ast`). | References `documents` via `doc_sid`; `assets` rows connect items to blobs when an AST node produces a file. |
57
+
|`assets`| Run-specific join table so consumers can tell which document referenced which asset at a given `version_id`. | Bridge between `documents` and `asset_info`; also carries the `blob_uid` for quick payload lookups. |
58
+
|`asset_info`| Deduplicated description of every asset (code blocks, tables, linked files, etc.) regardless of run. | Points to the owning document (`parent_doc_uid`) and the physical payload via `blob_uid`. |
59
+
|`blob_store`| Source of truth for payloads. Large blobs are stored under `blobs/YYYY/MM/ff/hash` and referenced by path, while small blobs inline their bytes (compressed when eligible). |`asset_info`/`assets` link to blobs through `blob_uid`. |
47
60
48
-
# Roadmap
49
-
- [x] provide an API for querying documents content-by-x
50
-
- [x] extracting svg text and span content with jsdom
51
-
- [x] replace refs with a reference node
52
-
- [x] test hierarchical content
53
-
- [ ] files with same name as folder count as folder type
54
-
- [ ] test combined content e.g. code inside table, image inside table
55
-
- [ ] provide an API for querying image-by-x, table-by-x,...
56
-
- [ ] helper for search engine injection
57
-
- [ ] check compatibility with content-collections
58
-
- [ ] add optional typecheck
59
-
60
-
## ideas
61
-
* parse other images types for text extraction
61
+
The catalog is intentionally compact: fields are named to match DOM concerns (`slug`, `url_type`, `level`), content analysis (`headings`, `links`, `code`), and asset lifecycle (`first_seen`, `last_seen`). Instead of memorizing every column, browse [`catalog.yaml`](./catalog.yaml) whenever you need the exact types or to extend the dataset. Downstream tools can rely on the catalog as the authoritative contract when generating queries, migrations, or analytics dashboards.
62
62
63
-
# Documentation
64
-
## Documents fields description
65
-
### Metadata
66
-
Documents expose a `meta_data` column that stores the JSON representation of any metadata fields not mapped directly to schema columns.
67
-
In multi-document mode (default) markdown front matter is split into known schema fields (e.g., `title`, `slug`, `tags`, etc.) and leftover fields. The leftovers are serialized to JSON and stored in `meta_data`.
68
-
When `folder_single_doc` is enabled, every folder is treated as a single document:
69
-
70
-
1. All markdown files inside the folder are concatenated alphabetically and parsed as one document. Front matter is ignored.
71
-
2. The first YAML/YML file inside the same folder is parsed, its known fields override document columns, and any extra keys are serialized into `meta_data`.
72
-
73
-
Metadata is therefore always collocated with the document row itself—no additional assets are created just to store free-form fields.
74
-
75
-
### Ordering
76
-
Documents also expose an `order` column. When you omit it, Content Structure assigns numbers per directory-and-level group using the alphabetical listing of siblings, ensuring menus can render in a predictable order. If you declare `order` in front matter or the folder YAML, those positions are reserved and any remaining siblings automatically fill the lowest available gaps.
77
-
78
-
### URL type
79
-
Content structure allows both file and folder URL types to be used at the same time without the need of user configuration.
80
-
If a markdown file is named `readme.md` or matches the parent directory name, it is treated as a folder document (`url_type: "dir"`); any other filename is considered a file document.
81
-
82
-
The field `url_type` will also be exposed for the user as in the example entry below
83
-
```json
84
-
{
85
-
"sid": "a518c9b7",
86
-
"uid": "authors.agatha-christie",
87
-
"path": "authors/agatha-christie/entry.yml",
88
-
"url_type": "dir",
89
-
"slug": "agatha-christie",
90
-
"meta_data": "{\"featured\":true}"
91
-
...
92
-
}
93
-
```
63
+
### Document behavior highlights
64
+
-**Metadata folding** – Any front matter not mapped to a declared column is serialized into `documents.meta_data`, keeping schemas manageable without losing context.
65
+
-**Automatic ordering** – Documents inherit incremental `order` values scoped to their directory level unless you pin them explicitly. This keeps navigation menus stable even when markdown files are added later.
66
+
-**Mixed routing** – Folder-style (`readme.md` or matching filenames) and file-style URLs coexist. `url_type` reveals which variant was used to generate the url.
67
+
68
+
### Item and asset lifecycle
69
+
- Paragraphs, headings, tables, code blocks, and images are all represented in `items`. Simple rows expose fully extracted text; nested structures store their sanitized AST so you can re-render bold or embedded assets without reparsing the original markdown.
70
+
- Every asset mentioned by an item produces two entries: a durable definition in `asset_info` and a run-scoped membership row in `assets`. The membership row ties the asset to both the document and its blob so you can know exactly when something was added, removed, or reused.
71
+
- Blob payloads avoid bloat with configurable thresholds: large files stream to disk under `blobs/`, while smaller text blobs can be compressed inline and served straight from SQLite.
72
+
73
+
Refer back to the catalog for exhaustive field notes, and treat the tables above as the primary contract between your markdown source and any rendering or analytics layers.
94
74
95
75
## Config parameters
96
76
the config parameter is optional and do have default values
@@ -102,135 +82,6 @@ the config parameter is optional and do have default values
102
82
*`file_compress_ext` : defaults to `["txt","md","json","csv","tsv","yaml","yml"]`. Inline blobs are compressed only if their source extension (when known) appears in this list.
103
83
104
84
## Generated output
105
-
* `gen/document_list.json`
106
-
* documents : a list of documents properties
107
-
* slug : auto generated if not provided
108
-
* uid : autogenerated and unique across all documents
109
-
* sid : a short uid with first 8 letters of the md5 hash, for simplified referencing e.g. in data directories or links
110
-
* meta_data : JSON string of any remaining frontmatter/YAML fields that do not match schema columns
111
-
* images : a list of images properties. These images were parsed from the markdown text content and not from the filesystem
112
-
* heading : the heading id of the section the image belongs to
113
-
* title : from the image link meta data
114
-
* document : the document the image was referenced in
115
-
* each markdown file gets a `./gen/documents/<sid>` directors with
116
-
* `tree.json` the raw output of the remark AST parser
117
-
* content.json with the parameters and parsed content parameters
118
-
* `.structure/structure.db` : a SQLite database (powered by better-sqlite3) that mirrors the JSON output.
119
-
The database exposes the tables `documents`, `items`, `assets`, `asset_info`, and `blob_store`. Each `documents` row now includes the `version_id` of the run that produced it, plus an optional `meta_data` JSON string whenever leftover metadata fields are detected.
120
-
Repeating values are normalised into dedicated tables, while any retained list uses a `*_list` column that stores a JSON string of the related ids.
121
-
Items flatten the AST of every markdown document using a stable `version_id` per run and now embed inline asset references directly as `asset://type/asset_uid` Markdown tokens. Complex items with nested formatting retain their AST subtree as a JSON string in the optional `ast` column so recursive rendering data is not lost.
122
-
`assets` rows keep per-run joins between assets and documents without placeholder ids, `asset_info` rows store the asset catalog metadata, and `blob_store` rows capture the blob hash, byte size, storage directory (when persisted to disk), inline payloads for small blobs, and a compression flag (`true`/`false` or `null` when the payload lives on disk).
123
-
124
-
## Example generated output
125
-
126
-
this files structure
127
-
```shell
128
-
└───content
129
-
├───title-complex
130
-
│ readme.md
131
-
├───text-simple
132
-
│ readme.md
133
-
...
134
-
```
135
-
generates this output
136
-
```shell
137
-
└─gen
138
-
│ document_list.json
139
-
└───documents
140
-
├───35298154
141
-
│ content.json
142
-
│ tree.json
143
-
├───12b0e722
144
-
│ content.json
145
-
│ tree.json
146
-
...
147
-
```
148
-
* `document_list.json` is the documents index
149
-
```json
150
-
[
151
-
{
152
-
"sid": "35298154",
153
-
"uid": "title-complex",
154
-
"path": "title-complex/readme.md",
155
-
"url_type": "dir",
156
-
"slug": "title-complex",
157
-
"title": "title Complex",
158
-
"meta_data": "{\"hero\":\"title\"}"
159
-
},
160
-
{
161
-
"sid": "12b0e722",
162
-
"uid": "text-simple",
163
-
"path": "text-simple/readme.md",
164
-
"url_type": "dir",
165
-
"slug": "text-simple",
166
-
"title": "Text Simple",
167
-
"meta_data": null
168
-
},
169
-
...
170
-
```
171
-
* file content example
172
-
```markdown
173
-
---
174
-
title: Image
175
-
---
176
-

177
-
178
-
```
179
-
example of generated files for `image/readme.md` which has an sid of `78805a22`
180
-
```json
181
-
{
182
-
"sid": "78805a22",
183
-
"uid": "image",
184
-
"path": "image/readme.md",
185
-
"url_type": "dir",
186
-
"slug": "image",
187
-
"title": "Image",
188
-
"meta_data": "{\"hero\":\"Image\"}",
189
-
"headings": [],
190
-
"tables": [],
191
-
"images": [
192
-
{
193
-
"id": "tree",
194
-
"heading": null,
195
-
"title": null,
196
-
"url": "./tree.svg",
197
-
"alt": "Tree",
198
-
"label": ""
199
-
}
200
-
],
201
-
"code": [],
202
-
"paragraphs": [
203
-
{
204
-
"heading": null,
205
-
"label": []
206
-
},
207
-
{
208
-
"heading": null,
209
-
"label": []
210
-
}
211
-
]
212
-
}
213
-
```
214
-
215
-
and the beginning of `tree.json`
216
-
217
-
```json
218
-
{
219
-
"type": "root",
220
-
"children": [
221
-
{
222
-
"type": "paragraph",
223
-
"children": [
224
-
{
225
-
"type": "image",
226
-
"title": null,
227
-
"url": "./tree.svg",
228
-
"alt": "Tree",
229
-
"position": {
230
-
"start": {
231
-
"line": 1,
232
-
"column": 1,
233
-
"offset": 0
234
-
},
235
-
...
236
-
```
85
+
*`.structure/structure.db` : a SQLite database (powered by better-sqlite3).
86
+
The database exposes the tables `documents`, `items`, `assets`, `asset_info`, and `blob_store`.
87
+
*`blobs/year/month/prefix/hash` path for all files larger than `config.external_storage_kb`
0 commit comments