A structured, machine-readable version of Liddell, Scott, Jones, A Greek-English Lexicon, 9th edition (1940). The base text is in the public domain. This dataset adds OCR corrections, structured parsing, and grammatical annotations. Used by Dilemma (Greek lemmatizer).
| File | Description |
|---|---|
lsj9_headwords.json |
119,450 headwords with grammar, etymology, genitive, homograph markers |
lsj9_forms.tsv |
63,389 entries with explicit grammatical info (article/adjective type, genitive ending) |
lsj9_glosses.jsonl |
177,190 hierarchical glosses (definitions with citations stripped) |
lsj9_refs.tsv |
211,585 structured references (author, work, passage) |
lsj9_frequency.json |
Reference counts per headword (55,495 entries) |
lsj9_indeclinables.json |
Indeclinable entries with POS categories (adverbs, prepositions, conjunctions, particles, interjections) |
lsj9_authors.json |
Authors & works abbreviation table (from front matter) |
lsj9_abbreviations.json |
General abbreviations, epigraphical publications, papyri, periodicals |
Each entry in lsj9_headwords.json:
{
"id": 42,
"headword": "ἄβαξ",
"grammar": "ὁ",
"genitive": "ᾰκος",
"etymology": null
}grammar: article (ὁ/ἡ/τό) for nouns, adjective ending (ον/ές) for adjectives, null for verbsgenitive: genitive ending extracted from the first line, where available (17,667 entries)etymology: parenthesized etymology note, where present (11,335 entries)homograph: (A)/(B) marker for entries sharing a headword
Glosses follow the LSJ numbering system:
- major (I, II, III): top-level sense divisions
- minor (1, 2, 3): sub-senses within a major division
- sub (a, b, c): sub-sub-senses
- unnumbered: entries without numbered divisions (single gloss)
Each gloss has a parent_id linking to its containing sense.
Tab-separated: headword, grammar, genitive, etymology. Designed for consumption by inflection generators (e.g., Wiktionary template expansion).
These derived files are built from the raw data by build_exports.py. They provide ready-to-use formats for downstream consumers.
| File | Description | Consumers |
|---|---|---|
lsj9_headwords_flat.json |
118,764 headword strings (deduplicated, length-marks stripped) | dilemma (headword-set filtering) |
lsj9_headword_pos.json |
135,180 headword-to-UPOS mappings (NOUN, ADJ, VERB, ADV, ADP, etc.) | dilemma (POS disambiguation) |
lsj9_crossrefs.json |
7,678 cross-reference mappings (headword to target headwords) | LSJ10 Kindle edition |
lsj9_short_defs.json |
111,506 clean English short definitions per headword (includes 3,625 resolved cross-references) | dilemma |
lsj9_glosses_flat.json |
115,330 headwords with 173,317 flattened English glosses | dilemma |
To rebuild after updating raw data:
python build_exports.pylsj9_headword_pos.json maps each headword to a UPOS tag:
{
"ἄβαξ": "NOUN",
"ἀγαθός": "ADJ",
"ἄγω": "VERB",
"ἄγαν": "ADV"
}Sources: grammar field (ὁ/ἡ/τό -> NOUN, ον/ές -> ADJ), verb-ending heuristics (-ω/-μι/-μαι -> VERB), and lsj9_indeclinables.json (ADV, ADP, CCONJ, PART, INTJ).
lsj9_short_defs.json provides one concise English definition per headword, with Greek text, citations, and abbreviations stripped:
{
"ἀγαθός": "good",
"ἄγω": "lead, carry, bring",
"ἀδράφαξυς": "orach, Atriplex rosea (see ἀτράφαξυς)"
}Entries that are pure cross-references ("v. X", "= X") with no direct English definition get the target's definition with a (see X) note, resolved via lsj9_crossrefs.json.
lsj9_crossrefs.json maps headwords that are cross-references to their resolved target headword(s). Chains (A -> B -> C) are resolved so A points directly to C.
{
"ἀδικήω": ["ἀδικέω"],
"ἀασιφροσύνη": ["ἀεσιφροσύνη", "ἀεσίφρων"]
}Parsed from "v. X", "v. sub X", and "= X" patterns in gloss text. Editorial hyphens in targets (e.g. "ἀεσι-φροσύνη") are removed to match headword forms. Only targets that exist in the headword list are included.
Two other digital LSJ datasets are widely used:
- LSJLogeion (Helma Dik, U. of Chicago) - 86 XML files derived from the Perseus Digital Library TEI markup. Corrections focus on character encoding, entry reorganization, and language tagging. Used by the Logeion search tool.
- lsj-js (Perseids Project) - A single JSON blob of the same Perseus/Internet Archive text, bundled into a JavaScript web app for offline searching.
LSJLogeion derives from the Perseus Project's manual keyboard entry of LSJ (mid-1990s, funded by the National Science Foundation), converted from TLG Beta Code to Unicode with extensive editorial corrections by Helma Dik. LSJ9 and lsj-js both use the Internet Archive lsj.txt.
All digital LSJ versions contain errors from their respective digitization processes. lsj9 is correcting these systematically via OCR against high-resolution scans of the original printed edition. Typical issues:
| Error type | Example (before) | Corrected |
|---|---|---|
| Missing breathing marks | αγαθός |
ἀγαθός |
| Citation spacing | E.Rh.990,cf.Supp.208 |
E. Rh. 990, cf. Supp. 208 |
| Garbled diacritics | δστα |
ὀστᾶ |
| Line-break artifacts | causing\nhoarseness |
causing hoarseness |
| Character confusions | FVarso |
PVarsov. (θ/δ, P/F) |
The differences between projects are in what is done with the text:
| LSJ9 (this project) | LSJLogeion | lsj-js | |
|---|---|---|---|
| Format | Structured files (JSON, TSV, JSONL) | TEI XML (86 files) | Single JSON blob |
| OCR corrections | Systematic, pipeline-based | Manual, incremental | None |
| Headword parsing | 119,450 with grammar, genitive, etymology | Headwords in XML tags | Flat key-value |
| Gloss extraction | 176,622 hierarchical glosses (I.1.a structure) | Embedded in XML prose | None |
| Reference parsing | 211,731 structured refs (author, work, passage) | Inline citations | None |
| Designed for | Programmatic consumption (NLP, apps, inflection tools) | Human reading, Logeion integration | Browser-based lookup |
The key difference: LSJLogeion and lsj-js preserve the dictionary as formatted text (XML or JSON). LSJ9 decomposes it into structured, independently queryable data - headwords, grammatical forms, sense hierarchies, and citation networks - suitable for building applications, training models, or cross-referencing with other corpora.
The base text (lsj.txt) comes from the Internet Archive. Its exact provenance is unknown. It contains thousands of OCR errors, particularly in polytonic Greek (missing breathing marks, garbled diacritics, character confusions between similar glyphs like θ/δ). We are systematically correcting these using a multi-model OCR pipeline against a high-resolution scan of the original 1940 Oxford edition:
- Qwen3-VL (primary): Vision-language model run on column images. Produces proper polytonic Greek with breathing marks and accents. Run locally on GPU.
- Google Cloud Vision (secondary): Document text detection API. Better at English text, citation punctuation, and reference numbers. Weaker on Greek diacritics.
- Dilemma spell-checker: Validates Greek tokens against a 12.3M-form lookup table and suggests corrections by edit distance.
The two OCR sources are cross-validated, and corrections are applied programmatically. This is an ongoing process.
LSJ9 does not include content from the 1996 Revised Supplement - that is part of the separate LSJ10 project. For context, the supplement OCR pipeline tested three vision-language models on the 348-page supplement PDF:
| Model | Details | Notes |
|---|---|---|
| Qwen2.5-VL-7B-Instruct | float16, 1024px column images | Original run, good baseline |
| Qwen3-VL-8B-Instruct | Re-OCR of worst pages | Some hallucinations on complex pages |
| Qwen3-VL-30B-A3B-Instruct | MoE model, bf16 on A100 80GB | Zero hallucinations, best quality |
The Qwen3-VL-30B MoE model produced the cleanest output with no hallucinations, making it the preferred choice for high-fidelity Greek lexicon OCR.
CC BY 4.0. Attribution required. See LICENSE.
Digitized from the public domain 9th edition (Oxford, 1940). Base text from the Internet Archive scan.