LSJ9: A Digitized Greek-English Lexicon

A structured, machine-readable version of Liddell, Scott, Jones, A Greek-English Lexicon, 9th edition (1940). The base text is in the public domain. This dataset adds OCR corrections, structured parsing, and grammatical annotations. Used by Dilemma (Greek lemmatizer).

Data Files

File	Description
`lsj9_headwords.json`	119,450 headwords with grammar, etymology, genitive, homograph markers
`lsj9_forms.tsv`	63,389 entries with explicit grammatical info (article/adjective type, genitive ending)
`lsj9_glosses.jsonl`	177,190 hierarchical glosses (definitions with citations stripped)
`lsj9_refs.tsv`	211,585 structured references (author, work, passage)
`lsj9_frequency.json`	Reference counts per headword (55,495 entries)
`lsj9_indeclinables.json`	Indeclinable entries with POS categories (adverbs, prepositions, conjunctions, particles, interjections)
`lsj9_authors.json`	Authors & works abbreviation table (from front matter)
`lsj9_abbreviations.json`	General abbreviations, epigraphical publications, papyri, periodicals

Headword Format

Each entry in lsj9_headwords.json:

{
  "id": 42,
  "headword": "ἄβαξ",
  "grammar": "ὁ",
  "genitive": "ᾰκος",
  "etymology": null
}

grammar: article (ὁ/ἡ/τό) for nouns, adjective ending (ον/ές) for adjectives, null for verbs
genitive: genitive ending extracted from the first line, where available (17,667 entries)
etymology: parenthesized etymology note, where present (11,335 entries)
homograph: (A)/(B) marker for entries sharing a headword

Gloss Hierarchy

Glosses follow the LSJ numbering system:

major (I, II, III): top-level sense divisions
minor (1, 2, 3): sub-senses within a major division
sub (a, b, c): sub-sub-senses
unnumbered: entries without numbered divisions (single gloss)

Each gloss has a parent_id linking to its containing sense.

Forms TSV

Tab-separated: headword, grammar, genitive, etymology. Designed for consumption by inflection generators (e.g., Wiktionary template expansion).

Processed Exports

These derived files are built from the raw data by build_exports.py. They provide ready-to-use formats for downstream consumers.

File	Description	Consumers
`lsj9_headwords_flat.json`	118,764 headword strings (deduplicated, length-marks stripped)	dilemma (headword-set filtering)
`lsj9_headword_pos.json`	135,180 headword-to-UPOS mappings (NOUN, ADJ, VERB, ADV, ADP, etc.)	dilemma (POS disambiguation)
`lsj9_crossrefs.json`	7,678 cross-reference mappings (headword to target headwords)	LSJ10 Kindle edition
`lsj9_short_defs.json`	111,506 clean English short definitions per headword (includes 3,625 resolved cross-references)	dilemma
`lsj9_glosses_flat.json`	115,330 headwords with 173,317 flattened English glosses	dilemma

To rebuild after updating raw data:

python build_exports.py

Headword POS Format

lsj9_headword_pos.json maps each headword to a UPOS tag:

{
  "ἄβαξ": "NOUN",
  "ἀγαθός": "ADJ",
  "ἄγω": "VERB",
  "ἄγαν": "ADV"
}

Sources: grammar field (ὁ/ἡ/τό -> NOUN, ον/ές -> ADJ), verb-ending heuristics (-ω/-μι/-μαι -> VERB), and lsj9_indeclinables.json (ADV, ADP, CCONJ, PART, INTJ).

Short Definitions Format

lsj9_short_defs.json provides one concise English definition per headword, with Greek text, citations, and abbreviations stripped:

{
  "ἀγαθός": "good",
  "ἄγω": "lead, carry, bring",
  "ἀδράφαξυς": "orach, Atriplex rosea (see ἀτράφαξυς)"
}

Entries that are pure cross-references ("v. X", "= X") with no direct English definition get the target's definition with a (see X) note, resolved via lsj9_crossrefs.json.

Cross-References Format

lsj9_crossrefs.json maps headwords that are cross-references to their resolved target headword(s). Chains (A -> B -> C) are resolved so A points directly to C.

{
  "ἀδικήω": ["ἀδικέω"],
  "ἀασιφροσύνη": ["ἀεσιφροσύνη", "ἀεσίφρων"]
}

Parsed from "v. X", "v. sub X", and "= X" patterns in gloss text. Editorial hyphens in targets (e.g. "ἀεσι-φροσύνη") are removed to match headword forms. Only targets that exist in the headword list are included.

How This Differs from Other Digital LSJ Projects

Two other digital LSJ datasets are widely used:

LSJLogeion (Helma Dik, U. of Chicago) - 86 XML files derived from the Perseus Digital Library TEI markup. Corrections focus on character encoding, entry reorganization, and language tagging. Used by the Logeion search tool.
lsj-js (Perseids Project) - A single JSON blob of the same Perseus/Internet Archive text, bundled into a JavaScript web app for offline searching.

LSJLogeion derives from the Perseus Project's manual keyboard entry of LSJ (mid-1990s, funded by the National Science Foundation), converted from TLG Beta Code to Unicode with extensive editorial corrections by Helma Dik. LSJ9 and lsj-js both use the Internet Archive lsj.txt.

All digital LSJ versions contain errors from their respective digitization processes. lsj9 is correcting these systematically via OCR against high-resolution scans of the original printed edition. Typical issues:

Error type	Example (before)	Corrected
Missing breathing marks	`αγαθός`	`ἀγαθός`
Citation spacing	`E.Rh.990,cf.Supp.208`	`E. Rh. 990, cf. Supp. 208`
Garbled diacritics	`δστα`	`ὀστᾶ`
Line-break artifacts	`causing\nhoarseness`	`causing hoarseness`
Character confusions	`FVarso`	`PVarsov.` (θ/δ, P/F)

The differences between projects are in what is done with the text:

	LSJ9 (this project)	LSJLogeion	lsj-js
Format	Structured files (JSON, TSV, JSONL)	TEI XML (86 files)	Single JSON blob
OCR corrections	Systematic, pipeline-based	Manual, incremental	None
Headword parsing	119,450 with grammar, genitive, etymology	Headwords in XML tags	Flat key-value
Gloss extraction	176,622 hierarchical glosses (I.1.a structure)	Embedded in XML prose	None
Reference parsing	211,731 structured refs (author, work, passage)	Inline citations	None
Designed for	Programmatic consumption (NLP, apps, inflection tools)	Human reading, Logeion integration	Browser-based lookup

The key difference: LSJLogeion and lsj-js preserve the dictionary as formatted text (XML or JSON). LSJ9 decomposes it into structured, independently queryable data - headwords, grammatical forms, sense hierarchies, and citation networks - suitable for building applications, training models, or cross-referencing with other corpora.

OCR Pipeline

Base text corrections (lsj9)

The base text (lsj.txt) comes from the Internet Archive. Its exact provenance is unknown. It contains thousands of OCR errors, particularly in polytonic Greek (missing breathing marks, garbled diacritics, character confusions between similar glyphs like θ/δ). We are systematically correcting these using a multi-model OCR pipeline against a high-resolution scan of the original 1940 Oxford edition:

Qwen3-VL (primary): Vision-language model run on column images. Produces proper polytonic Greek with breathing marks and accents. Run locally on GPU.
Google Cloud Vision (secondary): Document text detection API. Better at English text, citation punctuation, and reference numbers. Weaker on Greek diacritics.
Dilemma spell-checker: Validates Greek tokens against a 12.3M-form lookup table and suggests corrections by edit distance.

The two OCR sources are cross-validated, and corrections are applied programmatically. This is an ongoing process.

Supplement OCR (lsj10 only)

LSJ9 does not include content from the 1996 Revised Supplement - that is part of the separate LSJ10 project. For context, the supplement OCR pipeline tested three vision-language models on the 348-page supplement PDF:

Model	Details	Notes
Qwen2.5-VL-7B-Instruct	float16, 1024px column images	Original run, good baseline
Qwen3-VL-8B-Instruct	Re-OCR of worst pages	Some hallucinations on complex pages
Qwen3-VL-30B-A3B-Instruct	MoE model, bf16 on A100 80GB	Zero hallucinations, best quality

The Qwen3-VL-30B MoE model produced the cleanest output with no hallucinations, making it the preferred choice for high-fidelity Greek lexicon OCR.

License

CC BY 4.0. Attribution required. See LICENSE.

Source

Digitized from the public domain 9th edition (Oxford, 1940). Base text from the Internet Archive scan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LSJ9: A Digitized Greek-English Lexicon

Data Files

Headword Format

Gloss Hierarchy

Forms TSV

Processed Exports

Headword POS Format

Short Definitions Format

Cross-References Format

How This Differs from Other Digital LSJ Projects

OCR Pipeline

Base text corrections (lsj9)

Supplement OCR (lsj10 only)

License

Source

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
build_exports.py		build_exports.py
lsj9_crossrefs.json		lsj9_crossrefs.json
lsj9_forms.tsv		lsj9_forms.tsv
lsj9_frequency.json		lsj9_frequency.json
lsj9_glosses.jsonl		lsj9_glosses.jsonl
lsj9_glosses_flat.json		lsj9_glosses_flat.json
lsj9_headword_pos.json		lsj9_headword_pos.json
lsj9_headwords.json		lsj9_headwords.json
lsj9_headwords_flat.json		lsj9_headwords_flat.json
lsj9_indeclinables.json		lsj9_indeclinables.json
lsj9_refs.tsv		lsj9_refs.tsv
lsj9_short_defs.json		lsj9_short_defs.json

Folders and files

Latest commit

History

Repository files navigation

LSJ9: A Digitized Greek-English Lexicon

Data Files

Headword Format

Gloss Hierarchy

Forms TSV

Processed Exports

Headword POS Format

Short Definitions Format

Cross-References Format

How This Differs from Other Digital LSJ Projects

OCR Pipeline

Base text corrections (lsj9)

Supplement OCR (lsj10 only)

License

Source

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages