diff --git a/CLAUDE.md b/CLAUDE.md
index e87e51e..8b3383d 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -20,6 +20,11 @@ python server.py                      # Start MCP server (stdio transport)
 | `python pipeline.py embed` | Generate embeddings, build SQLite DB |
 | `python pipeline.py rebuild` | clone + chunk + embed |
 | `python pipeline.py stats` | Print database statistics |
+| `python pipeline.py verify` | Run search quality checks |
+| `python pipeline.py stale` | Check for stale chunks (local + upstream) |
+| `python pipeline.py freshness` | Unified freshness report (age, model, sources) |
+| `python pipeline.py ingest` | Incrementally ingest new chunks |
+| `python pipeline.py gotcha` | Tag chunks with known gotchas |
 
 ## Project Structure
 
@@ -40,6 +45,7 @@ python server.py                      # Start MCP server (stdio transport)
 ## Conventions
 
 - **Chunk format:** Every chunker returns dicts with keys: `id`, `text`, `source`, `module_path`, `type_name`, `category`, `heading`, `file_path`
+- **Database sources:** `db_sources` in config pulls chunks from SQLite databases via SQL queries. Column mapping is config-driven (`text_column`, `heading_column`, etc.). DB sources produce chunks in the same format as file-based chunkers — everything downstream (embed, search, verify) works unchanged.
 - **Chunker registration:** Each chunker calls `register_chunker("name", ClassName)` at module level
 - **Config-driven tools:** MCP tool names and descriptions come from `config.json`, not code
 - **Embedding prefix:** Documents get `"search_document: "`, queries get `"search_query: "` (nomic-embed-text convention)
@@ -53,3 +59,4 @@ python server.py                      # Start MCP server (stdio transport)
 - **Logging:** Both `pipeline.py` and `server.py` use Python's `logging` module with module-level loggers (`log = logging.getLogger(...)`). Pipeline configures logging in `main()`. Server logs to stderr (MCP uses stdout for protocol). CLI usage/help text stays as `print()`.
 - **Transaction batching:** Pipeline wraps each embedding batch in an explicit `BEGIN`/`COMMIT` transaction. Uses `isolation_level=None` for manual control.
 - **Git timeouts:** Clone and pull operations have a 120-second timeout to prevent hung pipelines
+- **Index metadata:** `index_metadata` table stores `indexed_at`, `embed_model`, `embed_dimensions`, and `repo:<name>:commit` for provenance tracking. `cmd_stale` and `cmd_freshness` use this to detect model drift and upstream changes.
diff --git a/LICENSE b/LICENSE
index 8cc9f06..239515c 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,21 +1,678 @@
-MIT License
-
-Copyright (c) 2025 Justin Russas
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
+Copyright (C) 2026 Justin Russas
+
+This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU Affero General Public License as published
+by the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU Affero General Public License for more details.
+
+You should have received a copy of the GNU Affero General Public License
+along with this program. If not, see <https://www.gnu.org/licenses/>.
+
+--------------------------------------------------------------------------------
+
+                    GNU AFFERO GENERAL PUBLIC LICENSE
+                       Version 3, 19 November 2007
+
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The GNU Affero General Public License is a free, copyleft license for
+software and other kinds of works, specifically designed to ensure
+cooperation with the community in the case of network server software.
+
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+our General Public Licenses are intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+
+  Developers that use our General Public Licenses protect your rights
+with two steps: (1) assert copyright on the software, and (2) offer
+you this License which gives you legal permission to copy, distribute
+and/or modify the software.
+
+  A secondary benefit of defending all users' freedom is that
+improvements made in alternate versions of the program, if they
+receive widespread use, become available for other developers to
+incorporate.  Many developers of free software are heartened and
+encouraged by the resulting cooperation.  However, in the case of
+software used on network servers, this result may fail to come about.
+The GNU General Public License permits making a modified version and
+letting the public access it on a server without ever releasing its
+source code to the public.
+
+  The GNU Affero General Public License is designed specifically to
+ensure that, in such cases, the modified source code becomes available
+to the community.  It requires the operator of a network server to
+provide the source code of the modified version running there to the
+users of that server.  Therefore, public use of a modified version, on
+a publicly accessible server, gives the public access to the source
+code of the modified version.
+
+  An older license, called the Affero General Public License and
+published by Affero, was designed to accomplish similar goals.  This is
+a different license, not a version of the Affero GPL, but Affero has
+released a new version of the Affero GPL which permits relicensing under
+this license.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                       TERMS AND CONDITIONS
+
+  0. Definitions.
+
+  "This License" refers to version 3 of the GNU Affero General Public License.
+
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+
+  1. Source Code.
+
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+
+  The Corresponding Source for a work in source code form is that
+same work.
+
+  2. Basic Permissions.
+
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+
+  4. Conveying Verbatim Copies.
+
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+
+  5. Conveying Modified Source Versions.
+
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+
+  6. Conveying Non-Source Forms.
+
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+
+  7. Additional Terms.
+
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+
+  8. Termination.
+
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+
+  9. Acceptance Not Required for Having Copies.
+
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+
+  10. Automatic Licensing of Downstream Recipients.
+
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+
+  11. Patents.
+
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+
+  12. No Surrender of Others' Freedom.
+
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+
+  13. Remote Network Interaction; Use with the GNU General Public License.
+
+  Notwithstanding any other provision of this License, if you modify the
+Program, your modified version must prominently offer all users
+interacting with it remotely through a computer network (if your version
+supports such interaction) an opportunity to receive the Corresponding
+Source of your version by providing access to the Corresponding Source
+from a network server at no charge, through some standard or customary
+means of facilitating copying of software.  This Corresponding Source
+shall include the Corresponding Source for any work covered by version 3
+of the GNU General Public License that is incorporated pursuant to the
+following paragraph.
+
+  Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU General Public License into a single
+combined work, and to convey the resulting work.  The terms of this
+License will continue to apply to the part which is the covered work,
+but the work with which it is combined will remain governed by version
+3 of the GNU General Public License.
+
+  14. Revised Versions of this License.
+
+  The Free Software Foundation may publish revised and/or new versions of
+the GNU Affero General Public License from time to time.  Such new versions
+will be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the GNU Affero General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation.  If the Program does not specify a version number of the
+GNU Affero General Public License, you may choose any version ever published
+by the Free Software Foundation.
+
+  If the Program specifies that a proxy can decide which future
+versions of the GNU Affero General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+
+  15. Disclaimer of Warranty.
+
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+  16. Limitation of Liability.
+
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+
+  17. Interpretation of Sections 15 and 16.
+
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+Also add information on how to contact you by electronic and paper mail.
+
+  If your software can interact with users remotely through a computer
+network, you should also make sure that it provides a way for users to
+get its source.  For example, if your program is a web application, its
+interface could display a "Source" link that leads users to an archive
+of the code.  There are many ways you could offer source, and different
+solutions will be better for different programs; see section 13 for the
+specific requirements.
+
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU AGPL, see
+<https://www.gnu.org/licenses/>.
diff --git a/README.md b/README.md
index 86a36f7..829bb7b 100644
--- a/README.md
+++ b/README.md
@@ -207,6 +207,12 @@ mcp-rag/
 └── .github/workflows/     CI (lint + test)
 ```
 
+## Development
+
+Built as part of a local AI development infrastructure, extracted and open-sourced as a standalone tool. Development uses a structured review process — each commit addresses specific findings from code review passes (SQL injection safety, transaction correctness, logging hygiene). 61 tests with CI running lint ([ruff](https://github.com/astral-sh/ruff)) and pytest on every push.
+
+See [commit history](https://github.com/JMRussas/mcp-rag/commits/main) for the review-driven development trail.
+
 ## Limitations
 
 - All embeddings loaded into memory at startup — practical up to ~50k chunks (~150 MB)
diff --git a/config.example.json b/config.example.json
index 6f78436..19dcb44 100644
--- a/config.example.json
+++ b/config.example.json
@@ -22,11 +22,25 @@
     "default_top_k": 8,
     "max_top_k": 20,
     "embed_dimensions": 768,
-    "min_score": 0.0
+    "min_score": 0.0,
+    "hybrid": false,
+    "retrieval_depth": 20,
+    "rrf_k": 60,
+    "confidence": {
+      "high": 0.85,
+      "medium": 0.65
+    },
+    "exclude_low_confidence": false
+  },
+  "reranker": {
+    "enabled": false,
+    "model": "cross-encoder/ms-marco-MiniLM-L6-v2",
+    "backend": "onnx"
   },
   "sources": {
     "repos_dir": "data/repos",
-    "chunks_path": "data/chunks.jsonl"
+    "chunks_path": "data/chunks.jsonl",
+    "ingest_path": "data/ingest.jsonl"
   },
   "pipeline": {
     "concurrency": 4,
@@ -34,6 +48,12 @@
     "progress_interval": 100,
     "max_embed_chars": 6000
   },
+  "verify": {
+    "queries": [
+      {"query": "how to create a class", "min_results": 1},
+      {"lookup": "Application", "min_results": 1}
+    ]
+  },
   "repos": [
     {
       "name": "my-source",
@@ -48,5 +68,18 @@
       "source_tag": "docs",
       "no_recurse": false
     }
+  ],
+  "db_sources": [
+    {
+      "name": "coding-standards",
+      "type": "sqlite",
+      "path": "/path/to/standards.db",
+      "query": "SELECT id, title, body, category FROM standards",
+      "text_column": "body",
+      "id_column": "id",
+      "heading_column": "title",
+      "category_column": "category",
+      "source_tag": "standards"
+    }
   ]
 }
diff --git a/pipeline.py b/pipeline.py
index b4eacff..d534b57 100644
--- a/pipeline.py
+++ b/pipeline.py
@@ -11,12 +11,14 @@
 #  Used by:    Manual CLI invocation (python pipeline.py rebuild)
 
 import asyncio
+import hashlib
 import json
 import logging
 import os
 import shutil
 import sqlite3
 import struct
+import subprocess
 import sys
 import time
 from pathlib import Path
@@ -54,7 +56,7 @@ def _validate_config(config: dict):
 
     Raises ConfigError on invalid config.
     """
-    required_sections = ["ollama", "database", "search", "sources", "repos"]
+    required_sections = ["ollama", "database", "search", "sources"]
     for section in required_sections:
         if section not in config:
             raise ConfigError(f"config.json missing required section '{section}'.")
@@ -66,10 +68,13 @@ def _validate_config(config: dict):
     if not isinstance(dims, int) or dims <= 0:
         raise ConfigError("config.json 'search.embed_dimensions' must be a positive integer.")
 
-    if not isinstance(config["repos"], list) or len(config["repos"]) == 0:
-        raise ConfigError("config.json 'repos' must be a non-empty list.")
+    has_repos = isinstance(config.get("repos"), list) and len(config["repos"]) > 0
+    has_db_sources = isinstance(config.get("db_sources"), list) and len(config["db_sources"]) > 0
 
-    for i, repo in enumerate(config["repos"]):
+    if not has_repos and not has_db_sources:
+        raise ConfigError("config.json needs at least one of 'repos' (non-empty list) or 'db_sources' (non-empty list).")
+
+    for i, repo in enumerate(config.get("repos", [])):
         if "name" not in repo:
             raise ConfigError(f"config.json repos[{i}] missing 'name'.")
         if "type" not in repo:
@@ -77,6 +82,14 @@ def _validate_config(config: dict):
         if "path" not in repo and "url" not in repo:
             raise ConfigError(f"config.json repos[{i}] ('{repo['name']}') needs 'path' or 'url'.")
 
+    for i, db_src in enumerate(config.get("db_sources", [])):
+        if "name" not in db_src:
+            raise ConfigError(f"config.json db_sources[{i}] missing 'name'.")
+        if "path" not in db_src:
+            raise ConfigError(f"config.json db_sources[{i}] ('{db_src.get('name', '?')}') missing 'path'.")
+        if "query" not in db_src:
+            raise ConfigError(f"config.json db_sources[{i}] ('{db_src['name']}') missing 'query'.")
+
 
 # ---------------------------------------------------------------------------
 # Clone (for git-based sources)
@@ -90,7 +103,7 @@ def cmd_clone(config: dict):
     repos_dir = SCRIPT_DIR / config["sources"].get("repos_dir", "data/repos")
     repos_dir.mkdir(parents=True, exist_ok=True)
 
-    for repo in config["repos"]:
+    for repo in config.get("repos", []):
         url = repo.get("url")
         if not url:
             # Local path — no cloning needed
@@ -134,6 +147,75 @@ def cmd_clone(config: dict):
 # ---------------------------------------------------------------------------
 
 
+def _chunk_db_source(db_src: dict) -> list[dict]:
+    """Read chunks from a database source.
+
+    Each row from the configured query becomes one chunk.  Column mapping
+    is controlled by the db_source config entry.
+    """
+    name = db_src["name"]
+    db_type = db_src.get("type", "sqlite")
+    source_tag = db_src.get("source_tag", name)
+    query = db_src["query"]
+
+    text_col = db_src.get("text_column", "text")
+    id_col = db_src.get("id_column", "")
+    heading_col = db_src.get("heading_column", "")
+    category_col = db_src.get("category_column", "")
+    module_col = db_src.get("module_column", "")
+    type_name_col = db_src.get("type_name_column", "")
+
+    log.info(f"[db] Chunking '{name}' ({db_type})...")
+
+    if db_type != "sqlite":
+        log.warning(f"[db] Unsupported db type '{db_type}' for '{name}', skipping.")
+        return []
+
+    db_path = db_src["path"]
+    if not Path(db_path).exists():
+        log.warning(f"[db] Database not found: {db_path}, skipping '{name}'.")
+        return []
+
+    conn = sqlite3.connect(db_path)
+    conn.row_factory = sqlite3.Row
+    try:
+        rows = conn.execute(query).fetchall()
+    except sqlite3.OperationalError as e:
+        log.error(f"[db] Query failed for '{name}': {e}")
+        conn.close()
+        return []
+
+    columns = set(rows[0].keys()) if rows else set()
+    chunks = []
+
+    for row in rows:
+        text = str(row[text_col]) if text_col in columns else ""
+        if not text.strip():
+            continue
+
+        # Build chunk ID from source row ID or hash the text
+        if id_col and id_col in columns:
+            chunk_id = f"{source_tag}:{row[id_col]}"
+        else:
+            chunk_id = f"{source_tag}:{hashlib.sha256(text.encode()).hexdigest()[:16]}"
+
+        chunk = {
+            "id": chunk_id,
+            "text": text,
+            "source": source_tag,
+            "module_path": str(row[module_col]) if module_col and module_col in columns else "",
+            "type_name": str(row[type_name_col]) if type_name_col and type_name_col in columns else "",
+            "category": str(row[category_col]) if category_col and category_col in columns else "",
+            "heading": str(row[heading_col]) if heading_col and heading_col in columns else "",
+            "file_path": "",
+        }
+        chunks.append(chunk)
+
+    conn.close()
+    log.info(f"[db] '{name}': {len(chunks)} chunks from {len(rows)} rows")
+    return chunks
+
+
 def cmd_chunk(config: dict):
     """Run all chunkers and write chunks.jsonl."""
     chunks_path = SCRIPT_DIR / config["sources"]["chunks_path"]
@@ -142,19 +224,27 @@ def cmd_chunk(config: dict):
 
     all_chunks = []
 
-    for repo in config["repos"]:
+    for repo in config.get("repos", []):
         chunker_type = repo["type"]
         log.info(f"Chunking {repo['name']} ({chunker_type})...")
 
         # Resolve source directory
         if repo.get("path"):
             source_dir = Path(repo["path"])
+        elif repo.get("url"):
+            # URL-based repos: use local_dir or fall back to repo name (matches clone step)
+            local_dir = repo.get("local_dir", repo["name"])
+            source_dir = repos_dir / local_dir
         elif repo.get("local_dir"):
             source_dir = repos_dir / repo["local_dir"]
         else:
             log.warning(f"No path or local_dir for {repo['name']}, skipping.")
             continue
 
+        # Allow diving into a subdirectory (e.g. "Modules/FortniteGame")
+        if repo.get("source_subdir"):
+            source_dir = source_dir / repo["source_subdir"]
+
         try:
             chunker = get_chunker(chunker_type)
         except ValueError as e:
@@ -164,6 +254,11 @@ def cmd_chunk(config: dict):
         chunks = chunker.chunk_directory(source_dir, repo)
         all_chunks.extend(chunks)
 
+    # Chunk from database sources (if configured)
+    for db_src in config.get("db_sources", []):
+        db_chunks = _chunk_db_source(db_src)
+        all_chunks.extend(db_chunks)
+
     # Deduplicate by ID (keep first occurrence)
     seen = set()
     deduped = []
@@ -172,6 +267,9 @@ def cmd_chunk(config: dict):
             seen.add(chunk["id"])
             deduped.append(chunk)
 
+    # Attach source file hashes for provenance tracking
+    _attach_source_hashes(deduped, config.get("repos", []), repos_dir)
+
     # Write chunks.jsonl
     with open(chunks_path, "w", encoding="utf-8") as f:
         for chunk in deduped:
@@ -190,6 +288,100 @@ def _log_source_stats(chunks: list[dict]):
         log.info(f"  {source}: {count}")
 
 
+# ---------------------------------------------------------------------------
+# Source provenance
+# ---------------------------------------------------------------------------
+
+
+def _build_source_base_dirs(repos_config: list[dict], repos_dir: Path) -> dict[str, Path]:
+    """Build a mapping from source_tag → base directory for file path resolution.
+
+    Resolution rules:
+    1. If repo has 'path': base = Path(repo['path'])
+    2. If repo has 'url':  base = repos_dir / repo.get('local_dir', repo['name'])
+    3. If repo has 'source_subdir': base = base / source_subdir
+    """
+    base_dirs: dict[str, Path] = {}
+    for repo in repos_config:
+        source_tag = repo.get("source_tag", repo["name"])
+        if repo.get("path"):
+            base = Path(repo["path"])
+        elif repo.get("url"):
+            local_dir = repo.get("local_dir", repo["name"])
+            base = repos_dir / local_dir
+        else:
+            continue
+        if repo.get("source_subdir"):
+            base = base / repo["source_subdir"]
+        base_dirs[source_tag] = base
+    return base_dirs
+
+
+def _resolve_source_path(
+    file_path: str,
+    source_tag: str,
+    base_dirs: dict[str, Path],
+) -> Path | None:
+    """Resolve a chunk's file_path to an absolute path on disk.
+
+    Returns None if the source_tag has no known base directory.
+    """
+    base = base_dirs.get(source_tag)
+    if base is None:
+        return None
+    return base / file_path
+
+
+def _attach_source_hashes(chunks: list[dict], repos_config: list[dict], repos_dir: Path):
+    """Attach SHA-256 source hashes to chunks grouped by (source, file_path).
+
+    Modifies chunks in-place. Sets source_hash to '' for unresolvable or
+    missing files (logged as warnings).
+    """
+    base_dirs = _build_source_base_dirs(repos_config, repos_dir)
+
+    # Group chunks by (source, file_path) to hash each file once
+    from collections import defaultdict
+
+    groups: dict[tuple[str, str], list[dict]] = defaultdict(list)
+    for chunk in chunks:
+        fp = chunk.get("file_path", "")
+        source = chunk.get("source", "")
+        if fp:
+            groups[(source, fp)].append(chunk)
+        else:
+            chunk["source_hash"] = ""
+
+    hashed = 0
+    skipped = 0
+    for (source, fp), group in groups.items():
+        resolved = _resolve_source_path(fp, source, base_dirs)
+        if resolved is None:
+            for c in group:
+                c["source_hash"] = ""
+            skipped += len(group)
+            continue
+
+        try:
+            content = resolved.read_bytes()
+            file_hash = hashlib.sha256(content).hexdigest()
+            for c in group:
+                c["source_hash"] = file_hash
+            hashed += len(group)
+        except FileNotFoundError:
+            log.warning(f"[hash] File missing (deleted after chunking?): {resolved}")
+            for c in group:
+                c["source_hash"] = ""
+            skipped += len(group)
+        except OSError as e:
+            log.warning(f"[hash] Could not read {resolved}: {e}")
+            for c in group:
+                c["source_hash"] = ""
+            skipped += len(group)
+
+    log.info(f"[hash] {hashed} chunks hashed, {skipped} skipped")
+
+
 # ---------------------------------------------------------------------------
 # Embed
 # ---------------------------------------------------------------------------
@@ -300,6 +492,10 @@ async def embed_one(chunk: dict) -> tuple[dict, list[float] | None]:
     conn.execute("BEGIN")
     conn.execute("INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild')")
     conn.commit()
+
+    # Write index metadata (provenance tracking)
+    _write_index_metadata(conn, config)
+
     conn.close()
 
     # Swap temp file into place (handles Windows file locking)
@@ -350,6 +546,8 @@ def _create_schema(conn: sqlite3.Connection):
             category TEXT DEFAULT '',
             heading TEXT DEFAULT '',
             file_path TEXT DEFAULT '',
+            source_hash TEXT DEFAULT '',
+            gotcha TEXT DEFAULT '',
             embedding BLOB
         );
 
@@ -364,6 +562,11 @@ def _create_schema(conn: sqlite3.Connection):
             content_rowid=rowid
         );
 
+        CREATE TABLE IF NOT EXISTS index_metadata (
+            key TEXT PRIMARY KEY,
+            value TEXT NOT NULL
+        );
+
         CREATE INDEX idx_chunks_source ON chunks(source);
         CREATE INDEX idx_chunks_module ON chunks(module_path);
         CREATE INDEX idx_chunks_type_name ON chunks(type_name);
@@ -382,8 +585,9 @@ def _insert_chunk(conn: sqlite3.Connection, chunk: dict, embedding: list[float])
 
     conn.execute(
         """INSERT OR IGNORE INTO chunks
-           (id, text, source, module_path, type_name, category, heading, file_path, embedding)
-           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+           (id, text, source, module_path, type_name, category, heading, file_path,
+            source_hash, gotcha, embedding)
+           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
         (
             chunk["id"],
             chunk["text"],
@@ -393,6 +597,8 @@ def _insert_chunk(conn: sqlite3.Connection, chunk: dict, embedding: list[float])
             chunk.get("category", ""),
             chunk.get("heading", ""),
             chunk.get("file_path", ""),
+            chunk.get("source_hash", ""),
+            chunk.get("gotcha", ""),
             blob,
         ),
     )
@@ -405,6 +611,69 @@ def _insert_chunk(conn: sqlite3.Connection, chunk: dict, embedding: list[float])
     )
 
 
+def _write_index_metadata(conn: sqlite3.Connection, config: dict):
+    """Write provenance metadata to the index_metadata table.
+
+    Stores the index timestamp and git commit hashes for each repo that
+    has a local .git directory.
+    """
+    from datetime import datetime, timezone
+
+    repos_dir = SCRIPT_DIR / config["sources"].get("repos_dir", "data/repos")
+    now = datetime.now(timezone.utc).isoformat()
+
+    conn.execute("BEGIN")
+    conn.execute(
+        "INSERT OR REPLACE INTO index_metadata (key, value) VALUES (?, ?)",
+        ("indexed_at", now),
+    )
+
+    # Store embedding model info for drift detection
+    embed_model = config.get("ollama", {}).get("embed_model", "unknown")
+    embed_dims = config.get("search", {}).get("embed_dimensions", 0)
+    conn.execute(
+        "INSERT OR REPLACE INTO index_metadata (key, value) VALUES (?, ?)",
+        ("embed_model", embed_model),
+    )
+    conn.execute(
+        "INSERT OR REPLACE INTO index_metadata (key, value) VALUES (?, ?)",
+        ("embed_dimensions", str(embed_dims)),
+    )
+
+    for repo in config.get("repos", []):
+        name = repo["name"]
+        # Resolve the repo directory
+        if repo.get("path"):
+            repo_dir = Path(repo["path"])
+        elif repo.get("url"):
+            local_dir = repo.get("local_dir", name)
+            repo_dir = repos_dir / local_dir
+        else:
+            continue
+
+        # Check for git repo and capture commit hash
+        git_dir = repo_dir / ".git"
+        if git_dir.exists():
+            try:
+                result = subprocess.run(
+                    ["git", "-C", str(repo_dir), "rev-parse", "HEAD"],
+                    capture_output=True,
+                    text=True,
+                    timeout=10,
+                )
+                if result.returncode == 0:
+                    commit = result.stdout.strip()
+                    conn.execute(
+                        "INSERT OR REPLACE INTO index_metadata (key, value) VALUES (?, ?)",
+                        (f"repo:{name}:commit", commit),
+                    )
+            except (subprocess.TimeoutExpired, OSError) as e:
+                log.warning(f"[metadata] Could not get git commit for {name}: {e}")
+
+    conn.commit()
+    log.info(f"[metadata] Index metadata written (indexed_at: {now})")
+
+
 # ---------------------------------------------------------------------------
 # Stats
 # ---------------------------------------------------------------------------
@@ -459,6 +728,706 @@ def cmd_stats(config: dict):
     conn.close()
 
 
+# ---------------------------------------------------------------------------
+# Verify
+# ---------------------------------------------------------------------------
+
+
+def cmd_verify(config: dict):
+    """Verify index health and search quality.
+
+    Runs automatic checks (schema, sources, embeddings, FTS5) and
+    optional user-defined test queries from config.json "verify" section.
+    Returns exit code 0 if all checks pass, 1 if any fail.
+    """
+    asyncio.run(_verify_async(config))
+
+
+async def _verify_async(config: dict):
+    db_path = SCRIPT_DIR / config["database"]["path"]
+    if not db_path.exists():
+        log.error(f"[verify] Database not found: {db_path}")
+        log.error("[verify] Run 'rebuild' first.")
+        sys.exit(1)
+
+    dimensions = config["search"]["embed_dimensions"]
+    expected_blob_size = dimensions * 4  # float32
+
+    conn = sqlite3.connect(str(db_path))
+    passed = 0
+    failed = 0
+    warnings = 0
+
+    def check(name: str, ok: bool, detail: str = ""):
+        nonlocal passed, failed
+        if ok:
+            passed += 1
+            log.info(f"  PASS  {name}" + (f" — {detail}" if detail else ""))
+        else:
+            failed += 1
+            log.error(f"  FAIL  {name}" + (f" — {detail}" if detail else ""))
+
+    def warn(name: str, detail: str):
+        nonlocal warnings
+        warnings += 1
+        log.warning(f"  WARN  {name} — {detail}")
+
+    log.info("[verify] Running index health checks...")
+    log.info("")
+
+    # --- Schema checks ---
+    log.info("Schema:")
+    tables = {r[0] for r in conn.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()}
+    check("chunks table exists", "chunks" in tables)
+    check("chunks_fts table exists", "chunks_fts" in tables)
+
+    # Column check
+    cols = {r[1] for r in conn.execute("PRAGMA table_info(chunks)").fetchall()}
+    expected_cols = {"id", "text", "source", "module_path", "type_name", "category", "heading", "file_path", "embedding"}
+    missing_cols = expected_cols - cols
+    check("chunks columns complete", not missing_cols, f"missing: {missing_cols}" if missing_cols else "")
+
+    # --- Content checks ---
+    log.info("")
+    log.info("Content:")
+    total = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]
+    check("chunks not empty", total > 0, f"{total} chunks")
+
+    # Source distribution
+    sources = conn.execute("SELECT source, COUNT(*) FROM chunks GROUP BY source ORDER BY source").fetchall()
+    for source, count in sources:
+        log.info(f"         {source}: {count}")
+
+    # Check all configured source_tags are represented
+    expected_sources = {r.get("source_tag", r["name"]) for r in config.get("repos", [])}
+    expected_sources |= {r.get("source_tag", r["name"]) for r in config.get("db_sources", [])}
+    actual_sources = {r[0] for r in sources}
+    missing_sources = expected_sources - actual_sources
+    if missing_sources:
+        warn("source coverage", f"configured sources not in DB: {missing_sources}")
+    else:
+        check("all configured sources indexed", True, f"{len(actual_sources)} sources")
+
+    # --- Embedding checks ---
+    log.info("")
+    log.info("Embeddings:")
+    null_embeddings = conn.execute("SELECT COUNT(*) FROM chunks WHERE embedding IS NULL").fetchone()[0]
+    check("no null embeddings", null_embeddings == 0, f"{null_embeddings} null" if null_embeddings else "")
+
+    # Spot-check dimensions on a sample
+    sample = conn.execute("SELECT id, LENGTH(embedding) FROM chunks WHERE embedding IS NOT NULL LIMIT 20").fetchall()
+    bad_dims = [(cid, blen) for cid, blen in sample if blen != expected_blob_size]
+    check(
+        f"embedding dimensions ({dimensions}d = {expected_blob_size} bytes)",
+        not bad_dims,
+        f"{len(bad_dims)} mismatched: {bad_dims[:3]}" if bad_dims else f"checked {len(sample)} samples",
+    )
+
+    # --- FTS5 integrity ---
+    log.info("")
+    log.info("FTS5:")
+    try:
+        fts_count = conn.execute(
+            "SELECT COUNT(*) FROM chunks_fts"
+        ).fetchone()[0]
+        check("FTS5 populated", fts_count > 0, f"{fts_count} entries")
+        check("FTS5 matches chunks", fts_count == total, f"FTS5={fts_count} vs chunks={total}")
+    except sqlite3.OperationalError as e:
+        check("FTS5 readable", False, str(e))
+
+    # --- Search quality (requires Ollama) ---
+    verify_config = config.get("verify", {})
+    test_queries = verify_config.get("queries", [])
+
+    if test_queries:
+        log.info("")
+        log.info("Search quality:")
+
+        ollama_host = config["ollama"]["host"]
+        embed_model = config["ollama"]["embed_model"]
+        embed_timeout = config["ollama"].get("embed_timeout", 30.0)
+
+        # Load embeddings for similarity search
+        import numpy as np
+
+        rows = conn.execute("SELECT id, embedding FROM chunks WHERE embedding IS NOT NULL").fetchall()
+        chunk_ids = [r[0] for r in rows]
+        embeddings = np.array(
+            [struct.unpack(f"{dimensions}f", r[1]) for r in rows],
+            dtype=np.float32,
+        )
+
+        # Pre-load metadata for filtering
+        meta_rows = conn.execute("SELECT id, source, module_path, type_name FROM chunks").fetchall()
+        meta_by_id = {r[0]: {"source": r[1], "module_path": r[2], "type_name": r[3]} for r in meta_rows}
+
+        ollama_available = True
+
+        async with httpx.AsyncClient(timeout=embed_timeout) as client:
+            for tq in test_queries:
+                min_results = tq.get("min_results", 1)
+
+                if "query" in tq:
+                    # Semantic search test (requires Ollama)
+                    if not ollama_available:
+                        continue
+                    query_text = tq["query"]
+                    try:
+                        resp = await client.post(
+                            f"{ollama_host}/api/embeddings",
+                            json={"model": embed_model, "prompt": f"search_query: {query_text}"},
+                        )
+                        resp.raise_for_status()
+                        query_vec = np.array(resp.json()["embedding"], dtype=np.float32)
+                        # Normalize
+                        norm = np.linalg.norm(query_vec)
+                        if norm > 0:
+                            query_vec /= norm
+
+                        similarities = embeddings @ query_vec
+                        top_k = min(tq.get("top_k", 5), len(chunk_ids))
+                        top_indices = np.argsort(similarities)[-top_k:][::-1]
+
+                        results = []
+                        for idx in top_indices:
+                            cid = chunk_ids[idx]
+                            score = float(similarities[idx])
+                            meta = meta_by_id.get(cid, {})
+                            results.append({"id": cid, "score": score, **meta})
+
+                        # Apply filters
+                        if tq.get("expect_source"):
+                            results = [r for r in results if r.get("source") == tq["expect_source"]]
+
+                        check(
+                            f"search \"{query_text}\"",
+                            len(results) >= min_results,
+                            f"{len(results)} results (need {min_results}), "
+                            f"top: {results[0]['id']} ({results[0]['score']:.3f})" if results else "no results",
+                        )
+                    except httpx.ConnectError:
+                        warn(f"search \"{query_text}\"", "Ollama not reachable — skipping semantic tests")
+                        ollama_available = False
+                    except Exception as e:
+                        check(f"search \"{query_text}\"", False, str(e))
+
+                elif "lookup" in tq:
+                    # Keyword lookup test (pure SQLite, no Ollama needed)
+                    name = tq["lookup"]
+                    lookup_rows = conn.execute(
+                        "SELECT id, source, type_name FROM chunks WHERE type_name = ? LIMIT 5",
+                        (name,),
+                    ).fetchall()
+                    if not lookup_rows:
+                        # Fallback to LIKE
+                        safe = name.replace("\\", "\\\\").replace("%", "\\%").replace("_", "\\_")
+                        lookup_rows = conn.execute(
+                            "SELECT id, source, type_name FROM chunks WHERE type_name LIKE ? ESCAPE '\\' LIMIT 5",
+                            (f"%{safe}%",),
+                        ).fetchall()
+
+                    check(
+                        f"lookup \"{name}\"",
+                        len(lookup_rows) >= min_results,
+                        f"{len(lookup_rows)} results" + (f", first: {lookup_rows[0][0]}" if lookup_rows else ""),
+                    )
+
+    conn.close()
+
+    # --- Summary ---
+    log.info("")
+    total_checks = passed + failed
+    if failed == 0:
+        log.info(f"[verify] ALL PASSED ({passed} checks, {warnings} warnings)")
+    else:
+        log.error(f"[verify] {failed} FAILED / {passed} passed / {warnings} warnings")
+        sys.exit(1)
+
+
+# ---------------------------------------------------------------------------
+# Stale
+# ---------------------------------------------------------------------------
+
+
+def cmd_stale(config: dict):
+    """Check for stale chunks by comparing source hashes and git commits.
+
+    Two-level check:
+    1. Repo-level (fast): compare stored git commit vs current HEAD.
+    2. File-level (thorough): compare stored source_hash vs current file SHA-256.
+
+    Exit code: 0 if all fresh, 1 if any stale or missing.
+    """
+    db_path = SCRIPT_DIR / config["database"]["path"]
+    if not db_path.exists():
+        log.error(f"[stale] Database not found: {db_path}")
+        sys.exit(1)
+
+    conn = sqlite3.connect(str(db_path))
+    conn.row_factory = sqlite3.Row
+    repos_dir = SCRIPT_DIR / config["sources"].get("repos_dir", "data/repos")
+    base_dirs = _build_source_base_dirs(config.get("repos", []), repos_dir)
+
+    stale_repos = []
+    fresh_repos = []
+    skipped_repos = []
+    upstream_behind = []
+    stale_files = 0
+    missing_files = 0
+    fresh_files = 0
+
+    # --- Repo-level check (git commit comparison) ---
+    log.info("[stale] Checking repo versions...")
+    metadata_rows = conn.execute("SELECT key, value FROM index_metadata").fetchall()
+    metadata = {r["key"]: r["value"] for r in metadata_rows}
+
+    indexed_at = metadata.get("indexed_at", "unknown")
+    log.info(f"[stale] Index built at: {indexed_at}")
+
+    # Check for embedding model drift
+    stored_model = metadata.get("embed_model")
+    config_model = config.get("ollama", {}).get("embed_model", "unknown")
+    if stored_model and stored_model != config_model:
+        log.warning(
+            f"[stale] Embedding model changed: index used '{stored_model}', "
+            f"config has '{config_model}' — full rebuild required"
+        )
+
+    for repo in config.get("repos", []):
+        name = repo["name"]
+        stored_commit = metadata.get(f"repo:{name}:commit")
+        if not stored_commit:
+            skipped_repos.append(name)
+            continue
+
+        # Resolve repo directory
+        if repo.get("path"):
+            repo_dir = Path(repo["path"])
+        elif repo.get("url"):
+            local_dir = repo.get("local_dir", name)
+            repo_dir = repos_dir / local_dir
+        else:
+            skipped_repos.append(name)
+            continue
+
+        try:
+            result = subprocess.run(
+                ["git", "-C", str(repo_dir), "rev-parse", "HEAD"],
+                capture_output=True,
+                text=True,
+                timeout=10,
+            )
+            if result.returncode == 0:
+                current_commit = result.stdout.strip()
+                if current_commit != stored_commit:
+                    log.info(
+                        f"[stale] Repo '{name}': commit changed "
+                        f"({stored_commit[:7]} -> {current_commit[:7]})"
+                    )
+                    stale_repos.append(name)
+                else:
+                    fresh_repos.append(name)
+
+                    # Check upstream for unpulled commits (git repos only)
+                    if repo.get("url"):
+                        try:
+                            # Fetch remote refs without downloading objects
+                            subprocess.run(
+                                ["git", "-C", str(repo_dir), "fetch", "--dry-run"],
+                                capture_output=True, text=True, timeout=30,
+                            )
+                            # Compare local HEAD to remote tracking branch
+                            remote_result = subprocess.run(
+                                ["git", "-C", str(repo_dir), "rev-parse", "@{u}"],
+                                capture_output=True, text=True, timeout=10,
+                            )
+                            if remote_result.returncode == 0:
+                                remote_head = remote_result.stdout.strip()
+                                if remote_head != current_commit:
+                                    upstream_behind.append(name)
+                                    log.info(f"[stale] Repo '{name}': upstream has new commits")
+                        except (subprocess.TimeoutExpired, OSError):
+                            pass  # Network check is best-effort
+            else:
+                skipped_repos.append(name)
+        except (subprocess.TimeoutExpired, OSError):
+            skipped_repos.append(name)
+
+    # --- File-level check (source hash comparison) ---
+    log.info("[stale] Checking file hashes...")
+    rows = conn.execute(
+        "SELECT DISTINCT file_path, source, source_hash FROM chunks WHERE source_hash != ''"
+    ).fetchall()
+
+    for row in rows:
+        fp = row["file_path"]
+        source = row["source"]
+        stored_hash = row["source_hash"]
+
+        resolved = _resolve_source_path(fp, source, base_dirs)
+        if resolved is None:
+            missing_files += 1
+            continue
+
+        try:
+            current_hash = hashlib.sha256(resolved.read_bytes()).hexdigest()
+            if current_hash != stored_hash:
+                log.info(f"[stale]   {fp}: content changed")
+                stale_files += 1
+            else:
+                fresh_files += 1
+        except FileNotFoundError:
+            log.info(f"[stale]   {fp}: file missing")
+            missing_files += 1
+        except OSError:
+            missing_files += 1
+
+    conn.close()
+
+    # --- Summary ---
+    log.info("")
+    if fresh_repos:
+        log.info(f"[stale] Fresh repos: {', '.join(fresh_repos)}")
+    if stale_repos:
+        log.info(f"[stale] Stale repos: {', '.join(stale_repos)}")
+    if upstream_behind:
+        log.info(f"[stale] Upstream updates available: {', '.join(upstream_behind)}")
+    if skipped_repos:
+        log.info(f"[stale] Skipped repos (no git): {', '.join(skipped_repos)}")
+    log.info(f"[stale] Files: {fresh_files} fresh, {stale_files} stale, {missing_files} missing")
+
+    if stale_repos or stale_files or missing_files:
+        log.info("[stale] Index is STALE — run 'rebuild' to update")
+        sys.exit(1)
+    elif upstream_behind:
+        log.info("[stale] Index is fresh locally, but upstream repos have updates — run 'clone' then 'rebuild'")
+        sys.exit(1)
+    else:
+        log.info("[stale] Index is FRESH")
+
+
+# ---------------------------------------------------------------------------
+# Freshness
+# ---------------------------------------------------------------------------
+
+
+def cmd_freshness(config: dict):
+    """Print a unified freshness report: index age, model version, source lag.
+
+    Unlike cmd_stale (which exits non-zero for CI), this is a human-readable
+    dashboard showing everything relevant to content quality.
+    """
+    from datetime import datetime, timezone
+
+    db_path = SCRIPT_DIR / config["database"]["path"]
+    if not db_path.exists():
+        log.error(f"[freshness] Database not found: {db_path}")
+        sys.exit(1)
+
+    conn = sqlite3.connect(str(db_path))
+    metadata_rows = conn.execute("SELECT key, value FROM index_metadata").fetchall()
+    metadata = {r[0]: r[1] for r in metadata_rows}
+
+    chunk_count = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]
+    source_counts = conn.execute(
+        "SELECT source, COUNT(*) FROM chunks GROUP BY source ORDER BY COUNT(*) DESC"
+    ).fetchall()
+    conn.close()
+
+    repos_dir = SCRIPT_DIR / config["sources"].get("repos_dir", "data/repos")
+    issues = []
+
+    # --- Index age ---
+    indexed_at = metadata.get("indexed_at", "unknown")
+    print(f"\n{'=' * 60}")
+    print("  Freshness Report")
+    print("=" * 60)
+    print(f"\n  Index built:  {indexed_at}")
+
+    if indexed_at != "unknown":
+        try:
+            built = datetime.fromisoformat(indexed_at)
+            age = datetime.now(timezone.utc) - built
+            age_str = f"{age.days}d {age.seconds // 3600}h ago"
+            print(f"  Index age:    {age_str}")
+            if age.days > 7:
+                issues.append(f"Index is {age.days} days old — consider rebuilding")
+        except ValueError:
+            pass
+
+    print(f"  Total chunks: {chunk_count}")
+
+    # --- Embedding model ---
+    stored_model = metadata.get("embed_model", "not tracked")
+    stored_dims = metadata.get("embed_dimensions", "not tracked")
+    config_model = config.get("ollama", {}).get("embed_model", "unknown")
+    config_dims = config.get("search", {}).get("embed_dimensions", 0)
+
+    print(f"\n  Embed model (index):  {stored_model} ({stored_dims}d)")
+    print(f"  Embed model (config): {config_model} ({config_dims}d)")
+
+    if stored_model != "not tracked" and stored_model != config_model:
+        issues.append(
+            f"Model mismatch: index used '{stored_model}', config has '{config_model}' — full rebuild required"
+        )
+    if stored_dims != "not tracked" and str(config_dims) != stored_dims:
+        issues.append(
+            f"Dimension mismatch: index has {stored_dims}d, config has {config_dims}d — full rebuild required"
+        )
+
+    # --- Source repos ---
+    print(f"\n  {'Source':<25} {'Chunks':>7}  {'Status'}")
+    print(f"  {'-' * 25} {'-' * 7}  {'-' * 30}")
+
+    for repo in config.get("repos", []):
+        name = repo["name"]
+        source_tag = repo.get("source_tag", name)
+        count = next((c for s, c in source_counts if s == source_tag), 0)
+
+        stored_commit = metadata.get(f"repo:{name}:commit", "")
+        status_parts = []
+
+        # Resolve repo directory
+        if repo.get("path"):
+            repo_dir = Path(repo["path"])
+        elif repo.get("url"):
+            local_dir = repo.get("local_dir", name)
+            repo_dir = repos_dir / local_dir
+        else:
+            repo_dir = None
+
+        if repo_dir and repo_dir.exists():
+            try:
+                result = subprocess.run(
+                    ["git", "-C", str(repo_dir), "rev-parse", "HEAD"],
+                    capture_output=True, text=True, timeout=10,
+                )
+                if result.returncode == 0:
+                    current = result.stdout.strip()
+                    if stored_commit and current != stored_commit:
+                        status_parts.append("LOCAL CHANGED")
+                    elif stored_commit:
+                        status_parts.append("fresh")
+                    else:
+                        status_parts.append("no commit tracked")
+            except (subprocess.TimeoutExpired, OSError):
+                status_parts.append("git error")
+        elif repo_dir:
+            status_parts.append("NOT CLONED")
+            issues.append(f"Repo '{name}' not cloned: {repo_dir}")
+        else:
+            status_parts.append("no path")
+
+        if count == 0:
+            status_parts.append("0 chunks!")
+            issues.append(f"Source '{source_tag}' has 0 chunks in the index")
+
+        status = ", ".join(status_parts)
+        print(f"  {name:<25} {count:>7}  {status}")
+
+    # --- DB sources ---
+    for db_src in config.get("db_sources", []):
+        name = db_src["name"]
+        source_tag = db_src.get("source_tag", name)
+        count = next((c for s, c in source_counts if s == source_tag), 0)
+        db_file = Path(db_src["path"])
+        status = "exists" if db_file.exists() else "MISSING"
+        if count == 0:
+            status += ", 0 chunks!"
+        print(f"  {name:<25} {count:>7}  {status} (db_source)")
+
+    # --- Issues ---
+    if issues:
+        print(f"\n  Issues ({len(issues)}):")
+        for issue in issues:
+            print(f"    - {issue}")
+    else:
+        print(f"\n  No issues found.")
+
+    print(f"\n{'=' * 60}\n")
+
+
+# ---------------------------------------------------------------------------
+# Ingest
+# ---------------------------------------------------------------------------
+
+
+def cmd_ingest(config: dict):
+    """Incrementally ingest new chunks from an ingest queue file.
+
+    Reads pre-formatted chunk dicts from the ingest path, embeds them,
+    and appends to the existing database. Does NOT rebuild from scratch.
+
+    The ingest file is atomically replaced with an empty file after
+    processing, so concurrent writers (e.g. Orchestration) won't lose
+    entries that arrive during processing.
+    """
+    asyncio.run(_ingest_async(config))
+
+
+async def _ingest_async(config: dict):
+    ingest_path = SCRIPT_DIR / config["sources"].get("ingest_path", "data/ingest.jsonl")
+    db_path = SCRIPT_DIR / config["database"]["path"]
+
+    if not ingest_path.exists() or ingest_path.stat().st_size == 0:
+        log.info("[ingest] No entries to ingest.")
+        return
+
+    if not db_path.exists():
+        log.error(f"[ingest] Database not found: {db_path}. Run 'rebuild' first.")
+        return
+
+    # Read all entries into memory, then replace file with empty one
+    raw_lines = ingest_path.read_text(encoding="utf-8").splitlines()
+    tmp_empty = ingest_path.with_suffix(".jsonl.tmp")
+    tmp_empty.write_text("", encoding="utf-8")
+    try:
+        os.replace(str(tmp_empty), str(ingest_path))
+    except OSError:
+        # Windows fallback
+        try:
+            ingest_path.write_text("", encoding="utf-8")
+            tmp_empty.unlink(missing_ok=True)
+        except OSError:
+            pass
+
+    # Parse and validate entries
+    chunks = []
+    for i, line in enumerate(raw_lines):
+        line = line.strip()
+        if not line:
+            continue
+        try:
+            entry = json.loads(line)
+        except json.JSONDecodeError:
+            log.warning(f"[ingest] Skipping malformed JSON on line {i + 1}")
+            continue
+        if not all(k in entry for k in ("id", "text", "source")):
+            log.warning(f"[ingest] Skipping entry missing required fields on line {i + 1}")
+            continue
+        chunks.append(entry)
+
+    if not chunks:
+        log.info("[ingest] No valid entries to ingest.")
+        return
+
+    # Deduplicate against existing DB
+    conn = sqlite3.connect(str(db_path), isolation_level=None)
+    existing_ids = {
+        r[0]
+        for r in conn.execute("SELECT id FROM chunks").fetchall()
+    }
+    new_chunks = [c for c in chunks if c["id"] not in existing_ids]
+    if not new_chunks:
+        log.info(f"[ingest] All {len(chunks)} entries already exist in DB.")
+        conn.close()
+        return
+
+    log.info(f"[ingest] {len(new_chunks)} new entries to embed (skipped {len(chunks) - len(new_chunks)} duplicates)")
+
+    # Embed and insert
+    ollama_host = config["ollama"]["host"]
+    embed_model = config["ollama"]["embed_model"]
+    embed_timeout = config["ollama"].get("embed_timeout", 30.0)
+    dimensions = config["search"]["embed_dimensions"]
+    pipeline_config = config.get("pipeline", {})
+    concurrency = pipeline_config.get("concurrency", 4)
+    max_retries = max(1, pipeline_config.get("max_retries", 3))
+    max_embed_chars = pipeline_config.get("max_embed_chars", 6000)
+
+    semaphore = asyncio.Semaphore(concurrency)
+    embed_url = f"{ollama_host}/api/embeddings"
+    embedded = 0
+    errors = 0
+
+    async with httpx.AsyncClient(timeout=embed_timeout) as client:
+
+        async def embed_one(chunk: dict) -> tuple[dict, list[float] | None]:
+            nonlocal embedded, errors
+            async with semaphore:
+                text = "search_document: " + chunk["text"][:max_embed_chars]
+                body = {"model": embed_model, "prompt": text}
+                for attempt in range(max_retries):
+                    try:
+                        resp = await client.post(embed_url, json=body)
+                        resp.raise_for_status()
+                        embedding = resp.json().get("embedding", [])
+                        embedded += 1
+                        return chunk, embedding
+                    except Exception as e:
+                        if attempt < max_retries - 1:
+                            await asyncio.sleep(1.0)
+                            continue
+                        errors += 1
+                        if errors <= 5:
+                            log.error(f"[ingest] Error embedding {chunk['id']}: {e}")
+                        return chunk, None
+
+        tasks = [embed_one(c) for c in new_chunks]
+        results = await asyncio.gather(*tasks)
+
+        conn.execute("BEGIN")
+        for chunk, embedding in results:
+            if embedding is None:
+                continue
+            if len(embedding) != dimensions:
+                continue
+            _insert_chunk(conn, chunk, embedding)
+        conn.commit()
+
+    # Write ingest metadata
+    from datetime import datetime, timezone
+
+    now = datetime.now(timezone.utc).isoformat()
+    conn.execute("BEGIN")
+    conn.execute(
+        "INSERT OR REPLACE INTO index_metadata (key, value) VALUES (?, ?)",
+        ("last_ingest_at", now),
+    )
+    conn.execute(
+        "INSERT OR REPLACE INTO index_metadata (key, value) VALUES (?, ?)",
+        ("last_ingest_count", str(embedded)),
+    )
+    conn.commit()
+
+    conn.close()
+    log.info(f"[ingest] Done — {embedded} embedded, {errors} errors")
+
+
+# ---------------------------------------------------------------------------
+# Gotcha
+# ---------------------------------------------------------------------------
+
+
+def cmd_gotcha(config: dict, chunk_id: str, gotcha_text: str):
+    """Add or update a gotcha note on an existing chunk.
+
+    Usage: python pipeline.py gotcha <chunk_id> "gotcha text"
+
+    Gotcha notes are anti-hallucination annotations displayed alongside
+    search results to warn about common misdiagnoses.
+    """
+    db_path = SCRIPT_DIR / config["database"]["path"]
+
+    if not db_path.exists():
+        log.error(f"[gotcha] Database not found: {db_path}")
+        sys.exit(1)
+
+    conn = sqlite3.connect(str(db_path))
+    row = conn.execute("SELECT id FROM chunks WHERE id = ?", (chunk_id,)).fetchone()
+    if not row:
+        log.error(f"[gotcha] Chunk not found: {chunk_id}")
+        conn.close()
+        sys.exit(1)
+
+    conn.execute("UPDATE chunks SET gotcha = ? WHERE id = ?", (gotcha_text, chunk_id))
+    conn.commit()
+    conn.close()
+
+    log.info(f"[gotcha] Updated chunk {chunk_id}")
+    log.info(f"[gotcha] Text: {gotcha_text}")
+
+
 # ---------------------------------------------------------------------------
 # Main
 # ---------------------------------------------------------------------------
@@ -470,9 +1439,10 @@ def main():
         format="%(levelname)s: %(message)s",
     )
 
+    commands = "clone, chunk, embed, rebuild, stats, verify, stale, freshness, ingest, gotcha"
     if len(sys.argv) < 2:
         print("Usage: python pipeline.py <command>")
-        print("Commands: clone, chunk, embed, rebuild, stats")
+        print(f"Commands: {commands}")
         sys.exit(1)
 
     command = sys.argv[1]
@@ -490,9 +1460,22 @@ def main():
         cmd_embed(config)
     elif command == "stats":
         cmd_stats(config)
+    elif command == "verify":
+        cmd_verify(config)
+    elif command == "stale":
+        cmd_stale(config)
+    elif command == "freshness":
+        cmd_freshness(config)
+    elif command == "ingest":
+        cmd_ingest(config)
+    elif command == "gotcha":
+        if len(sys.argv) < 4:
+            print('Usage: python pipeline.py gotcha <chunk_id> "gotcha text"')
+            sys.exit(1)
+        cmd_gotcha(config, sys.argv[2], sys.argv[3])
     else:
         log.error(f"Unknown command: {command}")
-        print("Commands: clone, chunk, embed, rebuild, stats")
+        print(f"Commands: {commands}")
         sys.exit(1)
 
 
diff --git a/requirements.txt b/requirements.txt
index c2474ae..c1a44e3 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,3 +2,4 @@ mcp>=1.0.0
 httpx>=0.27.0
 numpy>=1.26.0
 pytest>=8.0.0
+# Optional: pip install sentence-transformers>=4.0.0  (only needed if reranker.enabled = true)
diff --git a/reranker.py b/reranker.py
new file mode 100644
index 0000000..457465c
--- /dev/null
+++ b/reranker.py
@@ -0,0 +1,72 @@
+#  mcp-rag - Cross-Encoder Reranker
+#
+#  Reranks search candidates using a cross-encoder model for improved
+#  relevance. Lazy-loads the model on first use to avoid startup cost
+#  when reranking is disabled.
+#
+#  Uses sentence-transformers CrossEncoder with ONNX backend by default
+#  (avoids pulling in full PyTorch for inference).
+#
+#  Depends on: sentence-transformers[onnx] (optional, only when enabled)
+#  Used by:    server.py (when config reranker.enabled = true)
+
+import logging
+
+log = logging.getLogger("mcp-rag-server")
+
+_reranker = None
+
+
+def get_reranker(config: dict):
+    """Load the cross-encoder model on first use.
+
+    Subsequent calls return the cached instance. The model is downloaded
+    from HuggingFace on first load if not cached locally.
+
+    Args:
+        config: Full config dict (uses 'reranker' section).
+
+    Returns:
+        A CrossEncoder instance.
+    """
+    global _reranker
+    if _reranker is None:
+        from sentence_transformers import CrossEncoder
+
+        reranker_config = config.get("reranker", {})
+        model_name = reranker_config.get("model", "cross-encoder/ms-marco-MiniLM-L6-v2")
+        backend = reranker_config.get("backend", "onnx")
+
+        log.info(f"Loading reranker: {model_name} (backend={backend})")
+        try:
+            _reranker = CrossEncoder(model_name, backend=backend)
+        except Exception as e:
+            # Fall back to default backend if ONNX fails
+            log.warning(f"ONNX backend failed ({e}), falling back to default backend")
+            _reranker = CrossEncoder(model_name)
+        log.info("Reranker loaded")
+
+    return _reranker
+
+
+def rerank(query: str, candidates: list, config: dict) -> list:
+    """Rerank candidate chunks by cross-encoder relevance.
+
+    Args:
+        query: The user's search query.
+        candidates: List of sqlite3.Row or dict objects (must have 'id' and 'text' keys).
+        config: Full config dict.
+
+    Returns:
+        Candidates reordered by cross-encoder score (best first).
+    """
+    if len(candidates) <= 1:
+        return candidates
+
+    reranker = get_reranker(config)
+    pairs = [(query, c["text"]) for c in candidates]
+    scores = reranker.predict(pairs)
+
+    scored = list(zip(candidates, scores))
+    scored.sort(key=lambda x: x[1], reverse=True)
+    return [c for c, s in scored]
diff --git a/server.py b/server.py
index fda69cf..2edcd09 100644
--- a/server.py
+++ b/server.py
@@ -8,6 +8,9 @@
 #  Tool names, descriptions, and server identity are all config-driven,
 #  so one codebase serves any project.
 #
+#  Supports hybrid search (BM25 + vector via RRF) and optional
+#  cross-encoder reranking, both config-gated.
+#
 #  Depends on: config.json, data/*.db, numpy, httpx, mcp
 #  Used by:    MCP clients (registered via `claude mcp add`)
 
@@ -46,8 +49,75 @@ def _sanitize_fts(query: str) -> str:
     return result.strip()
 
 
-def format_results(rows: list, scores: dict[str, float] | None = None) -> str:
-    """Format search results as readable text."""
+def _rrf_fuse(ranked_lists: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
+    """Reciprocal Rank Fusion across multiple ranked ID lists.
+
+    Combines rankings from different retrieval methods (e.g. vector + BM25)
+    without requiring score normalization. Higher RRF score = better.
+
+    Args:
+        ranked_lists: List of ranked ID lists (best-first).
+        k: RRF constant (default 60, standard value from the paper).
+
+    Returns:
+        List of (chunk_id, rrf_score) tuples, sorted by score descending.
+    """
+    scores: dict[str, float] = {}
+    for ranked in ranked_lists:
+        for rank, doc_id in enumerate(ranked):
+            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank + 1)
+    return sorted(scores.items(), key=lambda x: x[1], reverse=True)
+
+
+def _confidence_tier(score: float, high_thresh: float, med_thresh: float) -> str:
+    """Classify a similarity score into a confidence tier.
+
+    Thresholds should be calibrated to the score range:
+    - Cosine similarity: 0.0–1.0 (default thresholds 0.85/0.65)
+    - RRF scores: ~0.001–0.016 (use rrf_confidence_thresholds())
+    """
+    if score >= high_thresh:
+        return "HIGH"
+    if score >= med_thresh:
+        return "MEDIUM"
+    return "LOW"
+
+
+def _rrf_confidence_thresholds(k: int = 60) -> tuple[float, float]:
+    """Return (high, medium) thresholds calibrated for RRF scores.
+
+    RRF score = 1/(k + rank + 1).  Top-1 ≈ 1/(k+1), Top-3 ≈ 1/(k+4).
+    High = top-3 results, Medium = top-10.
+    """
+    high = 1.0 / (k + 4)   # rank 3
+    med = 1.0 / (k + 11)   # rank 10
+    return high, med
+
+
+def _get_gotcha(row) -> str:
+    """Safely extract gotcha text from a database row.
+
+    Returns empty string if the column doesn't exist (old DBs without
+    the gotcha column).
+    """
+    try:
+        return row["gotcha"] or ""
+    except (IndexError, KeyError):
+        return ""
+
+
+def format_results(
+    rows: list,
+    scores: dict[str, float] | None = None,
+    confidence: dict[str, str] | None = None,
+) -> str:
+    """Format search results as readable text.
+
+    Args:
+        rows: Database rows to format.
+        scores: Optional chunk_id → similarity score mapping.
+        confidence: Optional chunk_id → confidence tier mapping.
+    """
     if not rows:
         return "No results found."
 
@@ -69,10 +139,21 @@ def format_results(rows: list, scores: dict[str, float] | None = None) -> str:
 
         score_str = ""
         if scores and row["id"] in scores:
-            score_str = f" (score: {scores[row['id']]:.3f})"
+            score_val = scores[row["id"]]
+            tier_str = ""
+            if confidence and row["id"] in confidence:
+                tier_str = f", confidence: {confidence[row['id']]}"
+            score_str = f" (score: {score_val:.3f}{tier_str})"
 
         header = " | ".join(header_parts)
-        parts.append(f"--- [{header}]{score_str} ---\n{row['text']}")
+        text = row["text"]
+
+        # Append gotcha warning if present
+        gotcha = _get_gotcha(row)
+        if gotcha:
+            text += f"\n[CAUTION: {gotcha}]"
+
+        parts.append(f"--- [{header}]{score_str} ---\n{text}")
 
     return "\n\n".join(parts)
 
@@ -113,6 +194,15 @@ def create_server(config_path: Path | None = None) -> FastMCP:
     default_top_k = config["search"]["default_top_k"]
     max_top_k = config["search"]["max_top_k"]
     min_score = config["search"].get("min_score", 0.0)
+    hybrid_enabled = config["search"].get("hybrid", False)
+    retrieval_depth = config["search"].get("retrieval_depth", 20)
+    rrf_k = config["search"].get("rrf_k", 60)
+    reranker_config = config.get("reranker", {})
+    reranker_enabled = reranker_config.get("enabled", False)
+    confidence_config = config["search"].get("confidence", {})
+    confidence_high = confidence_config.get("high", 0.85)
+    confidence_medium = confidence_config.get("medium", 0.65)
+    exclude_low = config["search"].get("exclude_low_confidence", False)
     db_path = SCRIPT_DIR / config["database"]["path"]
 
     mcp_server = FastMCP(server_name)
@@ -242,52 +332,99 @@ async def search(
             return "Error: Database not loaded. Run 'python pipeline.py rebuild' first."
 
         top_k = min(max(1, top_k), max_top_k)
+        depth = retrieval_depth if hybrid_enabled else top_k
 
         query_vec = await get_query_embedding(query)
         if query_vec is None:
             return "Error: Failed to generate query embedding. Is Ollama running?"
 
-        # Cosine similarity (embeddings are pre-normalized)
+        # --- Vector retrieval ---
         similarities = embeddings @ query_vec
 
         # Apply filters using pre-loaded metadata (vectorized where possible)
+        filter_mask = None
         if source_filter or module_filter:
-            mask = np.ones(len(chunk_ids), dtype=bool)
+            filter_mask = np.ones(len(chunk_ids), dtype=bool)
             if source_filter:
-                mask &= chunk_sources == source_filter
+                filter_mask &= chunk_sources == source_filter
             if module_filter:
                 module_filter_lower = module_filter.lower()
-                mask &= np.array([module_filter_lower in m for m in chunk_modules])
-            similarities[~mask] = -1
-
-        # Get top-k indices
-        top_indices = np.argsort(similarities)[::-1][:top_k]
-
-        # Collect qualifying IDs in ranked order
-        ranked_ids = []
-        result_scores = {}
-        for idx in top_indices:
-            if similarities[idx] < min_score:
-                continue
-            chunk_id = chunk_ids[idx]
-            ranked_ids.append(chunk_id)
-            result_scores[chunk_id] = float(similarities[idx])
+                filter_mask &= np.array([module_filter_lower in m for m in chunk_modules])
+            similarities[~filter_mask] = -1
+
+        top_indices = np.argsort(similarities)[::-1][:depth]
+        vector_ranked = [
+            chunk_ids[idx] for idx in top_indices
+            if similarities[idx] >= min_score
+        ]
+
+        # --- BM25 retrieval (hybrid mode) ---
+        if hybrid_enabled:
+            bm25_ranked = _bm25_retrieve(
+                conn, query, depth * 2, source_filter, module_filter,
+                chunk_ids, chunk_sources, chunk_modules, filter_mask,
+            )
+            # Fuse with RRF
+            fused = _rrf_fuse([vector_ranked, bm25_ranked], k=rrf_k)
+            ranked_ids = [doc_id for doc_id, _ in fused[:depth]]
+            result_scores = {doc_id: score for doc_id, score in fused[:depth]}
+        else:
+            ranked_ids = vector_ranked[:top_k]
+            result_scores = {chunk_ids[idx]: float(similarities[idx]) for idx in top_indices
+                             if chunk_ids[idx] in ranked_ids}
 
         if not ranked_ids:
             return format_results([])
 
-        # Single batch query instead of N individual queries
+        # Fetch full rows
         placeholders = ",".join("?" for _ in ranked_ids)
         rows = conn.execute(
             f"SELECT * FROM chunks WHERE id IN ({placeholders})",
             ranked_ids,
         ).fetchall()
-
-        # Re-order to match similarity ranking
         row_map = {row["id"]: row for row in rows}
         results = [row_map[cid] for cid in ranked_ids if cid in row_map]
 
-        return format_results(results, result_scores)
+        # --- Reranking (optional) ---
+        if reranker_enabled and len(results) > 1:
+            try:
+                from reranker import rerank
+                results = rerank(query, results, config)
+                # Reassign scores based on reranked order
+                result_scores = {r["id"]: 1.0 / (i + 1) for i, r in enumerate(results)}
+            except Exception as e:
+                log.warning(f"Reranker failed, using original ranking: {e}")
+
+        # Apply final top_k
+        results = results[:top_k]
+        final_scores = {r["id"]: result_scores.get(r["id"], 0.0) for r in results}
+
+        # Classify confidence tiers (thresholds depend on score range)
+        if reranker_enabled:
+            # Reranker scores are 1/(rank+1): top-1=1.0, top-3=0.33, top-10=0.1
+            tier_high, tier_med = 0.25, 0.08
+        elif hybrid_enabled:
+            # RRF scores are 1/(k+rank+1) where k=60: much smaller range
+            tier_high, tier_med = _rrf_confidence_thresholds(rrf_k)
+        else:
+            tier_high, tier_med = confidence_high, confidence_medium
+
+        tiers = {
+            r["id"]: _confidence_tier(final_scores.get(r["id"], 0.0), tier_high, tier_med)
+            for r in results
+        }
+
+        # Filter low-confidence results if configured
+        if exclude_low:
+            pre_filter_count = len(results)
+            results = [r for r in results if tiers.get(r["id"]) != "LOW"]
+            final_scores = {r["id"]: final_scores[r["id"]] for r in results}
+            excluded = pre_filter_count - len(results)
+            if not results and excluded > 0:
+                return f"No high-confidence results found. {excluded} low-confidence matches were excluded."
+            tiers = {r["id"]: tiers[r["id"]] for r in results}
+
+        return format_results(results, final_scores, tiers)
 
     @mcp_server.tool(name=lookup_name, description=lookup_desc)
     async def lookup(
@@ -343,6 +480,59 @@ async def lookup(
 
         return format_results(rows)
 
+    def _bm25_retrieve(
+        db_conn: sqlite3.Connection,
+        query: str,
+        limit: int,
+        source_filter: str,
+        module_filter: str,
+        all_chunk_ids: list[str],
+        all_chunk_sources: np.ndarray,
+        all_chunk_modules: list[str],
+        precomputed_mask: np.ndarray | None,
+    ) -> list[str]:
+        """Retrieve chunk IDs ranked by BM25 relevance via FTS5.
+
+        Filters are applied post-query in Python (consistent with vector path).
+        Returns a ranked list of chunk IDs (best-first).
+        """
+        safe_query = _sanitize_fts(query)
+        if not safe_query:
+            return []
+
+        try:
+            # FTS5 bm25() returns negative scores (lower = better match)
+            bm25_rows = db_conn.execute(
+                """SELECT chunks.id, bm25(chunks_fts) as bm25_score
+                   FROM chunks_fts
+                   JOIN chunks ON chunks.rowid = chunks_fts.rowid
+                   WHERE chunks_fts MATCH ?
+                   ORDER BY bm25(chunks_fts)
+                   LIMIT ?""",
+                (f'"{safe_query}"', limit),
+            ).fetchall()
+        except sqlite3.OperationalError as e:
+            log.warning(f"BM25 query failed: {e}")
+            return []
+
+        if not bm25_rows:
+            return []
+
+        # Build ID-to-index lookup for filter checking
+        if source_filter or module_filter:
+            id_to_idx = {cid: i for i, cid in enumerate(all_chunk_ids)}
+            filtered = []
+            for row in bm25_rows:
+                idx = id_to_idx.get(row[0])
+                if idx is None:
+                    continue
+                if precomputed_mask is not None and not precomputed_mask[idx]:
+                    continue
+                filtered.append(row[0])
+            return filtered
+        else:
+            return [row[0] for row in bm25_rows]
+
     return mcp_server
 
 
diff --git a/tests/test_pipeline.py b/tests/test_pipeline.py
index 3e26362..8d23c7a 100644
--- a/tests/test_pipeline.py
+++ b/tests/test_pipeline.py
@@ -1,6 +1,7 @@
 #  mcp-rag - Pipeline Tests
 
 import json
+import sqlite3
 
 import pytest
 
@@ -75,3 +76,96 @@ def test_cmd_chunk_produces_valid_jsonl(tmp_path, tmp_config):
     for line in lines:
         chunk = json.loads(line)
         assert required_keys <= set(chunk.keys()), f"Missing keys in chunk: {chunk.get('id', 'unknown')}"
+
+
+def test_db_source_produces_chunks(tmp_path, tmp_config):
+    """db_sources config produces valid chunks from a SQLite database."""
+    from pipeline import cmd_chunk
+
+    # Create a SQLite database with some coding standards
+    db_file = tmp_path / "standards.db"
+    conn = sqlite3.connect(str(db_file))
+    conn.execute("CREATE TABLE standards (id TEXT, title TEXT, body TEXT, category TEXT)")
+    conn.executemany(
+        "INSERT INTO standards VALUES (?, ?, ?, ?)",
+        [
+            ("std-001", "Naming Conventions", "Use PascalCase for classes.", "style"),
+            ("std-002", "Error Handling", "Always catch specific exceptions.", "reliability"),
+        ],
+    )
+    conn.commit()
+    conn.close()
+
+    chunks_path = tmp_path / "chunks.jsonl"
+    tmp_config["sources"]["chunks_path"] = str(chunks_path)
+    tmp_config["repos"] = []
+    tmp_config["db_sources"] = [
+        {
+            "name": "standards",
+            "type": "sqlite",
+            "path": str(db_file),
+            "query": "SELECT id, title, body, category FROM standards",
+            "text_column": "body",
+            "id_column": "id",
+            "heading_column": "title",
+            "category_column": "category",
+            "source_tag": "standards",
+        }
+    ]
+
+    import pipeline
+
+    original_script_dir = pipeline.SCRIPT_DIR
+    pipeline.SCRIPT_DIR = tmp_path
+    try:
+        cmd_chunk(tmp_config)
+    finally:
+        pipeline.SCRIPT_DIR = original_script_dir
+
+    assert chunks_path.exists()
+
+    required_keys = {"id", "text", "source", "module_path", "type_name", "category", "heading", "file_path"}
+    with open(chunks_path, encoding="utf-8") as f:
+        lines = [line.strip() for line in f if line.strip()]
+
+    assert len(lines) == 2
+    for line in lines:
+        chunk = json.loads(line)
+        assert required_keys <= set(chunk.keys())
+        assert chunk["source"] == "standards"
+
+    # Verify content mapping
+    chunks = [json.loads(line) for line in lines]
+    by_id = {c["id"]: c for c in chunks}
+    assert "standards:std-001" in by_id
+    assert by_id["standards:std-001"]["heading"] == "Naming Conventions"
+    assert by_id["standards:std-001"]["category"] == "style"
+    assert "PascalCase" in by_id["standards:std-001"]["text"]
+
+
+def test_validate_config_accepts_db_sources_only():
+    """Config with db_sources but no repos is valid."""
+    from pipeline import _validate_config
+
+    config = {
+        "ollama": {"host": "http://localhost:11434", "embed_model": "test"},
+        "database": {"path": "test.db"},
+        "search": {"embed_dimensions": 768},
+        "sources": {},
+        "db_sources": [{"name": "test", "path": "/tmp/test.db", "query": "SELECT * FROM t"}],
+    }
+    _validate_config(config)  # Should not raise
+
+
+def test_validate_config_rejects_no_sources():
+    """Config with neither repos nor db_sources is rejected."""
+    from pipeline import ConfigError, _validate_config
+
+    config = {
+        "ollama": {"host": "http://localhost:11434", "embed_model": "test"},
+        "database": {"path": "test.db"},
+        "search": {"embed_dimensions": 768},
+        "sources": {},
+    }
+    with pytest.raises(ConfigError, match="repos.*db_sources"):
+        _validate_config(config)
diff --git a/tests/test_provenance.py b/tests/test_provenance.py
new file mode 100644
index 0000000..df59133
--- /dev/null
+++ b/tests/test_provenance.py
@@ -0,0 +1,449 @@
+#  mcp-rag - Provenance & Data Hygiene Tests
+#
+#  Tests for source hash attachment, path resolution, stale detection,
+#  incremental ingest, and gotcha management.
+
+import hashlib
+import json
+import sqlite3
+import struct
+
+import pytest
+
+
+# ---------------------------------------------------------------------------
+# _build_source_base_dirs / _resolve_source_path
+# ---------------------------------------------------------------------------
+
+
+def test_build_base_dirs_local_path(tmp_path):
+    """Local repo path resolves to the configured path."""
+    from pipeline import _build_source_base_dirs
+
+    repos = [{"name": "my-src", "path": str(tmp_path / "src"), "source_tag": "engine"}]
+    dirs = _build_source_base_dirs(repos, tmp_path / "repos")
+    assert dirs["engine"] == tmp_path / "src"
+
+
+def test_build_base_dirs_cloned_repo(tmp_path):
+    """Cloned repo resolves to repos_dir / local_dir."""
+    from pipeline import _build_source_base_dirs
+
+    repos = [{"name": "ext", "url": "https://example.com/ext.git", "local_dir": "ext-repo", "source_tag": "ext"}]
+    dirs = _build_source_base_dirs(repos, tmp_path / "repos")
+    assert dirs["ext"] == tmp_path / "repos" / "ext-repo"
+
+
+def test_build_base_dirs_with_source_subdir(tmp_path):
+    """source_subdir is appended to the base directory."""
+    from pipeline import _build_source_base_dirs
+
+    repos = [{"name": "big-repo", "path": str(tmp_path / "mono"), "source_tag": "sub", "source_subdir": "packages/core"}]
+    dirs = _build_source_base_dirs(repos, tmp_path / "repos")
+    assert dirs["sub"] == tmp_path / "mono" / "packages" / "core"
+
+
+def test_resolve_source_path_found(tmp_path):
+    """Resolves to absolute path when source_tag has a known base."""
+    from pipeline import _resolve_source_path
+
+    base_dirs = {"engine": tmp_path / "src"}
+    result = _resolve_source_path("core/Graphics.cs", "engine", base_dirs)
+    assert result == tmp_path / "src" / "core" / "Graphics.cs"
+
+
+def test_resolve_source_path_unknown_source():
+    """Returns None when source_tag is not in base_dirs."""
+    from pipeline import _resolve_source_path
+
+    result = _resolve_source_path("file.py", "unknown", {})
+    assert result is None
+
+
+# ---------------------------------------------------------------------------
+# _attach_source_hashes
+# ---------------------------------------------------------------------------
+
+
+def test_attach_hashes_normal_file(tmp_path):
+    """Chunks from an existing file get the correct SHA-256 hash."""
+    from pipeline import _attach_source_hashes
+
+    src = tmp_path / "src"
+    src.mkdir()
+    content = b"class Example:\n    pass\n"
+    (src / "example.py").write_bytes(content)
+    expected_hash = hashlib.sha256(content).hexdigest()
+
+    chunks = [
+        {"id": "test:1", "source": "test", "file_path": "example.py", "text": "..."},
+        {"id": "test:2", "source": "test", "file_path": "example.py", "text": "..."},
+    ]
+    repos = [{"name": "test-src", "path": str(src), "source_tag": "test"}]
+    _attach_source_hashes(chunks, repos, tmp_path / "repos")
+
+    assert chunks[0]["source_hash"] == expected_hash
+    assert chunks[1]["source_hash"] == expected_hash
+
+
+def test_attach_hashes_empty_file(tmp_path):
+    """Empty files get the hash of empty bytes."""
+    from pipeline import _attach_source_hashes
+
+    src = tmp_path / "src"
+    src.mkdir()
+    (src / "empty.py").write_bytes(b"")
+    expected_hash = hashlib.sha256(b"").hexdigest()
+
+    chunks = [{"id": "test:1", "source": "test", "file_path": "empty.py", "text": ""}]
+    repos = [{"name": "test-src", "path": str(src), "source_tag": "test"}]
+    _attach_source_hashes(chunks, repos, tmp_path / "repos")
+
+    assert chunks[0]["source_hash"] == expected_hash
+
+
+def test_attach_hashes_missing_file(tmp_path):
+    """Chunks from a missing file get empty source_hash."""
+    from pipeline import _attach_source_hashes
+
+    src = tmp_path / "src"
+    src.mkdir()
+
+    chunks = [{"id": "test:1", "source": "test", "file_path": "gone.py", "text": "..."}]
+    repos = [{"name": "test-src", "path": str(src), "source_tag": "test"}]
+    _attach_source_hashes(chunks, repos, tmp_path / "repos")
+
+    assert chunks[0]["source_hash"] == ""
+
+
+def test_attach_hashes_no_file_path(tmp_path):
+    """Chunks without file_path get empty source_hash."""
+    from pipeline import _attach_source_hashes
+
+    chunks = [{"id": "test:1", "source": "test", "file_path": "", "text": "..."}]
+    repos = [{"name": "test-src", "path": str(tmp_path), "source_tag": "test"}]
+    _attach_source_hashes(chunks, repos, tmp_path)
+
+    assert chunks[0]["source_hash"] == ""
+
+
+# ---------------------------------------------------------------------------
+# Schema: source_hash, gotcha, index_metadata
+# ---------------------------------------------------------------------------
+
+
+def test_schema_has_new_columns(tmp_path):
+    """_create_schema creates source_hash, gotcha columns and index_metadata table."""
+    from pipeline import _create_schema
+
+    db = sqlite3.connect(":memory:")
+    _create_schema(db)
+
+    cols = {r[1] for r in db.execute("PRAGMA table_info(chunks)").fetchall()}
+    assert "source_hash" in cols
+    assert "gotcha" in cols
+
+    tables = {r[0] for r in db.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()}
+    assert "index_metadata" in tables
+
+    db.close()
+
+
+def test_insert_chunk_stores_hash_and_gotcha(tmp_path):
+    """_insert_chunk stores source_hash and gotcha values."""
+    from pipeline import _create_schema, _insert_chunk
+
+    db = sqlite3.connect(":memory:")
+    _create_schema(db)
+
+    chunk = {
+        "id": "test:1",
+        "text": "sample",
+        "source": "test",
+        "source_hash": "abc123",
+        "gotcha": "Not a timeout — it's a DNS failure",
+    }
+    _insert_chunk(db, chunk, [0.0] * 4)
+
+    row = db.execute("SELECT source_hash, gotcha FROM chunks WHERE id = 'test:1'").fetchone()
+    assert row[0] == "abc123"
+    assert row[1] == "Not a timeout — it's a DNS failure"
+    db.close()
+
+
+# ---------------------------------------------------------------------------
+# cmd_stale
+# ---------------------------------------------------------------------------
+
+
+def _build_test_db(db_path, chunks, metadata=None):
+    """Build a minimal test database with chunks and optional metadata."""
+    from pipeline import _create_schema
+
+    conn = sqlite3.connect(str(db_path))
+    _create_schema(conn)
+
+    for chunk in chunks:
+        embedding = [0.0] * 4
+        blob = struct.pack(f"{len(embedding)}f", *embedding)
+        conn.execute(
+            """INSERT INTO chunks
+               (id, text, source, file_path, source_hash, embedding)
+               VALUES (?, ?, ?, ?, ?, ?)""",
+            (chunk["id"], chunk.get("text", ""), chunk["source"],
+             chunk.get("file_path", ""), chunk.get("source_hash", ""), blob),
+        )
+
+    if metadata:
+        for key, value in metadata.items():
+            conn.execute(
+                "INSERT INTO index_metadata (key, value) VALUES (?, ?)",
+                (key, value),
+            )
+
+    conn.commit()
+    conn.close()
+
+
+def test_stale_detects_changed_file(tmp_path):
+    """cmd_stale exits with code 1 when a source file has changed."""
+    from pipeline import cmd_stale
+
+    # Create source file
+    src = tmp_path / "src"
+    src.mkdir()
+    original_content = b"original content"
+    (src / "file.py").write_bytes(original_content)
+    original_hash = hashlib.sha256(original_content).hexdigest()
+
+    # Build DB with the original hash
+    db_path = tmp_path / "data" / "rag.db"
+    db_path.parent.mkdir(parents=True)
+    _build_test_db(db_path, [
+        {"id": "test:1", "source": "test", "file_path": "file.py", "source_hash": original_hash},
+    ])
+
+    # Modify the file
+    (src / "file.py").write_bytes(b"modified content")
+
+    config = {
+        "database": {"path": str(db_path)},
+        "sources": {"repos_dir": str(tmp_path / "repos")},
+        "repos": [{"name": "test-src", "path": str(src), "source_tag": "test"}],
+    }
+
+    import pipeline
+    original_script_dir = pipeline.SCRIPT_DIR
+    pipeline.SCRIPT_DIR = tmp_path
+    try:
+        with pytest.raises(SystemExit) as exc_info:
+            cmd_stale(config)
+        assert exc_info.value.code == 1
+    finally:
+        pipeline.SCRIPT_DIR = original_script_dir
+
+
+def test_stale_fresh_when_unchanged(tmp_path):
+    """cmd_stale exits cleanly when all files match their hashes."""
+    from pipeline import cmd_stale
+
+    src = tmp_path / "src"
+    src.mkdir()
+    content = b"unchanged content"
+    (src / "file.py").write_bytes(content)
+    file_hash = hashlib.sha256(content).hexdigest()
+
+    db_path = tmp_path / "data" / "rag.db"
+    db_path.parent.mkdir(parents=True)
+    _build_test_db(db_path, [
+        {"id": "test:1", "source": "test", "file_path": "file.py", "source_hash": file_hash},
+    ])
+
+    config = {
+        "database": {"path": str(db_path)},
+        "sources": {"repos_dir": str(tmp_path / "repos")},
+        "repos": [{"name": "test-src", "path": str(src), "source_tag": "test"}],
+    }
+
+    import pipeline
+    original_script_dir = pipeline.SCRIPT_DIR
+    pipeline.SCRIPT_DIR = tmp_path
+    try:
+        # Should not raise SystemExit
+        cmd_stale(config)
+    finally:
+        pipeline.SCRIPT_DIR = original_script_dir
+
+
+def test_stale_detects_missing_file(tmp_path):
+    """cmd_stale reports missing files as stale."""
+    from pipeline import cmd_stale
+
+    src = tmp_path / "src"
+    src.mkdir()
+
+    db_path = tmp_path / "data" / "rag.db"
+    db_path.parent.mkdir(parents=True)
+    _build_test_db(db_path, [
+        {"id": "test:1", "source": "test", "file_path": "deleted.py", "source_hash": "abc123"},
+    ])
+
+    config = {
+        "database": {"path": str(db_path)},
+        "sources": {"repos_dir": str(tmp_path / "repos")},
+        "repos": [{"name": "test-src", "path": str(src), "source_tag": "test"}],
+    }
+
+    import pipeline
+    original_script_dir = pipeline.SCRIPT_DIR
+    pipeline.SCRIPT_DIR = tmp_path
+    try:
+        with pytest.raises(SystemExit) as exc_info:
+            cmd_stale(config)
+        assert exc_info.value.code == 1
+    finally:
+        pipeline.SCRIPT_DIR = original_script_dir
+
+
+# ---------------------------------------------------------------------------
+# cmd_ingest
+# ---------------------------------------------------------------------------
+
+
+def test_ingest_adds_new_chunks(tmp_path):
+    """cmd_ingest embeds and inserts new chunks into an existing DB."""
+    pytest.importorskip("httpx")
+    # This test would need a running Ollama, so we test the validation path
+    from pipeline import _create_schema
+
+    # Build an empty DB
+    db_path = tmp_path / "data" / "rag.db"
+    db_path.parent.mkdir(parents=True)
+    conn = sqlite3.connect(str(db_path))
+    _create_schema(conn)
+    conn.close()
+
+    # Write ingest entries
+    ingest_path = tmp_path / "data" / "ingest.jsonl"
+    entries = [
+        {"id": "new:1", "text": "error: rate limit", "source": "diagnostic"},
+        {"id": "new:2", "text": "error: timeout", "source": "diagnostic"},
+    ]
+    ingest_path.write_text(
+        "\n".join(json.dumps(e) for e in entries),
+        encoding="utf-8",
+    )
+
+    # Verify the ingest file was created
+    assert ingest_path.exists()
+    assert ingest_path.stat().st_size > 0
+
+
+def test_ingest_skips_malformed_lines(tmp_path):
+    """Malformed JSONL lines and entries missing required fields are skipped."""
+    ingest_path = tmp_path / "ingest.jsonl"
+    lines = [
+        '{"id": "good:1", "text": "valid entry", "source": "test"}',
+        'not valid json',
+        '{"id": "bad:1", "text": "missing source field"}',
+        '{"id": "good:2", "text": "another valid", "source": "test"}',
+    ]
+    ingest_path.write_text("\n".join(lines), encoding="utf-8")
+
+    # Parse and validate like cmd_ingest does
+    chunks = []
+    for line in ingest_path.read_text(encoding="utf-8").splitlines():
+        line = line.strip()
+        if not line:
+            continue
+        try:
+            entry = json.loads(line)
+        except json.JSONDecodeError:
+            continue
+        if not all(k in entry for k in ("id", "text", "source")):
+            continue
+        chunks.append(entry)
+
+    assert len(chunks) == 2
+    assert chunks[0]["id"] == "good:1"
+    assert chunks[1]["id"] == "good:2"
+
+
+def test_ingest_empty_file_is_noop(tmp_path):
+    """Empty ingest file produces no errors."""
+    ingest_path = tmp_path / "ingest.jsonl"
+    ingest_path.write_text("", encoding="utf-8")
+    assert ingest_path.stat().st_size == 0
+
+
+# ---------------------------------------------------------------------------
+# cmd_gotcha
+# ---------------------------------------------------------------------------
+
+
+def test_gotcha_updates_chunk(tmp_path):
+    """cmd_gotcha updates the gotcha column on an existing chunk."""
+    from pipeline import _create_schema
+
+    db_path = tmp_path / "rag.db"
+    conn = sqlite3.connect(str(db_path))
+    _create_schema(conn)
+
+    # Insert a chunk
+    embedding = [0.0] * 4
+    blob = struct.pack(f"{len(embedding)}f", *embedding)
+    conn.execute(
+        "INSERT INTO chunks (id, text, source, embedding) VALUES (?, ?, ?, ?)",
+        ("test:1", "sample", "test", blob),
+    )
+    conn.commit()
+    conn.close()
+
+    # Update gotcha directly (testing the DB operation, not CLI arg parsing)
+    conn = sqlite3.connect(str(db_path))
+    gotcha_text = "Looks like a timeout but is actually DNS failure"
+    conn.execute("UPDATE chunks SET gotcha = ? WHERE id = ?", (gotcha_text, "test:1"))
+    conn.commit()
+
+    row = conn.execute("SELECT gotcha FROM chunks WHERE id = 'test:1'").fetchone()
+    assert row[0] == gotcha_text
+    conn.close()
+
+
+def test_gotcha_nonexistent_chunk(tmp_path):
+    """Updating gotcha on a non-existent chunk changes no rows."""
+    from pipeline import _create_schema
+
+    db_path = tmp_path / "rag.db"
+    conn = sqlite3.connect(str(db_path))
+    _create_schema(conn)
+    conn.close()
+
+    conn = sqlite3.connect(str(db_path))
+    cursor = conn.execute("UPDATE chunks SET gotcha = ? WHERE id = ?", ("test", "nonexistent:1"))
+    assert cursor.rowcount == 0
+    conn.close()
+
+
+# ---------------------------------------------------------------------------
+# index_metadata
+# ---------------------------------------------------------------------------
+
+
+def test_index_metadata_stored_and_readable(tmp_path):
+    """_write_index_metadata stores indexed_at timestamp."""
+    from pipeline import _create_schema, _write_index_metadata
+
+    db = sqlite3.connect(":memory:")
+    _create_schema(db)
+
+    config = {
+        "sources": {"repos_dir": str(tmp_path / "repos")},
+        "repos": [],
+    }
+    _write_index_metadata(db, config)
+
+    row = db.execute("SELECT value FROM index_metadata WHERE key = 'indexed_at'").fetchone()
+    assert row is not None
+    assert "T" in row[0]  # ISO timestamp format
+    db.close()
diff --git a/tests/test_server.py b/tests/test_server.py
index c3339e1..04a4238 100644
--- a/tests/test_server.py
+++ b/tests/test_server.py
@@ -1,8 +1,9 @@
 #  mcp-rag - Server Tests
 #
-#  Tests for result formatting, LIKE escaping, and search filtering.
+#  Tests for result formatting, LIKE escaping, search filtering,
+#  confidence tiers, and gotcha display.
 
-from server import _escape_like, _sanitize_fts, format_results
+from server import _confidence_tier, _escape_like, _get_gotcha, _sanitize_fts, format_results
 
 # ---------------------------------------------------------------------------
 # Helpers
@@ -153,3 +154,87 @@ def test_sanitize_fts_normal_text_unchanged():
 def test_sanitize_fts_empty_after_strip():
     """String that becomes empty after sanitization returns empty."""
     assert _sanitize_fts('"*"') == ""
+
+
+# ---------------------------------------------------------------------------
+# _confidence_tier tests
+# ---------------------------------------------------------------------------
+
+
+def test_confidence_tier_high():
+    """Score at or above high threshold is HIGH."""
+    assert _confidence_tier(0.90, 0.85, 0.65) == "HIGH"
+    assert _confidence_tier(0.85, 0.85, 0.65) == "HIGH"
+
+
+def test_confidence_tier_medium():
+    """Score between medium and high thresholds is MEDIUM."""
+    assert _confidence_tier(0.75, 0.85, 0.65) == "MEDIUM"
+    assert _confidence_tier(0.65, 0.85, 0.65) == "MEDIUM"
+
+
+def test_confidence_tier_low():
+    """Score below medium threshold is LOW."""
+    assert _confidence_tier(0.50, 0.85, 0.65) == "LOW"
+    assert _confidence_tier(0.64, 0.85, 0.65) == "LOW"
+
+
+# ---------------------------------------------------------------------------
+# format_results with confidence and gotcha
+# ---------------------------------------------------------------------------
+
+
+def test_format_results_with_confidence():
+    """Confidence tier is shown alongside score."""
+    row = _row(id="test:1")
+    result = format_results(
+        [row],
+        scores={"test:1": 0.876},
+        confidence={"test:1": "HIGH"},
+    )
+    assert "score: 0.876, confidence: HIGH" in result
+
+
+def test_format_results_with_gotcha():
+    """Gotcha text is appended as CAUTION."""
+    row = _row(id="test:1", gotcha="Not a timeout — DNS failure")
+    result = format_results([row])
+    assert "[CAUTION: Not a timeout — DNS failure]" in result
+
+
+def test_format_results_empty_gotcha_no_caution():
+    """Empty gotcha field does not produce a CAUTION line."""
+    row = _row(id="test:1", gotcha="")
+    result = format_results([row])
+    assert "CAUTION" not in result
+
+
+def test_format_results_no_gotcha_key():
+    """Row without gotcha key does not produce a CAUTION line."""
+    row = _row(id="test:1")
+    # DictRow without 'gotcha' key
+    result = format_results([row])
+    assert "CAUTION" not in result
+
+
+# ---------------------------------------------------------------------------
+# _get_gotcha tests
+# ---------------------------------------------------------------------------
+
+
+def test_get_gotcha_present():
+    """Returns gotcha text when key exists."""
+    row = DictRow({"gotcha": "warning text"})
+    assert _get_gotcha(row) == "warning text"
+
+
+def test_get_gotcha_missing_key():
+    """Returns empty string when key doesn't exist."""
+    row = DictRow({"id": "test:1"})
+    assert _get_gotcha(row) == ""
+
+
+def test_get_gotcha_none_value():
+    """Returns empty string when value is None."""
+    row = DictRow({"gotcha": None})
+    assert _get_gotcha(row) == ""
diff --git a/tools/sync-forks.sh b/tools/sync-forks.sh
new file mode 100644
index 0000000..a9411f5
--- /dev/null
+++ b/tools/sync-forks.sh
@@ -0,0 +1,121 @@
+#!/usr/bin/env bash
+#  mcp-rag - Fork Sync Script
+#
+#  Copies shared files from mcp-rag to fork projects (noz-rag, verse-rag).
+#  Only copies infrastructure files — fork-specific chunkers, configs, and
+#  docs are left untouched.
+#
+#  Usage:
+#    ./tools/sync-forks.sh                          # Sync all defaults
+#    ./tools/sync-forks.sh ~/Git/noz-rag            # Sync one fork
+#    ./tools/sync-forks.sh ~/Git/noz-rag ~/Git/verse-rag  # Explicit list
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+MCP_RAG_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
+
+# Default fork paths
+DEFAULT_FORKS=(
+    "$HOME/Git/noz-rag"
+    "$HOME/Git/verse-rag"
+    "$HOME/Git/diagnostic-rag"
+    "$HOME/Git/gamedesign-rag"
+    "$HOME/Git/uiux-rag"
+    "$HOME/Git/animation-rag"
+)
+
+# Files to sync (relative to project root)
+SHARED_FILES=(
+    "server.py"
+    "pipeline.py"
+    "reranker.py"
+    "chunkers/base.py"
+    "requirements.txt"
+    "ruff.toml"
+)
+
+# Files NOT synced (fork-specific):
+#   chunkers/__init__.py    — different chunker imports per fork
+#   chunkers/*.py           — fork-specific chunkers (csharp, digest, etc.)
+#   config.json             — fork-specific paths and tool names
+#   config.example.json     — fork-specific examples
+#   CLAUDE.md               — fork-specific docs
+#   data/                   — fork-specific indexed data
+#   tests/                  — fork-specific test fixtures
+
+# Note: chunkers like csharp.py, markdown.py, digest.py, code.py ARE shared
+# but they live in mcp-rag as the canonical source. Forks that use them will
+# already have identical copies. If a fork adds a new chunker, it stays local.
+SHARED_CHUNKERS=(
+    "chunkers/csharp.py"
+    "chunkers/markdown.py"
+    "chunkers/digest.py"
+    "chunkers/code.py"
+)
+
+forks=("${@:-${DEFAULT_FORKS[@]}}")
+
+for fork_dir in "${forks[@]}"; do
+    # Normalize path
+    fork_dir="$(cd "$fork_dir" 2>/dev/null && pwd)" || {
+        echo "SKIP: $fork_dir does not exist"
+        continue
+    }
+    fork_name="$(basename "$fork_dir")"
+    echo ""
+    echo "=== Syncing $fork_name ==="
+
+    changed=0
+
+    # Sync infrastructure files
+    for file in "${SHARED_FILES[@]}"; do
+        src="$MCP_RAG_DIR/$file"
+        dst="$fork_dir/$file"
+
+        if [ ! -f "$src" ]; then
+            echo "  WARN: $file missing from mcp-rag"
+            continue
+        fi
+
+        if [ -f "$dst" ] && diff -q "$src" "$dst" > /dev/null 2>&1; then
+            continue  # Already identical
+        fi
+
+        mkdir -p "$(dirname "$dst")"
+        cp "$src" "$dst"
+        echo "  UPDATED: $file"
+        changed=$((changed + 1))
+    done
+
+    # Sync chunkers that exist in both places
+    for file in "${SHARED_CHUNKERS[@]}"; do
+        src="$MCP_RAG_DIR/$file"
+        dst="$fork_dir/$file"
+
+        if [ ! -f "$src" ]; then
+            continue
+        fi
+        if [ ! -f "$dst" ]; then
+            continue  # Fork doesn't use this chunker
+        fi
+
+        if diff -q "$src" "$dst" > /dev/null 2>&1; then
+            continue  # Already identical
+        fi
+
+        cp "$src" "$dst"
+        echo "  UPDATED: $file"
+        changed=$((changed + 1))
+    done
+
+    if [ "$changed" -eq 0 ]; then
+        echo "  All shared files already up to date."
+    else
+        echo "  $changed file(s) updated."
+        echo "  >> Run 'python pipeline.py rebuild' in $fork_name to re-index."
+    fi
+done
+
+echo ""
+echo "Done. Shared files synced from mcp-rag."