Add context_attribution intrinsics by dennislwei · Pull Request #135 · ibm-granite/granite-common

dennislwei · 2026-03-04T05:51:39Z

Closes #119

Summary

Add input/output processing and test data for a new context attribution intrinsic with two variants:

context_attribution_all: attributes each response sentence to context sentences, where context includes documents and previous conversation turns
context_attribution_single: attributes a single selected response sentence to context sentences

Changes

src/granite_common/intrinsics/input.py:
- Added ability to mark sentence boundaries in all_but_last_message (all conversation turns except the last assistant response), with a shared sentence index across documents and all but the last message
src/granite_common/intrinsics/output.py:
- Added all_but_last_message as a source for DecodeSentences, with a message_index output field
- Extended DecodeSentences to accept a list of sources (e.g. ["documents", "all_but_last_message"]) processed in order
- Added a WrapInList transformation rule
Test data files for context attribution
- input_yaml: local YAML files for now
- input_json
- input_args: for context_attribution_single only
- test_canned_input
- test_canned_output

This adds a new test input file for context attribution testing. The file contains a multi-turn conversation about the Plastic Ono Band with 3 documents and properly formatted messages following the citations.json format. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

This adds support for marking sentence boundaries in all messages except the last one, enabling unified context attribution across both documents and conversation history. Key changes: - Updated mark_sentence_boundaries() to accept a starting index parameter and return both the rewritten texts and the next available index - Added "all_but_last_message" as a valid sentence_boundaries location - Modified _mark_sentence_boundaries() to track sentence indices across documents and all_but_last_message cases, allowing continuous numbering - The last_message case remains independent with its own numbering This enables context attribution where documents and conversation history share the same tag prefix (e.g., "c") with continuous numbering, while the response to be attributed uses a separate tag prefix (e.g., "r"). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

This configuration enables context attribution by tagging both documents and conversation history with the same prefix ("c"), allowing the model to attribute response sentences to any part of the context. Key configuration: - sentence_boundaries uses "c" for both documents and all_but_last_message - Uses "r" for last_message (the response to be attributed) - Updated instruction to reflect that context includes both documents and conversation turns - Uses the same response format and transformations as citations (except for merge_spans) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Escape curly braces in JSON example to prevent Python format() from interpreting {"r" and "c" as format placeholders. The doubled braces {{}} will render as single braces {} in the final instruction string. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

This adds support for decoding sentence references from conversation history (all messages except the last) in the DecodeSentences output transformation rule. Key changes: - source parameter now accepts a string or list of strings, normalized to a list internally, allowing multiple sources to be processed in order - Added "all_but_last_message" as a valid source alongside "last_message" and "documents" - _prepare() now loops over sources, maintaining a running sentence number across all of them, building unified lookup lists for begins/ends/texts - Added message_indices list (parallel to document_ids) populated with the 0-based message index for conversation sentences and None for others - Added message_index to output_names, gated on "all_but_last_message" being present in source - Backwards compatible: existing YAML configs with source as a string continue to work unchanged This enables context attribution where documents and conversation history share the same tag prefix (e.g., "c") with continuous numbering, decoded by a single DecodeSentences rule with source: ["documents", "all_but_last_message"]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Update the decode_sentences transformation for context sentences to use the new list-based source feature, combining documents and conversation history into a single decoding pass. Key changes: - source for the context decode_sentences rule changed from "documents" to ["documents", "all_but_last_message"], processing both in order with continuous sentence numbering - Added message_index: "citation_msg_index" to output_names to track which conversation turn each cited sentence came from (null for document sentences) - Added "citation_msg_index" to the project retained_fields Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

…lwei/granite-common into feat/issue119-context-attribution

Rename context_attribution.yaml to context_attribution_all.yaml and copy context_attribution.json to _all and _single variants to support both the all-sentences and single-sentence attribution configs. Key changes: - Renamed input_yaml/context_attribution.yaml to context_attribution_all.yaml - Renamed input_json/context_attribution.json to context_attribution_all.json - Added input_json/context_attribution_single.json (same content as _all for now) - Added input_yaml/context_attribution_single.yaml with: - response_format for a flat array of non-negative integers - Nest transformation to wrap the array into {"c": [...]} - No decode_sentences for last_message - instruction_single asking about a single response sentence <r{response_sent_id}> - Added input_args/context_attribution_single.json with response_sent_id: 0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Add a new WrapInList rule that wraps a matched value in a single-element list, enabling subsequent rules (like explode) that expect a list of records to operate on a single record. Key changes: - Added WrapInList(InPlaceTransformation) with YAML name "wrap_in_list" - Registered WrapInList in ALL_RULES - Used in context_attribution_single.yaml after nest to convert {"c": [0, 2]} into [{"c": [0, 2]}] before explode Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Generated by running IntrinsicsRewriter.transform() on the respective input_json and input_yaml files. The rewritten chat completions have: - Documents tagged <c0>-<c69>, conversation history tagged <c70>-<c72>, and the last assistant message tagged <r0>, <r1>, etc. - The attribution instruction appended as a 5th user message Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

task=None for now as there is no adapter on HF Hub yet. This enables test_canned_input but excludes inference tests and test_canned_output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Signed-off-by: Dennis Wei <dwei@us.ibm.com>

…lwei/granite-common into feat/issue119-context-attribution

…ext_attribution_single Rename test data files from context_attribution_all to context-attribution (with hyphen) to match the task name convention. Stop tracking all context_attribution_all and context_attribution_single files, keeping them on disk for local development. Key changes: - Renamed input_json, test_canned_input, test_canned_output model_output and expected_result files from context_attribution_all to context-attribution - Untracked context_attribution_all.yaml, context_attribution_single.* and context_attribution_all.json files - Updated test_rag_intrinsics_lib.py: replaced context_attribution_all and context_attribution_single YamlJsonCombos with a single context-attribution entry with task="context-attribution" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

- Add repo_id="ibm-granite/granitelib-core-r1.0" to the context-attribution YamlJsonCombo entry in test_rag_intrinsics_lib.py so tests download the YAML from the Hugging Face Hub rather than using a local file. - Regenerate test_canned_output/expected_result/context-attribution.json using the updated YAML (which now uses attribution_* field names) and the corrected _desplit_sentences() offset calculation merged from main. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Ollama LoRA adapter not yet available on HF Hub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

dennislwei · 2026-03-13T18:18:21Z

Documenting some updates:

context_attribution_single: It was decided that this variant will not be released. Hence I removed all associated files, etc.
context_attribution_all: Renamed to context-attribution (with a hyphen instead of an underscore)
input_yaml: No longer needed because the YAML file is now on HF Hub
input_args: No longer needed because context_attribution_single is removed
test_run_transformers: Added
test_run_ollama: Added, but since there is no Ollama LoRA adapter yet, I excluded context-attribution from this test

I will now make a PR to Mellea similar to this one.

dennislwei and others added 20 commits February 13, 2026 11:07

Merge branch 'feat/issue119-context-attribution' of github.com:dennis…

22a82ab

…lwei/granite-common into feat/issue119-context-attribution

Add test_canned_output for context_attribution_all and _single

08c2939

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Replace "citation" with "attribution" in context_attribution_*.yaml

e12aec3

Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Merge branch 'feat/issue119-context-attribution' of github.com:dennis…

22d3662

…lwei/granite-common into feat/issue119-context-attribution

Merge branch 'main' into feat/issue119-context-attribution

dbbd733

Add test_run_transformers expected output for context-attribution

8d2c156

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Add test_run_ollama expected output for context-attribution

71f8c46

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Exclude context-attribution from test_run_ollama

13e7a21

Ollama LoRA adapter not yet available on HF Hub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dennis Wei <dwei@us.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add context_attribution intrinsics#135

Add context_attribution intrinsics#135
dennislwei wants to merge 20 commits intoibm-granite:mainfrom
dennislwei:feat/issue119-context-attribution

dennislwei commented Mar 4, 2026

Uh oh!

dennislwei commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dennislwei commented Mar 4, 2026

Summary

Changes

Uh oh!

dennislwei commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant