Skip to content

AMIRA Text Replay Coder is a set of Python scripts that generates a sample.xlsx composed of 20-second clips from AMIRA micro-intervention and timing logs. Coders use a Tkinter-based UI to label the clips, and annotations are saved to CSV or XLSX.

License

Notifications You must be signed in to change notification settings

pcla-code/AMIRA-Text-Replay-Coder

Repository files navigation

📟 AMIRA Text Replay Coder

A small, self‑contained set of Python scripts to sample 20s audio/text clips from AMIRA logs and code them with a lightweight desktop GUI. The workflow is two‑step:

  1. Sampler (0_RunSamplerNew.py) builds sample.xlsx from your AMIRA logs (micro‑intervention + reading/timing).
  2. Text Replay Coder (1_RunTextReplayCoderOct1.py) loads sample.xlsx and lets a coder label each clip; results are saved to a CSV per coder.
image

📁 Contents

.
├─ 0_RunSamplerNew.py                    # CLI script: builds sample.xlsx from AMIRA logs
├─ 1_RunTextReplayCoderOct1.py          # GUI tool (Tkinter): code the sampled clips
├─ AMIRA_microinterventionlogs.csv      # INPUT: micro-intervention logs (CSV)
├─ AMIRA_readingandtiminglogs.xlsx      # INPUT: reading & timing logs (XLSX preferred)
├─ coder_Neithan_annotations.csv        # EXAMPLE: output from a prior coding run
└─ sample.xlsx                          # EXAMPLE: sampler output to be coded

⚠️Important: Use XLSX for timing data? XLSX preserves special characters (odd pronunciations, diacritics) that can get lost in CSV UTF-11 encodings. ⚠️Important: Intervention Logs and Reading and Timing Logs are not in this repository Please consult AMIRA regarding access to these files.


🔎 What the sampler produces

The sampler aligns micro‑interventions with timing rows per activity_id, adjusts timing boundaries across phrases/words, and slices sessions into fixed‑length clip windows (default 20s). For each chosen clip, it exports:

  • Word rows with story word, student speech (Kaldi), optional W2V, adjusted start/end times, and correctness labels.
  • Phrase‑level intervention rows (if present).
  • A deterministic order (unless you opt into random order).

The output is a single Excel file (default sample.xlsx) with a sample sheet.


⚙️ Requirements

  • Python ≥ 3.9 (3.10–3.12 tested)
  • Pip packages:
    pip install pandas openpyxl
  • Tkinter for the GUI:
    • Windows/macOS: usually included with Python.org installers.
    • Linux: you may need to install system packages, e.g. on Debian/Ubuntu:
      sudo apt-get update && sudo apt-get install -y python3-tk

If you see ModuleNotFoundError: No module named 'tkinter', install your OS’s Tk bindings as above.


📑 Input files (place in repo root)

  • AMIRA_microinterventionlogs.csv (required)
  • AMIRA_readingandtiminglogs.xlsx (preferred) or AMIRA_readingandtiminglogs.csv (fallback)

Required columns (sampler creates blanks if missing but you’ll get better results when present):

  • Micro‑intervention: activity_id, student_id, phrase_index, word_index, intervention_scope, intervention_type, intervention_word
  • Timing: activity_id, phrase_index, word_index, Story Word, AmiraW2V_Rec_Word, AmiraKaldi_Rec_Word, AmiraKaldi_Start_Time, AmiraKaldi_End_Time, annotator label, raw_text

The sampler attaches student_id to timing rows by joining on activity_id.


🚀 Quick start

1) Build the sample file

From the repository root:

# Basic
python 0_RunSamplerNew.py --out sample.xlsx

# Recommended (explicit all knobs)
python 0_RunSamplerNew.py \
  --out sample.xlsx \
  --clip-seconds 20 \
  --min-sessions 2 \
  --sessions-per-student -1 \
  --clips-per-session 3 \
  --mode random \
  --global-mode linear \
  --criterion phrase_start \
  --seed 42 \
  --batch-id 2025-10-13

or just do python 0_RunSamplerNew.py (the default values should take over and be used)

Key options

  • --out (str, default: sample.xlsx) — Output Excel path (always written as .xlsx).
  • --clip-seconds (int, default: 20) — Window length in seconds.
  • --min-sessions (int, default: 2) — Students must have at least this many distinct activity_id sessions to be eligible.
  • --sessions-per-student (int, default: -1) — Sessions to include per student (-1 = use all available).
  • --clips-per-session (int, default: 3) — Cap of clips per session (<0 = unlimited).
  • --mode random|linear — How sessions are chosen per student (random shuffles before taking the first N).
  • --global-mode random|linear — How rows are globally ordered in the output.
  • --criterion phrase_start|word_overlap — Clip selection: by phrase starts or any word overlap with the window.
  • --seed (int|None) — Random seed for reproducibility when --mode random is used.
  • --batch-id (str) — Optional tag written to each row (handy for tracking runs).

Sampler output

  • sample.xlsx with a single sample sheet; each clip’s rows share a sample_id.
  • A console report summarizing: total rows, unique students, sessions attempted/with yield/zero‑clip, unique clips, and (if capped) a clip fill‑rate.

2) Code the sample

python 1_RunTextReplayCoderOct1.py --sample sample.xlsx
AMIRA.-.Running.the.Text.Replay.Coder.mp4
  • You’ll be prompted for Coder Name (used in the output filename).
  • A scrollable monospaced view shows each clip, grouped by Phrase 1, Phrase 2, … with:
    • Story Word, Student Speech (Kaldi), optional W2V
    • [start–end] timestamps and [Correct|Incorrect|Unknown] status
    • Any detected Word Interventions and Phrase Interventions
    • Optional Raw text line per phrase (first non‑empty value)
  • At the bottom, select one or more labels (multi‑select) and click Save & Next.

Coder output

  • Each coder/session produces/extends a CSV named:
    coder_{CODER_NAME}_{SAMPLE_FILENAME}_annotations.csv
    Example: coder_Jen_sample.xlsx_annotations.csv
  • Resume where you left off: The GUI loads that CSV (if present) and skips already‑coded sample_ids.

🔀 Typical workflow (end‑to‑end)

  1. Put your two input files in the repo root.
  2. Run the sampler once to produce sample.xlsx.
  3. Share sample.xlsx with coders or run the GUI yourself.
  4. Collect coder_*.csv files; these are your coding outputs for analysis.
  5. (Optional) Rerun the sampler with a different seed or parameters to create a new batch, e.g., sample_2025‑10‑13.xlsx with a --batch-id tag. Coders can annotate that separately.

📝 Notes on what’s inside the sample

  • Adjusted times: start/end times are “cleaned” to avoid 0/NaN and smoothed within phrases; phrase gaps are respected when laying out continuous adjusted time across a session.
  • Interventions aggregation: multiple word‑level interventions targeting the same word are pipe‑joined (e.g., SaySound|Blend); corresponding words are joined in parallel. Phrase‑level interventions are written as their own rows.
  • Ordering: If you pass --global-mode linear, rows are sorted as: (student_id, activity_id, clip_index, row_type, phrase_index, word_index) so word rows precede phrase‑IV rows per clip. With random, rows are shuffled after construction.

🔧 Troubleshooting

Q: No module named 'openpyxl' / xlrd errors
Install the dependency: pip install openpyxl.

Q: No module named 'tkinter'
Install your OS Tk bindings (Linux example above).

Q: The GUI opens but the words show ? or weird symbols.
Use the XLSX timing file to preserve characters. CSV can lossy‑encode special characters.

Q: The GUI says “Missing sampling file”.
Check --sample path and that sample.xlsx exists in the working directory.

Q: I changed parameters but my coder CSV keeps appending.
The output filename includes both the coder name and the sample filename. If you want a fresh file, change either (e.g., new --out for the sampler).

Q: Some sessions have zero clips.
That’s reported by the sampler. Reasons: min session threshold filters a student; your clip criterion finds no phrase starts in the window; or timing rows lack usable times.


🌱 Reproducibility & randomization

  • Use --seed to freeze any randomness (--mode random).
  • For fully deterministic exports, set --mode linear --global-mode linear and omit --seed.

📂 Data privacy

These scripts assume local, offline processing of educational data. Do not commit raw student logs to a public repository. Consider keeping actual CSV/XLSX files out of Git via .gitignore, and share synthetic or redacted examples instead.

Example .gitignore entries:

AMIRA_microinterventionlogs.*
AMIRA_readingandtiminglogs.*
coder_*_annotations.csv
*.xlsx

🪪 License

This project is licensed for internal use within AMIRA.


📎 Acknowledgments

Built for the AMIRA workflow to support rapid, reliable human coding of short replay windows, with special thanks to collaborators who tested the sampler and GUI.

About

AMIRA Text Replay Coder is a set of Python scripts that generates a sample.xlsx composed of 20-second clips from AMIRA micro-intervention and timing logs. Coders use a Tkinter-based UI to label the clips, and annotations are saved to CSV or XLSX.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages