# Homework: Extend the Keyword Analysis This homework gives you another chance to practice the Session 1 workflow on a messy field: the `subjects` values in the Seattle Public Library title-level sample data. The goal is not to produce a polished report or a final subject taxonomy. The goal is to practice using Codex to inspect data, compare approaches, and decide what kind of method is appropriate before trusting an automated result. ## Choose one path If you did not complete the subject-tag stretch exercise during the session, start with {ref}`the subject-tag stretch exercise `. ```text Do the checked out materials have subject tags in any dataset? Check the raw CSV headers and report which files include the field. ``` Then ask Codex to add a small notebook section that profiles the `subjects` field in the digital and physical title-level samples: ```text Add a cell to the notebook that summarizes the subjects field for the digital and physical title-level sample files. Count the number of unique subject values in each dataset and show a small sample of values. Keep the code simple and explain any parsing assumptions. ``` If the result is surprising, pause before asking Codex to implement a larger solution. Use the same method-choice prompt from the stretch exercise: ```text The physical sample has far more unique subject values than the digital sample. This seems like a text-cleaning or lightweight NLP problem. Do not implement anything yet. What simpler libraries or methods could help us inspect whether these are near-duplicates, formatting variants, or genuinely granular subjects? ``` You can also ask Codex to write a handoff prompt, then start a new thread with `/new` and use that prompt to perform new analysis in a new notebook: ```text Write a handoff prompt for another coding agent to investigate the subjects field as a keyword-analysis problem. The prompt should ask the agent to profile the field, compare raw and normalized unique counts, identify examples of near-duplicate subject values, recommend a simple method before using an LLM, and avoid large-scale canonicalization unless explicitly asked. ``` If you already completed the subject-tag stretch exercise, extend it by trying a different model, reasoning setting, or method. For example, you might compare: - a faster model and a stronger reasoning model - a simple normalization-only approach and a fuzzy-matching approach - RapidFuzz and scikit-learn TF-IDF - an LLM-generated grouping proposal and a simpler string-similarity method ## Questions to investigate Use the exercise to answer practical questions about the field: - What does the `subjects` field appear to contain? - How different are the raw unique counts for digital and physical samples? - Does simple normalization reduce the number of distinct values? - Are there obvious near-duplicates, formatting variants, or compound subject strings? - Does the large physical count look like meaningful topical variety, metadata granularity, parsing noise, or a mixture? - What would you trust Codex or another model to do automatically? - What would still require human review?