Homework: Extend the Keyword Analysis#

This homework gives you another chance to practice the Session 1 workflow on a messy field: the subjects values in the Seattle Public Library title-level sample data.

The goal is not to produce a polished report or a final subject taxonomy. The goal is to practice using Codex to inspect data, compare approaches, and decide what kind of method is appropriate before trusting an automated result.

Choose one path#

If you did not complete the subject-tag stretch exercise during the session, start with the subject-tag stretch exercise.

Do the checked out materials have subject tags in any dataset? Check the raw
CSV headers and report which files include the field.

Then ask Codex to add a small notebook section that profiles the subjects field in the digital and physical title-level samples:

Add a cell to the notebook that summarizes the subjects field for the digital
and physical title-level sample files. Count the number of unique subject
values in each dataset and show a small sample of values. Keep the code simple
and explain any parsing assumptions.

If the result is surprising, pause before asking Codex to implement a larger solution. Use the same method-choice prompt from the stretch exercise:

The physical sample has far more unique subject values than the digital sample.
This seems like a text-cleaning or lightweight NLP problem. Do not implement
anything yet. What simpler libraries or methods could help us inspect whether
these are near-duplicates, formatting variants, or genuinely granular subjects?

You can also ask Codex to write a handoff prompt, then start a new thread with /new and use that prompt to perform new analysis in a new notebook:

Write a handoff prompt for another coding agent to investigate the subjects
field as a keyword-analysis problem. The prompt should ask the agent to profile
the field, compare raw and normalized unique counts, identify examples of
near-duplicate subject values, recommend a simple method before using an LLM,
and avoid large-scale canonicalization unless explicitly asked.

If you already completed the subject-tag stretch exercise, extend it by trying a different model, reasoning setting, or method. For example, you might compare:

a faster model and a stronger reasoning model
a simple normalization-only approach and a fuzzy-matching approach
RapidFuzz and scikit-learn TF-IDF
an LLM-generated grouping proposal and a simpler string-similarity method

Questions to investigate#

Use the exercise to answer practical questions about the field:

What does the subjects field appear to contain?
How different are the raw unique counts for digital and physical samples?
Does simple normalization reduce the number of distinct values?
Are there obvious near-duplicates, formatting variants, or compound subject strings?
Does the large physical count look like meaningful topical variety, metadata granularity, parsing noise, or a mixture?
What would you trust Codex or another model to do automatically?
What would still require human review?