Understanding the Data Before Analysis#
A coding agent will generate summaries and plots for any input, including data that cannot support the intended conclusion. The person doing the analysis is responsible for establishing what the data represents before interpreting results. This page covers how to direct Codex to characterize a dataset, and how to review what it returns.
Direct the investigation with specific questions#
A general request (“tell me about this data”) returns whatever the agent decides is relevant. A more reliable approach is to specify the questions that characterize any dataset and require the agent to answer them from the data:
What does one row represent?
What value is being counted or aggregated?
Is the data a sample, an aggregate, or a complete extract?
How do the datasets relate to one another?
Which dataset answers which kind of question, and what would be misleading to infer from each?
These questions are the goal of the investigation, but they do not all need to go into the first prompt. Start with a smaller profiling task so Codex creates a simple notebook artifact before moving into interpretation.
Ask Codex to create the first serious notebook artifact for this project:
Add a simple data profile to my notebook for the data in seattle-public-library/. Load the files, print the first few rows, report non-null values and data types.
After Codex finishes, review the notebook section it created. Use the profile to start asking what the files represent, but do not move on to plotting until the data structure is clear.
Two parts of this prompt are doing specific work. “Don’t make any plots yet” stops the agent from producing charts before the structure is confirmed. The request asks for basic profiling outputs rather than interpretation, which helps keep the first notebook section compact and reviewable.
Why this prompt is more specific#
The previous lesson used a narrower prompt:
Can you help me investigate the data in my data folder? I would like to know what the dataset or datasets are and the relationships between the items.
That prompt was better than broad EDA because it focused on data inventory and relationships. In testing, Codex identified the three datasets and explained that the title-level files connect to the aggregate table by year, month, and usage class.
However, later testing showed that asking for all of the interpretation at once made Codex write too much code and too much explanation. We still need to answer the data-shape questions, but it is better to build up to them from a simple profile:
What does one row represent?
What is being counted?
Is this sampled or aggregated?
What would be misleading to infer?
The revised prompt asks only for the facts needed to begin that discussion: files, rows, columns, data types, and non-null counts.
Confirm the profile#
Codex may add a notebook section with a compact profile for each file. Treat the profile as a starting point, not a complete understanding of the data. The next interpretive question is whether each file is a sample, an aggregate, or a complete extract, because that determines which questions the file can legitimately answer.
Review whether the notebook includes:
files present
row counts and shapes
column names
data types
non-null counts
a small preview of each file, if useful
After reviewing the profile, use the column names and sample rows to discuss:
What does one row represent in each file?
What value is being counted or aggregated?
Which files appear sampled, aggregated, or complete?
How might the files relate to each other?
In the checkout example, the full investigation eventually distinguishes two title-level sample files from one monthly aggregate file, and establishes that the samples relate to the aggregate by checkout year, month, and usage class. The practical consequence is immediate: a question about total volume must use the aggregate, not the samples.
In an earlier test, Codex identified:
Digital title sample
digital_checkout_title_sample_balanced_250k.csv250,000 title-month rows
digital checkouts
covers January 2006 through December 2025
includes title metadata and checkout counts
Physical title sample
physical_checkout_title_sample_balanced_250k.csv250,000 title-month rows
physical checkouts
covers January 2006 through December 2025
has the same schema as the digital file
Combined monthly totals
combined_checkout_totals_by_month_usageclass.csv501 monthly aggregate rows
covers April 2005 through April 2026
aggregated by
checkout_year,checkout_month, andusageclassincludes
total_checkouts,month_total_checkouts,usageclass_share,title_month_rows, andsample_rows
Codex also tested the relationship between files:
combined_checkout_totals_by_month_usageclass
keyed by: checkout_year + checkout_month + usageclass
|
| matches
v
digital_checkout_title_sample_balanced_250k
physical_checkout_title_sample_balanced_250k
keyed by: checkoutyear + checkoutmonth + usageclass
It reported that the aggregate file’s sample_rows column matched the number of rows in the title-level sample files for each month and usage class.
Document the data for future sessions#
At this point, Codex has done useful investigation that future sessions should not
need to repeat from scratch. The notebook records the analysis process, but the
basic dataset facts also belong in a durable reference file. Dataset
documentation belongs somewhere humans and agents would both expect to find it:
inside the seattle-public-library/ folder.
Ask Codex to create a data README:
Please edit the README.md file in the data folder documenting the raw data files with enough detail that future sessions do not have to rediscover the basic dataset structure every time.
Update AGENTS.md to direct to this file.
Review seattle-public-library/README.md before keeping it. Check that Codex did not invent
dataset meanings, overstate what the data can support, or turn sampled data into
full-population claims.
Then ask Codex to update AGENTS.md so future assistants know to use the data
documentation before starting new analysis:
Please update AGENTS.md to tell future assistants to read seattle-public-library/README.md before analyzing the raw data files.
This is an important teaching moment. If Codex has to inspect the same files,
schemas, row counts, and relationships in every new session, the project is missing
shared memory. Once those facts have been verified, seattle-public-library/README.md can record
them so later sessions can start from known dataset orientation and spend their work
on the new question.
Request a relationship diagram#
Once the relationships are established, have the agent express them visually so they can be checked at a glance:
Tip
Mermaid is a text-based diagram syntax. Instead of drawing boxes and arrows by hand, you write a small block of structured text, and a renderer turns it into a visual diagram. JupyterLab can render Mermaid diagrams in notebooks, and Codex can write Mermaid syntax for common diagram types.
In this lesson, we use Mermaid to make the dataset relationships visible. This is not a separate charting library for analyzing values; it is a way to document structure.
Ask Codex to summarize the dataset relationships visually:
Based on the investigation, add a Mermaid diagram to the notebook showing the
datasets and how they relate. It should look like an Entity Relationship diagram.
Review the diagram as an interpretation of the investigation, not as an automatically correct schema.
Note
If you want to export a Mermaid diagram outside the notebook, the official Mermaid CLI can render Mermaid text to SVG, PNG, or PDF. Mermaid CLI is a Node.js tool installed with npm, not a Python package. If Node.js and npm are available, one option is:
npm install -g @mermaid-js/mermaid-cli
Then this command converts a Mermaid file to SVG:
mmdc -i input.mmd -o output.svg
If you do not want to install it globally, see the Mermaid CLI documentation for local, Docker, and other installation options.
Data relationship diagram#
erDiagram
COMBINED_MONTHLY_TOTALS {
int checkout_year
int checkout_month
string usageclass
int total_checkouts
int month_total_checkouts
float usageclass_share
int title_month_rows
int sample_rows
}
DIGITAL_TITLE_SAMPLE {
string usageclass
string checkouttype
string materialtype
int checkoutyear
int checkoutmonth
int checkouts
string title
string isbn
string creator
string subjects
string publisher
string publicationyear
string sample_design
}
PHYSICAL_TITLE_SAMPLE {
string usageclass
string checkouttype
string materialtype
int checkoutyear
int checkoutmonth
int checkouts
string title
string isbn
string creator
string subjects
string publisher
string publicationyear
string sample_design
}
COMBINED_MONTHLY_TOTALS ||--o{ DIGITAL_TITLE_SAMPLE : "monthly context by year-month-usageclass"
COMBINED_MONTHLY_TOTALS ||--o{ PHYSICAL_TITLE_SAMPLE : "monthly context by year-month-usageclass"
The diagram shows a many-to-one relationship for context only: many sampled title-month rows can correspond to one monthly aggregate row. The aggregate keys are not title identifiers.
Compare the notebook investigation with the relationship diagram.
Discuss:
Does the diagram make the sample-versus-aggregate distinction visible?
What relationship is shown?
What relationship is not shown?
Which file should answer questions about total checkout volume?
Which files should answer questions about titles, creators, subjects, or material types?
Make another diagram
Ask Codex what other types of visualizations it could make for data relationships. Pick one or describe your own and add it to the notebook.
Key points
Understanding what one row represents comes before plotting or interpreting trends.
Sample files and aggregate files answer different kinds of questions.
Reviewing the agent’s notebook output is part of the analysis workflow, not a separate cleanup step.