Chapter 14: Exploratory Data Analysis with AI

What You’ll Learn

  • How to use AI to accelerate your first pass through data

  • What questions to answer before you touch a single row

  • Generating visualizations and summary statistics to understand what you’re working with

  • Identifying data quality issues that will guide your cleaning strategy

  • The iterative nature of EDA: initial exploration before prep, deep exploration after

The Two-Pass Journey

You just received your dataset. Before you run a single model or write a cleaning script, you need to answer a few basic questions: What is each column? Which one are you trying to predict or explain? Where are the gaps, and are those gaps random or concentrated in a way that matters?

This is what EDA is really about. It is not just generating a quick summary and moving on. It is taking the time to build a mental model of your data, column by column, before you trust it with anything important.

This chapter covers the first pass, the initial exploration you do when the data is still unfamiliar. You will discover its shape, its gaps, its quirks, and its quality issues. These discoveries will guide your cleaning strategy in the next chapter.

After cleaning and preparing the data, you will return for a second, deeper exploration. That is when you investigate relationships, test assumptions about distributions, and look for patterns with confidence, because by then you will trust what you are working with.

A Quick Example

Imagine a public health researcher who just received a dataset from a multi-site collaborative study. It has 8,000 rows and 42 columns, and they have never seen it before.

The first thing they do is ask an AI to generate a quick summary. Within 15 minutes, they know the row and column counts, the data types for each variable, and where missingness is concentrated. One critical variable is missing in about 12% of rows, and those rows turn out to be disproportionately from one study site. Dates are stored in three different formats across sites. Two columns have identical names but different scales.

None of this is catastrophic. But all of it would have caused silent errors if left unchecked. Now they know what to fix, and why.

That is the value of a careful first-pass EDA. It takes maybe an hour. It saves days.

Starting With the Right Questions

Before generating a single plot, it helps to get grounded in what the data is supposed to represent. Some questions worth answering at the start:

What does each column actually mean? Column names are often abbreviations, legacy labels, or ambiguous shorthand. If you have a data dictionary, read it. If you do not, ask a collaborator or use AI to help you infer meaning from context. You should be able to describe every column in plain language before you start modeling.

Which column is your outcome or target? Whether you are doing supervised learning, regression, or just building summary statistics for a paper, you need to identify your dependent variable clearly. What are its values? Is it continuous or categorical? How is it distributed? How much missing data does it have?

What are the known gaps, and do they matter? Missing data is normal. But missing data that is correlated with your outcome, or with a particular subgroup, is a problem that no amount of imputation will fully solve. You want to flag this early.

Are there columns you should exclude outright? Some variables are leakage risks (information that would not be available at prediction time), near-duplicates, or simply too sparse to be useful. Better to identify those now than after you have built features from them.

When AI Enhances EDA

AI is useful for generating visualization code quickly, computing summary statistics across many columns at once, suggesting relevant plots based on data types, and flagging potential outliers or anomalies. It is also good at writing the boilerplate you would otherwise copy from Stack Overflow: the missing value heatmap, the correlation matrix, the histogram grid.

Where AI falls short is interpretation. AI can tell you that 23% of a column is missing. It cannot tell you whether that is a problem for your research question, whether you should impute it, or whether the missingness is systematic in a way that undermines your design. That judgment requires your domain knowledge.

After Cleaning: The Second Pass

Once you have cleaned and prepared your data, you come back for a deeper look. The second-pass EDA is where the real analysis begins. Now you can examine relationships between variables with confidence, build correlation matrices and scatter plots, test whether distributions match your assumptions, and explore subgroup differences.

In the first pass, you are asking “can I trust this data?” In the second pass, you are asking “what does this data tell me?” The separation matters because trying to do both at once usually means doing neither well.

It is also worth noting that EDA is not a one-time event. Plan to revisit your data after each major transformation. If you impute a large block of missing values or re-encode a categorical variable, take another look at the distributions before moving forward. Chapter 15 covers the cleaning and preparation steps, and Chapter 21 addresses validation, but each of those stages benefits from looping back to visualization before you commit to your final dataset.

Rushing through initial EDA means you will miss important context. You might clean your data beautifully according to the wrong assumptions. By investing 30 minutes to an hour in this first pass, you will make smarter cleaning decisions, and AI makes that time go a long way.

Common AI-Assisted EDA Tasks

Generating an Initial Data Quality Report

AI can create a report showing missing values, duplicates, data types, and basic statistics. You review it and use it to inform your cleaning decisions.

Example prompt: “Generate a Python script that gives me a complete data quality report for my CSV file, including missing values, duplicates, data type mismatches, and summary statistics.”

Creating Distribution Plots

Ask AI to generate plots for each column or for a subset of key variables. Box plots for numeric variables, bar charts for categories, histograms for continuous values.

Identifying Missing Value Patterns

AI can visualize which columns have missing data and whether missingness is random or concentrated in certain rows or time periods. This matters for deciding how to handle it.

Spotting Outliers

AI can flag statistical outliers, but you decide whether they are errors or real phenomena that matter to your research.

Validation Checklist: Before Moving to Data Prep

Before you write down your cleaning strategy:

  • ✅ Have you reviewed a sample of raw data yourself (not just summaries)?

  • ✅ Can you name and describe every column in plain language?

  • ✅ Have you identified which column is your outcome or target variable?

  • ✅ Do the summary statistics match what you expect?

  • ✅ Have you identified all major data quality issues?

  • ✅ Do you understand where missing values are concentrated?

  • ✅ Have you caught any obviously incorrect values?

Key Takeaways

  • 🎯 First-pass EDA is about getting your bearings, not deep analysis

  • 🎯 Start by understanding what each column means and which is your target variable

  • 🎯 AI accelerates summary statistics and visualization generation

  • 🎯 Data quality issues you find here guide your cleaning strategy

  • 🎯 You will return for a deeper second pass after cleaning

  • 🎯 Speed matters here; perfection does not

Try This

For your next dataset:

  1. Before generating any plots, write a one-sentence description of each column

  2. Ask AI to generate a complete data quality report

  3. Spend 10 minutes reviewing the outputs

  4. Create a list of 3 to 5 data quality issues that need fixing

  5. For each issue, write a one-line plan (what you will do about it)

  6. Move to Chapter 15 with your findings in hand

Resources for Hands-On Learning

Kaggle Learn offers free micro-courses with practical examples that align well with this workflow. The courses on data cleaning and exploratory analysis include guided notebooks you can adapt to your own datasets. They are freely available at kaggle.com/learn.