After an AI Coding Assistant Joins, the Workflow Must Be Upgraded

Story Setup: When AI Turns from an Accelerator into a Landmine Factory

You spent a month rapidly building a complex multimodal learning framework with the help of Copilot. The code generation speed was astonishing—previously you could write 200 lines a day; now you could write 500 lines in half a day. You felt your productivity had multiplied severalfold.

But in the week before preparing your paper, problems began to erupt all at once:

Problem 1: Hidden Bugs

While testing an edge case, you found the model output NaN. Tracing the code back, you discovered an issue in a data preprocessing function—it would divide by zero in some rare cases. This function was generated by Copilot; at the time, you only checked that it “ran” and did not carefully review the boundary conditions.

Problem 2: Inconsistent Implementations

You found that data augmentation was implemented inconsistently between training and evaluation: training used version A generated by Copilot, while evaluation used version B that you later wrote by hand. The two had subtle differences in normalization. This caused a mismatch between the training and test distributions, hurting performance.

Problem 3: Overly Complex Architecture

For the sake of “elegance,” you had Copilot help you design a highly abstract architecture with many layers of abstract classes and factory patterns. Now you needed to quickly modify a feature, only to find you had to change five files because the logic was scattered across abstraction layers.

Problem 4: Missing Validation

You realized that much of the code generated by Copilot had no test coverage. You had been relying on “as long as it runs,” and had never systematically validated edge cases, error handling, or data consistency. Now that you wanted to add tests, you found the code was too tightly coupled to test in isolation.

Moment of Realization

You had to stop all new feature development and spend a week:

Reviewing all AI-generated code to uncover hidden issues;
Unifying inconsistent implementations;
Simplifying the over-engineered architecture;
Adding missing tests and validation.

You realized: the AI coding assistant is not the problem; the problem is that your workflow did not keep up.

Before AI joined, your codebase was smaller and had fewer errors; you could control quality through memory and experience. But now the code volume has exploded, while your validation mechanism is still stuck at the “manual review” stage—and this gap will only grow.

Three Major Pitfalls of AI Coding

Pitfall 1: Generation Speed Outpaces Validation Capacity

Symptom: AI generates 100 lines of code in a minute, and you accept it because it “looks fine.” But you have not carefully checked boundary conditions, error handling, or performance impact.

Consequences:

A large number of hidden bugs are planted and surface at critical moments;
The codebase fills with implementations that “run but are not robust”;
Refactoring costs far exceed the cost of doing it right from the start.

Root cause: The validation mechanism has not been upgraded and remains at the “manual review” stage.

Pitfall 2: Repeated Generation Leads to Inconsistency

Symptom: When you need similar functionality, you ask AI to regenerate it instead of reusing existing code. As a result, the project contains multiple implementations that are “similar but not exactly the same.”

Consequences:

Changing one piece of logic requires changes in multiple places, and omissions are likely;
Different implementations may have subtle differences, leading to inconsistent results;
The codebase bloats and maintenance costs surge.

Root cause: Failure to distinguish between “core library code” and “one-off glue code” (review Chapter 3).

Pitfall 3: Overreliance on “Quickly Tweaking It While I’m Here”

Symptom: When asking AI to implement a feature, you also casually ask it to “change this too” or “optimize that as well.” A single commit includes changes across more than a dozen files.

Consequences:

You cannot pinpoint which change caused the problem;
When you want to roll back, you find that “pulling one thread moves the whole fabric”;
The code history becomes chaotic, and the evolution of logic is no longer traceable.

Root cause: Changes are no longer “the smallest verifiable unit” (review Chapter 1).

The Upgraded Workflow: Treat AI as a “Junior Programmer for Rapid Trial and Error”

Core idea: AI is an assistant, not a replacement. Its strength is rapidly generating a first draft; your responsibility is to validate, integrate, and gatekeep.

Treat AI coding as a two-stage process:

Generation stage: AI quickly produces a first draft (fast, rough, allowed to have issues);
Validation stage: Humans review, test, and refactor (slow, strict, ensuring quality).

Key principle: Generation can be fast, but shipping must be slow.

Principle 1: Make Only One Verifiable Change at a Time

Operational recommendations:

Each time, limit AI-generated code to no more than a 200-line diff;
Each change should involve only one logical function (e.g., “add data augmentation,” “fix NaN bug”);
The change must be independently verifiable (with corresponding tests or experiments).

Practical example:

Counterexample:

# A single change includes:

Modify data loading logic
Add a new model layer
Adjust training hyperparameters
Refactor evaluation code
Update configuration files

Result: if experimental results worsen, you cannot determine which change caused it.

Positive example:

# Change 1 (independent commit):

Modify data loading logic
Test: data loading tests pass; output shape is correct

Change 2 (independent commit):
Add a new model layer
Test: model forward-pass tests pass

Change 3 (independent commit):
Run experiments; compare the new model with the baseline
Record: run_id, metrics, conclusions

Principle 2: Attach “How to Verify” to Every Change

Operational recommendations:

When asking AI to generate code, also require it to generate verification code.

Example prompt:

Please implement data augmentation, including:

Implementation code
Unit tests (at least covering normal cases and edge cases)
Usage examples
Potential risk points

Verification checklist (check after every change):

The code runs (does not crash)
Outputs match expectations (shape, numeric range, data type)
Edge cases are handled correctly (empty input, extreme values, invalid input)
Existing functionality is not affected (regression tests pass)
Test coverage exists (at least a smoke test)

Principle 3: Ban “Refactoring a Large Chunk While I’m Here”

Symptom identification:

Danger signals:

AI suggests: “I’ll also refactor this part to make it more elegant”
You think: “Since I’m changing it anyway, I might as well do it together”
Result: one PR includes feature additions + refactoring + bug fixes

Correct approach:

Implement the feature first, refactor later:

   # Step 1: implement the feature (rough is acceptable)
   git commit -m "feat: add data augmentation (rough version)"

   # Step 2: verify the feature is correct
   pytest tests/test_data_aug.py

   # Step 3: refactor independently (no functional changes)
   git commit -m "refactor: clean up data augmentation code"

   # Step 4: verify the refactor did not break functionality
   pytest tests/test_data_aug.py

Separate refactoring from features:

Feature changes: use “feat:” or “fix:” in the commit message
Refactoring changes: use “refactor:” in the commit message
Never mix them together

Principle 4: Core logic must be manually reviewed

Definition of “core logic”:

The following code must not be accepted blindly in an AI-generated version; it must be carefully reviewed by a human:

Data processing: data loading, preprocessing, splitting, augmentation
Model core: loss functions, key modules (attention, normalization, etc.)
Evaluation logic: metric computation, post-processing, threshold selection
Randomness control: random seed configuration, random sampling logic
Hyperparameters: learning-rate scheduling, weight decay, dropout rate

Review checklist:

Logical correctness (is the algorithm implemented correctly?)
Boundary conditions (what happens with empty data or extreme values?)
Data consistency (are training and testing logic consistent?)
Randomness (is the seed set correctly? is it reproducible?)
Performance impact (will it be extremely slow? will memory blow up?)

Principle 5: Add minimal tests first, then modify core logic

Anti-pattern:

You: Copilot, help me refactor this data loading function
Copilot: [generates 150 lines of code]
You: Looks good, accept!
[A week later you discover the data loading is wrong, but you have already run many experiments]

Correct workflow:

# Step 1: Write tests for the existing code first
def test_data_loader_current():
    """Test the current data-loading behavior"""
    loader = DataLoader(...)
    batch = next(iter(loader))

    assert batch['image'].shape == (32, 3, 224, 224)
    assert batch['label'].shape == (32,)
    assert batch['label'].min() >= 0
    assert batch['label'].max() < num_classes

# Step 2: Run the tests and ensure they pass
pytest tests/test_data_loader.py

# Step 3: Refactor
[Copilot generates new code]

# Step 4: Run the tests and ensure behavior is unchanged
pytest tests/test_data_loader.py

# Step 5: If tests fail, fix or roll back

Benefits of test-first:

Clarifies current behavior (as a baseline for regression tests)
Quickly validates refactoring (1 minute instead of 1 hour of experiments)
Avoids breaking existing functionality
Forces you to think about “what the correct behavior is”

AI Rules Page (CLAUDE.md): Team consensus for collaboration

To standardize AI usage practices within a team, it is recommended to create CLAUDE.md (or AI_RULES.md) in the project root directory, clearly defining the boundaries and workflow for AI-assisted coding.

CLAUDE.md Template

# Guidelines for Using AI Coding Assistants

This project uses AI coding assistants (e.g., GitHub Copilot, Claude, GPT) to support development.
To ensure code quality and maintainability, all AI-generated code must follow the guidelines below.

## Core Principles

AI generates; humans are accountable: AI can quickly produce a first draft, but humans must review, test, and gatekeep.
Iterate in small steps: each change must not exceed a 200-line diff and must be independently verifiable.
Test first: before modifying core logic, add tests; after the change, tests must pass.
No opportunistic refactoring: functional changes and refactoring must be committed separately.

Operational Guidelines

1. Every change must include a verification method

Clearly specify in the commit message or PR description:

How to verify this change:

Run tests: pytest tests/test_xxx.py
Run experiments: python train.py –config xxx
Check outputs: output shape should be [B, C, H, W]


    ### 2. Every change must document risk points

    ```
Potential risks:
- Modified data preprocessing, which may affect the data distribution
- Added new random operations; need to check seed settings
- Changed evaluation logic; need to rerun baseline to confirm fairness

### 3. Every change must have a rollback strategy

```bash

Rollback strategy:

If tests fail: git checkout – <file>
If experimental results degrade: revert to commit <hash>
If other functionality is affected: roll back and create a new branch for isolated debugging


    ## Prohibited Behaviors

    [NO] **Prohibition 1: Blindly accepting large blocks of generated code**
      - Any AI-generated code exceeding 50 lines must be manually reviewed line by line
      - Core logic (data, model, evaluation) must be reviewed with extra rigor

    [NO] **Prohibition 2: Modifying too many files**
      - A single commit should not touch more than 5 files (except in special cases)
      - If many files must be changed, split into multiple commits

    [NO] **Prohibition 3: Merging without tests**
      - Before any code is merged into main, all tests must pass
      - If there are no relevant tests, tests must be added first

    [NO] **Prohibition 4: Direct changes on the main branch**
      - All AI-generated code must be validated on an experimental branch first
      - Merge into main only after validation passes

    ## Required Behaviors

    [YES] **Requirement 1: Test coverage for core logic**
      - Data loading, model core, and evaluation logic must have unit tests
      - After every modification to core logic, tests must pass

    [YES] **Requirement 2: Comparable experimental results**
      - Any change that affects results must be validated with comparative experiments
      - Record run_id and metric comparisons before and after the change

    [YES] **Requirement 3: Synchronized configuration versions**
      - Any change that affects results must update the corresponding config
      - Config files must be committed and tagged

    [YES] **Requirement 4: Documentation updated in sync**
      - API changes must update the documentation
      - New features must update the usage instructions in the README

    ## Acceptance Criteria (Before merging into main)

- [ ] Changes do not exceed 200 lines (or have been thoroughly discussed)
- [ ] All tests pass
- [ ] Core logic has been manually reviewed
- [ ] Clear commit messages
- [ ] Includes verification method, risk points, and rollback strategy
- [ ] If experimental results are affected, includes comparative experiment records

    ## Example: A good AI-assisted workflow

    ```bash
    # 1. Create an experimental branch
    git checkout -b exp/add-mixup

# 2. Have AI Generate the First Draft
    # [Copilot generates mixup data augmentation code]

    # 3. Manual Review and Revision
    # - Check boundary conditions
    # - Add type annotations
    # - Verify that random seed handling is correct

    # 4. Add Tests
    pytest tests/test_mixup.py

    # 5. Run Comparative Experiments
    make train CONFIG=configs/baseline.yaml  # baseline
    make train CONFIG=configs/mixup.yaml      # with mixup

    # 6. Record Results
    # outputs/2026-02-01_1030_baseline/run.json
    # outputs/2026-02-01_1100_mixup/run.json

    # 7. If Results Improve, Merge into main
    git checkout main
    git merge exp/add-mixup

    # 8. Clean Up the Experiment Branch
    git branch -d exp/add-mixup
    ```

    ## Reference Resources

- Repository structure conventions: see the "Project Structure" section in README.md
- DoD checklist: see DoD.md
- Experiment logging conventions: see LOGGING.md

---

    **Principle Summary: AI makes you faster, but it must not make you sloppier.**

## Practical Cases: The Correct Way to Use AI Assistance

### Case 1: Adding a New Model Component

##### Incorrect Approach:

    You: Copilot, help me implement a multi-head attention module
    Copilot: [generates 200 lines of code]
    You: Looks good—accept! Use it directly in training
    [training crashes; a bug is found in the attention computation]

##### Correct Approach:

    # Step 1: Generate a first draft
    You: Copilot, help me implement a multi-head attention module, including:
        - implementation code
        - unit tests
        - usage examples

    # Step 2: Manual review
- Check whether there is a division by sqrt(d_k) before softmax
- Check whether the mask implementation is correct
- Check whether the output shape matches expectations

    # Step 3: Write tests
    def test_multi_head_attention():
        attn = MultiHeadAttention(d_model=512, n_heads=8)
        x = torch.randn(32, 100, 512)  # (batch, seq, dim)
        out, weights = attn(x, x, x)

        assert out.shape == (32, 100, 512)
        assert weights.shape == (32, 8, 100, 100)
        assert torch.allclose(weights.sum(dim=-1),
                              torch.ones_like(weights.sum(dim=-1)))

    # Step 4: Test the module in isolation
    pytest tests/test_attention.py -v

    # Step 5: Test integration on a small dataset
    python train.py --config configs/test_attention.yaml \
                    --data_subset 100 --epochs 2

    # Step 6: After confirming everything is fine, run formal training
    python train.py --config configs/attention.yaml

### Case 2: Refactoring Data Loading

##### Incorrect Approach:

    You: This data loading code is a mess—Copilot, refactor it for me
    Copilot: [rewrites the entire DataLoader class]
    You: Great, use the new version!
    [rerun experiments; results differ from before, and it is unclear where the issue is]

##### Correct Approach:

    # Step 1: Write behavioral tests for the old version
    def test_old_data_loader_behavior():
        """Record the behavior of the old version as a baseline"""
        loader = OldDataLoader(...)
        batch = next(iter(loader))

        # Record key behaviors
        assert batch['image'].shape == (32, 3, 224, 224)
        assert batch['image'].dtype == torch.float32
        assert batch['image'].min() >= -1.0
        assert batch['image'].max() <= 1.0
        # ... more assertions

    # Step 2: Run the tests and ensure they pass
    pytest tests/test_data_loader.py::test_old_behavior -v

    # Step 3: Have AI refactor
    Copilot: [generates a new DataLoader]

    # Step 4: Write the same tests for the new version
    def test_new_data_loader_behavior():
        """Ensure the new version’s behavior is consistent"""
        loader = NewDataLoader(...)
        batch = next(iter(loader))

        # Same assertions
        assert batch['image'].shape == (32, 3, 224, 224)
        # ...

    # Step 5: Compare outputs from the two versions
    def test_output_consistency():
        """Directly compare outputs from the old and new versions"""
        old_loader = OldDataLoader(seed=42)
        new_loader = NewDataLoader(seed=42)

        old_batch = next(iter(old_loader))
        new_batch = next(iter(new_loader))

```python
torch.testing.assert_close(old_batch['image'],
                                   new_batch['image'])

# Step 6: If they match, then run the full experiment for verification
make reproduce RUN=baseline  # Use the old version
make train CONFIG=configs/baseline_new_loader.yaml  # Use the new version
# Compare the results

Common Issues and Solutions

Q1: The AI-generated code is too complex—I can’t understand it. What should I do?

Solution:

Refuse to accept it. Ask the AI to regenerate a simpler version.

Prompt:

  Please implement feature XXX with the following requirements:
  - Keep the code simple and straightforward; avoid over-abstraction
  - Prefer the standard library and common patterns whenever possible
  - Add detailed comments explaining each step

Remember: Code you cannot understand now will also be unintelligible to your future self—and even more so to others.

Q2: The AI-generated code has a bug; after fixing it, a new bug appears

Solution:

Do not repeatedly ask the AI to fix bugs and fall into an infinite loop.
The correct approach:
1. Write tests first to reproduce the bug
2. Fix the bug manually (small, localized changes)
3. After tests pass, consider whether refactoring is necessary
AI can suggest possible issues, but the fix should be done by humans.

Q3: What if team members do not follow the AI usage guidelines?

Solution:

Add automated checks in a Git pre-commit hook:

  # .git/hooks/pre-commit
  #!/bin/bash

  # Check the number of lines in the commit diff
  DIFF_LINES=$(git diff --cached | wc -l)
  if [ $DIFF_LINES -gt 400 ]; then
      echo "Error: commit diff too large ($DIFF_LINES lines)"
      echo "Please split into smaller commits"
      exit 1
  fi

  # Check whether there are tests
  if git diff --cached --name-only | grep -q "src/"; then
      if ! git diff --cached --name-only | grep -q "tests/"; then
          echo "Warning: you modified src/ but no test changes"
          echo "Please ensure tests are updated"
      fi
  fi

  # Run tests
  pytest tests/ -q || {
      echo "Error: tests failed"
      exit 1
  **

Enforce it in CI (see the next section)
Enforce strict checks during code review

10-Minute Action: Establish Checkpoints for Your Next AI Coding Session

If you do only one thing right now: establish a minimal verification workflow for your next AI-assisted coding session.

Create CLAUDE.md

Copy the template from this chapter and place it in the project root directory.

Write a smoke test

  # tests/test_smoke.py

  def test_basic_training_loop():
      """The basic training workflow can run end-to-end."""
      config = load_config("configs/test.yaml")
      config["epochs"] = 2
      config["data_subset"] = 100

      model, metrics = train_model(config)

      assert metrics["train_loss"] > 0
      assert not math.isnan(metrics["val_loss"])

Set up a pre-commit hook

  # Simple version: only check tests
  echo '#!/bin/bash\npytest tests/ -q' > .git/hooks/pre-commit
  chmod +x .git/hooks/pre-commit

Run one complete workflow

  # 1. Create an experiment branch
  git checkout -b exp/test-ai-workflow

  # 2. Ask the AI to make a small change (e.g., add a utility function)
  # 3. Write the corresponding tests
  # 4. Run tests
  pytest tests/test_xxx.py

  # 5. Commit (the hook will run automatically)
  git add .
  git commit -m "feat: add utility function X"

  # 6. If the hook passes, merge into main
  git checkout main
  git merge exp/test-ai-workflow

Starting from your next AI coding session, the default behavior should be: generate → review → test → verify → merge. This workflow will become muscle memory, ensuring that AI remains your accelerator rather than a landmine factory.

Keyboard shortcuts

Research Engineering OS