---
title: "Annotation Guide"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Annotation Guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

```{r setup}
library(putior)
```

## Introduction

This guide provides a complete reference for PUT annotation syntax. It covers all annotation formats, multi-language support, multiline annotations, and best practices.

> **New to putior?** Start with the [Quick Start](quick-start.html) guide to create your first diagram in 2 minutes.

**PUT** stands for **P**UT + **I**nput + **O**utput + **R**, reflecting the package's core purpose: tracking data inputs and outputs through your analysis pipeline using special annotations.

## Annotation Basics

PUT annotations are special comments that describe workflow nodes. Start simple:

**Minimal annotation (just a label):**

    # put label:"Load Data"

That's all you need! putior will:
- Auto-generate a unique ID
- Default `node_type` to `"process"`
- Default `output` to the filename

**Add more detail as needed:**

    # put label:"Load Data", node_type:"input", output:"data.csv"

**Full R script example:**

    # data_processing.R
    # put label:"Load Customer Data", node_type:"input", output:"raw_data.csv"

    # Your actual code
    data <- read.csv("customer_data.csv")
    write.csv(data, "raw_data.csv")

    # put label:"Clean and Validate", input:"raw_data.csv", output:"clean_data.csv"

    # Data cleaning code
    cleaned_data <- data %>%
      filter(!is.na(customer_id)) %>%
      mutate(purchase_date = as.Date(purchase_date))

    write.csv(cleaned_data, "clean_data.csv")

**Python script example:**

    # analysis.py
    # put id:"analyze_sales", label:"Sales Analysis", node_type:"process", input:"clean_data.csv", output:"sales_report.json"

    import pandas as pd
    import json

    # Load cleaned data
    data = pd.read_csv("clean_data.csv")

    # Perform analysis
    sales_summary = {
        "total_sales": data["amount"].sum(),
        "avg_order": data["amount"].mean(),
        "customer_count": data["customer_id"].nunique()
    }

    # Save results
    with open("sales_report.json", "w") as f:
        json.dump(sales_summary, f)

**Resulting diagram from both files:**

```{r multi-file-diagram, echo=FALSE, results='asis', eval=TRUE}
library(putior)
multi_file_workflow <- data.frame(
  file_name = c("data_processing.R", "data_processing.R", "analysis.py"),
  id = c("load_data", "clean_data", "analyze_sales"),
  label = c("Load Customer Data", "Clean and Validate", "Sales Analysis"),
  node_type = c("input", "process", "process"),
  input = c(NA, "raw_data.csv", "clean_data.csv"),
  output = c("raw_data.csv", "clean_data.csv", "sales_report.json"),
  stringsAsFactors = FALSE
)
cat("```mermaid\n")
cat(put_diagram(multi_file_workflow, theme = "github", output = "raw"))
cat("\n```\n")
```

## Extracting Annotations

Use the `put()` function to scan your files and extract workflow information:

```{r}
# Scan all R and Python files in a directory
workflow <- put("./src/")

# View the extracted workflow
print(workflow)
```

The output is a data frame where each row represents a workflow node:

| Column | Description |
|--------|-------------|
| `file_name` | Which script contains this node |
| `file_type` | Programming language (r, py, sql, etc.) |
| `id` | Unique identifier for the node |
| `label` | Human-readable description |
| `node_type` | Type of operation (input, process, output) |
| `input` | Files consumed by this step |
| `output` | Files produced by this step |

Custom properties you define are also included as additional columns.

## Complete Syntax Reference

### Basic Format

The general syntax for PUT annotations is:

    # put property1:"value1", property2:"value2", property3:"value3"

### Flexible Syntax Options

PUT annotations support several formats to fit different coding styles:

    # put id:"my_node", label:"My Process"          # Standard format (matches logo)
    #put id:"my_node", label:"My Process"           # Also valid (no space)
    # put| id:"my_node", label:"My Process"         # Pipe separator
    # put id:'my_node', label:'Single quotes'       # Single quotes
    # put id:"my_node", label:'Mixed quotes'        # Mixed quote styles

### Multiline Annotations

For complex annotations with many properties, use backslash (`\`) continuation:

**R/Python style:**
```r
# put id:"complex_etl", \
#     label:"Complex ETL Process", \
#     node_type:"process", \
#     input:"raw_data.csv, config.yaml", \
#     output:"processed.parquet", \
#     author:"Data Team", \
#     version:"2.0"
```

**SQL style:**
```sql
--put id:"load_customers", \
--    label:"Load Customer Data", \
--    node_type:"input", \
--    output:"customers_table"
SELECT * FROM raw_customers;
```

**JavaScript/TypeScript style:**
```javascript
//put id:"api_handler", \
//    label:"Process API Request", \
//    input:"request.json", \
//    output:"response.json"
```

**Rules for multiline annotations:**

1. End each line (except the last) with a backslash `\`
2. Start continuation lines with the same comment prefix
3. Continuation lines can have leading whitespace for readability
4. Properties can span multiple lines
5. The backslash must be the last character on the line (no trailing spaces)

**Example with many properties:**
```r
# put id:"train_model", \
#     label:"Train Random Forest Model", \
#     node_type:"process", \
#     input:"features.csv, labels.csv", \
#     output:"model.rds, metrics.json", \
#     group:"machine_learning", \
#     stage:"3", \
#     estimated_time:"45min", \
#     memory_intensive:"true"
```

> **When Multiline Annotations Don't Work:**
>
> - **Trailing spaces**: Ensure backslash is the *last* character (no spaces after)
> - **Missing prefix**: Each continuation line needs the comment prefix (`#`, `--`, `//`)
> - **Fallback**: If multiline fails, use a single long line - readability is secondary to functionality
> - **Debug**: Use `set_putior_log_level("DEBUG")` to see exactly how lines are being parsed

### Multi-Language Support

putior automatically uses the correct comment prefix based on file extension:

| Comment Style | Languages | Extensions |
| :--- | :--- | :--- |
| `# put` | R, Python, Shell, Julia, Ruby, YAML | `.R`, `.py`, `.sh`, `.jl`, `.rb`, `.yaml` |
| `-- put` | SQL, Lua, Haskell | `.sql`, `.lua`, `.hs` |
| `// put` | JavaScript, TypeScript, C, Java, Go, Rust | `.js`, `.ts`, `.c`, `.java`, `.go`, `.rs` |
| `% put` | MATLAB, LaTeX | `.m`, `.tex` |

**SQL Example:**

    -- query.sql
    --put id:"load_data", label:"Load Customer Data", output:"customers"
    SELECT * FROM customers WHERE active = 1;

**JavaScript Example:**

    // process.js
    //put id:"transform", label:"Transform JSON", input:"data.json", output:"output.json"
    const transformed = data.map(item => process(item));

**MATLAB Example:**

    % analysis.m
    %put id:"compute", label:"Statistical Analysis", input:"data.mat", output:"results.mat"
    results = compute_statistics(data);

### Block Comments

For languages with block comment support (JavaScript, TypeScript, C, C++, Java,
Go, Rust, and other `//`-prefix languages), PUT annotations can also appear
inside `/* ... */` and `/** ... */` block comments. Use a `*` line prefix:

**JSDoc-style (recommended for JS/TS):**

    /**
     * put id:"load", label:"Load Data", node_type:"input"
     */
    function loadData() { return fetch('/api/data'); }

**C-style block comment:**

    /*
     * put id:"init", label:"Initialize System"
     */
    void init() {}

**Single-line block comment:**

    /* put id:"quick", label:"Quick Operation" */
    const x = transform(data);

Multiple annotations can appear in one block:

    /**
     * put id:"step_a", label:"Step A"
     * put id:"step_b", label:"Step B"
     */

Both single-line (`//`) and block (`/* */`) annotations can coexist in the
same file. Languages without block comment syntax (R, Python, SQL, etc.)
continue to use their single-line prefix only.

### Core Properties

While putior accepts any properties you define, these are commonly used:

| Property | Purpose | Example Values |
|----------|---------|----------------|
| `id` | Unique identifier | `"load_data"`, `"process_sales"` |
| `label` | Human description | `"Load Customer Data"` |
| `node_type` | Operation type | `"input"`, `"process"`, `"output"` |
| `input` | Input files | `"raw_data.csv"`, `"data/*.json"` |
| `output` | Output files | `"processed_data.csv"` |

### Standard Node Types

For consistency across projects, use these standard node types:

| Type | Mermaid Shape | Use For |
|------|---------------|---------|
| `input` | Stadium `([...])` | Data sources, file loading, API inputs |
| `process` | Rectangle `[...]` | Data transformation, analysis, computation (default) |
| `output` | Subroutine `[[...]]` | Report generation, data export, visualization |
| `decision` | Diamond `{...}` | Conditional logic, branching workflows |
| `start` | Stadium `([...])` | Workflow entry point (gets boundary styling) |
| `end` | Stadium `([...])` | Workflow exit point (gets boundary styling) |

> **`artifact`** nodes (cylinder shape) are automatically created by `put_diagram(show_artifacts = TRUE)` for data files referenced in `input`/`output` fields. You don't set `node_type:"artifact"` manually.

**Visual representation of node types:**

```{r node-types-diagram, echo=FALSE, results='asis', eval=TRUE}
library(putior)
node_types_workflow <- data.frame(
  file_name = rep("example.R", 4),
  id = c("load", "transform", "export", "check"),
  label = c("Load Data (input)", "Transform (process)", "Export (output)", "Validate? (decision)"),
  node_type = c("input", "process", "output", "decision"),
  input = c(NA, "raw", "clean", "clean"),
  output = c("raw", "clean", "report", "valid"),
  stringsAsFactors = FALSE
)
cat("```mermaid\n")
cat(put_diagram(node_types_workflow, theme = "github", show_artifacts = FALSE, output = "raw"))
cat("\n```\n")
```

### Custom Properties

Add any properties you need for visualization or metadata:

    # put id:"train_model", label:"Train ML Model", node_type:"process", color:"green", group:"machine_learning", duration:"45min", priority:"high"

These custom properties can be used by visualization tools or workflow management systems.

## Advanced Usage

### Processing Individual Files

You can process single files instead of entire directories:

```{r}
# Process a single file
workflow <- put("./scripts/analysis.R")
```

### Recursive Directory Scanning

Include subdirectories in your scan:

```{r}
# Search subdirectories recursively
workflow <- put("./project/", recursive = TRUE)
```

### Custom File Patterns

Control which files are processed:

```{r}
# Only R files
workflow <- put("./src/", pattern = "\\.R$")

# R and SQL files only
workflow <- put("./src/", pattern = "\\.(R|sql)$")

# All supported file types (default)
workflow <- put("./src/", pattern = "\\.(R|r|py|sql|sh|jl)$")
```

### Including Line Numbers

For debugging annotation issues, include line numbers:

```{r}
# Include line numbers for debugging
workflow <- put("./src/", include_line_numbers = TRUE)
```

### Validation Control

Control annotation validation:

```{r}
# Enable validation (default) - provides helpful warnings
workflow <- put("./src/", validate = TRUE)

# Disable validation warnings
workflow <- put("./src/", validate = FALSE)
```

### Automatic ID Generation

If you omit the `id` field, putior will automatically generate a unique UUID:

```{r}
# Annotations without explicit IDs get auto-generated UUIDs
# put label:"Load Data", node_type:"input", output:"data.csv"
# put label:"Process Data", node_type:"process", input:"data.csv", output:"clean.csv"

# Extract workflow - IDs will be auto-generated
workflow <- put("./")
print(workflow$id)  # Will show UUIDs like "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
```

Note: If you provide an empty `id` (e.g., `id:""`), you'll get a validation warning.

### Automatic Output Defaulting

If you omit the `output` field, putior automatically uses the file name as the output:

```{r}
# In process_data.R:
# put label:"Process Step", node_type:"process", input:"raw.csv"
# No output specified - will default to "process_data.R"

# In analyze_data.R:
# put label:"Analyze", node_type:"process", input:"process_data.R", output:"results.csv"
# This creates a connection from process_data.R to analyze_data.R
```

This feature ensures that scripts can be connected in workflows even when explicit output files aren't specified.

### Tracking Source Relationships

When you have scripts that source other scripts, use this annotation pattern:

```{r}
# In main.R (sources other scripts):
# put label:"Main Analysis", input:"load_data.R,process_data.R", output:"report.pdf"
source("load_data.R")    # Reading load_data.R into main.R
source("process_data.R") # Reading process_data.R into main.R

# In load_data.R (sourced by main.R):
# put label:"Data Loader", node_type:"input"
# output defaults to "load_data.R"

# In process_data.R (sourced by main.R, depends on load_data.R):
# put label:"Data Processor", input:"load_data.R"
# output defaults to "process_data.R"
```

This correctly shows the flow: sourced scripts are **inputs** to the main script.

## Variable References with `.internal` Extension

putior supports tracking in-memory variables and objects using the `.internal` extension. This is useful for documenting computational steps within scripts while maintaining clear data flow between scripts.

### Key Concepts

**`.internal` variables:**
- Represent in-memory objects during script execution
- Can only be **outputs**, never inputs between scripts
- Help document what variables are created within each script
- Example: `my_data.internal` represents a variable named `my_data`

**Persistent files:**
- Enable actual data flow between scripts
- Can be both inputs and outputs
- Required for connected workflows
- Example: `my_data.RData`, `results.csv`

### Correct Usage Pattern

```{r eval=FALSE}
# Script 1: Create variable and save it
# put id:"create_data", output:"dataset.internal, dataset.RData"
dataset <- data.frame(x = 1:100, y = rnorm(100))
save(dataset, file = "dataset.RData")

# Script 2: Load data and create new variables
# put id:"analyze_data", input:"dataset.RData", output:"analysis.internal, summary.txt"
load("dataset.RData")  # Load the persistent file (NOT dataset.internal)
analysis <- summary(dataset)  # Create new in-memory variable
writeLines(capture.output(analysis), "summary.txt")
```

### What NOT to Do

```{r eval=FALSE}
# INCORRECT: Using .internal as input between scripts
# put input:"dataset.internal"  # This is wrong!

# CORRECT: Use persistent files as inputs
# put input:"dataset.RData"     # This is correct!
```

### Complete Example

Try the comprehensive variable reference example:

```{r eval=FALSE}
source(system.file("examples", "variable-reference-example.R", package = "putior"))
```

This creates a connected 4-script workflow demonstrating proper `.internal` usage and file-based data flow.

## Real-World Example

Let's walk through a complete data science workflow:

### 1. Data Collection (Python)

    # 01_collect_data.py
    # put id:"fetch_api_data", label:"Fetch Data from API", node_type:"input", output:"raw_api_data.json"

    import requests
    import json

    response = requests.get("https://api.example.com/sales")
    data = response.json()

    with open("raw_api_data.json", "w") as f:
        json.dump(data, f)

### 2. Data Processing (R)

    # 02_process_data.R
    # put id:"clean_api_data", label:"Clean and Structure Data", node_type:"process", input:"raw_api_data.json", output:"processed_sales.csv"

    library(jsonlite)
    library(dplyr)

    # Load raw data
    raw_data <- fromJSON("raw_api_data.json")

    # Process and clean
    processed <- raw_data %>%
      filter(!is.na(sale_amount)) %>%
      mutate(
        sale_date = as.Date(sale_date),
        sale_amount = as.numeric(sale_amount)
      ) %>%
      arrange(sale_date)

    # Save processed data
    write.csv(processed, "processed_sales.csv", row.names = FALSE)

### 3. Analysis and Reporting (R)

    # 03_analyze_report.R
    # put id:"sales_analysis", label:"Perform Sales Analysis", node_type:"process", input:"processed_sales.csv", output:"analysis_results.rds"
    # put id:"generate_report", label:"Generate HTML Report", node_type:"output", input:"analysis_results.rds", output:"sales_report.html"

    library(dplyr)

    # Load processed data
    sales_data <- read.csv("processed_sales.csv")

    # Perform analysis
    analysis_results <- list(
      total_sales = sum(sales_data$sale_amount),
      monthly_trends = sales_data %>%
        group_by(month = format(sale_date, "%Y-%m")) %>%
        summarise(monthly_total = sum(sale_amount)),
      top_products = sales_data %>%
        group_by(product) %>%
        summarise(product_sales = sum(sale_amount)) %>%
        arrange(desc(product_sales)) %>%
        head(10)
    )

    # Save analysis
    saveRDS(analysis_results, "analysis_results.rds")

    # Generate report
    rmarkdown::render("report_template.Rmd",
                      output_file = "sales_report.html")

### 4. Extract the Complete Workflow
```{r}
# Extract workflow from all files
complete_workflow <- put("./sales_project/", recursive = TRUE)
print(complete_workflow)
```

This would show the complete data flow: API → JSON → CSV → Analysis → Report

## Best Practices

### 1. Use Descriptive Names

Choose clear, descriptive names that explain what each step does:

    # Good
    # put id:"load_customer_transactions", label:"Load Customer Transaction Data"
    # put id:"calculate_monthly_revenue", label:"Calculate Monthly Revenue Totals"

    # Less descriptive
    # put id:"step1", label:"Load data"
    # put id:"process", label:"Do calculations"

### 2. Document Data Dependencies

Always specify inputs and outputs for data processing steps:

    # put id:"merge_datasets", label:"Merge Customer and Transaction Data", input:"customers.csv,transactions.csv", output:"merged_data.csv"

### 3. Use Consistent Node Types

Stick to a standard set of node types across your team:

    # put id:"load_raw_data", label:"Load Raw Sales Data", node_type:"input"
    # put id:"clean_data", label:"Clean and Validate", node_type:"process"
    # put id:"export_results", label:"Export Final Results", node_type:"output"

### 4. Add Helpful Metadata

Include metadata that helps with workflow understanding:

    # put id:"train_model", label:"Train Random Forest Model", node_type:"process", estimated_time:"30min", requires:"tidymodels", memory_intensive:"true"

### 5. Group Related Operations

Use grouping properties to organize complex workflows:

    # put id:"feature_engineering", label:"Engineer Features", group:"preprocessing", stage:"1"
    # put id:"model_training", label:"Train Model", group:"modeling", stage:"2"
    # put id:"model_evaluation", label:"Evaluate Model", group:"modeling", stage:"3"

## Troubleshooting

Having issues with annotations? See the [Troubleshooting Guide](troubleshooting.html) for:

- **[Most Common Issues](troubleshooting.html#most-common-issues)** - Start here for quick solutions
- **[Annotation Syntax Errors](troubleshooting.html#annotation-syntax-errors)** - Quote mismatches, invalid properties
- **[File Pattern Matching](troubleshooting.html#file-pattern-matching-issues)** - Files not being scanned
- **[Debugging with Logging](troubleshooting.html#debugging-with-logging)** - Enable detailed output

**Quick diagnostic:**
```{r}
# Test if your annotation is valid
is_valid_put_annotation('# put id:"test", label:"Test Node"')  # Should be TRUE
```

## See Also

| Guide | Description |
|-------|-------------|
| [Quick Start](quick-start.html) | First diagram in 2 minutes |
| [Features Tour](features-tour.html) | Auto-detection, themes, logging |
| [API Reference](api-reference.html) | Function documentation |
| [Showcase](showcase.html) | Real-world examples |
| [Quick Reference](quick-reference.html) | At-a-glance reference card |
| [Troubleshooting](troubleshooting.html) | Common issues and solutions |
| [AI Integration](ai-integration.html) | MCP/ACP integration guide |

**Built-in examples:**

```{r eval=FALSE}
# Complete workflow example
source(system.file("examples", "reprex.R", package = "putior"))

# Variable reference example
source(system.file("examples", "variable-reference-example.R", package = "putior"))

# Interactive diagrams example
source(system.file("examples", "interactive-diagrams-example.R", package = "putior"))
```

**Function help:**

- `?put` - Extract annotations from files
- `?put_diagram` - Generate Mermaid diagrams
- `?put_auto` - Auto-detect workflow from code