--- title: "Annotation Guide" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Annotation Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ```{r setup} library(putior) ``` ## Introduction This guide provides a complete reference for PUT annotation syntax. It covers all annotation formats, multi-language support, multiline annotations, and best practices. > **New to putior?** Start with the [Quick Start](quick-start.html) guide to create your first diagram in 2 minutes. **PUT** stands for **P**UT + **I**nput + **O**utput + **R**, reflecting the package's core purpose: tracking data inputs and outputs through your analysis pipeline using special annotations. ## Annotation Basics PUT annotations are special comments that describe workflow nodes. Start simple: **Minimal annotation (just a label):** # put label:"Load Data" That's all you need! putior will: - Auto-generate a unique ID - Default `node_type` to `"process"` - Default `output` to the filename **Add more detail as needed:** # put label:"Load Data", node_type:"input", output:"data.csv" **Full R script example:** # data_processing.R # put label:"Load Customer Data", node_type:"input", output:"raw_data.csv" # Your actual code data <- read.csv("customer_data.csv") write.csv(data, "raw_data.csv") # put label:"Clean and Validate", input:"raw_data.csv", output:"clean_data.csv" # Data cleaning code cleaned_data <- data %>% filter(!is.na(customer_id)) %>% mutate(purchase_date = as.Date(purchase_date)) write.csv(cleaned_data, "clean_data.csv") **Python script example:** # analysis.py # put id:"analyze_sales", label:"Sales Analysis", node_type:"process", input:"clean_data.csv", output:"sales_report.json" import pandas as pd import json # Load cleaned data data = pd.read_csv("clean_data.csv") # Perform analysis sales_summary = { "total_sales": data["amount"].sum(), "avg_order": data["amount"].mean(), "customer_count": data["customer_id"].nunique() } # Save results with open("sales_report.json", "w") as f: json.dump(sales_summary, f) **Resulting diagram from both files:** ```{r multi-file-diagram, echo=FALSE, results='asis', eval=TRUE} library(putior) multi_file_workflow <- data.frame( file_name = c("data_processing.R", "data_processing.R", "analysis.py"), id = c("load_data", "clean_data", "analyze_sales"), label = c("Load Customer Data", "Clean and Validate", "Sales Analysis"), node_type = c("input", "process", "process"), input = c(NA, "raw_data.csv", "clean_data.csv"), output = c("raw_data.csv", "clean_data.csv", "sales_report.json"), stringsAsFactors = FALSE ) cat("```mermaid\n") cat(put_diagram(multi_file_workflow, theme = "github", output = "raw")) cat("\n```\n") ``` ## Extracting Annotations Use the `put()` function to scan your files and extract workflow information: ```{r} # Scan all R and Python files in a directory workflow <- put("./src/") # View the extracted workflow print(workflow) ``` The output is a data frame where each row represents a workflow node: | Column | Description | |--------|-------------| | `file_name` | Which script contains this node | | `file_type` | Programming language (r, py, sql, etc.) | | `id` | Unique identifier for the node | | `label` | Human-readable description | | `node_type` | Type of operation (input, process, output) | | `input` | Files consumed by this step | | `output` | Files produced by this step | Custom properties you define are also included as additional columns. ## Complete Syntax Reference ### Basic Format The general syntax for PUT annotations is: # put property1:"value1", property2:"value2", property3:"value3" ### Flexible Syntax Options PUT annotations support several formats to fit different coding styles: # put id:"my_node", label:"My Process" # Standard format (matches logo) #put id:"my_node", label:"My Process" # Also valid (no space) # put| id:"my_node", label:"My Process" # Pipe separator # put id:'my_node', label:'Single quotes' # Single quotes # put id:"my_node", label:'Mixed quotes' # Mixed quote styles ### Multiline Annotations For complex annotations with many properties, use backslash (`\`) continuation: **R/Python style:** ```r # put id:"complex_etl", \ # label:"Complex ETL Process", \ # node_type:"process", \ # input:"raw_data.csv, config.yaml", \ # output:"processed.parquet", \ # author:"Data Team", \ # version:"2.0" ``` **SQL style:** ```sql --put id:"load_customers", \ -- label:"Load Customer Data", \ -- node_type:"input", \ -- output:"customers_table" SELECT * FROM raw_customers; ``` **JavaScript/TypeScript style:** ```javascript //put id:"api_handler", \ // label:"Process API Request", \ // input:"request.json", \ // output:"response.json" ``` **Rules for multiline annotations:** 1. End each line (except the last) with a backslash `\` 2. Start continuation lines with the same comment prefix 3. Continuation lines can have leading whitespace for readability 4. Properties can span multiple lines 5. The backslash must be the last character on the line (no trailing spaces) **Example with many properties:** ```r # put id:"train_model", \ # label:"Train Random Forest Model", \ # node_type:"process", \ # input:"features.csv, labels.csv", \ # output:"model.rds, metrics.json", \ # group:"machine_learning", \ # stage:"3", \ # estimated_time:"45min", \ # memory_intensive:"true" ``` > **When Multiline Annotations Don't Work:** > > - **Trailing spaces**: Ensure backslash is the *last* character (no spaces after) > - **Missing prefix**: Each continuation line needs the comment prefix (`#`, `--`, `//`) > - **Fallback**: If multiline fails, use a single long line - readability is secondary to functionality > - **Debug**: Use `set_putior_log_level("DEBUG")` to see exactly how lines are being parsed ### Multi-Language Support putior automatically uses the correct comment prefix based on file extension: | Comment Style | Languages | Extensions | | :--- | :--- | :--- | | `# put` | R, Python, Shell, Julia, Ruby, YAML | `.R`, `.py`, `.sh`, `.jl`, `.rb`, `.yaml` | | `-- put` | SQL, Lua, Haskell | `.sql`, `.lua`, `.hs` | | `// put` | JavaScript, TypeScript, C, Java, Go, Rust | `.js`, `.ts`, `.c`, `.java`, `.go`, `.rs` | | `% put` | MATLAB, LaTeX | `.m`, `.tex` | **SQL Example:** -- query.sql --put id:"load_data", label:"Load Customer Data", output:"customers" SELECT * FROM customers WHERE active = 1; **JavaScript Example:** // process.js //put id:"transform", label:"Transform JSON", input:"data.json", output:"output.json" const transformed = data.map(item => process(item)); **MATLAB Example:** % analysis.m %put id:"compute", label:"Statistical Analysis", input:"data.mat", output:"results.mat" results = compute_statistics(data); ### Block Comments For languages with block comment support (JavaScript, TypeScript, C, C++, Java, Go, Rust, and other `//`-prefix languages), PUT annotations can also appear inside `/* ... */` and `/** ... */` block comments. Use a `*` line prefix: **JSDoc-style (recommended for JS/TS):** /** * put id:"load", label:"Load Data", node_type:"input" */ function loadData() { return fetch('/api/data'); } **C-style block comment:** /* * put id:"init", label:"Initialize System" */ void init() {} **Single-line block comment:** /* put id:"quick", label:"Quick Operation" */ const x = transform(data); Multiple annotations can appear in one block: /** * put id:"step_a", label:"Step A" * put id:"step_b", label:"Step B" */ Both single-line (`//`) and block (`/* */`) annotations can coexist in the same file. Languages without block comment syntax (R, Python, SQL, etc.) continue to use their single-line prefix only. ### Core Properties While putior accepts any properties you define, these are commonly used: | Property | Purpose | Example Values | |----------|---------|----------------| | `id` | Unique identifier | `"load_data"`, `"process_sales"` | | `label` | Human description | `"Load Customer Data"` | | `node_type` | Operation type | `"input"`, `"process"`, `"output"` | | `input` | Input files | `"raw_data.csv"`, `"data/*.json"` | | `output` | Output files | `"processed_data.csv"` | ### Standard Node Types For consistency across projects, use these standard node types: | Type | Mermaid Shape | Use For | |------|---------------|---------| | `input` | Stadium `([...])` | Data sources, file loading, API inputs | | `process` | Rectangle `[...]` | Data transformation, analysis, computation (default) | | `output` | Subroutine `[[...]]` | Report generation, data export, visualization | | `decision` | Diamond `{...}` | Conditional logic, branching workflows | | `start` | Stadium `([...])` | Workflow entry point (gets boundary styling) | | `end` | Stadium `([...])` | Workflow exit point (gets boundary styling) | > **`artifact`** nodes (cylinder shape) are automatically created by `put_diagram(show_artifacts = TRUE)` for data files referenced in `input`/`output` fields. You don't set `node_type:"artifact"` manually. **Visual representation of node types:** ```{r node-types-diagram, echo=FALSE, results='asis', eval=TRUE} library(putior) node_types_workflow <- data.frame( file_name = rep("example.R", 4), id = c("load", "transform", "export", "check"), label = c("Load Data (input)", "Transform (process)", "Export (output)", "Validate? (decision)"), node_type = c("input", "process", "output", "decision"), input = c(NA, "raw", "clean", "clean"), output = c("raw", "clean", "report", "valid"), stringsAsFactors = FALSE ) cat("```mermaid\n") cat(put_diagram(node_types_workflow, theme = "github", show_artifacts = FALSE, output = "raw")) cat("\n```\n") ``` ### Custom Properties Add any properties you need for visualization or metadata: # put id:"train_model", label:"Train ML Model", node_type:"process", color:"green", group:"machine_learning", duration:"45min", priority:"high" These custom properties can be used by visualization tools or workflow management systems. ## Advanced Usage ### Processing Individual Files You can process single files instead of entire directories: ```{r} # Process a single file workflow <- put("./scripts/analysis.R") ``` ### Recursive Directory Scanning Include subdirectories in your scan: ```{r} # Search subdirectories recursively workflow <- put("./project/", recursive = TRUE) ``` ### Custom File Patterns Control which files are processed: ```{r} # Only R files workflow <- put("./src/", pattern = "\\.R$") # R and SQL files only workflow <- put("./src/", pattern = "\\.(R|sql)$") # All supported file types (default) workflow <- put("./src/", pattern = "\\.(R|r|py|sql|sh|jl)$") ``` ### Including Line Numbers For debugging annotation issues, include line numbers: ```{r} # Include line numbers for debugging workflow <- put("./src/", include_line_numbers = TRUE) ``` ### Validation Control Control annotation validation: ```{r} # Enable validation (default) - provides helpful warnings workflow <- put("./src/", validate = TRUE) # Disable validation warnings workflow <- put("./src/", validate = FALSE) ``` ### Automatic ID Generation If you omit the `id` field, putior will automatically generate a unique UUID: ```{r} # Annotations without explicit IDs get auto-generated UUIDs # put label:"Load Data", node_type:"input", output:"data.csv" # put label:"Process Data", node_type:"process", input:"data.csv", output:"clean.csv" # Extract workflow - IDs will be auto-generated workflow <- put("./") print(workflow$id) # Will show UUIDs like "a1b2c3d4-e5f6-7890-abcd-ef1234567890" ``` Note: If you provide an empty `id` (e.g., `id:""`), you'll get a validation warning. ### Automatic Output Defaulting If you omit the `output` field, putior automatically uses the file name as the output: ```{r} # In process_data.R: # put label:"Process Step", node_type:"process", input:"raw.csv" # No output specified - will default to "process_data.R" # In analyze_data.R: # put label:"Analyze", node_type:"process", input:"process_data.R", output:"results.csv" # This creates a connection from process_data.R to analyze_data.R ``` This feature ensures that scripts can be connected in workflows even when explicit output files aren't specified. ### Tracking Source Relationships When you have scripts that source other scripts, use this annotation pattern: ```{r} # In main.R (sources other scripts): # put label:"Main Analysis", input:"load_data.R,process_data.R", output:"report.pdf" source("load_data.R") # Reading load_data.R into main.R source("process_data.R") # Reading process_data.R into main.R # In load_data.R (sourced by main.R): # put label:"Data Loader", node_type:"input" # output defaults to "load_data.R" # In process_data.R (sourced by main.R, depends on load_data.R): # put label:"Data Processor", input:"load_data.R" # output defaults to "process_data.R" ``` This correctly shows the flow: sourced scripts are **inputs** to the main script. ## Variable References with `.internal` Extension putior supports tracking in-memory variables and objects using the `.internal` extension. This is useful for documenting computational steps within scripts while maintaining clear data flow between scripts. ### Key Concepts **`.internal` variables:** - Represent in-memory objects during script execution - Can only be **outputs**, never inputs between scripts - Help document what variables are created within each script - Example: `my_data.internal` represents a variable named `my_data` **Persistent files:** - Enable actual data flow between scripts - Can be both inputs and outputs - Required for connected workflows - Example: `my_data.RData`, `results.csv` ### Correct Usage Pattern ```{r eval=FALSE} # Script 1: Create variable and save it # put id:"create_data", output:"dataset.internal, dataset.RData" dataset <- data.frame(x = 1:100, y = rnorm(100)) save(dataset, file = "dataset.RData") # Script 2: Load data and create new variables # put id:"analyze_data", input:"dataset.RData", output:"analysis.internal, summary.txt" load("dataset.RData") # Load the persistent file (NOT dataset.internal) analysis <- summary(dataset) # Create new in-memory variable writeLines(capture.output(analysis), "summary.txt") ``` ### What NOT to Do ```{r eval=FALSE} # INCORRECT: Using .internal as input between scripts # put input:"dataset.internal" # This is wrong! # CORRECT: Use persistent files as inputs # put input:"dataset.RData" # This is correct! ``` ### Complete Example Try the comprehensive variable reference example: ```{r eval=FALSE} source(system.file("examples", "variable-reference-example.R", package = "putior")) ``` This creates a connected 4-script workflow demonstrating proper `.internal` usage and file-based data flow. ## Real-World Example Let's walk through a complete data science workflow: ### 1. Data Collection (Python) # 01_collect_data.py # put id:"fetch_api_data", label:"Fetch Data from API", node_type:"input", output:"raw_api_data.json" import requests import json response = requests.get("https://api.example.com/sales") data = response.json() with open("raw_api_data.json", "w") as f: json.dump(data, f) ### 2. Data Processing (R) # 02_process_data.R # put id:"clean_api_data", label:"Clean and Structure Data", node_type:"process", input:"raw_api_data.json", output:"processed_sales.csv" library(jsonlite) library(dplyr) # Load raw data raw_data <- fromJSON("raw_api_data.json") # Process and clean processed <- raw_data %>% filter(!is.na(sale_amount)) %>% mutate( sale_date = as.Date(sale_date), sale_amount = as.numeric(sale_amount) ) %>% arrange(sale_date) # Save processed data write.csv(processed, "processed_sales.csv", row.names = FALSE) ### 3. Analysis and Reporting (R) # 03_analyze_report.R # put id:"sales_analysis", label:"Perform Sales Analysis", node_type:"process", input:"processed_sales.csv", output:"analysis_results.rds" # put id:"generate_report", label:"Generate HTML Report", node_type:"output", input:"analysis_results.rds", output:"sales_report.html" library(dplyr) # Load processed data sales_data <- read.csv("processed_sales.csv") # Perform analysis analysis_results <- list( total_sales = sum(sales_data$sale_amount), monthly_trends = sales_data %>% group_by(month = format(sale_date, "%Y-%m")) %>% summarise(monthly_total = sum(sale_amount)), top_products = sales_data %>% group_by(product) %>% summarise(product_sales = sum(sale_amount)) %>% arrange(desc(product_sales)) %>% head(10) ) # Save analysis saveRDS(analysis_results, "analysis_results.rds") # Generate report rmarkdown::render("report_template.Rmd", output_file = "sales_report.html") ### 4. Extract the Complete Workflow ```{r} # Extract workflow from all files complete_workflow <- put("./sales_project/", recursive = TRUE) print(complete_workflow) ``` This would show the complete data flow: API → JSON → CSV → Analysis → Report ## Best Practices ### 1. Use Descriptive Names Choose clear, descriptive names that explain what each step does: # Good # put id:"load_customer_transactions", label:"Load Customer Transaction Data" # put id:"calculate_monthly_revenue", label:"Calculate Monthly Revenue Totals" # Less descriptive # put id:"step1", label:"Load data" # put id:"process", label:"Do calculations" ### 2. Document Data Dependencies Always specify inputs and outputs for data processing steps: # put id:"merge_datasets", label:"Merge Customer and Transaction Data", input:"customers.csv,transactions.csv", output:"merged_data.csv" ### 3. Use Consistent Node Types Stick to a standard set of node types across your team: # put id:"load_raw_data", label:"Load Raw Sales Data", node_type:"input" # put id:"clean_data", label:"Clean and Validate", node_type:"process" # put id:"export_results", label:"Export Final Results", node_type:"output" ### 4. Add Helpful Metadata Include metadata that helps with workflow understanding: # put id:"train_model", label:"Train Random Forest Model", node_type:"process", estimated_time:"30min", requires:"tidymodels", memory_intensive:"true" ### 5. Group Related Operations Use grouping properties to organize complex workflows: # put id:"feature_engineering", label:"Engineer Features", group:"preprocessing", stage:"1" # put id:"model_training", label:"Train Model", group:"modeling", stage:"2" # put id:"model_evaluation", label:"Evaluate Model", group:"modeling", stage:"3" ## Troubleshooting Having issues with annotations? See the [Troubleshooting Guide](troubleshooting.html) for: - **[Most Common Issues](troubleshooting.html#most-common-issues)** - Start here for quick solutions - **[Annotation Syntax Errors](troubleshooting.html#annotation-syntax-errors)** - Quote mismatches, invalid properties - **[File Pattern Matching](troubleshooting.html#file-pattern-matching-issues)** - Files not being scanned - **[Debugging with Logging](troubleshooting.html#debugging-with-logging)** - Enable detailed output **Quick diagnostic:** ```{r} # Test if your annotation is valid is_valid_put_annotation('# put id:"test", label:"Test Node"') # Should be TRUE ``` ## See Also | Guide | Description | |-------|-------------| | [Quick Start](quick-start.html) | First diagram in 2 minutes | | [Features Tour](features-tour.html) | Auto-detection, themes, logging | | [API Reference](api-reference.html) | Function documentation | | [Showcase](showcase.html) | Real-world examples | | [Quick Reference](quick-reference.html) | At-a-glance reference card | | [Troubleshooting](troubleshooting.html) | Common issues and solutions | | [AI Integration](ai-integration.html) | MCP/ACP integration guide | **Built-in examples:** ```{r eval=FALSE} # Complete workflow example source(system.file("examples", "reprex.R", package = "putior")) # Variable reference example source(system.file("examples", "variable-reference-example.R", package = "putior")) # Interactive diagrams example source(system.file("examples", "interactive-diagrams-example.R", package = "putior")) ``` **Function help:** - `?put` - Extract annotations from files - `?put_diagram` - Generate Mermaid diagrams - `?put_auto` - Auto-detect workflow from code