Complete guide to all RocketMouse AI features
RocketMouse AI is a visual editor for building RPA macros using Scratch-style puzzle blocks. Drag and snap blocks together to automate mouse actions, keyboard input, browser interactions, Excel operations, image recognition, AI integration, and more — all without writing code.
The app uses a 3-column layout:
Saved in .rmproj format (JSON). Ctrl+S to save, Ctrl+Shift+S for Save As. An asterisk * in the title bar indicates unsaved changes.
Drag blocks from the palette to the workspace. You can also click a palette item to place it directly.
Bring blocks close together (within 20px) and they automatically snap-connect. Yellow highlight shows connection candidates.
Drag the top block of a connected chain to move all attached blocks together.
Click to select (blue border). Ctrl+click for multi-select. Drag on canvas for marquee selection. Ctrl+A to select all.
Mouse wheel to zoom. Middle button drag or right button drag to pan. Home key to zoom-to-fit all blocks.
Appears on the right when a block is selected. Set comments, delay (ms), enabled/disabled, and block-specific parameters.
Use the search box above the palette for incremental search across all categories. Esc or the ✕ button clears the search.
200×150px overview in the bottom-left corner. Click/drag to pan the workspace. Toggle via toolbar button.
| Shape | Description | Examples |
|---|---|---|
| Stack | Standard block with top notch + bottom bump | Click, type, file operations |
| Hat | Rounded top, no top connector | Start block, function definitions |
| C-Block | Contains child blocks in its mouth, auto-expands | Loops, conditions, TryCatch |
| Boolean | Hexagonal, represents true/false conditions | Comparisons (=, >, <), AND/OR/NOT |
| Reporter | Capsule shape, returns values, snaps into parameter slots | Variable refs, math, strings |
| Cap | Top connector only, terminates flow | Break, Continue, MacroExit, Return |
| Block | Description |
|---|---|
| LeftClick / RightClick / MiddleClick | Mouse button clicks (with optional coordinates) |
| DoubleClick | Double-click |
| MouseMove | Move cursor to coordinates |
| MouseDrag | Drag from point A to point B |
| ScrollUp / ScrollDown | Mouse wheel scroll |
| MouseClickRelative / MouseMoveRelative | Relative to active window |
| Others | LeftDown/Up, RightDown/Up, ScrollLeft/Right |
| Block | Description |
|---|---|
| KeyInput | Key combos (e.g., Ctrl+S, {ENTER}, {F8}{ENTER}) |
| TextPaste | Paste text via clipboard |
| Block | Description |
|---|---|
| ActivateWindow | Bring window to front |
| CloseWindow | Close specified window |
| MinimizeWindow / MaximizeWindow / RestoreWindow | Window state |
| ResizeWindow / MoveWindow | Resize and reposition |
| GetWindowTitle | Get active window title |
Control Edge/Chrome via Playwright.
| Block | Description |
|---|---|
| BrowserLaunch | Launch browser (Edge/Chrome/Chromium) |
| BrowserNavigate | Navigate to URL |
| BrowserClick | Click element (CSS selector) |
| BrowserFill | Type text into input field |
| BrowserSelect | Select from dropdown |
| BrowserCheck | Toggle checkbox on/off |
| BrowserGetText / BrowserGetAttribute | Extract element data |
| BrowserGetTitle / BrowserGetUrl | Get page title or URL |
| BrowserScreenshot | Save page or element screenshot |
| BrowserExecuteScript | Execute JavaScript |
| BrowserWaitForElement | Wait for element to appear |
| BrowserSwitchTab / BrowserGoBack | Tab and navigation control |
| BrowserClose | Close browser |
COM-based Excel operations (requires Microsoft Excel).
| Block | Description |
|---|---|
| ExcelOpen / ExcelClose / ExcelSave | Workbook management |
| ExcelReadCell / ExcelWriteCell | Cell read/write |
| ExcelSetFormula | Set cell formula |
| ExcelRunMacro | Execute VBA macro |
| ExcelFilter / ExcelReadRange | Filter and range ops |
| Others | CopySheet, DeleteSheet, WriteRange, InsertRow, Sort, FindCell, Chart, ExportPdf, GetLastRow |
| Block | Description |
|---|---|
| FileCopy / FileMove / FileDelete | Basic file operations |
| FileRead / FileWrite / FileAppend | Text file I/O |
| FolderCreate / FolderCopy / FolderDelete | Folder operations |
| FileRename / FileGetInfo / FileListFiles | File utilities |
| Block | Description |
|---|---|
| JsonParse / JsonStringify | JSON parsing and generation |
| RegexMatch / RegexReplace | Regular expressions |
| StringIndexOf / StartsWith / EndsWith | String search |
| DateAdd / DateDiff / DateFormat | Date arithmetic and formatting |
| ListCreate / Add / Get / Set / Length / Remove / Sort / Join | List operations |
| Others | TextExtract, TextFormat, TextReplace, TextQuote |
| Block | Description |
|---|---|
| ConditionBranch | if-then (C-block) |
| ConditionBranchElse | if-then-else (dual mouth) |
| LoopStart | For loop (specified count) |
| ForEachLoop | Iterate over list items |
| TryCatch | Error handling (try/catch mouths) |
| BreakLoop / ContinueLoop | Loop control (Cap type) |
| MacroExit | Immediate macro termination (Cap type) |
| WaitTime / WaitWindow / WaitImage / WaitColor / WaitCpu | Conditional waits |
| FunctionDefine / FunctionCall / FunctionReturn | Custom functions |
| Block | Description |
|---|---|
| AiInstruction | Send prompt to LLM, store response in variable. 5 providers (OpenAI, Anthropic, Gemini, Groq, local LLM). Vision API support (image attachment, 4MB limit). |
| Block | Description |
|---|---|
| VisionClick | Click on template image |
| VisionWait / VisionDisappear | Wait for image appear/disappear |
| VisionGetPosition | Get image coordinates |
| VisionCapture | Save screen capture |
| OcrReadText | Read text from screen region (Windows OCR) |
| OcrClickText / OcrWaitText | Find and click/wait for text |
| WaitImage | Image recognition wait (OpenCV) |
Hexagonal blocks that snap into C-block condition slots.
| Block | Description |
|---|---|
| BoolCompareEquals / Greater / Less | Comparison operators (=, >, <) |
| BoolVariableEquals | Variable value match |
| BoolFileExists | File existence check |
| BoolWindowVisible | Window visibility check |
| BoolImageFound / BoolColorFound | Image/color detection |
| BoolAnd / BoolOr / BoolNot | Logical operators (nestable) |
| Block | Description |
|---|---|
| AppLaunch / AppClose | Launch and close applications |
| MessageBox / InputDialog | User interaction dialogs |
| ClipboardSet / ClipboardGet | Clipboard operations |
| ProcessStart / ProcessKill | Process management |
| ZipCreate / ZipExtract | ZIP compression/extraction |
| Base64Encode / Base64Decode | Base64 encoding |
| HashCompute | Hash (MD5/SHA1/SHA256/SHA512) |
| HttpGet / HttpPost / HttpDownload | HTTP operations |
| RegistryRead / RegistryWrite | Registry operations |
| TextToSpeech / ScreenCapture | System utilities |
Capsule-shaped blocks that snap into parameter slots of other blocks.
| Block | Description |
|---|---|
| MathAdd / Subtract / Multiply / Divide / Modulo | Arithmetic + modulo |
| MathFunction | Math functions (abs, round, ceil, floor, sqrt) |
| StringConcat / StringLength / StringCharAt | String operations |
| RandomNumber | Random number generation |
| SenseMouseX / SenseMouseY | Current mouse coordinates |
| SenseCurrentDateTime | Current date/time |
| SenseClipboardText | Clipboard text |
| SenseEnvironmentVariable | Environment variable value |
Press the ▶ Play button in the toolbar. Execution starts from the Start block. The currently executing block is highlighted with a green border.
While running: ⏸ Pause to stop, ▶ Resume to continue, ⏩ Step to execute one block at a time.
Press ■ Stop to immediately halt execution.
Test the flow without performing actual mouse/keyboard operations. Results are shown in the log panel.
Adjust execution speed from 1x to 10x using the toolbar slider. Changes apply in real-time.
Failed blocks are highlighted with a red border, and the error message is shown in a tooltip.
Structured log in the bottom panel with 4 color levels: gray (Info), green (Success), red (Error), yellow (Warning).
The "Variables" tab in the bottom panel shows runtime variable values. Changed variables are highlighted in green.
Right-click a block → "Set Breakpoint" or press F9. A red dot marker appears, and execution pauses automatically when it reaches that block.
Press the ⚫ Record button to capture mouse/keyboard actions. Recorded actions are placed as blocks in the workspace.
Use the VariableDefine block to create variables. Reference values with {=variableName} syntax in any parameter field.
The VariableExpression block supports arithmetic: +, -, *, /, %, parentheses. Example: {=count} + 1
Automatically set inside loops:
{=_loopIndex} — 1-based loop counter (1, 2, 3, ...){=_loopIteration} — 0-based loop counter (0, 1, 2, ...)ListCreate for empty lists, ListAdd to append, ListGet/ListSet for index access. ListSort, ListJoin, and more available.
Define functions with FunctionDefine (Hat), call with FunctionCall (Stack), return values with FunctionReturn (Cap). Recursion supported (max depth 100).
Blocks with output variables get smart default names (e.g., CellValue). Duplicates are auto-numbered (CellValue2, CellValue3...).
The AiInstruction block sends a prompt to an LLM and stores the response in a variable. Attach image files for Vision API (4MB limit). Parameters: prompt, outputVariable, filePath, provider, model, temperature.
| Provider | Default Model | Vision |
|---|---|---|
| OpenAI | gpt-5.5 (default), gpt-5.5-mini, gpt-5.4, gpt-5.2, gpt-4.1, gpt-4o | ✓ |
| Anthropic | claude-sonnet-4-6 (default), claude-opus-4-7, claude-haiku-4-5 | ✓ |
| Google Gemini | gemini-3-pro / 3-flash (Preview), gemini-2.5-pro / 2.5-flash | ✓ |
| Groq | Llama 4 Scout 17B (multimodal), Llama 3.3 70B, Llama 3.1 8B | ✓ (with Llama 4 Scout) |
| Local LLM | User-specified | ✓ |
New in v2.0 — AI Vision features let AI "see" and understand your screen, enabling automation that adapts to dynamic UI changes without fixed coordinates or template images.
| Block | Function | Key Parameters |
|---|---|---|
| AiClick | AI sees the screen, locates the element, and clicks | prompt, button, clickType, provider, model |
| AiSmartWait | AI periodically checks the screen until condition is met | prompt, timeoutMs, pollingMs, provider, model |
| AiOcr | AI reads text from the screen into a variable | prompt, outputVariable, provider, model |
| AiValidate | AI verifies screen state and returns true/false | prompt, outputVariable, provider, model |
| BoolAiCondition | AI judgment in if/while conditions | prompt, provider, model |
Example prompts for each AI Vision block. Specific, clear descriptions improve AI accuracy.
The "File" menu in Notepad
The "Save" button in the Save As dialog
The Chrome icon in the taskbar
Cell B3 in the Excel spreadsheet
The file download has completed (progress bar disappeared)
A "Processing complete" message is displayed
The splash screen has closed and the main window is visible
Read the error message shown in the dialog at the center of the screen
Read the total amount in the bottom-right cell of the Excel table
Read the filename shown in the title bar
The file was saved successfully (no "*" mark in the title bar)
A login screen is displayed (username and password fields are visible)
The print preview is correctly displayed
An error dialog is displayed on the screen
Notepad is visible on the screen
The web page has finished loading (no loading spinner visible)
AiAutopilot takes a natural language task description and autonomously operates your computer — taking screenshots, deciding actions, and repeating until the task is complete.
Autopilot automatically selects from 3 execution paths based on the provider and model. v2.0.6 adds OpenAI GPT-5.5 native path.
| Path | When | How It Works | Notes |
|---|---|---|---|
| Anthropic Native Path | Anthropic + Sonnet/Opus | Claude Computer Use API (multi-turn tool_use/tool_result) | High accuracy, well-proven stability |
| OpenAI Native Path NEW | OpenAI + GPT-5.5 / GPT-5.4 / computer-use-preview | OpenAI Responses API (stateless continuation via previous_response_id) | Faster (~1.3x) and cheaper (~5x) vs. Sonnet |
| Generic Path | All other providers + Anthropic Haiku | Prompt-based JSON approach | Standard |
Native paths send screenshots to the AI, receive tool calls (click, type, scroll, etc.), execute them, and return the resulting screenshot — a multi-turn conversation enabling high accuracy and self-correction.
Approach differs by provider: GPT-5.5 tends to take the shortest path (e.g., Win+R Run dialog), while Claude Sonnet behaves more human-like, navigating the Start menu step-by-step with visual confirmation. Pick GPT-5.5 for simple repetitive tasks, Sonnet for complex screen recognition.
| Parameter | Description | Default |
|---|---|---|
| task (prompt) | Natural language task (e.g., "Open Notepad, type Hello World, and save") | (required) |
| maxSteps | Maximum steps (each step = screenshot + LLM + action) | 30 |
| timeoutSeconds | Timeout in seconds | 300 |
| provider | LLM provider. Default uses the provider selected in AI Settings → "AI Autopilot" section (Anthropic or OpenAI). Can be overridden per block: Anthropic / OpenAI / OpenAICompatible etc. | Default |
| model | Model ID. Can be overridden per block (e.g., gemma4:26b) | (empty = use AI Settings) |
| outputVariable | Result variable (True/False) | AutopilotResult |
| Action | Description |
|---|---|
| left_click / right_click / double_click | Mouse click operations |
| type | Text input (Japanese supported via clipboard) |
| key | Keyboard input (Ctrl+S, Enter, Win+D, etc.) |
| scroll | Mouse wheel scroll |
| mouse_move / left_click_drag | Cursor movement and drag |
| screenshot | Request additional screenshot |
| wait | Wait 2 seconds |
Specific, step-clear task descriptions lead to more accurate Autopilot execution.
Open Notepad, type "Hello World", and save it as test.txt on the Desktop
→ Specify filename and location
Open Chrome, navigate to https://example.com, copy the page title, and paste it into Notepad
→ Multi-app coordination task
Open Calculator and compute 1234 × 5678
→ Simple, clear objective
Open report.xlsx from the Desktop and check the value in cell A1
→ Working with existing files
AI Vision works with all Vision-capable providers, but here are the recommended models for each use case:
| Use Case | Recommended Model | Reason |
|---|---|---|
| Autopilot (Speed & Cost) NEW | OpenAI GPT-5.5 | Built-in Computer Use. ~1.3x faster, ~5x cheaper than Sonnet. Available from OpenAI Tier 1. Best for simple repetitive tasks. |
| Autopilot (Accuracy & Stability) | Claude Sonnet 4.6 | Native Computer Use API. Long production track record; step-by-step verification approach excels at complex tasks. |
| Autopilot (Top Performance) | Claude Opus 4.7 | Native Computer Use API. Highest accuracy and self-correction. For long, complex tasks. |
| Autopilot (OpenAI Premium) | computer-use-preview | OpenAI's specialized Computer Use model. Requires Tier 3+ (cumulative $100 spend + 7 days). |
| Autopilot (Local LLM) | Gemma 4 26B A4B / Qwen3-VL 8B | Via Ollama. No API costs, works offline. Speed depends on GPU. |
| AiClick / AiSmartWait / AiOcr | Claude Sonnet 4.6 / GPT-5.5 / Gemini 2.5 Pro / Llama 4 Scout | Single Vision API call. Any Vision-capable model works well. |
| BoolAiCondition | Gemini 2.5 Flash / GPT-5.5 mini | Simple Yes/No judgment. Fast, low-cost models suffice. |
| AiInstruction (Text) | Any (per your needs) | No Vision required — Groq or local LLMs are also valid. |
The native-path model is selected from a single unified ComboBox at: AI Settings → "AI Autopilot" section → "Default Model". Pick from Anthropic (Sonnet 4.6 / Opus 4.7) or OpenAI (GPT-5.5 / 5.4 / computer-use-preview) in one selection. You can also override per-block with the provider and model parameters.
provider + model parameters → (2) AI Settings "AI Autopilot" Default Model → (3) Default: Claude Sonnet 4.6 (Anthropic)
computer-use-preview requires Tier 3+ (cumulative $100 + 7 days). Also, the OpenAI project's "Allowed Models" whitelist must include the target model (some Default projects only permit gpt-4o).
Run a Vision-capable local LLM (e.g., Gemma 4) via Ollama to use Autopilot with no API costs and fully offline.
http://localhost:11434/v1 and the model name (e.g., gemma4:26b)OpenAICompatible| Model | Ollama Command | VRAM | Notes |
|---|---|---|---|
| Gemma 4 26B A4B | ollama pull gemma4:26b | ~15GB | MoE (3.8B active) for fast inference. Vision + function calling. Apache 2.0 license. |
| Qwen3-VL 8B | ollama pull qwen3-vl:8b | ~6GB | Lightweight and fast. Runs on GPUs with less VRAM. Strong Vision accuracy for its size. |
AI Vision features work with locally-hosted Vision-capable LLMs via the "Local LLM" provider (OpenAI-compatible API). No cloud API required.
Below are recommended Vision-capable local models for screen recognition and UI element detection (as of March 2026).
| Model | Size | VRAM (Q4) | Runtime | Use Case & Notes |
|---|---|---|---|---|
| Qwen3-VL 8B | 8B | ~6GB | LM Studio / Ollama | TOP PICK Best-in-class GUI grounding (ScreenSpot 94.4%). Excellent at screen analysis, OCR, and UI element detection. 128K context. |
| Qwen2.5-VL 7B | 7B | ~6GB | LM Studio / Ollama | Battle-tested and stable. Exceptional document OCR (DocVQA 95.7). Choose this for proven reliability. |
| Gemma 3 4B | 4B | ~3-4GB | LM Studio / Ollama | LIGHTWEIGHT Runs on 6GB GPUs. Good for simple screen checks (Yes/No). Not suited for precise coordinate detection. |
| Phi-4-Reasoning-Vision 15B | 15B | ~10GB | LM Studio (GGUF) | Microsoft. Excels at complex reasoning over screen content: charts, tables, error messages. |
| Gemma 3 27B QAT | 27B | ~14GB | LM Studio / Ollama | HIGH-END For 24GB GPUs. QAT (Quantization-Aware Training) preserves quality. Best local quality available. |
ollama pull qwen3-vl.
http://localhost:1234, Ollama http://localhost:11434qwen3-vl-8b)The AI Assistant chat panel on the right side offers:
Free 15-day trial, then activate with a license key for full access. See License Policy for details.
Toolbar → AI Settings to configure:
API keys are stored encrypted (DPAPI) in the Windows registry. They are never saved in project files.
.rmproj files are JSON format. They can be manually edited but are typically managed within the app.
| Shortcut | Action |
|---|---|
| Ctrl+Z | Undo |
| Ctrl+Y | Redo |
| Ctrl+C | Copy |
| Ctrl+V | Paste |
| Ctrl+D | Duplicate |
| Ctrl+A | Select All |
| Ctrl+F | Search Blocks |
| Ctrl+S | Save |
| Ctrl+Shift+S | Save As |
| Ctrl+Shift+F | Fold / Unfold |
| Delete | Delete Selected |
| Escape | Deselect |
| Home | Zoom to Fit |
| F9 | Toggle Breakpoint |
| Mouse Wheel | Zoom |
| Middle Button Drag | Pan |
| Right Button Drag | Pan |