Automatically remove silence from videos and generate perfectly synced subtitles
Drop a video → AI detects & cuts silence → Export to Final Cut Pro with word-level subtitles
Most silence-removal tools split audio by time, which cuts words in half. Silenci uses a 2-Pass ASR approach — first transcribe, then split only at word boundaries. No mid-word cuts. Ever.
| Other tools | Silenci | |
|---|---|---|
| Split method | Time-based → words get chopped | Word-boundary → clean cuts |
| Subtitles | Separate tool needed | Built-in, word-level synced |
| Runs on | Cloud / GPU server | 100% local on your Mac |
| Cost | Subscription / API fees | Free & open source |
| Privacy | Upload to cloud | Offline — nothing leaves your Mac |
|
|
|
|
|
|
Most silence-removal tools split audio by fixed time windows before running ASR. This causes words to be cut in half at chunk boundaries. Silenci solves this with a 2-pass approach:
Pass 1: VAD → chunk by silence gaps (≤30s) → ASR + ForcedAligner → word-level timestamps
Pass 2: Split only at word end_time boundaries → never cuts mid-word
| Step | Component | Input | Output |
|---|---|---|---|
| 1 | ffprobe | Video file | fps, resolution, duration |
| 2 | ffmpeg | Video file | 16kHz mono WAV |
| 3 | Silero VAD | WAV audio | Speech segments [{start, end}, ...] |
| 4 | Split | Speech segments | ≤30s chunks (split at silence gaps, not mid-speech) |
| 5 | Qwen3-ASR | Audio chunk | Transcribed text |
| 6 | Qwen3-ForcedAligner | Audio + text | Word-level [{text, start, end}, ...] |
| 7 | Word merge | All words | Full word list with absolute timestamps |
| 8 | Segment split | Word list | Segments split at end_time boundaries |
| 9 | Subtitle split | Segments | Subtitle chunks by punctuation/endings |
When splitting at step 8, the algorithm prefers the largest silence gap between words, producing natural sentence-like segments.
The app uses a dual-process architecture: a Swift frontend communicates with a Python subprocess via JSON-RPC 2.0 over stdin/stdout pipes.
┌──────────────────────┐ stdin (JSON) ┌──────────────────────┐
│ Swift macOS App │ ──── {"method":"analyze"} ──→ │ Python Subprocess │
│ │ │ │
│ • SwiftUI Views │ ←── {"result": segments} ──── │ • silence_cutter/ │
│ • PythonBridge │ │ server.py │
│ • ExportService │ ←── {"method":"progress"} ── │ • Silero VAD │
│ • PythonEnvironment │ (notifications) │ • Qwen3-ASR (MLX) │
└──────────────────────┘ stdout (JSON) └──────────────────────┘
Why this architecture?
- Isolation: Python ML stack (PyTorch, MLX) runs in a separate process — crashes don't take down the UI
- Streaming progress: JSON-RPC notifications push real-time progress (VAD %, ASR chunk N/M, model download bytes)
- No FFI overhead: No ctypes/cffi bindings needed — just line-delimited JSON
- Cancelable: Swift can kill the Python process at any time for instant cancellation
On first launch, PythonEnvironment.swift handles a fully automated setup chain:
App launch
→ Check Homebrew (not found? → install via official script)
→ Check Python3 (not found? → brew install python@3)
→ Check ffmpeg (not found? → brew install ffmpeg)
→ Create venv (~/Library/Application Support/Silenci/venv/)
→ pip install (torch, mlx-audio, silero-vad, soundfile, numpy, soynlp)
→ Write version stamp (.sc-version)
→ Ready ✅
Subsequent launches skip installation if the version stamp matches. Bumping envVersion in code forces a clean reinstall.
Native SwiftUI app — load a video, configure settings, analyze, edit, and export in one window.
| Feature | Description |
|---|---|
| 🎬 Load video | Drag & drop or File → Open |
| ⚙️ Analysis settings | Auto-popup on load — language, model, VAD sensitivity |
| 📊 Real-time progress | Separate progress for analysis & model download |
| ⛔ Cancel analysis | Stop anytime with cancel button |
| ✂️ Word-level editing | Delete/restore words, split/merge clips |
| 🔍 Find & Replace | Cmd+F to batch-edit subtitle text |
| 🔄 Import FCPXML | Re-transcribe edited FCPXML with language & model selection |
| 🌐 App Language | Switch UI language in Settings (Korean/English/Japanese/Chinese) |
| 📤 Export | FCPXML (with inline iTT captions), SRT, iTT — all word-boundary split |
| Category | Setting | Default | Description |
|---|---|---|---|
| Speech | Language | Korean | Korean / English / Japanese / Chinese |
| ASR Model | 0.6B | 0.6B (fast) / 1.7B (accurate) | |
| Silence | VAD Sensitivity | 0.50 | 0.1–0.9 (lower = more sensitive) |
| Min Silence | 200ms | Shorter silences are ignored | |
| Padding | 100ms | Buffer around speech segments | |
| Subtitle | Max Clip Length | 8s | 3–20s slider |
| Max Chars/Line | 20 | Subtitle line break threshold | |
| Font Size | 42pt | FCPXML subtitle font |
Settings are persisted via UserDefaults across app restarts.
./build-release.sh # Build → dist/SilenciApp.app
open dist/SilenciApp.app # Launch- Download
Silenci-vX.X.X-macOS.dmgfrom Releases - Open DMG → drag
SilenciAppto Applications - First launch: Right-click (or Control+click) the app → Open → click Open in the dialog
macOS shows "unidentified developer" warning for open-source apps. This is a one-time step — after this, the app opens normally.
- The app auto-installs Python, ffmpeg, and AI models on first launch (~1-2 min)
On first launch, the app automatically creates a Python venv and installs dependencies (~45 seconds). ASR models are downloaded on first analysis with byte-level progress tracking.
| Item | Path | Size |
|---|---|---|
| 🐍 Python venv | ~/Library/Application Support/Silenci/venv/ |
~1.5 GB |
| 🤖 ASR model cache | ~/.cache/huggingface/hub/ |
~1-2 GB |
Option 1 — From the app:
Menu bar → Silenci → Python 환경 삭제
Option 2 — Manual:
rm -rf ~/Library/Application\ Support/Silenci/
rm -rf ~/.cache/huggingface/hub/models--mlx-community--Qwen3-*python -m silence_cutter <command> [options]
silence-cutter <command> [options] # after pip install -e .silence-cutter cut input.mp4 # basic
silence-cutter cut input.mp4 -o output.fcpxml # custom output
silence-cutter cut input.mp4 -l English --itt # English + iTT📋 All options
| Option | Default | Description |
|---|---|---|
-o, --output |
<input>.fcpxml |
Output path |
-l, --language |
Korean |
Speech language |
--asr-model |
Qwen3-ASR-1.7B-8bit |
ASR model |
--aligner-model |
Qwen3-ForcedAligner-0.6B-8bit |
Alignment model |
--vad-threshold |
0.5 |
VAD sensitivity (0–1) |
--min-speech-ms |
250 |
Min speech duration (ms) |
--min-silence-ms |
300 |
Min silence duration (ms) |
--speech-pad-ms |
100 |
Speech padding (ms) |
--font-size |
42 |
Subtitle font size |
--max-subtitle-chars |
20 |
Max chars per subtitle line |
--itt |
false |
Also generate iTT subtitles |
silence-cutter multi video1.mp4 video2.mp4 -o merged.fcpxml --ittsilence-cutter script input.mp4 -t -o script.txt # with timecodessilence-cutter resub edited.fcpxml -o final.fcpxml --ittsilence-cutter extract timeline.fcpxml -t -o script.txt| Format | Extension | Use Case | Subtitle Splitting |
|---|---|---|---|
| FCPXML | .fcpxml |
Final Cut Pro (silence cuts + inline titles + iTT captions) | ✅ Word-based |
| SRT | .srt |
Universal subtitles (YouTube, VLC, etc.) | ✅ Word-based |
| iTT | .itt |
iTunes Timed Text (FCP compatible) | ✅ Word-based |
| TXT | .txt |
Plain text script (optional timecodes) | — |
All subtitle formats use word-level timestamps for precise splitting. FCPXML exports include both title text overlays (lane 1) and iTT inline captions (lane 2) — FCP shows both automatically.
File → Import → XML... → select the
.fcpxmlfileThe silence-removed timeline with embedded subtitles loads automatically.
| Item | Requirement |
|---|---|
| OS | macOS 14.0+ (Apple Silicon) |
| Disk | ~2-4 GB for Python venv + ASR models |
Python, ffmpeg, Homebrew are all auto-installed on first launch if not present. No manual setup needed.
Download from Releases → see Install from DMG above.
brew install ffmpeg
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
silence-cutter cut input.mp4| Package | Purpose |
|---|---|
mlx-audio |
Qwen3-ASR / ForcedAligner (MLX backend) |
silero-vad |
Voice Activity Detection |
torch |
Silero VAD runtime |
soundfile |
WAV I/O |
numpy<2 |
Numerical computation |
soynlp |
Korean tokenization (ForcedAligner) |
Silenci uses three AI models from the Qwen3 family, all running locally on Apple Silicon via the MLX framework.
| 0.6B | 1.7B | |
|---|---|---|
| Model | mlx-community/Qwen3-ASR-0.6B-8bit |
mlx-community/Qwen3-ASR-1.7B-8bit |
| Parameters | 600M | 1.7B |
| Quantization | 8-bit (MLX) | 8-bit (MLX) |
| Disk size | ~600 MB | ~1.7 GB |
| Use case | Fast drafts, short videos | Production quality, long-form |
| Languages | Korean, English, Japanese, Chinese, and 10+ more |
Qwen3-ASR is an encoder-decoder transformer trained on large-scale multilingual speech data. The MLX 8-bit quantized versions run efficiently on Apple Silicon's Neural Engine and GPU, achieving near-real-time transcription without requiring a cloud API.
How it's used in Silenci:
- Audio is extracted from video via ffmpeg (16kHz mono WAV)
- VAD segments are chunked into ≤30s pieces
- Each chunk is fed to
asr.generate(audio, language=...)→ returns transcribed text
| Model | mlx-community/Qwen3-ForcedAligner-0.6B-8bit |
| Parameters | 600M |
| Quantization | 8-bit (MLX) |
| Disk size | ~600 MB |
| Purpose | Align transcribed text to audio → word-level {text, start, end} |
ForcedAligner takes the ASR output text and the original audio, then aligns each word to its exact position in the audio stream. This is what enables word-boundary splitting — the core innovation of Silenci.
How it's used:
- ASR produces text for a chunk:
"Through being someone mobile" - ForcedAligner receives audio + text → outputs:
[{text: "Through", start: 0.12, end: 0.45}, {text: "being", start: 0.47, end: 0.71}, {text: "someone", start: 0.73, end: 1.15}, {text: "mobile", start: 1.18, end: 1.52}] - Segment splitting only happens at word
endtimes (never mid-word)
Coverage validation: If ForcedAligner output covers <75% of the ASR text, the result is discarded and the segment falls back to chunk-level timing (safety net for edge cases).
| Model | Silero VAD v5 |
| Framework | PyTorch |
| Size | ~2 MB |
| Speed | Processes 1 hour of audio in ~3 seconds |
| Purpose | Detect speech vs. silence boundaries |
Silero VAD is a lightweight neural network that classifies audio frames as speech or non-speech. It outputs speech timestamps used to:
- Remove silence (the primary feature)
- Define ASR chunk boundaries (speech segments → 30s chunks)
- Calculate energy for optimal split points
Configurable parameters:
| Parameter | Default | Effect |
|---|---|---|
threshold |
0.50 | Speech detection sensitivity (0.1=sensitive, 0.9=strict) |
min_speech_ms |
250 | Minimum speech duration to keep |
min_silence_ms |
200 | Minimum silence to detect as gap |
speech_pad_ms |
100 | Padding added around speech segments |
The subtitle splitting engine runs both in Python (server-side) and Swift (export-side) with identical logic:
Priority 1 Split at punctuation or sentence endings (min 6 chars accumulated)
Korean endings: 요, 다, 까, 죠, 고, 서, 며, 면, 습니다, 합니다 …
Punctuation: . ! ? 。,
Priority 2 Force-split when exceeding max_subtitle_chars
- Include next word if ≤3 chars (prevents Korean particle separation)
- Hard limit at max_chars + 8 (prevents infinite accumulation)
Priority 3 Auto-correct overlapping timestamps after splitting
Korean-specific post-processing:
merge_orphan_josa() handles cases where ForcedAligner separates Korean particles (조사) at segment boundaries:
Before: "맛집" | "을 검색을..." ← "을" orphaned from its noun
After: "맛집을" | "검색을..." ← particle merged back
FCPXML requires frame-exact timing. Silenci uses Python Fraction arithmetic to avoid floating-point drift:
| fps | FCP Code | Frame Duration | Notes |
|---|---|---|---|
| 23.976 | 2398 | 1001/24000s | Drop-frame NTSC film |
| 24 | 24 | 100/2400s | Cinema |
| 25 | 25 | 100/2500s | PAL |
| 29.97 | 2997 | 1001/30000s | Drop-frame NTSC |
| 30 | 30 | 100/3000s | Non-drop NTSC |
| 59.94 | 5994 | 1001/60000s | High frame rate |
| 60 | 60 | 100/6000s | Gaming/action |
| 120 | 120 | 100/12000s | iPhone slo-mo |
All time calculations use Fraction(numerator, denominator) → converted to FCPXML offset="N/Ds" format. This ensures sample-accurate alignment even for long timelines (>1 hour).
ASR models are downloaded from Hugging Face Hub on first analysis:
~/.cache/huggingface/hub/
models--mlx-community--Qwen3-ASR-0.6B-8bit/
models--mlx-community--Qwen3-ASR-1.7B-8bit/
models--mlx-community--Qwen3-ForcedAligner-0.6B-8bit/
Silenci monkey-patches huggingface_hub.snapshot_download's tqdm progress bars to capture byte-level download progress and forward it to the UI via JSON-RPC notifications. This provides accurate "1.2 GB / 1.7 GB" progress display during model downloads.
Silenci/
├── silence_cutter/ # Python package
│ ├── server.py # JSON-RPC server (2-pass ASR)
│ ├── vad.py # Silero VAD + silence-based splitting
│ ├── transcribe.py # Qwen3-ASR + ForcedAligner + josa merge
│ ├── fcpxml.py # FCPXML generation + subtitle splitting
│ ├── srt.py / itt.py # SRT, iTT subtitles
│ ├── pipeline.py # CLI pipeline
│ └── ...
├── SilenciApp/ # Swift macOS app
│ ├── Package.swift
│ └── Sources/
│ ├── App.swift # Entry point + menu (env cleanup)
│ ├── ContentView.swift # Main layout + analysis popup
│ ├── Models/
│ │ ├── AnalysisService.swift # Analysis runner + Python bridge
│ │ ├── AnalysisSettings.swift # Settings model (UserDefaults)
│ │ └── ...
│ ├── Services/
│ │ ├── PythonBridge.swift # JSON-RPC communication
│ │ ├── PythonEnvironment.swift # Auto venv install/cleanup
│ │ └── ExportService.swift # FCPXML/SRT/iTT (word-based split)
│ └── Views/
│ ├── AnalyzeDialogView.swift # Pre-analysis settings popup
│ ├── AnalysisProgressView.swift # Progress + model download + cancel
│ ├── ClipCardView.swift # Clip card (video edit + subtitle)
│ ├── WordFlowView.swift # Word-level editing UI
│ ├── RetranscribeSheetView.swift # FCPXML retranscribe settings + progress
│ └── SettingsView.swift # Settings sheet (incl. app language)
├── build-release.sh # Release build → dist/SilenciApp.app
├── setup_mac.sh # Auto Python environment setup
└── docs/ # Diagrams & screenshots
ffmpeg/ffprobe not found
brew install ffmpegThe app automatically adds /opt/homebrew/bin to PATH.
Model download is slow
ASR models are downloaded from Hugging Face on first analysis.
Byte-level progress is shown in the app. After download, models are cached in ~/.cache/huggingface/hub/.
VAD is too sensitive / not sensitive enough
App: Adjust VAD Sensitivity slider in the analysis popup.
CLI:
| Direction | Parameter |
|---|---|
| More sensitive (catch quiet speech) | --vad-threshold 0.3 |
| Less sensitive (only clear speech) | --vad-threshold 0.7 |
| Remove short silences too | --min-silence-ms 150 |
| Only remove long silences | --min-silence-ms 500 |
Subtitles are too short / too long
App: Adjust Max Chars in the analysis popup (default: 20).
CLI: --max-subtitle-chars 30 for longer lines.
Words are cut in the middle of subtitles
The 2-Pass ASR approach prevents mid-word cuts.
If it still happens, try increasing --max-segment-seconds (default 8s → 15s).
pip install -e ".[dev]" # Install dev dependencies
pytest # Run tests
black --line-length 100 . # Format
ruff check silence_cutter/ # LintContributions are welcome! Please feel free to submit issues and pull requests.
If you find this project useful, please consider giving it a star ⭐
It helps others discover the project and motivates continued development.



