Query Parsing Split

🧭 Quick Return to Map

You are in a sub-page of Retrieval.
To reorient, go back here:

Retrieval — information access and knowledge lookup

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A field guide to diagnose and fix failures where the same user question turns into different queries across branches. Typical sources are HyDE, rewrite chains, keyword extraction, BM25 analyzers, and tool side expansions. The result is high recall but wrong ordering, unstable answers, or hybrid pipelines that underperform single retrievers.

Read together with

Overview and short route → retrieval-playbook.md
Hybrid fusion knobs → hybrid_retrieval.md
Ordering control → rerankers.md
Trace and citation schema → retrieval-traceability.md
ΔS probes → deltaS_probes.md
Chunk window parity → chunk_alignment.md

Acceptance targets

ΔS(question, retrieved) ≤ 0.45
Coverage ≥ 0.70 to the intended section
λ convergent across 3 paraphrases and 2 seeds
Dense vs sparse query parity recorded in trace, mismatch rate ≤ 5 percent

Symptoms that point to parsing split

Dense and sparse branches return different topics that both look relevant.
Fused results are worse than the best single retriever.
λ flips only when the HyDE rewrite is enabled or when keyword extraction runs.
ΔS stays flat and high while top k overlap across branches is below 10 percent.
Citations show mismatched analyzers or different casing rules.

Canonical causes

HyDE rewrite fed to dense only
Sparse receives the user text while dense uses the hypothetical document.
Keyword extraction fed to sparse only
Dense receives full natural language while sparse gets boolean or phrase queries.
Analyzer mismatch
Lowercase, ascii fold, stemming, and punctuation stripping differ between write and read paths.
Parentheses and operators interpreted differently
Sparse parser treats parentheses and quotes as control tokens while dense treats them as text.
Language mix
Dense model trained on multilingual text while sparse index built with English analyzer.

Unify the query before branching

Define a single normalized query object. Derive branch specific fields from it. Log the object into the trace for audits.

{
  "q_base": "what are the latency limits of vector search on faiss ivf flat",
  "q_hyde": "A technical note that discusses latency limits of vector search using FAISS IVF Flat...",
  "keywords": ["latency", "vector search", "FAISS", "IVF Flat", "limits"],
  "policy": {
    "case": "lower",
    "fold": "ascii",
    "stopwords": "en_smart",
    "stemming": "porter"
  },
  "routing": {
    "dense": {"use": true, "text": "q_hyde"},
    "sparse": {"use": true, "keywords": true, "operator": "OR"}
  }
}

Rules

Normalize casing and unicode fold according to policy.
If HyDE is used, either feed the rewrite to both branches or to none.
If keywords are used for sparse, also pass q_base as a soft clause to keep semantic context.
Record policy and routing in each citation row.

Minimal recipes

Python pseudo plan

def normalize_query(user_q, hyde=False, extract_kw=True):
    q_base = ascii_fold(user_q.lower())
    q_hyde = generate_hyde(q_base) if hyde else None
    kws = top_keywords(q_base) if extract_kw else []
    policy = {"case": "lower", "fold": "ascii", "stopwords": "en_smart", "stemming": "porter"}
    return {
        "q_base": q_base,
        "q_hyde": q_hyde,
        "keywords": kws,
        "policy": policy,
        "routing": {
            "dense": {"use": True, "text": q_hyde or q_base},
            "sparse": {"use": True, "keywords": bool(kws), "operator": "OR", "soft": q_base}
        }
    }

def run_branches(plan):
    dense_hits = dense_retriever.invoke(plan["routing"]["dense"]["text"], k=20)
    sparse_hits = bm25(plan["keywords"], operator=plan["routing"]["sparse"]["operator"], soft=plan["routing"]["sparse"]["soft"], k=50)
    return dense_hits, sparse_hits

LCEL outline

# 1) normalize once
# 2) pass the same policy into both branches
# 3) fuse and rerank with deterministic tiebreak
qplan = normalize_query(q, hyde=True, extract_kw=True)
dense = dense_chain.invoke(qplan["routing"]["dense"]["text"])
sparse = bm25_chain.invoke({"keywords": qplan["keywords"], "soft": qplan["q_base"]})
fused = fuse_linear(project(dense), project(sparse), alpha=0.55, k=20)
fused = optional_rerank(fused)
validate_citations(fused, policy=qplan["policy"])

LlamaIndex outline

plan = normalize_query(q, hyde=False, extract_kw=True)
dense = vector_index.as_retriever(similarity_top_k=20).retrieve(plan["routing"]["dense"]["text"])
sparse = bm25_retriever.retrieve(plan["keywords"], top_k=50, soft=plan["q_base"])
fused = fuse_linear(project(dense), project(sparse), alpha=0.6, k=20)

ΔS and λ probes for parsing split

Run with HyDE off and log ΔS and λ.
Run with HyDE on for both branches and log again.
Run with HyDE on for dense only and compare. If this variant is worse while single dense is fine, the split is confirmed.
Compute top k overlap between branches. If below 10 percent and ΔS is flat, fix routing.

Helper → deltaS_probes.md

Typical failures and exact fixes

Fused results worse than single Normalize query, use the same rewrite for both branches, then fuse. Open: hybrid_retrieval.md
Sparse ignores important terms after rewrite Keep the base text as a soft clause with lower weight. Open: retrieval-playbook.md
Citations show analyzer mismatch Align analyzer and restamp the index. Open: chunk_alignment.md
Order flips between runs Add cross encoder rerank and deterministic tiebreak. Open: rerankers.md
High similarity but wrong meaning Rebuild with correct metric and pooling. Open: embedding-vs-semantic.md

Evaluation checklist

Three paraphrases per question.
Single dense, single sparse, fused.
Record query object, policy, and routing.
Target improvement for fused vs best single: ΔS drop by at least 0.05 and coverage rise by at least 0.05.
Store the plan and results with a regression gate in CI.

Copy paste validator prompt

You have TXTOS and the WFGY Problem Map loaded.

My issue is query parsing split. Current data:
- user question: "<text>"
- plan: {q_base, q_hyde, keywords, policy, routing}
- results: ΔS_dense=..., ΔS_sparse=..., ΔS_fused=..., topk_overlap=...

Return:
1) whether the plan keeps parity across dense and sparse,
2) the exact normalization and routing changes to try,
3) which fusion method and α or RRF k to use,
4) a JSON object to log in each citation row to keep audits stable.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Layer	Page	What it’s for
⭐ Proof	WFGY Recognition Map	External citations, integrations, and ecosystem proof
⚙️ Engine	WFGY 1.0	Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine	WFGY 2.0	Production tension kernel for RAG and agent systems
⚙️ Engine	WFGY 3.0	TXT based Singularity tension engine (131 S class set)
🗺️ Map	Problem Map 1.0	Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map	Problem Map 2.0	Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map	Problem Map 3.0	Global AI troubleshooting atlas and failure pattern map
🧰 App	TXT OS	.txt semantic OS with fast bootstrap
🧰 App	Blah Blah Blah	Abstract and paradox Q&A built on TXT OS
🧰 App	Blur Blur Blur	Text to image generation with semantic control
🏡 Onboarding	Starter Village	Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Parsing Split

Acceptance targets

Symptoms that point to parsing split

Canonical causes

Unify the query before branching

Minimal recipes

Python pseudo plan

LCEL outline

LlamaIndex outline

ΔS and λ probes for parsing split

Typical failures and exact fixes

Evaluation checklist

Copy paste validator prompt

🔗 Quick-Start Downloads (60 sec)

Explore More

FilesExpand file tree

query_parsing_split.md

Latest commit

History

query_parsing_split.md

File metadata and controls

Query Parsing Split

Acceptance targets

Symptoms that point to parsing split

Canonical causes

Unify the query before branching

Minimal recipes

Python pseudo plan

LCEL outline

LlamaIndex outline

ΔS and λ probes for parsing split

Typical failures and exact fixes

Evaluation checklist

Copy paste validator prompt

🔗 Quick-Start Downloads (60 sec)

Explore More