Is your feature request related to a problem? Please describe.
The SQLAlchemy dialect always generates ON syntax for joins. There is no way to generate USING (col) syntax, which is needed for correct FULL OUTER JOIN results in ClickHouse when joining on same-named columns.
With ON, ClickHouse keeps columns from both tables separate. For unmatched rows, the missing side gets default values (0 for integers, empty string for strings) instead of the actual value. With USING, ClickHouse merges same-named columns into one, returning the correct value from whichever side has data.
This can leads to silently wrong data when using FULL OUTER JOIN via the SQLAlchemy dialect.
Describe the solution you'd like
A way to generate USING syntax through the dialect, for example:
stmt = select(sales.c.product_id, sales.c.sold, returns.c.returned).join(
returns, using=["product_id"], full=True
)
Generating:
SELECT ... FROM test_sales FULL OUTER JOIN test_returns USING (product_id)
Describe alternatives you've considered
Using client.raw_query() with raw SQL instead of the SQLAlchemy dialect.
Additional context
Reproduction script (v0.10.0):
import clickhouse_connect
from sqlalchemy import create_engine, MetaData, Table, Column, select
from clickhouse_connect.cc_sqlalchemy.datatypes.sqltypes import UInt32 as ChUInt32
client = clickhouse_connect.get_client(host="localhost", port=8123, username="default", password="")
client.command("DROP TABLE IF EXISTS test_sales")
client.command("DROP TABLE IF EXISTS test_returns")
client.command("CREATE TABLE test_sales (product_id UInt32, sold UInt32) ENGINE = MergeTree() ORDER BY product_id")
client.command("CREATE TABLE test_returns (product_id UInt32, returned UInt32) ENGINE = MergeTree() ORDER BY product_id")
client.command("INSERT INTO test_sales VALUES (1, 10), (2, 20), (3, 30)")
client.command("INSERT INTO test_returns VALUES (2, 5), (3, 10), (4, 15)")
engine = create_engine("clickhousedb://default:@localhost:8123/default")
metadata = MetaData()
sales = Table("test_sales", metadata, Column("product_id", ChUInt32), Column("sold", ChUInt32))
returns = Table("test_returns", metadata, Column("product_id", ChUInt32), Column("returned", ChUInt32))
# SQLAlchemy generates ON - product_id=0 for unmatched row (wrong)
stmt = select(sales.c.product_id, sales.c.sold, returns.c.returned).join(
returns, sales.c.product_id == returns.c.product_id, full=True
)
print(f"SQL: {stmt.compile(engine)}")
# FULL OUTER JOIN ... ON `test_sales`.`product_id` = `test_returns`.`product_id`
with engine.connect() as conn:
print(conn.execute(stmt).fetchall())
# [(1, 10, 0), (2, 20, 5), (3, 30, 10), (0, 0, 15)] <-- last row: product_id=0 (wrong)
# Raw SQL with USING - product_id=4 for unmatched row (correct)
result = client.query("SELECT product_id, sold, returned FROM test_sales FULL OUTER JOIN test_returns USING (product_id)")
print(result.result_rows)
# [(1, 10, 0), (2, 20, 5), (3, 30, 10), (4, 0, 15)] <-- last row: product_id=4 (correct)
client.command("DROP TABLE test_sales")
client.command("DROP TABLE test_returns")
Is your feature request related to a problem? Please describe.
The SQLAlchemy dialect always generates
ONsyntax for joins. There is no way to generateUSING (col)syntax, which is needed for correct FULL OUTER JOIN results in ClickHouse when joining on same-named columns.With
ON, ClickHouse keeps columns from both tables separate. For unmatched rows, the missing side gets default values (0 for integers, empty string for strings) instead of the actual value. WithUSING, ClickHouse merges same-named columns into one, returning the correct value from whichever side has data.This can leads to silently wrong data when using FULL OUTER JOIN via the SQLAlchemy dialect.
Describe the solution you'd like
A way to generate
USINGsyntax through the dialect, for example:Generating:
Describe alternatives you've considered
Using
client.raw_query()with raw SQL instead of the SQLAlchemy dialect.Additional context
Reproduction script (v0.10.0):