Skip to main content

26 posts tagged with "tech-blog"

View All Tags

Axiom: Composable Query Engines Built on Velox

· 8 min read
Masha Basmanova
Software Engineer @ Meta
Amit Dutta
Software Engineer @ Meta

Introduction

Axiom is a C++ library for building fully composable, high-performance query engines, built on top of Velox. Think of it as Lego for query processing — the pieces are compatible, but don't restrict how you put them together. Configure each layer, swap in your own components.

Today, users face significant friction moving between interactive queries, batch processing, streaming, and AI training data preparation — different engines, different semantics, different quirks. With Axiom, we can build all of these on a shared foundation, delivering consistent semantics across deployment modes.

The Current Landscape

Velox made execution composable — plugged into Presto via Prestissimo, into Spark via Gluten, into streaming and AI training workloads — and delivered 2-3x efficiency gains across the board. Velox also opened the door to GPU-accelerated query processing, with IBM and NVIDIA collaborating on GPU-native implementations of Velox operators for Prestissimo. But everything above execution remains monolithic and engine-specific.

Presto excels at interactive queries with streaming shuffle — optimized for latency. Spark excels at batch with durable shuffle — optimized for throughput. DuckDB provides a great single-node experience. But they all have different semantics. When a Presto query organically grows beyond Presto's limits, it must be rewritten in Spark SQL and migrated to Spark — a non-trivial, error-prone effort. Sapphire (Presto-on-Spark) exists specifically to work around this, running Presto SQL on Spark's durable execution, but it's inherently awkward — bolting one engine onto another.

Neither Presto nor Spark offers a practical local execution mode. Even basic SQL linting or running a query over in-memory VALUES requires going through the cluster — network calls, queuing, distributed execution. When iterating on analysis over the same data, trying to figure out the right way to slice and dice, every run goes through the cluster. Velox's AsyncDataCache can cache warm data in memory and SSD on Prestissimo workers, but that still requires distributed execution and depends on all cluster users working on the same warm data. There's no way to download your dataset locally, keep it in memory, and iterate fast on your own machine — DuckDB-style. Not possible with a monolithic distributed-only architecture.

Under the hood, the architectures are split along language boundaries. In Prestissimo, Java coordinator handles SQL parsing, optimization, and orchestrates distributed execution; C++ worker handles local execution. They connect via RPC — relatively clean, but connectors are split: metadata in Java, data reading in C++. Constant folding is another pain point — it happens in the coordinator (Java) but must produce results compatible with Velox execution, currently solved via a C++ sidecar, but clunky. In Gluten, Spark's Catalyst handles SQL parsing, DataFrame API, optimization, and orchestrates distributed execution; Velox handles local execution. They connect via JNI — more complex, harder to debug.

Today's split architectures: Java/Scala coordinators connected to C++ Velox workers via RPC or JNI.

Both coordinators are monolithic — you can't add a DataFrame API to Presto, run Spark SQL on Presto's coordinator, or run Presto SQL on Spark.

Why hasn't anyone built a unified C++ stack before? We believe the major blocker has been the lack of a reusable C++ query optimizer. Optimizers are extremely hard to build, even with the use of AI. Catalyst is the closest thing to a reusable optimizer and has been widely adopted, but it's Java. So when Velox came along, the natural path was to plug it into existing JVM coordinators — building a C++ optimizer from scratch wasn't a realistic alternative. Until now.

Velox solved composability for execution. The layers above execution need the same treatment. That's the gap Axiom aims to fill.

How Axiom Solves This

Axiom decomposes the query engine into a set of independent, reusable components — frontends, optimizer, runtime, connectors, execution — connected by stable, engine-agnostic APIs. Each component can be swapped, extended, or composed independently. The key APIs are the logical plan (between frontends and the optimizer) and the physical plan (between the optimizer and the runtime). Axiom builds on Velox's extensible type system and function registry — the same types and function implementations are used by the optimizer and by execution, so semantics are consistent by construction. This is what makes it possible for different engines to plug in their own frontends or target different deployment modes — without forking or rebuilding the rest.

Composability requires crisp, clear contracts between components. Monolithic engines tend to treat these boundaries more fluidly, with code that sometimes crosses layer boundaries. This makes composability harder to build, but the resulting clarity pays off in extensibility and reuse.

Axiom's composable architecture: independent components connected by stable APIs.

The logical plan is a fully resolved, typed, dialect-agnostic representation of the query — the contract between frontends and the optimizer. It describes what the query does, not how to execute it. The logical plan has 13 node types; Velox has 32. For example, there is a single logical JoinNode that specifies what to join, but the optimizer decides how — whether it becomes a HashJoinNode, MergeJoinNode, or NestedLoopJoinNode in Velox. Any frontend that produces a valid logical plan can use the same optimizer and execution stack. This is what makes it practical to add a Spark SQL parser or a PySpark DataFrame API without touching the rest of the system.

The optimizer directly uses Velox's function registry and expression evaluator — constant folding uses the same code path as query execution, with no integration overhead. The optimizer can also fold constant scalar subqueries by executing them in-process. It knows which Velox operators exist and can target them directly — for example, using counting joins for EXCEPT ALL. No need for a 2700-line conversion layer to translate between plan representations that don't map one-to-one.

The optimizer produces a physical plan (MultiFragmentPlan) — a set of Velox plan fragments connected by shuffle boundaries. The deployment topology is an input to the optimizer — shuffle costs affect join ordering and other plan decisions.

The runtime is the component that enables different deployment modes — it takes the physical plan and executes it locally in a single process, distributes with streaming shuffle for interactive latency, or distributes with durable shuffle for batch throughput. Axiom ships with a local execution runtime (LocalRunner) and a CLI that serves as a reference implementation illustrating how the components fit together. Axel, a separate initiative, is building a Presto-like distributed runtime on top of Axiom.

Connectors are also unified. Planning-time logic (metadata, statistics, partition pruning) and execution-time logic (data reading) live in the same language, unlike Prestissimo and Gluten where they are split across Java and C++.

Where We Are

Building a composable query processing stack is a long journey. Velox was a big step forward, but there are still many steps to take. We are still early, but the foundation is in place and the results so far are encouraging.

All TPC-H queries work end-to-end, producing optimal plans. We are actively migrating production workloads at Meta — the optimizer and execution stack are handling real queries at scale, not just benchmarks.

Axiom includes an interactive CLI that lets developers and users run SQL locally against TPC-H, Hive, or in-memory data. This has become a popular tool for rapid prototyping and testing — users love sub-second SQL linting and validation without waiting for a cluster round-trip.

We are also extending Presto SQL with user-friendly features inspired by DuckDB: FROM-first syntax, trailing commas, SELECT * EXCLUDE/REPLACE, and more (see Presto SQL Extensions).

The composable architecture is proving itself beyond the original use case. The graph query team built a Cypher frontend that produces Axiom's logical plan and leverages the optimizer's cost-based join reordering for graph pattern matching — demonstrating that the logical plan and optimizer work for non-SQL languages. Both the streaming and AI training data processing teams adopted Axiom's Presto SQL parser to offer a SQL interface alongside their existing Python APIs.

Get Involved

Velox made execution composable and delivered massive efficiency gains. Axiom extends that same vision to the full query processing stack. The pieces are there — come help us put them together.

We want to partner with the community to tackle what's next. A Spark SQL parser and PySpark DataFrame API would close the gap between Presto and Spark users. There is room for more Friendly SQL features. The optimizer can become more extensible and add support for index and merge joins. Axel will benefit from durable shuffle to enable batch workloads with the fault tolerance that Spark users expect. And the connector ecosystem is ready to grow.

From flaky Axiom CI to a Velox bug fix: a cross-repo debugging story

· 9 min read
Masha Basmanova
Software Engineer @ Meta

TL;DR

When adding macOS CI to Axiom, set operation tests kept failing intermittently — but only in macOS debug CI. Linux CI (debug and release) passed consistently. Local runs always passed. The root cause turned out to be a bug in Velox — a dependency managed as a Git submodule. This post describes the process of debugging a CI-only failure when the bug lives in a different repository.

Adaptive Per-Function CPU Time Tracking

· 6 min read
Rajeev Singh
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta

Context

Velox evaluates SQL expressions as trees of functions. A query like if(array_gte(a, b), multiply(x, y), 0) compiles into a tree where each node processes an entire vector of rows at a time. When a query runs slowly, the first question usually is: which function is consuming the most CPU? Is it the expensive array comparison, or the cheap arithmetic called millions of times? This problem is even more prominent in use cases like training data loading, when very long and deeply nested expression trees are common, and jobs may run for many hours, or days; in such cases, the CPU usage of even seemingly short-lived functions may add up to substantial overhead. Without a detailed per-function CPU usage breakdown, you may be left guessing — or worse, optimizing the wrong thing.

Accelerating Unicode string processing with SIMD in Velox

· 8 min read
Ping Liu
Software Engineer
Yuhta
Software Engineer @ Meta
Masha Basmanova
Software Engineer @ Meta

TL;DR

We optimized two Unicode string helpers — cappedLengthUnicode and cappedByteLengthUnicode — by replacing byte-by-byte utf8proc_char_length calls with a SIMD-based scanning loop. The new implementation processes register-width blocks at a time: pure-ASCII blocks skip in one step, while mixed blocks use bitmask arithmetic to count character starts. Both helpers now share a single parameterized template, eliminating code duplication.

On a comprehensive benchmark matrix covering string lengths from 4 to 1024 bytes and ASCII ratios from 0% to 100%, we measured 2–15× speedups across most configurations, with no regressions on Unicode-heavy inputs. The optimization benefits all callers of these helpers, including the Iceberg truncate transform and various string functions.

The hidden traps of regex in LIKE and split

· 6 min read
Masha Basmanova
Software Engineer @ Meta

SQL functions sometimes use regular expressions under the hood in ways that surprise users. Two common examples are the LIKE operator and Spark's split function.

In Presto, split takes a literal string delimiter and regexp_split is a separate function for regex-based splitting. Spark's split, however, always treats the delimiter as a regular expression.

Both LIKE and Spark's split can silently produce wrong results and waste CPU when used with column values instead of constants. Understanding why this happens helps write faster, more correct queries — and helps engine developers make better design choices.

velox::StringView API Changes and Best Practices

· 5 min read
Pedro Pedreira
Software Engineer @ Meta

Context

Strings are ubiquitously used in large-scale analytic query processing. From storing identifiers, names, labels, or structured data (like json/xml), to simply descriptive text, like a product description or the contents of this very blog post, there is hardly a SQL query that does not require the manipulation of string data.

This post describes in more detail how Velox handles columns of strings, the low-level C++ APIs involved and some recent changes made to them, and presents best practices for string usage throughout Velox's codebase.

Task Barrier: Efficient Task Reuse and Streaming Checkpoints in Velox

· 4 min read
Xiaoxuan Meng
Software Engineer @ Meta
Yuhta
Software Engineer @ Meta
Masha Basmanova
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta

TL;DR

Velox Task Barriers provide a synchronization mechanism that not only enables efficient task reuse, important for workloads such as AI training data loading, but also delivers the strict sequencing and checkpointing semantics required for streaming workloads.

By injecting a barrier split, users guarantee that no subsequent data is processed until the entire DAG is flushed and the synchronization signal is unblocked. This capability serves two critical patterns:

  1. Task Reuse: Eliminates the overhead of repeated task initialization and teardown by safely reconfiguring warm tasks for new queries. This is a recurring pattern in AI training data loading workloads.

  2. Streaming Processing: Enables continuous data handling with consistent checkpoints, allowing stateful operators to maintain context across batches without service interruption.

See the Task Barrier Developer Guide for implementation details.

Why Sort is row-based in Velox — A Quantitative Assessment

· 8 min read
Meng Duan (macduan)
Software Engineer @ ByteDance
Xiaoxuan Meng
Software Engineer @ Meta

TL;DR

Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions, significantly accelerating large-scale query processing.

However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted to a row-oriented format. The OrderBy operator is one example, where our implementation first materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and converts the rows back to vectors.

In this article, we explain the rationale behind this design decision and provide experimental evidence for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the end-to-end performance did not improve—in fact, it was even up to slower. We present the two variants and discuss why one is counter-intuitively faster than the other.

Multi-Round Lazy Start Merge

· 6 min read
Meng Duan (macduan)
Software Engineer @ ByteDance
Xiaoxuan Meng
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta

Background

Efficiently merging sorted data partitions at scale is crucial for a variety of training data preparation workloads, especially for Generative Recommenders (GRs) a new paradigm introduced in the paper Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. A key requirement is to merge training data across partitions—for example, merging hourly partitions into daily ones—while ensuring that all rows sharing the same primary key are stored consecutively. Training data is typically partitioned and bucketed by primary key, with rows sharing the same key stored consecutively, so merging across partitions essentially becomes a multi-way merge problem.

Normally, Apache Spark can be used for this sort-merge requirement — for example, via CLUSTER BY. However, training datasets for a single job can often reach the PB scale, which in turn generates shuffle data at PB scale. Although we typically apply bucketing and ordering by key when preparing training data in production, Spark can eliminate the shuffle when merging training data from multiple hourly partitions. However, each Spark task can only read the files planned from various partitions within a split sequentially, placing them into the sorter and spilling as needed. Only after all files have been read does Spark perform a sort-merge of the spilled files. This process produces a large number of small spill files, which further degrades efficiency.