# NTT - Number Theoretic Transform

The Number Theoretic Transform (NTT) is a variant of the Fourier Transform used over finite fields, particularly those of integers modulo a prime number. NTT operates in a discrete domain and is used primarily in applications requiring modular arithmetic, such as cryptography and polynomial multiplication.

NTT is defined similarly to the Discrete Fourier Transform (DFT), but instead of using complex roots of unity, it uses roots of unity within a finite field. The definition hinges on the properties of the finite field, specifically the existence of a primitive root of unity of order $N$ (where $N$ is typically a power of 2), and the modulo operation is performed with respect to a specific prime number that supports these roots.

Formally, given a sequence of integers $a_0, a_1, ..., a_{N-1}$, the NTT of this sequence is another sequence of integers $A_0, A_1, ..., A_{N-1}$, computed as follows:

where:

- $N$ is the size of the input sequence and is a power of 2,
- $p$ is a prime number such that $p = kN + 1$ for some integer $k$, ensuring that $p$ supports the existence of $N$th roots of unity,
- $\omega$ is a primitive $N$th root of unity modulo $p$, meaning $\omega^N \equiv 1 \mod p$ and no smaller positive power of $\omega$ is congruent to 1 modulo $p$,
- $k$ ranges from 0 to $N-1$, and it indexes the output sequence.

NTT is particularly useful because it enables efficient polynomial multiplication under modulo arithmetic, crucial for algorithms in cryptographic protocols and other areas requiring fast modular arithmetic operations.

There exists also INTT which is the inverse operation of NTT. INTT can take as input an output sequence of integers from an NTT and reconstruct the original sequence.

## Using NTTβ

### Supported Bindingsβ

### Examplesβ

### Orderingβ

The `Ordering`

enum defines how inputs and outputs are arranged for the NTT operation, offering flexibility in handling data according to different algorithmic needs or compatibility requirements. It primarily affects the sequencing of data points for the transform, which can influence both performance and the compatibility with certain algorithmic approaches. The available ordering options are:

Both inputs and outputs are in their natural order. This is the simplest form of ordering, where data is processed in the sequence it is given, without any rearrangement.`kNN`

(Natural-Natural):Inputs are in natural order, while outputs are in bit-reversed order. This ordering is typically used in algorithms that benefit from having the output in a bit-reversed pattern.`kNR`

(Natural-Reversed):Inputs are in bit-reversed order, and outputs are in natural order. This is often used with the Cooley-Tukey FFT algorithm.`kRN`

(Reversed-Natural):Both inputs and outputs are in bit-reversed order.`kRR`

(Reversed-Reversed):Inputs are provided in their natural order, while outputs are arranged in a digit-reversed (mixed) order. This ordering is good for mixed radix NTT operations, where the mixed or digit-reversed ordering of outputs is a generalization of the bit-reversal pattern seen in simpler, radix-2 cases.`kNM`

(Natural-Mixed):Inputs are in a digit-reversed (mixed) order, while outputs are restored to their natural order. This ordering would primarily be used for mixed radix NTT`kMN`

(Mixed-Natural):

Choosing an algorithm is heavily dependent on your use case. For example Cooley-Tukey will often use `kRN`

and Gentleman-Sande often uses `kNR`

.

### Modesβ

NTT also supports two different modes `Batch NTT`

and `Single NTT`

Deciding weather to use `batch NTT`

vs `single NTT`

is highly dependent on your application and use case.

#### Single NTTβ

Single NTT will launch a single NTT computation.

Choose this mode when your application requires processing individual NTT operations in isolation.

#### Batch NTT Modeβ

Batch NTT allows you to run many NTTs with a single API call. Batch NTT mode can significantly reduce read/write times as well as computation overhead by executing multiple NTT operations in parallel. Batch mode may also offer better utilization of computational resources (memory and compute).

## Supported algorithmsβ

Our NTT implementation supports two algorithms `radix-2`

and `mixed-radix`

.

### Radix 2β

At its core, the Radix-2 NTT algorithm divides the problem into smaller sub-problems, leveraging the properties of "divide and conquer" to reduce the overall computational complexity. The algorithm operates on sequences whose lengths are powers of two.

**Input Preparation:**The input is a sequence of integers $a_0, a_1, \ldots, a_{N-1}, \text{ where } N$ is a power of two.**Recursive Decomposition:**The algorithm recursively divides the input sequence into smaller sequences. At each step, it separates the sequence into even-indexed and odd-indexed elements, forming two subsequences that are then processed independently.**Butterfly Operations:**The core computational element of the Radix-2 NTT is the "butterfly" operation, which combines pairs of elements from the sequences obtained in the decomposition step.Each butterfly operation involves multiplication by a "twiddle factor," which is a root of unity in the finite field, and addition or subtraction of the results, all performed modulo the prime modulus.

$X_k = (A_k + B_k \cdot W^k) \mod p$$X_k$ - The output of the butterfly operation for the $k$-th element

$A_k$ - an element from the even-indexed subset

$B_k$ - an element from the odd-indexed subset

$p$ - prime modulus

$k$ - The index of the current operation within the butterfly or the transform stage

The twiddle factors are precomputed to save runtime and improve performance.

**Bit-Reversal Permutation:**A final step involves rearranging the output sequence into the correct order. Due to the halving process in the decomposition steps, the elements of the transformed sequence are initially in a bit-reversed order. A bit-reversal permutation is applied to obtain the final sequence in natural order.

### Mixed Radixβ

The Mixed Radix NTT algorithm extends the concepts of the Radix-2 algorithm by allowing the decomposition of the input sequence based on various factors of its length. Specifically ICICLEs implementation splits the input into blocks of sizes 16, 32, or 64 compared to radix2 which is always splitting such that we end with NTT of size 2. This approach offers enhanced flexibility and efficiency, especially for input sizes that are composite numbers, by leveraging the "divide and conquer" strategy across multiple radices.

The NTT blocks in Mixed Radix are implemented more efficiently based on winograd NTT but also optimized memory and register usage is better compared to Radix-2.

Mixed Radix can reduce the number of stages required to compute for large inputs.

**Input Preparation:**The input to the Mixed Radix NTT is a sequence of integers $a_0, a_1, \ldots, a_{N-1}$, where $N$ is not strictly required to be a power of two. Instead, $N$ can be any composite number, ideally factorized into primes or powers of primes.**Factorization and Decomposition:**Unlike the Radix-2 algorithm, which strictly divides the computational problem into halves, the Mixed Radix NTT algorithm implements a flexible decomposition approach which isn't limited to prime factorization.For example, an NTT of size 256 can be decomposed into two stages of $16 \times \text{NTT}_{16}$, leveraging a composite factorization strategy rather than decomposing into eight stages of $\text{NTT}_{2}$. This exemplifies the use of composite factors (in this case, $256 = 16 \times 16$) to apply smaller NTT transforms, optimizing computational efficiency by adapting the decomposition strategy to the specific structure of $N$.

**Butterfly Operations with Multiple Radices:**The Mixed Radix algorithm utilizes butterfly operations for various radix sizes. Each sub-transform involves specific butterfly operations characterized by multiplication with twiddle factors appropriate for the radix in question.The generalized butterfly operation for a radix-$r$ element can be expressed as:

$X_{k,r} = \sum_{j=0}^{r-1} (A_{j,k} \cdot W^{jk}) \mod p$where:

$X_{k,r}$ - is the output of the $radix-r$ butterfly operation for the $k-th$ set of inputs

$A_{j,k}$ - represents the $j-th$ input element for the $k-th$ operation

$W$ - is the twiddle factor

$p$ - is the prime modulus

**Recombination and Reordering:**After applying the appropriate butterfly operations across all decomposition levels, the Mixed Radix algorithm recombines the results into a single output sequence. Due to the varied sizes of the sub-transforms, a more complex reordering process may be required compared to Radix-2. This involves digit-reversal permutations to ensure that the final output sequence is correctly ordered.

### Which algorithm should I choose ?β

Both work only on inputs of power of 2 (e.g., 256, 512, 1024).

Radix 2 is faster for small NTTs. A small NTT would be around logN = 16 and batch size 1. Radix 2 won't necessarily perform better for smaller `logn`

with larger batches.

Mixed radix on the other hand works better for larger NTTs with larger input sizes.

Performance really depends on logn size, batch size, ordering, inverse, coset, coeff-field and which GPU you are using.

For this reason we implemented our heuristic auto-selection which should choose the most efficient algorithm in most cases.

We still recommend you benchmark for your specific use case if you think a different configuration would yield better results.