Montgomery’s Multiplication Technique: How to make it Smaller and Faster : Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter
Computation Department, UMIST, UK
www.co.umist.ac.uk
Peter Montgomery : Peter Montgomery “Modular Multiplication without Trial Division” Math. Computation, vol. 44 (1985) 519-521
(A B) mod M without obtaining digits q (A B) / M
Motivation : Motivation Faster RSA Cryptosystem
through pipelined array
Safer encryption
against timing or DPA attacks
Overview : Overview RSA & Notation
Classical Algorithm
Montgomery’s Version
Comparison
carry propagation
digit distribution
communication
timing/power attacks
Conclusion
Enigma : Enigma Special Purpose Colossus (1943-44)
Tommy Flowers,
Bletchley Park, England.
General Purpose ENIAC (1943-46)
John Eckert & John Mauchly
Philadelphia, US.
RSA : RSA Modulus M of around 1024 bits
Two keys d and e such that Ade A mod M
A encrypted to C = Ae mod M
C decrypted by A = Cd mod M
M = PQ, a product of two large primes
e is often small (e.g. a Fermat prime)
d satisfies de 1 mod (P–1)(Q–1)
Faster H/W = More Secure Encryption : Faster H/W = More Secure Encryption Work to factorize M doubles for every extra ~15 bits (for key lengths ~210 bits)
Work to en/decrypt :
((1024+15)/1024)2 per multiplication
((1024+15)/1024)3 per exponentiation,
i.e. only 5% extra!
Number representations : Number representations X = i=0 xiri
r = 2k is the radix (prime to M)
xi is the ith digit (usually 0 xi < r)
n max no. of digits in any number
Redundant reps:
wider digit range than 0 .. r1
H/W is built from kk-bit multipliers
n fixed by H/W register size
n–1
Redundancy : Redundancy Digits xj split into carry-save parts: xj = xj,s + rxj,c
X X+Y
is performed by digit-parallel addition:
xj xj,s + xj1,c + yj
No carry propagation: only old carries on right side
Multiplication AB : Multiplication AB Use n digit multipliers to form aiB and add to a partial product P:
P := 0 ;
For i := n1 downto 0 do
P := rP + aiB
{ Post-condition: P = AB }
Slide11 : Either: Use redundancy in P and parallel digit addition to add aiB in one clock cycle
Cell j computes aibj in cycle i
P := P + aiB (digit-parallel) P in Carry-Save form: pj = pj,s + rpj,c cell
j cell
j-1 cell
j+1 pj,c pj+1,c pj+1,s pj,s pj-1,c pj-1,s pj+1,s pj,c pj,s pj-1,c pj-1,s bj+1 bj bj-1 ai ai
Slide12 : or: Pipeline the addition of aiB over n cycles and propagate carries with no redundancy:
Cell j computes aibj in cycle i+j
carry P := P + aiB (digit-serial) time
j+1 time
j time
j-1 carry carry carry pj+1 pj pj-1 pj+1 bj+1 pj bj pj-1 bj-1 ai ai ai ai
Multiplier Complexity : Multiplier Complexity Assume wires take area but not time (or power).
AreaTime2 complexity for un-pipelined k-bit multiplication is bounded below by k2
This can be achieved for time in [log k ..k ]
Discrete Fourier Transform has large constants for time and area.
Better, but asymptotically poorer designs for k expected here.
Slide14 : Cross-over point ?
107 transistors available for RSA
k 64 to accommodate aiB
Speed by using at least n multipliers to perform a full length aiB (or equivalent) in one cycle.
Real-Time ? : Real-Time ? Assume :
bus is one k-bit digit per cycle
k-bit multiplier operates in one cycle
Then :
AB takes n cycles using n multipliers
Throughput is one digit per cycle for multn.
Need O(nk) multiplications for decryption
Conclude :
Need O(nk) rows of n multipliers.
Classical Mod Multn Algorithm : Classical Mod Multn Algorithm { Pre-condition: 0 A < rn }
P := 0 ;
For i := n1 downto 0 do
Begin
P := r×P + ai×B ;
qi := P div M ;
P := P qi×M ;
End
{ Post-conditions: P = A×B Q×M,
P (A×B) mod M }
Comments : Carry propagation a problem
(it slows finding q) ;
Use only top digits of M and P to determine a good multiple of M to remove ;
P is bounded by small multiple of M ;
Clean up only at end ;
Critical path is finding q.
Comments
Disadvantages : Disadvantages Redundant rep. for digit-parallel operation
Global broadcast of q to each digit position
Montgomery’s Mod Multn Algm : Montgomery’s Mod Multn Algm { Pre-condition: 0 A < rn }
P := 0 ;
For i := 0 to n1 do
Begin
qi := (p0+aib0)(-m0-1) mod r ;
P := (P + ai×B + qi×M) div r ;
{ Invariant: 0 P < M+B }
End ;
{ Post-condition: Prn = A×B + Q×M ,
P (ABr–n) mod M }
Peter Montgomery : : Peter Montgomery : reverses the multiplication order
chooses digits from least to most significant
shifts down on each iteration.
uses the least significant digits to determine multiple of M to subtract.
Computes (AB×r–n) mod M
Slide21 : The factor r–n is cleared up in post-processing
Any extra multiple of M is removed then
qi has no carries to wait for
Pipelining of the digits can now take place :
compute aibj+1 on the cycle after aibj
use a non-redundant representation
no broadcasting of qi
The Post-Condition : The Post-Condition m01 exists
qi chosen so division by r is exact
Define Ai = j=0 ajrj and Qi analogously
Then Ai = Ai1+riai and An = A
So ri+1P = Ai×B + Qi×M at end of ith iteration
Hence rnP = A×B + Q×M at end.
i
The Bounds : The Bounds A converted on-line to non-redundant form
Can assume ai r1
So loop invariant P < M+B
Slide24 : If critical path length is computing q:
Scale M to ensure (m01) mod r = 1
Shift B up to make b0 = 0
Result:
qi = p0 mod r is simple
Critical path in repeated cell.
Cost:
Increase n by 2
Removing rn : Removing rn The Montgomery class of A is
A rnA mod M
Montgomery modr multn is denoted .
Montgomery product of A and B is
A B A B rn ABrn AB mod M.
Applying to A instead of to A produces
Ae in an expn algorithm _ _ _ _ _ _ _ _ _ _ _ ___ ___
Encryption Process : Process: A A Ae Ae
Precompute: R2 = rn r2n mod M
Start with: A R2 Arn A mod M
Exponentiate to obtain Ae
End with: Ae 1 Ae mod M
_ _ _ __ __ __ __ _ Encryption Process
2M Bound : Outputs are re-used as inputs.
So need to bound I/O:
Suppose an1 = 0
Then P < M+B at end of loop n2
yields P < M+r1B at very end.
e.g. If B < 2M then P < 2M
2M Bound
Slide28 : Suppose 2rM < rn, A < 2M and R2 < 2M
Then A < 2M, Ae < 2M and P = Ae 1 < 2M
Final output P satisfies
Prn = Ae + QM where Q rn1.
Here Ae < 2M yields Prn < (rn+1)M So P M
P = M Ae 0 mod M A 0 mod M
A = M should never arise; A = 0 yields P = 0.
So no final modular adjustment is necessary.
_ __ __ __ __ __ _
Digit-Parallel Implementation : Digit-Parallel Implementation Classical vs Montgomery:
Similarities:
Broadcasting of qi and ai
Redundant representations
Computing qi takes time
Differences:
Bits to determine qi
Slide30 : ai, qi P := P + aiB (digit-parallel, not modular) cell
j+1 P in Carry-Save form: pj = pj,s + rpj,c cell
j cell
j-1 ai, qi mj
bj mj+1
bj+1 mj-1
bj-1 pj+1,c pj,c pj-1,c pj+1,s pj-1,s pj,s pj-1,c pj-1,s pj,s pj,c pj+1,s
Slide31 : j j+1 j-1 n-1 Digit-Parallel P := rP + aiB - qiM (Classical) pj-1,c pj-1,s pj,s pj,c pj+1,c pj+1,s pn-2,s pn-3,c pj-2,s pj-3,c pj,c mn-1 bn-1 mj+1 bj+1 mj-1 bj-1 mj bj ai
qi ai
qi qi qi+1 pj-2,c
Slide32 : (Montgomery) j+1 j j-1 0 Data Flow for P(i+1) := (P(i) + aiB + qiM)/r pj-1 (n) pj-1 pj-1 pj pj pj+1 pj pj-2 pj-2 p0 (n) (n) (i) (i) (i) (i) (i+1) (i+1) (i+1) ci,j+2 ci,j+1 ci,j ci,j-1 ci,1 ai ai ai ai ai ai qi qi qi qi qi mi mi+1 mi-1 m0 b0 bj-1 bj+1 bj
Systolic Array (Montgomery) : Systolic Array (Montgomery) Write ith value of P as P(i) = j=0 p(i1) r j
Cells in col j compute p(i)j at time 2i+j :
p(i)j + rc(i)j p(i1)j+1 + c(i)j1 + aibj + qimj
Cells in col 0 compute qi at time 2i :
qi (p(i1)1+ai×b0)(m01) mod r
Any number of rows may be constructed
Different timing schedules are possible
n1
Slide34 : cell
i,j+1 cell
i,j cell
i,j-1 cell
i+1,j+1 cell
i+1,j cell
i+1,j-1 carry carry carry carry carry carry carry carry ai ai ai ai ai+1 ai+1 ai+1 ai+1 qi+1 qi+1 qi+1 qi+1 qi qi qi qi mj bj mj+1 bj+1 mj-1 bj-1 mj-1 bj-1 mj-1 bj-1 mj bj mj bj mj+1 bj+1 mj+1 bj+1 p(i+1) p(i+1) p(i+1) p(i+1) p(i) p(i) p(i) p(i) p(i+2) p(i+2) p(i+2) p(i+2) j-1 j-1 j-1 j-2 j-2 j-2 j+1 j+1 j+1 j j j Systolic Array for P := (AB + QM)r-n
Slide35 : j+1 j j-1 0 Data Flow for P(i+1) := (P(i) + aiB + qiM)/r pj-1 (n) pj-1 pj-1 pj pj pj+1 pj pj-2 pj-2 p0 (n) (n) (i) (i) (i) (i) (i+1) (i+1) (i+1) ci,j+2 ci,j+1 ci,j ci,j-1 ci,1 ai ai ai ai ai ai qi qi qi qi qi mi mi+1 mi-1 m0 b0 bj-1 bj+1 bj
Digit-Serial Implementation (Montgomery) : Digit-Serial Implementation (Montgomery) Advantages:
Local communication
Shorter critical path
Critical path easily in repeated cell
Non-redundant representation
Digit serial I/O
Different digits qi and ai re DPA
Digit-Serial Implementation (Montgomery) : Disadvantage:
H/W only half used
Solutions:
Interleave two multiplications
E.g. configure exponentiation 75% use
Group digits as per Peter Kornerup [94]
Digit-Serial Implementation (Montgomery)
Slide38 : Other cell boundaries/groupings are possible;
Timing front angles in the data dependency graph can be altered;
For current speed of array implementations see Blum and Paar [99];
Vuillemin et al. [97] constructed an array;
Design is parametrised: by k and no. of rows.
Data Dependency Diagrams : Data Dependency Diagrams
Data Dependency Diagrams : Data Dependency Diagrams Parallel Digit Implementation t = 0 t = 1 t = 2 t = 3
Data Dependency Diagrams : Data Dependency Diagrams Walter [93] t=4 t=5 t=3 t=6 t=2 t=1 t=0 t=7 1 tick 2 ticks ...
Data Dependency Diagrams : Data Dependency Diagrams Kornerup [94] t=0 t=1 t=2 t=3 t=4 t=5 t=6
Data Integrity : Data Integrity P = A×B Q×M or Prn = A×B + Q×M
These are easily checked mod m.
e.g. m a prime just above the maximum cell output.
Cost: ~ one cell in the array i.e. increasing n by 1.
On error, abort or re-compute by another route:
e.g. M replaced by dM for a digit d prime to r.
Timing & Power Attacks : Timing & Power Attacks Most attacks which succeed on the classical algorithm have equivalents which will succeed on corresponding implementation of Montgomery’s algorithm.
With parallel digit processing, the same digits of A and Q are used in every digit slice in the same cycle. So DPA might reveal them.
Pipelined version has no equivalent (see data dependency graph). It uses many different digits of A and Q in each cycle. DPA is more difficult.
Conclusions : Conclusions For single k-bit multiplier or array of n parallel cells, classical and Montgomery algorithms are almost equal.
For pipelined array, Montgomery method has advantages: smaller time & area constants, better I/O, better against DPA;
Pipeline is more complex for 100% use, but faster clock.
Parameters can be chosen for specific purposes.
Slide46 : Go forth and Multiply