CDW Ches99 Talk

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Montgomery’s Multiplication Technique: How to make it Smaller and Faster: 

Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk

Peter Montgomery: 

Peter Montgomery “Modular Multiplication without Trial Division” Math. Computation, vol. 44 (1985) 519-521 (A  B) mod M without obtaining digits q  (A  B) / M

Motivation: 

Motivation Faster RSA Cryptosystem  through pipelined array Safer encryption  against timing or DPA attacks

Overview: 

Overview RSA & Notation Classical Algorithm Montgomery’s Version Comparison carry propagation digit distribution communication timing/power attacks Conclusion

Enigma: 

Enigma Special Purpose Colossus (1943-44) Tommy Flowers, Bletchley Park, England. General Purpose ENIAC (1943-46) John Eckert & John Mauchly Philadelphia, US.

RSA: 

RSA Modulus M of around 1024 bits Two keys d and e such that Ade  A mod M A encrypted to C = Ae mod M C decrypted by A = Cd mod M M = PQ, a product of two large primes e is often small (e.g. a Fermat prime) d satisfies de  1 mod (P–1)(Q–1)

Faster H/W = More Secure Encryption : 

Faster H/W = More Secure Encryption Work to factorize M doubles for every extra ~15 bits (for key lengths ~210 bits) Work to en/decrypt : ((1024+15)/1024)2 per multiplication ((1024+15)/1024)3 per exponentiation, i.e. only 5% extra!

Number representations: 

Number representations X = i=0 xiri r = 2k is the radix (prime to M) xi is the ith digit (usually 0  xi < r) n  max no. of digits in any number Redundant reps: wider digit range than 0 .. r1 H/W is built from kk-bit multipliers n fixed by H/W register size n–1

Redundancy: 

Redundancy Digits xj split into carry-save parts: xj = xj,s + rxj,c X  X+Y is performed by digit-parallel addition: xj  xj,s + xj1,c + yj No carry propagation: only old carries on right side

Multiplication AB : 

Multiplication AB Use n digit multipliers to form aiB and add to a partial product P: P := 0 ; For i := n1 downto 0 do P := rP + aiB { Post-condition: P = AB }

Slide11: 

Either: Use redundancy in P and parallel digit addition to add aiB in one clock cycle Cell j computes aibj in cycle i P := P + aiB (digit-parallel) P in Carry-Save form: pj = pj,s + rpj,c cell j cell j-1 cell j+1 pj,c pj+1,c pj+1,s pj,s pj-1,c pj-1,s pj+1,s pj,c pj,s pj-1,c pj-1,s bj+1 bj bj-1 ai ai

Slide12: 

or: Pipeline the addition of aiB over n cycles and propagate carries with no redundancy: Cell j computes aibj in cycle i+j carry P := P + aiB (digit-serial) time j+1 time j time j-1 carry carry carry pj+1 pj pj-1 pj+1 bj+1 pj bj pj-1 bj-1 ai ai ai ai

Multiplier Complexity: 

Multiplier Complexity Assume wires take area but not time (or power). AreaTime2 complexity for un-pipelined k-bit multiplication is bounded below by k2 This can be achieved for time in [log k ..k ] Discrete Fourier Transform has large constants for time and area. Better, but asymptotically poorer designs for k expected here.

Slide14: 

Cross-over point ? 107 transistors available for RSA  k  64 to accommodate aiB Speed by using at least n multipliers to perform a full length aiB (or equivalent) in one cycle.

Real-Time ?: 

Real-Time ? Assume : bus is one k-bit digit per cycle k-bit multiplier operates in one cycle Then : AB takes n cycles using n multipliers Throughput is one digit per cycle for multn. Need O(nk) multiplications for decryption Conclude : Need O(nk) rows of n multipliers.

Classical Mod Multn Algorithm: 

Classical Mod Multn Algorithm { Pre-condition: 0  A < rn } P := 0 ; For i := n1 downto 0 do Begin P := r×P + ai×B ; qi := P div M ; P := P  qi×M ; End { Post-conditions: P = A×B  Q×M, P  (A×B) mod M }

Comments: 

Carry propagation a problem (it slows finding q) ; Use only top digits of M and P to determine a good multiple of M to remove ; P is bounded by small multiple of M ; Clean up only at end ; Critical path is finding q. Comments

Disadvantages: 

Disadvantages Redundant rep. for digit-parallel operation Global broadcast of q to each digit position

Montgomery’s Mod Multn Algm: 

Montgomery’s Mod Multn Algm { Pre-condition: 0  A < rn } P := 0 ; For i := 0 to n1 do Begin qi := (p0+aib0)(-m0-1) mod r ; P := (P + ai×B + qi×M) div r ; { Invariant: 0  P < M+B } End ; { Post-condition: Prn = A×B + Q×M , P  (ABr–n) mod M }

Peter Montgomery :: 

Peter Montgomery : reverses the multiplication order chooses digits from least to most significant shifts down on each iteration. uses the least significant digits to determine multiple of M to subtract. Computes (AB×r–n) mod M

Slide21: 

The factor r–n is cleared up in post-processing Any extra multiple of M is removed then qi has no carries to wait for Pipelining of the digits can now take place : compute aibj+1 on the cycle after aibj use a non-redundant representation no broadcasting of qi

The Post-Condition: 

The Post-Condition m01 exists qi chosen so division by r is exact Define Ai =  j=0 ajrj and Qi analogously Then Ai = Ai1+riai and An = A So ri+1P = Ai×B + Qi×M at end of ith iteration Hence rnP = A×B + Q×M at end. i

The Bounds: 

The Bounds A converted on-line to non-redundant form Can assume ai  r1 So loop invariant P < M+B

Slide24: 

If critical path length is computing q: Scale M to ensure (m01) mod r = 1 Shift B up to make b0 = 0 Result: qi = p0 mod r is simple Critical path in repeated cell. Cost: Increase n by 2

Removing rn: 

Removing rn The Montgomery class of A is A  rnA mod M Montgomery modr multn is denoted  . Montgomery product of A and B is A  B  A B rn  ABrn  AB mod M. Applying  to A instead of  to A produces Ae in an expn algorithm _ _ _ _ _ _ _ _ _ _ _ ___ ___

Encryption Process: 

Process: A  A  Ae  Ae Precompute: R2 = rn  r2n mod M Start with: A  R2  Arn  A mod M Exponentiate to obtain Ae End with: Ae  1  Ae mod M _ _ _ __ __ __ __ _ Encryption Process

2M Bound: 

Outputs are re-used as inputs. So need to bound I/O: Suppose an1 = 0 Then P < M+B at end of loop n2 yields P < M+r1B at very end. e.g. If B < 2M then P < 2M 2M Bound

Slide28: 

Suppose 2rM < rn, A < 2M and R2 < 2M Then A < 2M, Ae < 2M and P = Ae  1 < 2M Final output P satisfies Prn = Ae + QM where Q  rn1. Here Ae < 2M yields Prn < (rn+1)M So P  M P = M  Ae  0 mod M  A  0 mod M A = M should never arise; A = 0 yields P = 0. So no final modular adjustment is necessary. _ __ __ __ __ __ _

Digit-Parallel Implementation: 

Digit-Parallel Implementation Classical vs Montgomery: Similarities: Broadcasting of qi and ai Redundant representations Computing qi takes time Differences: Bits to determine qi

Slide30: 

ai, qi P := P + aiB (digit-parallel, not modular) cell j+1 P in Carry-Save form: pj = pj,s + rpj,c cell j cell j-1 ai, qi mj bj mj+1 bj+1 mj-1 bj-1 pj+1,c pj,c pj-1,c pj+1,s pj-1,s pj,s pj-1,c pj-1,s pj,s pj,c pj+1,s

Slide31: 

j j+1 j-1 n-1 Digit-Parallel P := rP + aiB - qiM (Classical) pj-1,c pj-1,s pj,s pj,c pj+1,c pj+1,s pn-2,s pn-3,c pj-2,s pj-3,c pj,c mn-1 bn-1 mj+1 bj+1 mj-1 bj-1 mj bj ai qi ai qi qi qi+1 pj-2,c

Slide32: 

(Montgomery) j+1 j j-1 0 Data Flow for P(i+1) := (P(i) + aiB + qiM)/r pj-1 (n) pj-1 pj-1 pj pj pj+1 pj pj-2 pj-2 p0 (n) (n) (i) (i) (i) (i) (i+1) (i+1) (i+1) ci,j+2 ci,j+1 ci,j ci,j-1 ci,1 ai ai ai ai ai ai qi qi qi qi qi mi mi+1 mi-1 m0 b0 bj-1 bj+1 bj

Systolic Array (Montgomery): 

Systolic Array (Montgomery) Write ith value of P as P(i) =  j=0 p(i1) r j Cells in col j compute p(i)j at time 2i+j : p(i)j + rc(i)j  p(i1)j+1 + c(i)j1 + aibj + qimj Cells in col 0 compute qi at time 2i : qi  (p(i1)1+ai×b0)(m01) mod r Any number of rows may be constructed Different timing schedules are possible n1

Slide34: 

cell i,j+1 cell i,j cell i,j-1 cell i+1,j+1 cell i+1,j cell i+1,j-1 carry carry carry carry carry carry carry carry ai ai ai ai ai+1 ai+1 ai+1 ai+1 qi+1 qi+1 qi+1 qi+1 qi qi qi qi mj bj mj+1 bj+1 mj-1 bj-1 mj-1 bj-1 mj-1 bj-1 mj bj mj bj mj+1 bj+1 mj+1 bj+1 p(i+1) p(i+1) p(i+1) p(i+1) p(i) p(i) p(i) p(i) p(i+2) p(i+2) p(i+2) p(i+2) j-1 j-1 j-1 j-2 j-2 j-2 j+1 j+1 j+1 j j j Systolic Array for P := (AB + QM)r-n

Slide35: 

j+1 j j-1 0 Data Flow for P(i+1) := (P(i) + aiB + qiM)/r pj-1 (n) pj-1 pj-1 pj pj pj+1 pj pj-2 pj-2 p0 (n) (n) (i) (i) (i) (i) (i+1) (i+1) (i+1) ci,j+2 ci,j+1 ci,j ci,j-1 ci,1 ai ai ai ai ai ai qi qi qi qi qi mi mi+1 mi-1 m0 b0 bj-1 bj+1 bj

Digit-Serial Implementation (Montgomery): 

Digit-Serial Implementation (Montgomery) Advantages: Local communication Shorter critical path Critical path easily in repeated cell Non-redundant representation Digit serial I/O Different digits qi and ai re DPA

Digit-Serial Implementation (Montgomery): 

Disadvantage: H/W only half used Solutions: Interleave two multiplications E.g. configure exponentiation  75% use Group digits as per Peter Kornerup [94] Digit-Serial Implementation (Montgomery)

Slide38: 

Other cell boundaries/groupings are possible; Timing front angles in the data dependency graph can be altered; For current speed of array implementations see Blum and Paar [99]; Vuillemin et al. [97] constructed an array; Design is parametrised: by k and no. of rows.

Data Dependency Diagrams: 

Data Dependency Diagrams

Data Dependency Diagrams: 

Data Dependency Diagrams Parallel Digit Implementation t = 0 t = 1 t = 2 t = 3

Data Dependency Diagrams: 

Data Dependency Diagrams Walter [93] t=4 t=5 t=3 t=6 t=2 t=1 t=0 t=7 1 tick 2 ticks ...

Data Dependency Diagrams: 

Data Dependency Diagrams Kornerup [94] t=0 t=1 t=2 t=3 t=4 t=5 t=6

Data Integrity: 

Data Integrity P = A×B  Q×M or Prn = A×B + Q×M These are easily checked mod m. e.g. m a prime just above the maximum cell output. Cost: ~ one cell in the array i.e. increasing n by 1. On error, abort or re-compute by another route: e.g. M replaced by dM for a digit d prime to r.

Timing & Power Attacks: 

Timing & Power Attacks Most attacks which succeed on the classical algorithm have equivalents which will succeed on corresponding implementation of Montgomery’s algorithm. With parallel digit processing, the same digits of A and Q are used in every digit slice in the same cycle. So DPA might reveal them. Pipelined version has no equivalent (see data dependency graph). It uses many different digits of A and Q in each cycle. DPA is more difficult.

Conclusions: 

Conclusions For single k-bit multiplier or array of n parallel cells, classical and Montgomery algorithms are almost equal. For pipelined array, Montgomery method has advantages: smaller time & area constants, better I/O, better against DPA; Pipeline is more complex for 100% use, but faster clock. Parameters can be chosen for specific purposes.

Slide46: 

Go forth and Multiply