logging in or signing up CDW Ches99 Talk BAWare Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 90 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 05, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Montgomery’s Multiplication Technique: How to make it Smaller and Faster: Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk Peter Montgomery: Peter Montgomery “Modular Multiplication without Trial Division” Math. Computation, vol. 44 (1985) 519-521 (A B) mod M without obtaining digits q (A B) / MMotivation: Motivation Faster RSA Cryptosystem through pipelined array Safer encryption against timing or DPA attacks Overview: Overview RSA & Notation Classical Algorithm Montgomery’s Version Comparison carry propagation digit distribution communication timing/power attacks ConclusionEnigma: Enigma Special Purpose Colossus (1943-44) Tommy Flowers, Bletchley Park, England. General Purpose ENIAC (1943-46) John Eckert & John Mauchly Philadelphia, US.RSA: RSA Modulus M of around 1024 bits Two keys d and e such that Ade A mod M A encrypted to C = Ae mod M C decrypted by A = Cd mod M M = PQ, a product of two large primes e is often small (e.g. a Fermat prime) d satisfies de 1 mod (P–1)(Q–1) Faster H/W = More Secure Encryption : Faster H/W = More Secure Encryption Work to factorize M doubles for every extra ~15 bits (for key lengths ~210 bits) Work to en/decrypt : ((1024+15)/1024)2 per multiplication ((1024+15)/1024)3 per exponentiation, i.e. only 5% extra! Number representations: Number representations X = i=0 xiri r = 2k is the radix (prime to M) xi is the ith digit (usually 0 xi < r) n max no. of digits in any number Redundant reps: wider digit range than 0 .. r1 H/W is built from kk-bit multipliers n fixed by H/W register size n–1Redundancy: Redundancy Digits xj split into carry-save parts: xj = xj,s + rxj,c X X+Y is performed by digit-parallel addition: xj xj,s + xj1,c + yj No carry propagation: only old carries on right side Multiplication AB : Multiplication AB Use n digit multipliers to form aiB and add to a partial product P: P := 0 ; For i := n1 downto 0 do P := rP + aiB { Post-condition: P = AB }Slide11: Either: Use redundancy in P and parallel digit addition to add aiB in one clock cycle Cell j computes aibj in cycle i P := P + aiB (digit-parallel) P in Carry-Save form: pj = pj,s + rpj,c cell j cell j-1 cell j+1 pj,c pj+1,c pj+1,s pj,s pj-1,c pj-1,s pj+1,s pj,c pj,s pj-1,c pj-1,s bj+1 bj bj-1 ai ai Slide12: or: Pipeline the addition of aiB over n cycles and propagate carries with no redundancy: Cell j computes aibj in cycle i+j carry P := P + aiB (digit-serial) time j+1 time j time j-1 carry carry carry pj+1 pj pj-1 pj+1 bj+1 pj bj pj-1 bj-1 ai ai ai ai Multiplier Complexity: Multiplier Complexity Assume wires take area but not time (or power). AreaTime2 complexity for un-pipelined k-bit multiplication is bounded below by k2 This can be achieved for time in [log k ..k ] Discrete Fourier Transform has large constants for time and area. Better, but asymptotically poorer designs for k expected here. Slide14: Cross-over point ? 107 transistors available for RSA k 64 to accommodate aiB Speed by using at least n multipliers to perform a full length aiB (or equivalent) in one cycle. Real-Time ?: Real-Time ? Assume : bus is one k-bit digit per cycle k-bit multiplier operates in one cycle Then : AB takes n cycles using n multipliers Throughput is one digit per cycle for multn. Need O(nk) multiplications for decryption Conclude : Need O(nk) rows of n multipliers. Classical Mod Multn Algorithm: Classical Mod Multn Algorithm { Pre-condition: 0 A < rn } P := 0 ; For i := n1 downto 0 do Begin P := r×P + ai×B ; qi := P div M ; P := P qi×M ; End { Post-conditions: P = A×B Q×M, P (A×B) mod M } Comments: Carry propagation a problem (it slows finding q) ; Use only top digits of M and P to determine a good multiple of M to remove ; P is bounded by small multiple of M ; Clean up only at end ; Critical path is finding q. CommentsDisadvantages: Disadvantages Redundant rep. for digit-parallel operation Global broadcast of q to each digit position Montgomery’s Mod Multn Algm: Montgomery’s Mod Multn Algm { Pre-condition: 0 A < rn } P := 0 ; For i := 0 to n1 do Begin qi := (p0+aib0)(-m0-1) mod r ; P := (P + ai×B + qi×M) div r ; { Invariant: 0 P < M+B } End ; { Post-condition: Prn = A×B + Q×M , P (ABr–n) mod M } Peter Montgomery :: Peter Montgomery : reverses the multiplication order chooses digits from least to most significant shifts down on each iteration. uses the least significant digits to determine multiple of M to subtract. Computes (AB×r–n) mod M Slide21: The factor r–n is cleared up in post-processing Any extra multiple of M is removed then qi has no carries to wait for Pipelining of the digits can now take place : compute aibj+1 on the cycle after aibj use a non-redundant representation no broadcasting of qi The Post-Condition: The Post-Condition m01 exists qi chosen so division by r is exact Define Ai = j=0 ajrj and Qi analogously Then Ai = Ai1+riai and An = A So ri+1P = Ai×B + Qi×M at end of ith iteration Hence rnP = A×B + Q×M at end. iThe Bounds: The Bounds A converted on-line to non-redundant form Can assume ai r1 So loop invariant P < M+B Slide24: If critical path length is computing q: Scale M to ensure (m01) mod r = 1 Shift B up to make b0 = 0 Result: qi = p0 mod r is simple Critical path in repeated cell. Cost: Increase n by 2 Removing rn: Removing rn The Montgomery class of A is A rnA mod M Montgomery modr multn is denoted . Montgomery product of A and B is A B A B rn ABrn AB mod M. Applying to A instead of to A produces Ae in an expn algorithm _ _ _ _ _ _ _ _ _ _ _ ___ ___Encryption Process: Process: A A Ae Ae Precompute: R2 = rn r2n mod M Start with: A R2 Arn A mod M Exponentiate to obtain Ae End with: Ae 1 Ae mod M _ _ _ __ __ __ __ _ Encryption Process2M Bound: Outputs are re-used as inputs. So need to bound I/O: Suppose an1 = 0 Then P < M+B at end of loop n2 yields P < M+r1B at very end. e.g. If B < 2M then P < 2M 2M BoundSlide28: Suppose 2rM < rn, A < 2M and R2 < 2M Then A < 2M, Ae < 2M and P = Ae 1 < 2M Final output P satisfies Prn = Ae + QM where Q rn1. Here Ae < 2M yields Prn < (rn+1)M So P M P = M Ae 0 mod M A 0 mod M A = M should never arise; A = 0 yields P = 0. So no final modular adjustment is necessary. _ __ __ __ __ __ _Digit-Parallel Implementation: Digit-Parallel Implementation Classical vs Montgomery: Similarities: Broadcasting of qi and ai Redundant representations Computing qi takes time Differences: Bits to determine qi Slide30: ai, qi P := P + aiB (digit-parallel, not modular) cell j+1 P in Carry-Save form: pj = pj,s + rpj,c cell j cell j-1 ai, qi mj bj mj+1 bj+1 mj-1 bj-1 pj+1,c pj,c pj-1,c pj+1,s pj-1,s pj,s pj-1,c pj-1,s pj,s pj,c pj+1,s Slide31: j j+1 j-1 n-1 Digit-Parallel P := rP + aiB - qiM (Classical) pj-1,c pj-1,s pj,s pj,c pj+1,c pj+1,s pn-2,s pn-3,c pj-2,s pj-3,c pj,c mn-1 bn-1 mj+1 bj+1 mj-1 bj-1 mj bj ai qi ai qi qi qi+1 pj-2,c Slide32: (Montgomery) j+1 j j-1 0 Data Flow for P(i+1) := (P(i) + aiB + qiM)/r pj-1 (n) pj-1 pj-1 pj pj pj+1 pj pj-2 pj-2 p0 (n) (n) (i) (i) (i) (i) (i+1) (i+1) (i+1) ci,j+2 ci,j+1 ci,j ci,j-1 ci,1 ai ai ai ai ai ai qi qi qi qi qi mi mi+1 mi-1 m0 b0 bj-1 bj+1 bjSystolic Array (Montgomery): Systolic Array (Montgomery) Write ith value of P as P(i) = j=0 p(i1) r j Cells in col j compute p(i)j at time 2i+j : p(i)j + rc(i)j p(i1)j+1 + c(i)j1 + aibj + qimj Cells in col 0 compute qi at time 2i : qi (p(i1)1+ai×b0)(m01) mod r Any number of rows may be constructed Different timing schedules are possible n1Slide34: cell i,j+1 cell i,j cell i,j-1 cell i+1,j+1 cell i+1,j cell i+1,j-1 carry carry carry carry carry carry carry carry ai ai ai ai ai+1 ai+1 ai+1 ai+1 qi+1 qi+1 qi+1 qi+1 qi qi qi qi mj bj mj+1 bj+1 mj-1 bj-1 mj-1 bj-1 mj-1 bj-1 mj bj mj bj mj+1 bj+1 mj+1 bj+1 p(i+1) p(i+1) p(i+1) p(i+1) p(i) p(i) p(i) p(i) p(i+2) p(i+2) p(i+2) p(i+2) j-1 j-1 j-1 j-2 j-2 j-2 j+1 j+1 j+1 j j j Systolic Array for P := (AB + QM)r-n Slide35: j+1 j j-1 0 Data Flow for P(i+1) := (P(i) + aiB + qiM)/r pj-1 (n) pj-1 pj-1 pj pj pj+1 pj pj-2 pj-2 p0 (n) (n) (i) (i) (i) (i) (i+1) (i+1) (i+1) ci,j+2 ci,j+1 ci,j ci,j-1 ci,1 ai ai ai ai ai ai qi qi qi qi qi mi mi+1 mi-1 m0 b0 bj-1 bj+1 bjDigit-Serial Implementation (Montgomery): Digit-Serial Implementation (Montgomery) Advantages: Local communication Shorter critical path Critical path easily in repeated cell Non-redundant representation Digit serial I/O Different digits qi and ai re DPA Digit-Serial Implementation (Montgomery): Disadvantage: H/W only half used Solutions: Interleave two multiplications E.g. configure exponentiation 75% use Group digits as per Peter Kornerup [94] Digit-Serial Implementation (Montgomery)Slide38: Other cell boundaries/groupings are possible; Timing front angles in the data dependency graph can be altered; For current speed of array implementations see Blum and Paar [99]; Vuillemin et al. [97] constructed an array; Design is parametrised: by k and no. of rows. Data Dependency Diagrams: Data Dependency Diagrams Data Dependency Diagrams: Data Dependency Diagrams Parallel Digit Implementation t = 0 t = 1 t = 2 t = 3 Data Dependency Diagrams: Data Dependency Diagrams Walter [93] t=4 t=5 t=3 t=6 t=2 t=1 t=0 t=7 1 tick 2 ticks ... Data Dependency Diagrams: Data Dependency Diagrams Kornerup [94] t=0 t=1 t=2 t=3 t=4 t=5 t=6 Data Integrity: Data Integrity P = A×B Q×M or Prn = A×B + Q×M These are easily checked mod m. e.g. m a prime just above the maximum cell output. Cost: ~ one cell in the array i.e. increasing n by 1. On error, abort or re-compute by another route: e.g. M replaced by dM for a digit d prime to r. Timing & Power Attacks: Timing & Power Attacks Most attacks which succeed on the classical algorithm have equivalents which will succeed on corresponding implementation of Montgomery’s algorithm. With parallel digit processing, the same digits of A and Q are used in every digit slice in the same cycle. So DPA might reveal them. Pipelined version has no equivalent (see data dependency graph). It uses many different digits of A and Q in each cycle. DPA is more difficult. Conclusions: Conclusions For single k-bit multiplier or array of n parallel cells, classical and Montgomery algorithms are almost equal. For pipelined array, Montgomery method has advantages: smaller time & area constants, better I/O, better against DPA; Pipeline is more complex for 100% use, but faster clock. Parameters can be chosen for specific purposes. Slide46: Go forth and Multiply You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
CDW Ches99 Talk BAWare Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 90 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 05, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Montgomery’s Multiplication Technique: How to make it Smaller and Faster: Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk Peter Montgomery: Peter Montgomery “Modular Multiplication without Trial Division” Math. Computation, vol. 44 (1985) 519-521 (A B) mod M without obtaining digits q (A B) / MMotivation: Motivation Faster RSA Cryptosystem through pipelined array Safer encryption against timing or DPA attacks Overview: Overview RSA & Notation Classical Algorithm Montgomery’s Version Comparison carry propagation digit distribution communication timing/power attacks ConclusionEnigma: Enigma Special Purpose Colossus (1943-44) Tommy Flowers, Bletchley Park, England. General Purpose ENIAC (1943-46) John Eckert & John Mauchly Philadelphia, US.RSA: RSA Modulus M of around 1024 bits Two keys d and e such that Ade A mod M A encrypted to C = Ae mod M C decrypted by A = Cd mod M M = PQ, a product of two large primes e is often small (e.g. a Fermat prime) d satisfies de 1 mod (P–1)(Q–1) Faster H/W = More Secure Encryption : Faster H/W = More Secure Encryption Work to factorize M doubles for every extra ~15 bits (for key lengths ~210 bits) Work to en/decrypt : ((1024+15)/1024)2 per multiplication ((1024+15)/1024)3 per exponentiation, i.e. only 5% extra! Number representations: Number representations X = i=0 xiri r = 2k is the radix (prime to M) xi is the ith digit (usually 0 xi < r) n max no. of digits in any number Redundant reps: wider digit range than 0 .. r1 H/W is built from kk-bit multipliers n fixed by H/W register size n–1Redundancy: Redundancy Digits xj split into carry-save parts: xj = xj,s + rxj,c X X+Y is performed by digit-parallel addition: xj xj,s + xj1,c + yj No carry propagation: only old carries on right side Multiplication AB : Multiplication AB Use n digit multipliers to form aiB and add to a partial product P: P := 0 ; For i := n1 downto 0 do P := rP + aiB { Post-condition: P = AB }Slide11: Either: Use redundancy in P and parallel digit addition to add aiB in one clock cycle Cell j computes aibj in cycle i P := P + aiB (digit-parallel) P in Carry-Save form: pj = pj,s + rpj,c cell j cell j-1 cell j+1 pj,c pj+1,c pj+1,s pj,s pj-1,c pj-1,s pj+1,s pj,c pj,s pj-1,c pj-1,s bj+1 bj bj-1 ai ai Slide12: or: Pipeline the addition of aiB over n cycles and propagate carries with no redundancy: Cell j computes aibj in cycle i+j carry P := P + aiB (digit-serial) time j+1 time j time j-1 carry carry carry pj+1 pj pj-1 pj+1 bj+1 pj bj pj-1 bj-1 ai ai ai ai Multiplier Complexity: Multiplier Complexity Assume wires take area but not time (or power). AreaTime2 complexity for un-pipelined k-bit multiplication is bounded below by k2 This can be achieved for time in [log k ..k ] Discrete Fourier Transform has large constants for time and area. Better, but asymptotically poorer designs for k expected here. Slide14: Cross-over point ? 107 transistors available for RSA k 64 to accommodate aiB Speed by using at least n multipliers to perform a full length aiB (or equivalent) in one cycle. Real-Time ?: Real-Time ? Assume : bus is one k-bit digit per cycle k-bit multiplier operates in one cycle Then : AB takes n cycles using n multipliers Throughput is one digit per cycle for multn. Need O(nk) multiplications for decryption Conclude : Need O(nk) rows of n multipliers. Classical Mod Multn Algorithm: Classical Mod Multn Algorithm { Pre-condition: 0 A < rn } P := 0 ; For i := n1 downto 0 do Begin P := r×P + ai×B ; qi := P div M ; P := P qi×M ; End { Post-conditions: P = A×B Q×M, P (A×B) mod M } Comments: Carry propagation a problem (it slows finding q) ; Use only top digits of M and P to determine a good multiple of M to remove ; P is bounded by small multiple of M ; Clean up only at end ; Critical path is finding q. CommentsDisadvantages: Disadvantages Redundant rep. for digit-parallel operation Global broadcast of q to each digit position Montgomery’s Mod Multn Algm: Montgomery’s Mod Multn Algm { Pre-condition: 0 A < rn } P := 0 ; For i := 0 to n1 do Begin qi := (p0+aib0)(-m0-1) mod r ; P := (P + ai×B + qi×M) div r ; { Invariant: 0 P < M+B } End ; { Post-condition: Prn = A×B + Q×M , P (ABr–n) mod M } Peter Montgomery :: Peter Montgomery : reverses the multiplication order chooses digits from least to most significant shifts down on each iteration. uses the least significant digits to determine multiple of M to subtract. Computes (AB×r–n) mod M Slide21: The factor r–n is cleared up in post-processing Any extra multiple of M is removed then qi has no carries to wait for Pipelining of the digits can now take place : compute aibj+1 on the cycle after aibj use a non-redundant representation no broadcasting of qi The Post-Condition: The Post-Condition m01 exists qi chosen so division by r is exact Define Ai = j=0 ajrj and Qi analogously Then Ai = Ai1+riai and An = A So ri+1P = Ai×B + Qi×M at end of ith iteration Hence rnP = A×B + Q×M at end. iThe Bounds: The Bounds A converted on-line to non-redundant form Can assume ai r1 So loop invariant P < M+B Slide24: If critical path length is computing q: Scale M to ensure (m01) mod r = 1 Shift B up to make b0 = 0 Result: qi = p0 mod r is simple Critical path in repeated cell. Cost: Increase n by 2 Removing rn: Removing rn The Montgomery class of A is A rnA mod M Montgomery modr multn is denoted . Montgomery product of A and B is A B A B rn ABrn AB mod M. Applying to A instead of to A produces Ae in an expn algorithm _ _ _ _ _ _ _ _ _ _ _ ___ ___Encryption Process: Process: A A Ae Ae Precompute: R2 = rn r2n mod M Start with: A R2 Arn A mod M Exponentiate to obtain Ae End with: Ae 1 Ae mod M _ _ _ __ __ __ __ _ Encryption Process2M Bound: Outputs are re-used as inputs. So need to bound I/O: Suppose an1 = 0 Then P < M+B at end of loop n2 yields P < M+r1B at very end. e.g. If B < 2M then P < 2M 2M BoundSlide28: Suppose 2rM < rn, A < 2M and R2 < 2M Then A < 2M, Ae < 2M and P = Ae 1 < 2M Final output P satisfies Prn = Ae + QM where Q rn1. Here Ae < 2M yields Prn < (rn+1)M So P M P = M Ae 0 mod M A 0 mod M A = M should never arise; A = 0 yields P = 0. So no final modular adjustment is necessary. _ __ __ __ __ __ _Digit-Parallel Implementation: Digit-Parallel Implementation Classical vs Montgomery: Similarities: Broadcasting of qi and ai Redundant representations Computing qi takes time Differences: Bits to determine qi Slide30: ai, qi P := P + aiB (digit-parallel, not modular) cell j+1 P in Carry-Save form: pj = pj,s + rpj,c cell j cell j-1 ai, qi mj bj mj+1 bj+1 mj-1 bj-1 pj+1,c pj,c pj-1,c pj+1,s pj-1,s pj,s pj-1,c pj-1,s pj,s pj,c pj+1,s Slide31: j j+1 j-1 n-1 Digit-Parallel P := rP + aiB - qiM (Classical) pj-1,c pj-1,s pj,s pj,c pj+1,c pj+1,s pn-2,s pn-3,c pj-2,s pj-3,c pj,c mn-1 bn-1 mj+1 bj+1 mj-1 bj-1 mj bj ai qi ai qi qi qi+1 pj-2,c Slide32: (Montgomery) j+1 j j-1 0 Data Flow for P(i+1) := (P(i) + aiB + qiM)/r pj-1 (n) pj-1 pj-1 pj pj pj+1 pj pj-2 pj-2 p0 (n) (n) (i) (i) (i) (i) (i+1) (i+1) (i+1) ci,j+2 ci,j+1 ci,j ci,j-1 ci,1 ai ai ai ai ai ai qi qi qi qi qi mi mi+1 mi-1 m0 b0 bj-1 bj+1 bjSystolic Array (Montgomery): Systolic Array (Montgomery) Write ith value of P as P(i) = j=0 p(i1) r j Cells in col j compute p(i)j at time 2i+j : p(i)j + rc(i)j p(i1)j+1 + c(i)j1 + aibj + qimj Cells in col 0 compute qi at time 2i : qi (p(i1)1+ai×b0)(m01) mod r Any number of rows may be constructed Different timing schedules are possible n1Slide34: cell i,j+1 cell i,j cell i,j-1 cell i+1,j+1 cell i+1,j cell i+1,j-1 carry carry carry carry carry carry carry carry ai ai ai ai ai+1 ai+1 ai+1 ai+1 qi+1 qi+1 qi+1 qi+1 qi qi qi qi mj bj mj+1 bj+1 mj-1 bj-1 mj-1 bj-1 mj-1 bj-1 mj bj mj bj mj+1 bj+1 mj+1 bj+1 p(i+1) p(i+1) p(i+1) p(i+1) p(i) p(i) p(i) p(i) p(i+2) p(i+2) p(i+2) p(i+2) j-1 j-1 j-1 j-2 j-2 j-2 j+1 j+1 j+1 j j j Systolic Array for P := (AB + QM)r-n Slide35: j+1 j j-1 0 Data Flow for P(i+1) := (P(i) + aiB + qiM)/r pj-1 (n) pj-1 pj-1 pj pj pj+1 pj pj-2 pj-2 p0 (n) (n) (i) (i) (i) (i) (i+1) (i+1) (i+1) ci,j+2 ci,j+1 ci,j ci,j-1 ci,1 ai ai ai ai ai ai qi qi qi qi qi mi mi+1 mi-1 m0 b0 bj-1 bj+1 bjDigit-Serial Implementation (Montgomery): Digit-Serial Implementation (Montgomery) Advantages: Local communication Shorter critical path Critical path easily in repeated cell Non-redundant representation Digit serial I/O Different digits qi and ai re DPA Digit-Serial Implementation (Montgomery): Disadvantage: H/W only half used Solutions: Interleave two multiplications E.g. configure exponentiation 75% use Group digits as per Peter Kornerup [94] Digit-Serial Implementation (Montgomery)Slide38: Other cell boundaries/groupings are possible; Timing front angles in the data dependency graph can be altered; For current speed of array implementations see Blum and Paar [99]; Vuillemin et al. [97] constructed an array; Design is parametrised: by k and no. of rows. Data Dependency Diagrams: Data Dependency Diagrams Data Dependency Diagrams: Data Dependency Diagrams Parallel Digit Implementation t = 0 t = 1 t = 2 t = 3 Data Dependency Diagrams: Data Dependency Diagrams Walter [93] t=4 t=5 t=3 t=6 t=2 t=1 t=0 t=7 1 tick 2 ticks ... Data Dependency Diagrams: Data Dependency Diagrams Kornerup [94] t=0 t=1 t=2 t=3 t=4 t=5 t=6 Data Integrity: Data Integrity P = A×B Q×M or Prn = A×B + Q×M These are easily checked mod m. e.g. m a prime just above the maximum cell output. Cost: ~ one cell in the array i.e. increasing n by 1. On error, abort or re-compute by another route: e.g. M replaced by dM for a digit d prime to r. Timing & Power Attacks: Timing & Power Attacks Most attacks which succeed on the classical algorithm have equivalents which will succeed on corresponding implementation of Montgomery’s algorithm. With parallel digit processing, the same digits of A and Q are used in every digit slice in the same cycle. So DPA might reveal them. Pipelined version has no equivalent (see data dependency graph). It uses many different digits of A and Q in each cycle. DPA is more difficult. Conclusions: Conclusions For single k-bit multiplier or array of n parallel cells, classical and Montgomery algorithms are almost equal. For pipelined array, Montgomery method has advantages: smaller time & area constants, better I/O, better against DPA; Pipeline is more complex for 100% use, but faster clock. Parameters can be chosen for specific purposes. Slide46: Go forth and Multiply