20210717, 19:51  #12  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·5^{2}·79 Posts 
Quote:
And/or look for InterimResidues OutputIterations in prime.txt Please provide the actual specs of your laptop (CPU model number, at minimum), so that timings are meaningful as benchmarks. Last fiddled with by kriesel on 20210717 at 20:17 

20210717, 20:01  #13 
P90 years forever!
Aug 2002
Yeehaw, FL
2×5×769 Posts 

20210717, 21:35  #14 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·5^{2}·79 Posts 
You tested two custom code paths, against a third custom code path? For what inputs, outputs, specifically?
Last fiddled with by kriesel on 20210717 at 21:35 
20210717, 23:10  #15 
Sep 2016
344_{10} Posts 
I'll mention that ProtoNTT is just a prototype. ycruncher's actual NTT is completely different implementationwise. (and faster)
Computationally, FFTs are much faster than NTT, but NTT wins the memory bandwidth game. Given that the CPU/memory gap keeps increasing, we can argue that NTTs would win in the long term. Let's say we give NTTs the performance win now. That's not enough. LL tests can't easily use NTT because the IBDWT needs both sufficiently deep rootsofunity and rootsoftwo  something that's hard to come by in the modular arithmetic ring. Last fiddled with by Mysticial on 20210717 at 23:11 
20210718, 00:42  #16 
If I May
"Chris Halsall"
Sep 2002
Barbados
13×773 Posts 

20210718, 03:54  #17  
Jul 2021
33_{10} Posts 
Quote:
Also I expect ycruncher's version of NTT that I also used to be quite well tested. But the point is in different thing  not to proof for myself that algorithm gives correct multiplication results, but just to compare speeds, relative speed of FFT against NTT. 

20210718, 04:11  #18  
Jul 2021
3·11 Posts 
Quote:
In NTT I used UInt64 words and also almost 64 bit prime numbers. These words where multiplied as 64x64 bit to 128 bit. Also used Montgomery and Barrett reduction everywhere to make things much faster. And precomputed table of twiddling factors (powers of root of union W). If you're speaking about FFT then I did following things: 1) I split input large numbers into Nbit words. "N" was chosen such that final measured maximal error is around 0.07 for different tests with random inputs. So mapping table was created to map between large number size and "N". For very large numbers like 2^25 bytes N was quite small, around 34 bits. Probably because of this my FFT was really slow, because instead of using large input words I used quite small with few bits. But right now I don't know other good FFT solution besides splitting into small Nbit words. Otherwise longfloat (128bit or 256bit float) multiplication should be done which will be probably much slower. 2) After splitting I did forward FFT transform of A and B numbers, then did pairwise multiplication, and finally inverse FFT transform of result. 3) Measured maximal error to be withing desired bounds and did carryover to propagate DOUBLEsized integers. After that joined words to make final result. 4) FFT algorithm was also tested with naive multiplication to give correct results. Tested for different sizes of input numbers but within affordable time (minutes) of running slow naive algorithm. 

20210718, 04:24  #19  
Jul 2021
3×11 Posts 
Quote:
Which means that Prime95's multiplitcation is much faster compared to my NTT version. So Prime95's is 0.34 seconds and my (NTT) is 9.8 seconds. So I have to take my initial words back "that NTT might be faster". Right now I can't understand really the reason why Prime95 is much faster. Maybe because Prime95 uses IBDWT (Irrational Base Discrete Weighted Transform) which I don't know details about. Or mabye because IBDWT is quite different compared to FFT. Or I did something wrong in my FFT algorithm that made it slow (I described my FFT approach in previous post). Or maybe Prime95 used very aggressive ASM level optimizations with SIMD instructions which I didn't do. PS. I have mobile (laptop) CPU "Intel Core i7 2630QM" at base frequencey of around 2GHz with 8 hardware threads (4 cores) and AVX1 support. (also because my CPU is usually overheated it slows down to 1GHz almost all of the time). Last fiddled with by moytrage on 20210718 at 04:24 

20210718, 04:33  #20 
Jul 2021
41_{8} Posts 
As you're the author of ProtoNTT, can you please comment on how to implement fast FFT multiplication? If you've read my few last posts above then you will know that I used splitting input large numbers into Nbit words for the input of FFT, where N is quite small (around 34 bits) for very large (2^25 bytes) numbers.
This 3bit splitting made FFT really slow, because NTT uses 64bit words, and FFT just 3bit words. 3bit is 21x times smaller than 64bit. This 3bit words splitting where necessary by FFT not to make overflow of maximal error (around 0.1). Can you suggest in short how else differently to implement FFT to use full 64bit input words and not to loose precision (and have out of bounds error). Also I didn't do long 128 or 256 bit float multiplication inside FFT. I just used regular DOUBLE multiplication using SIMD instructions like _mm256_mul_pd (doc is here https://software.intel.com/sites/lan...pd&expand=4894). Last fiddled with by moytrage on 20210718 at 04:35 
20210718, 04:36  #21  
∂^{2}ω=0
Sep 2002
República de California
2^{2}·3·7·139 Posts 
Quote:
Quote:


20210718, 04:50  #22  
P90 years forever!
Aug 2002
Yeehaw, FL
2·5·769 Posts 
Quote:
What you need to do is dig deeper into why your particular NTT or FFT implementations are faster or slower. That is no easy task. Quote:
I suspect (given the 29x speed difference) the cause is memory management. Fast implementations load up the L2 and L3 caches with data, do as much work as possible, then write that back to slow memory. Assuming an FFT size of 2^24, one possible FFT/NTT implementation is read all data, butterfly, write all data, repeat 24 times. Do the same for the inverse FFT. That would be 48 R/Ws of memory. Prime95 does 2 R/Ws  a 24x difference. If my assumption is correct, memory accesses are swamping the CPU costs. IOW, you are not able to meaningfully compare the two algorithms using your existing code. Although you might be able to for much smaller inputs. BTW, I applaud your work flow. Write code, test ideas, ask questions when faced with results that fly in the face of conventional wisdom, no ego, learn, iterate. You'd be surprised how many visitors we get making grand proclamations and unwilling to listen to reasoned rebuttals. 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
PRP on gpu is faster that on cpu  indomit  Information & Answers  4  20201007 10:50 
faster than LL?  paulunderwood  Miscellaneous Math  13  20160802 00:05 
My CPU is getting faster and faster ;)  lidocorc  Software  2  20081108 09:26 
Faster way to do LLT?  1260  Miscellaneous Math  23  20050904 07:12 
Faster than LL?  clowns789  Miscellaneous Math  3  20040527 23:39 