> Why do we write programs with one thread that executes the instructions in order? Why not just always write fully parallelized code?
Maybe to a layperson this is an interesting question, but surely on this site the answer is obvious.
I skimmed through anyway, just on the off chance he's talking about some new GPU development I didn't know about. But no, it's just a clickbaity title for a basic "pros and cons of parallelizing" article of mediocre quality.
> a clickbaity title for a basic "pros and cons of parallelizing" article of mediocre quality.
Why don't you write one with good quality (if not great) that is understandable to the audience of laypersons interested in, having or asking this question.
This exact question comes to my mind in the last 2 hours after reading about the new Nvidia Digit supercomputer (which will retail at $3,000) and RTX 5090 GPU (which will cost $2,000). At exactly those moments, I saw this submission here.
Teddy Wahle did a good job explaining it to non-professionals who are using gaming laptops/PCs in their daily lives for content creation and such.
If I can think this question, imagine thousands (if not millions) of people who have or will have similar questions.
Freight trains are good, but pipelines are better for bulk liquids.
GPUs are massively parallel limited purpose CPUs. You're always waiting for the program counter to get to the end, and getting data to/from Ram.
My project (since the 1980s) is to put compute and memory together, so you can either store 16 bits, or do a 4 bit look up with the same data in every cell (which has 4x16 bits total). Having no memory-compute barrier should allow for pushing data through a program, in a physical sense. Like an FPGA, but easier to program.
Imagine such a chip at a few Ghz, delivering one output per clock cycle for a given task.
In my experience the step of leaving von Neumann architecture and programming that data flows across the compute engines is the more difficult step than the crappy FPGA tools. I tell people that with an FPGA you teach the computer what to be, not what to do. A bit like functional vs. imperative programming.
Therefore I can't yet imagine how to easier onboard people who learned on a "classical CPU". Can you go a bit deeper in what makes your system easier to program?
FPGAs are heterogeneous, and the tools are all about getting the maximum utilization of a given chip's features while also routing signals to get the fastest reliable cycle times. Compile times sometimes stretch out into overnight sessions.
By making each cell identical you remove most of the floor planning concerns. By latching the data from each cell, you remove timing concerns, and deliver deterministic performance.
This has other benefits like being possible to route around a bad cell, or to build a "wall" around a given section of code.
The emulator will always give correct output baring hardware failure.
I hope to get through tiny Tapeout and have a chip before the end of the year.
As for programming, the code has to compile to a directed acyclic graph of boolean logical operations. I suspect that it would be fairly easy to use as a backend for tinygrad or perhaps LLVM as a stretch goal.
No, each cell has it's own 4x4 bit look up table (64 bits of "program" or "data" depending on the mode of that cell), so it's more like an FPGA with the routing fabric stripped out.
This really sounds like the Xilinx XC6200 series FPGA. It has a few routing hierarchies but does strip out most of the routing a classical (for that time) FPGA has. I heard numbers that 70% of a FPGA is routing.
The XC6200 became news once as they[1] used an evolutionary algorithm to create a configuration that can detect a single frequency tone. The resulting configuration was determined that it was impossible that it could work, but it did. Placing the configuration in a different part of the FPGA broke it.
It seemed that algorithm used the analog properties of the FPGA in that specific location to get to a smaller result.
Trucks are better at branching, but also heavily subsidized. In the 19th century when trains were being run everywhere, they were heavily subsidized too.
The gp comment speaks more of freight than of passengers. But the same analogy applies… trains are great when you need to transport a whole bunch of people exactly the same way, but they’re less useful if every person needs to start and end at a totally different place.
In the same sense, parallelization (and computing on a GPU) are great when you need to perform the same computation on a huge number of data. When each computation is more independent, it might make more sense to perform it through the CPU.
Hmmmm, the blog here doesn't really sell it right IMO.
CPUs are actually hugely parallel today. Every CPU, even the "small cores" (E-Cores from Intel or little cores from ARM) are out-of-order superscalar cores. This means that every CPU can execute 2, 3, 4, 6, 8 or more instructions per clock tick.
What a CPU is, fundamentally, is a device that takes a classically written sequential program and *DISCOVERS* parallelism for you. And then executes that discovered parallelism.
A GPU, and SIMD in general, is the opposite. SIMD requires that the programmer writes the program in a grossly parallel manner before it even executes on the GPU. Naturally, GPUs lead to more parallelism, not due to magic but simply because the programmer thought about parallelism in a more scalable way.
Nit, the relevant property you're looking for is associative (you can omit parantheses), not commutative (you can swap left and right); there are some operations that are the latter but not the former, such as the average:
That's a good example. My favorite has always been rock-paper-scissors. If `vs` is a binary operator which gives you the winner according to the game then
rock vs paper = paper vs rock
but
(rock vs paper) vs scissors != rock vs (paper vs scissors)
Much of parallelism to accomplish a single goal is trying to break computations into units that are associative and commutative. Then you can combine the results regardless of order.
Trying to decompose a problem in this way ranges from hard to impossible.
My thoughts exactly. His sequential program example is easily parallelizable. Maybe one of the takeaways should be that we need to work on converting more serial problems to parallel ones
Unfortunately, many interesting problems are known to be impossible to efficiently parallelize. This includes some very broad problems like evaluating a logical expression or simulating a turing machine.
People have been beating on that problem for decades, with modest success.
It's amazing how many different problems people are solving with GPUs.
There are at least four main ways to break apart a problem into parallel work units:
Search-type problems, where you want to do a lot of computations and discard most of them, are easy to parallelize. The extreme case is a crypto key tester. A Bitcoin miner and the WWII "bombe" are examples.
The amount of data that comes out is tiny compared to the work done.
Then there are problems which decompose spatially. Space is what keeps everything from being in the same place. Fluid flow, weather prediction, and hydrodynamics of nuclear explosions are all like that. The space is divided into regions and the regions talk to their neighbors. So most communication is local. Those are classic supercomputer problems. Machine learning, of course. Units are connected to local neighbors. So that, is in a sense, spatial.
Graphics rendering does large numbers of triangles in parallel. The combining operation has to be associative and commutative. Z-buffers make that happen for opaque and transparent surfaces. Translucent is harder, but there are tricks to pull that off without a depth sort. See "order independent transparency".
There are assembly-line approaches, where data progresses forward through multiple stages, with little or no data moving backwards. Audio and video processing work like that. So do some kinds of vision systems. This is the oldest approach - it's how large punched card tabulator shops worked, with trays of punched cards being carried from machine to machine.
Most things you can do on a GPU seem to fit into one of those four patterns. If you can hammer a problem into one of those four patterns, a GPU might help. If not, well...
It should be noted that operations like addition on floating point numbers are not associative (even though addition on real numbers is). Round off error changes depending on the order of operations. The error in divide-and-conquer summation (which you might see in parallelized addition) is worst case much better than sequential addition, although not as good as more elaborate algorithms like Kahan summation.
Generally recurrent sequences cannot be computed in parallel, but the Fibonacci sequence is a special case because it's a linear recurrence. There's a closed-form equation to generate the Nth number.
Linked Lists and other Tree Traversals are sequential.
Raytracing (which is a tree traversal) is a notable exception because there's so many millions of rays that get launched, that it becomes efficient to group them back together and then parallelize the traversals all together.
So if you have "enough" of them going at once, its back to GPU Superiority. But a *SINGLE* linked list or *SINGLE* tree search will always be best done on a CPU.
+ Efficient decoding of one instruction stream for many parallel cores.
+ Efficient parallel access to lots of data in regular layout with address offsets relative to core index.
- Lots of instruction streams quickly require serialization, given only one instruction stream can be handled per block of co-working cores.
- Only 1 or a few parallel calculations per stream, leaves rest of co-workers unused.
- Irregular data layouts, force parallel memory hardware to act sequentially, with each core getting access while the others wait their turn, until all cores access is complete.
- Lots of branching stalls cores, as cores that take one branch execute, while the other cores stall, and then vice versa.
CPU cores:
+ Efficient decoding of different instruction streams per core.
+ Efficient with different memory layouts, due to per core caching.
+ Highly optimized for branching.
- Parallel calculations across cores suffer from redundant instruction decoding overhead for every core.
- Highly parallel calculations suffer, from the relatively low number of cores.
GPU and CPU:
+ Both optimized for sequential data access across sequentially offset addresses.
CPU Bonuses:
+ They handle overall system initialization, power & device management, and other system tasks.
+ SIMD instructions let each CPU core perform a bit like a little GPU core group.
One might wonder if they eventually get rebranded as Serial Processing Units and Parallel Processing Units but of course it isn't that cut and dry either.
It's the wrong question to ask, since GPU is mainly being used now as data processing accelerator.
Perhaps a better question is does our current CPU architecture is the most suitable for data processing and the likely answer is no.
There are much better architecture for data processing tasks namely dataflow CPU architecture [1].
There's a new trend moving toward dataflow based architecture for data processing and AMD is one of them [2].
I think a combination of dataflow architecture optimized for array computation is the best for data processing because it generalized almost all common data processing techniques including spreadsheets, databases, graphs and matrices [3].
In the eyes of most users, GPUs are improving dramatically every year where a CPUs are not. This is highlighted by a lot of growth in PC hardware is driven by gaming where most gamers are able to keep the same CPU for a five year period nowadays, while GPUs require more updates to keep up.
I think the comparison makes a lot more sense when you consider a GPU compared with a system on the chip where the SOC can do the basic work that you would do. To the average gamer, it looks like the only part of a “CPU“ that is improving is the graphics.
So we might as well put everything on the GPU at this point. This goes hand-in-hand with the most showing games rendering via AI, storage being added to graphics cards, and AI processing in general.
He helps build This notion that maybe things wouldn’t be bottlenecked by the CPU anymore if we switched to a new paradigm.
I think maybe if the switch to ARM happens for PCs, we will see more development on that idea of an SOC bundled with high end GPUs.
Is there a transpiler / compiler that can take basic Python / Rust / C code and run it exclusively on GPU (I'm assuming GPUs support some sort of branching these days)?
It would be nice to be able to take random bits of code and actually look at what you win / lose by running (unoptimized) on GPU vs. CPU.
Hummm various models seem to think the problem is nobody has done the work of figuring out how stuff like memory allocation and function control flow would work.
I'm surprised there isn't a fun github with a minimalist proof of concept or that none of the big players have implemented (even very inefficient) instructions in their GPU to allow general purpose computing.
No one has done it because it’s a comically bad idea. It’s like trying to build taxis with only rail infrastructure. GPUs are only fast because they do the same thing across multiple pieces of data at the same time. Branching is possible but incredibly expensive. You would end up with terrible performance and poor power efficiency. And you would still need a cpu to load programs on the GPU and handle io.
This is basically a problem of automatic vectorization and parallelization. This is not a new area: if you’re curious about it there is literature about this in the compilers and high performance computing world going back to the 70s. Short answer: it’s quite hard, especially for generic code. It’s achievable to some degree for array-oriented, loop heavy code, but even then with severe limitations often related to the regularity of the logic in the loop bodies, the iteration space, and the use of pointers. There has been some work for Python to do this (eg, numba), but it is largely for array heavy code instead of general Python.
Yes, SPIRV does pretty much exactly this. Not for Python I'd imagine, maybe someone has leveraged the ast package for it though.
There's nothing particularly special about graphics programs. It's just that until Vulkan and Metal, the pipelines were more or less defined for you (with OpenGL, DX, etc).
What I wish we had is a portable version of SIMD that runs on any CPU and any GPU as part of all programming languages, so that taking advantage of those does not suck.
Just like we have for loops and threads that work on all CPU architectures and cores.
Instead we have a mess of various CPU instructions (SSE, AVX, NEON, SVE) along with incompatible GPU frameworks (CUDA, rocm?).
Java is trying to do that with Vector api but only for stuff on CPUs, so GPUs are not included [0].
If we had a unified way to take advantage of CPU and GPU, I’m sure devs would do it more.
Relies very heavily on the autovectorizer of GCC / LLVM, but it works somewhat well if you know what you're doing. Also, I'm pretty sure #pragma openmp simd doesn't actually do anything but maybe makes the "failed to vectorize" warnings a bit more obvious??
I dunno, I don't play with the auto-vectorizer that much actually.
When I bought a power drill, I didn’t throw out my screwdrivers. There are still cases where I find them essential. They’re all tools, each with their merits and tradeoffs. Few things truly move the Pareto line.
good article. I would add that GPUs are particularly good for the case where you want to do the same thing to different datums in parallel. e.g. brightening every pixel of an image - you have some data, the pixels in this case, and you want to do the same thing to every datum. This is called SIMD, short for “single instruction multiple data”. this is because great instruction you give to the GPU applied to multiple pieces of data, hopefully in parallel.
VLIW seems to have solidly lost to SIMD in the desktop space.
Don't forget: as Intel was floundering with Itanium, AMD was just making SSE and SSE2 instructions faster (AMD 3dNow! never became popular, but they continued to push that). It seems like SIMD was the historically correct choice over VLIW.
Today: every high-speed program in Desktop/Server is either SIMD-accelerated CPU (NEON or AVX512 or SVE), or full-blown SIMD GPU.
> Why do we write programs with one thread that executes the instructions in order? Why not just always write fully parallelized code?
Maybe to a layperson this is an interesting question, but surely on this site the answer is obvious.
I skimmed through anyway, just on the off chance he's talking about some new GPU development I didn't know about. But no, it's just a clickbaity title for a basic "pros and cons of parallelizing" article of mediocre quality.
You're not the target audience. This is a very well-written explainer for non-techies. I've already emailed it to a couple of friends. Lovely.
This whole site is not the target audience.
> a clickbaity title for a basic "pros and cons of parallelizing" article of mediocre quality.
Why don't you write one with good quality (if not great) that is understandable to the audience of laypersons interested in, having or asking this question.
I think the main point here is that there is no fundamental amazing insight here for professionals to ingest.
This exact question comes to my mind in the last 2 hours after reading about the new Nvidia Digit supercomputer (which will retail at $3,000) and RTX 5090 GPU (which will cost $2,000). At exactly those moments, I saw this submission here.
Teddy Wahle did a good job explaining it to non-professionals who are using gaming laptops/PCs in their daily lives for content creation and such.
If I can think this question, imagine thousands (if not millions) of people who have or will have similar questions.
Though no modem CPU actually executes instructions in order.
Because CPUs care about semantic ordering
If freight trains are so good, why do we use trucks?
Freight trains carry a lot, but are really bad at branching. Same with GPUs.
Freight trains are good, but pipelines are better for bulk liquids.
GPUs are massively parallel limited purpose CPUs. You're always waiting for the program counter to get to the end, and getting data to/from Ram.
My project (since the 1980s) is to put compute and memory together, so you can either store 16 bits, or do a 4 bit look up with the same data in every cell (which has 4x16 bits total). Having no memory-compute barrier should allow for pushing data through a program, in a physical sense. Like an FPGA, but easier to program.
Imagine such a chip at a few Ghz, delivering one output per clock cycle for a given task.
In my experience the step of leaving von Neumann architecture and programming that data flows across the compute engines is the more difficult step than the crappy FPGA tools. I tell people that with an FPGA you teach the computer what to be, not what to do. A bit like functional vs. imperative programming.
Therefore I can't yet imagine how to easier onboard people who learned on a "classical CPU". Can you go a bit deeper in what makes your system easier to program?
FPGAs are heterogeneous, and the tools are all about getting the maximum utilization of a given chip's features while also routing signals to get the fastest reliable cycle times. Compile times sometimes stretch out into overnight sessions.
By making each cell identical you remove most of the floor planning concerns. By latching the data from each cell, you remove timing concerns, and deliver deterministic performance.
This has other benefits like being possible to route around a bad cell, or to build a "wall" around a given section of code.
The emulator will always give correct output baring hardware failure.
I hope to get through tiny Tapeout and have a chip before the end of the year.
As for programming, the code has to compile to a directed acyclic graph of boolean logical operations. I suspect that it would be fairly easy to use as a backend for tinygrad or perhaps LLVM as a stretch goal.
Ah thanks! Reminds me of systolic arrays somehow. Good luck and hopefully there are no bugs in your design!
Are you speaking about a sort of (multi-state) cellular automata?
No, each cell has it's own 4x4 bit look up table (64 bits of "program" or "data" depending on the mode of that cell), so it's more like an FPGA with the routing fabric stripped out.
This sounds like just an fpga. What do you mean by “routing fabric stripped out?” How does data get from one cell to another?
Each cell only hooks to it's neighbors, up, down, left, right... so no "fast" wires across the chip, etc.
This really sounds like the Xilinx XC6200 series FPGA. It has a few routing hierarchies but does strip out most of the routing a classical (for that time) FPGA has. I heard numbers that 70% of a FPGA is routing.
The XC6200 became news once as they[1] used an evolutionary algorithm to create a configuration that can detect a single frequency tone. The resulting configuration was determined that it was impossible that it could work, but it did. Placing the configuration in a different part of the FPGA broke it.
It seemed that algorithm used the analog properties of the FPGA in that specific location to get to a smaller result.
[1] https://www.idi.ntnu.no/emner/tdt22/2011/Thompsonieeeehw.pdf
If data only travels into local neighbors this is really a multi state cellular automata though
I thought cellular automata all had to have the same rule/program across the grid? Like "Life" or the ones Wolfram is exploring.
Trucks are better at branching, but also heavily subsidized. In the 19th century when trains were being run everywhere, they were heavily subsidized too.
Trains in general a great, why do people insist on sitting without any company in an expensive car which they even have to drive themselves?
The gp comment speaks more of freight than of passengers. But the same analogy applies… trains are great when you need to transport a whole bunch of people exactly the same way, but they’re less useful if every person needs to start and end at a totally different place.
In the same sense, parallelization (and computing on a GPU) are great when you need to perform the same computation on a huge number of data. When each computation is more independent, it might make more sense to perform it through the CPU.
Hmmmm, the blog here doesn't really sell it right IMO.
CPUs are actually hugely parallel today. Every CPU, even the "small cores" (E-Cores from Intel or little cores from ARM) are out-of-order superscalar cores. This means that every CPU can execute 2, 3, 4, 6, 8 or more instructions per clock tick.
What a CPU is, fundamentally, is a device that takes a classically written sequential program and *DISCOVERS* parallelism for you. And then executes that discovered parallelism.
A GPU, and SIMD in general, is the opposite. SIMD requires that the programmer writes the program in a grossly parallel manner before it even executes on the GPU. Naturally, GPUs lead to more parallelism, not due to magic but simply because the programmer thought about parallelism in a more scalable way.
The only time a CPU "discovers" parallelism is in
> you couldn’t tell your friend, “You calculate steps 1 through 50 while I calculate steps 51 through 100.”
For that example, couldn't you do exactly that? Multiplication is commutative
Nit, the relevant property you're looking for is associative (you can omit parantheses), not commutative (you can swap left and right); there are some operations that are the latter but not the former, such as the average:
butThat's a good example. My favorite has always been rock-paper-scissors. If `vs` is a binary operator which gives you the winner according to the game then
butMuch of parallelism to accomplish a single goal is trying to break computations into units that are associative and commutative. Then you can combine the results regardless of order.
Trying to decompose a problem in this way ranges from hard to impossible.
My thoughts exactly. His sequential program example is easily parallelizable. Maybe one of the takeaways should be that we need to work on converting more serial problems to parallel ones
Unfortunately, many interesting problems are known to be impossible to efficiently parallelize. This includes some very broad problems like evaluating a logical expression or simulating a turing machine.
https://en.wikipedia.org/wiki/P-complete
People have been beating on that problem for decades, with modest success. It's amazing how many different problems people are solving with GPUs.
There are at least four main ways to break apart a problem into parallel work units:
Search-type problems, where you want to do a lot of computations and discard most of them, are easy to parallelize. The extreme case is a crypto key tester. A Bitcoin miner and the WWII "bombe" are examples. The amount of data that comes out is tiny compared to the work done.
Then there are problems which decompose spatially. Space is what keeps everything from being in the same place. Fluid flow, weather prediction, and hydrodynamics of nuclear explosions are all like that. The space is divided into regions and the regions talk to their neighbors. So most communication is local. Those are classic supercomputer problems. Machine learning, of course. Units are connected to local neighbors. So that, is in a sense, spatial.
Graphics rendering does large numbers of triangles in parallel. The combining operation has to be associative and commutative. Z-buffers make that happen for opaque and transparent surfaces. Translucent is harder, but there are tricks to pull that off without a depth sort. See "order independent transparency".
There are assembly-line approaches, where data progresses forward through multiple stages, with little or no data moving backwards. Audio and video processing work like that. So do some kinds of vision systems. This is the oldest approach - it's how large punched card tabulator shops worked, with trays of punched cards being carried from machine to machine.
Most things you can do on a GPU seem to fit into one of those four patterns. If you can hammer a problem into one of those four patterns, a GPU might help. If not, well...
Did I miss any core patterns?
Yes, you would parallelize it with prefix scan which has complexity nlog n I believe. However, floating point is neither commutative nor associative.
You mean associative?
probably, lol
It should be noted that operations like addition on floating point numbers are not associative (even though addition on real numbers is). Round off error changes depending on the order of operations. The error in divide-and-conquer summation (which you might see in parallelized addition) is worst case much better than sequential addition, although not as good as more elaborate algorithms like Kahan summation.
https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Alte...
A better example would be the Fibonacci sequence, or some other inductive process.
Generally recurrent sequences cannot be computed in parallel, but the Fibonacci sequence is a special case because it's a linear recurrence. There's a closed-form equation to generate the Nth number.
Linked Lists and other Tree Traversals are sequential.
Raytracing (which is a tree traversal) is a notable exception because there's so many millions of rays that get launched, that it becomes efficient to group them back together and then parallelize the traversals all together.
So if you have "enough" of them going at once, its back to GPU Superiority. But a *SINGLE* linked list or *SINGLE* tree search will always be best done on a CPU.
GPU cores:
+ Efficient decoding of one instruction stream for many parallel cores.
+ Efficient parallel access to lots of data in regular layout with address offsets relative to core index.
- Lots of instruction streams quickly require serialization, given only one instruction stream can be handled per block of co-working cores.
- Only 1 or a few parallel calculations per stream, leaves rest of co-workers unused.
- Irregular data layouts, force parallel memory hardware to act sequentially, with each core getting access while the others wait their turn, until all cores access is complete.
- Lots of branching stalls cores, as cores that take one branch execute, while the other cores stall, and then vice versa.
CPU cores:
+ Efficient decoding of different instruction streams per core.
+ Efficient with different memory layouts, due to per core caching.
+ Highly optimized for branching.
- Parallel calculations across cores suffer from redundant instruction decoding overhead for every core.
- Highly parallel calculations suffer, from the relatively low number of cores.
GPU and CPU:
+ Both optimized for sequential data access across sequentially offset addresses.
CPU Bonuses:
+ They handle overall system initialization, power & device management, and other system tasks.
+ SIMD instructions let each CPU core perform a bit like a little GPU core group.
One might wonder if they eventually get rebranded as Serial Processing Units and Parallel Processing Units but of course it isn't that cut and dry either.
It's the wrong question to ask, since GPU is mainly being used now as data processing accelerator.
Perhaps a better question is does our current CPU architecture is the most suitable for data processing and the likely answer is no.
There are much better architecture for data processing tasks namely dataflow CPU architecture [1].
There's a new trend moving toward dataflow based architecture for data processing and AMD is one of them [2].
I think a combination of dataflow architecture optimized for array computation is the best for data processing because it generalized almost all common data processing techniques including spreadsheets, databases, graphs and matrices [3].
[1] Dataflow architecture:
https://en.wikipedia.org/wiki/Dataflow_architecture
[2] AIEBLAS: Open-Source Expandable BLAS Implementation for AMD/Xilinx Versal Device:
https://github.com/atlarge-research/AIE-BLAS
[3] Mathematics of Big Data: Spreadsheets, Databases, Matrices, and Graphs:
https://mitpress.mit.edu/9780262038393/mathematics-of-big-da...
In the eyes of most users, GPUs are improving dramatically every year where a CPUs are not. This is highlighted by a lot of growth in PC hardware is driven by gaming where most gamers are able to keep the same CPU for a five year period nowadays, while GPUs require more updates to keep up.
I think the comparison makes a lot more sense when you consider a GPU compared with a system on the chip where the SOC can do the basic work that you would do. To the average gamer, it looks like the only part of a “CPU“ that is improving is the graphics.
So we might as well put everything on the GPU at this point. This goes hand-in-hand with the most showing games rendering via AI, storage being added to graphics cards, and AI processing in general.
He helps build This notion that maybe things wouldn’t be bottlenecked by the CPU anymore if we switched to a new paradigm.
I think maybe if the switch to ARM happens for PCs, we will see more development on that idea of an SOC bundled with high end GPUs.
Is there a transpiler / compiler that can take basic Python / Rust / C code and run it exclusively on GPU (I'm assuming GPUs support some sort of branching these days)?
It would be nice to be able to take random bits of code and actually look at what you win / lose by running (unoptimized) on GPU vs. CPU.
Hummm various models seem to think the problem is nobody has done the work of figuring out how stuff like memory allocation and function control flow would work.
I'm surprised there isn't a fun github with a minimalist proof of concept or that none of the big players have implemented (even very inefficient) instructions in their GPU to allow general purpose computing.
No one has done it because it’s a comically bad idea. It’s like trying to build taxis with only rail infrastructure. GPUs are only fast because they do the same thing across multiple pieces of data at the same time. Branching is possible but incredibly expensive. You would end up with terrible performance and poor power efficiency. And you would still need a cpu to load programs on the GPU and handle io.
This is basically a problem of automatic vectorization and parallelization. This is not a new area: if you’re curious about it there is literature about this in the compilers and high performance computing world going back to the 70s. Short answer: it’s quite hard, especially for generic code. It’s achievable to some degree for array-oriented, loop heavy code, but even then with severe limitations often related to the regularity of the logic in the loop bodies, the iteration space, and the use of pointers. There has been some work for Python to do this (eg, numba), but it is largely for array heavy code instead of general Python.
Is the thought that it can't be done efficiently or that GPUs are missing instructions to make it possible at all?
Yes, SPIRV does pretty much exactly this. Not for Python I'd imagine, maybe someone has leveraged the ast package for it though.
There's nothing particularly special about graphics programs. It's just that until Vulkan and Metal, the pipelines were more or less defined for you (with OpenGL, DX, etc).
There are several research efforts that do this, none you can reliably use in production.
Got a link or two?
What I wish we had is a portable version of SIMD that runs on any CPU and any GPU as part of all programming languages, so that taking advantage of those does not suck.
Just like we have for loops and threads that work on all CPU architectures and cores.
Instead we have a mess of various CPU instructions (SSE, AVX, NEON, SVE) along with incompatible GPU frameworks (CUDA, rocm?).
Java is trying to do that with Vector api but only for stuff on CPUs, so GPUs are not included [0].
If we had a unified way to take advantage of CPU and GPU, I’m sure devs would do it more.
[0]: https://openjdk.org/jeps/426
I dunno, I don't play with the auto-vectorizer that much actually.
Suppose you want to use say 100 GPUs for LLM training but you can’t buy them because the US doesn’t allow them to be sold in your country.
Say you can readily get CPUs that are 1/100th the TFLOPS of the GPUs.
Could you use 10000 of the CPUs and get performance on LLM training similar to what the 100 GPUs would have given?
When I bought a power drill, I didn’t throw out my screwdrivers. There are still cases where I find them essential. They’re all tools, each with their merits and tradeoffs. Few things truly move the Pareto line.
Answer: Greater facility with out of order execution and performance in high variance duty cycles.
To use a gross analogy, why do we use Otto cycle engines when (emissions aside), gas turbines exist?
The line blurs all the time. Just like client <-> server.
good article. I would add that GPUs are particularly good for the case where you want to do the same thing to different datums in parallel. e.g. brightening every pixel of an image - you have some data, the pixels in this case, and you want to do the same thing to every datum. This is called SIMD, short for “single instruction multiple data”. this is because great instruction you give to the GPU applied to multiple pieces of data, hopefully in parallel.
If planes are so good, why we use cars and bicycles ;)
AFAIK CPU also can predict instructions to speed up processing hence why sorted arrays are faster to process than an unsorted one.
Different tools for different jobs.
Though I expect the lines between hardware paradigms to get a bit blurry in the near future.
You need a GPU to pull a trigger of preinstalled pipes with colour capsules?
> If GPUs Are So Good
Who ever claimed that?
Because the best minds at IBM, DEC, HP, and Intel couldn't make VLIW effective for general purpose workloads.
VLIW seems to have solidly lost to SIMD in the desktop space.
Don't forget: as Intel was floundering with Itanium, AMD was just making SSE and SSE2 instructions faster (AMD 3dNow! never became popular, but they continued to push that). It seems like SIMD was the historically correct choice over VLIW.
Today: every high-speed program in Desktop/Server is either SIMD-accelerated CPU (NEON or AVX512 or SVE), or full-blown SIMD GPU.
SIMD didn't really win for general purpose workloads, either.
It's just that execution units are so small relative to everything else that you might as well expend some extra silicon for them.
SIMD is only really wins for very specific workloads (memory copy and 3D graphics, for example).
CPUs are less expensive, have higher yields, and have a more favorable model for managing peripherals.