If GPUs Are So Good, Why Do We Still Use CPUs at All?

codingstuff.substack.com

90 points by teddywahle 6 months ago

imoreno 6 months ago

> Why do we write programs with one thread that executes the instructions in order? Why not just always write fully parallelized code?

Maybe to a layperson this is an interesting question, but surely on this site the answer is obvious.

I skimmed through anyway, just on the off chance he's talking about some new GPU development I didn't know about. But no, it's just a clickbaity title for a basic "pros and cons of parallelizing" article of mediocre quality.

urig 6 months ago

You're not the target audience. This is a very well-written explainer for non-techies. I've already emailed it to a couple of friends. Lovely.
- imoreno 6 months ago
  
  This whole site is not the target audience.
shahzaibmushtaq 6 months ago

> a clickbaity title for a basic "pros and cons of parallelizing" article of mediocre quality.
Why don't you write one with good quality (if not great) that is understandable to the audience of laypersons interested in, having or asking this question.
- jldugger 6 months ago
  
  I think the main point here is that there is no fundamental amazing insight here for professionals to ingest.
  - shahzaibmushtaq 6 months ago
    
    This exact question comes to my mind in the last 2 hours after reading about the new Nvidia Digit supercomputer (which will retail at $3,000) and RTX 5090 GPU (which will cost $2,000). At exactly those moments, I saw this submission here.
    Teddy Wahle did a good job explaining it to non-professionals who are using gaming laptops/PCs in their daily lives for content creation and such.
    If I can think this question, imagine thousands (if not millions) of people who have or will have similar questions.
ZiiS 6 months ago

Though no modem CPU actually executes instructions in order.
- zekrioca 6 months ago
  
  Because CPUs care about semantic ordering

vardump 6 months ago

If freight trains are so good, why do we use trucks?

Freight trains carry a lot, but are really bad at branching. Same with GPUs.

mikewarot 6 months ago

Freight trains are good, but pipelines are better for bulk liquids.
GPUs are massively parallel limited purpose CPUs. You're always waiting for the program counter to get to the end, and getting data to/from Ram.
My project (since the 1980s) is to put compute and memory together, so you can either store 16 bits, or do a 4 bit look up with the same data in every cell (which has 4x16 bits total). Having no memory-compute barrier should allow for pushing data through a program, in a physical sense. Like an FPGA, but easier to program.
Imagine such a chip at a few Ghz, delivering one output per clock cycle for a given task.
- jeffreygoesto 6 months ago
  
  In my experience the step of leaving von Neumann architecture and programming that data flows across the compute engines is the more difficult step than the crappy FPGA tools. I tell people that with an FPGA you teach the computer what to be, not what to do. A bit like functional vs. imperative programming.
  Therefore I can't yet imagine how to easier onboard people who learned on a "classical CPU". Can you go a bit deeper in what makes your system easier to program?
  - mikewarot 6 months ago
    
    FPGAs are heterogeneous, and the tools are all about getting the maximum utilization of a given chip's features while also routing signals to get the fastest reliable cycle times. Compile times sometimes stretch out into overnight sessions.
    By making each cell identical you remove most of the floor planning concerns. By latching the data from each cell, you remove timing concerns, and deliver deterministic performance.
    This has other benefits like being possible to route around a bad cell, or to build a "wall" around a given section of code.
    The emulator will always give correct output baring hardware failure.
    I hope to get through tiny Tapeout and have a chip before the end of the year.
    As for programming, the code has to compile to a directed acyclic graph of boolean logical operations. I suspect that it would be fairly easy to use as a backend for tinygrad or perhaps LLVM as a stretch goal.
    
    jeffreygoesto 6 months ago
    
    Ah thanks! Reminds me of systolic arrays somehow. Good luck and hopefully there are no bugs in your design!
- TheMode 6 months ago
  
  Are you speaking about a sort of (multi-state) cellular automata?
  - mikewarot 6 months ago
    
    No, each cell has it's own 4x4 bit look up table (64 bits of "program" or "data" depending on the mode of that cell), so it's more like an FPGA with the routing fabric stripped out.
    
    nynx 6 months ago
    
    This sounds like just an fpga. What do you mean by “routing fabric stripped out?” How does data get from one cell to another?
    
    mikewarot 6 months ago
    
    Each cell only hooks to it's neighbors, up, down, left, right... so no "fast" wires across the chip, etc.
    
    squantor 6 months ago
    
    This really sounds like the Xilinx XC6200 series FPGA. It has a few routing hierarchies but does strip out most of the routing a classical (for that time) FPGA has. I heard numbers that 70% of a FPGA is routing.
    The XC6200 became news once as they[1] used an evolutionary algorithm to create a configuration that can detect a single frequency tone. The resulting configuration was determined that it was impossible that it could work, but it did. Placing the configuration in a different part of the FPGA broke it.
    It seemed that algorithm used the analog properties of the FPGA in that specific location to get to a smaller result.
    [1] https://www.idi.ntnu.no/emner/tdt22/2011/Thompsonieeeehw.pdf
    
    jldugger 6 months ago
    
    This sounds like the Systolic Array (https://en.wikipedia.org/wiki/Systolic_array) design google based its TensorFlow Processing Units on.
    
    nextaccountic 6 months ago
    
    If data only travels into local neighbors this is really a multi state cellular automata though
    
    mikewarot 6 months ago
    
    I thought cellular automata all had to have the same rule/program across the grid? Like "Life" or the ones Wolfram is exploring.
amanaplanacanal 6 months ago

Trucks are better at branching, but also heavily subsidized. In the 19th century when trains were being run everywhere, they were heavily subsidized too.
TomK32 6 months ago

Trains in general a great, why do people insist on sitting without any company in an expensive car which they even have to drive themselves?
- alwa 6 months ago
  
  The gp comment speaks more of freight than of passengers. But the same analogy applies… trains are great when you need to transport a whole bunch of people exactly the same way, but they’re less useful if every person needs to start and end at a totally different place.
  In the same sense, parallelization (and computing on a GPU) are great when you need to perform the same computation on a huge number of data. When each computation is more independent, it might make more sense to perform it through the CPU.

dragontamer 6 months ago

Hmmmm, the blog here doesn't really sell it right IMO.

CPUs are actually hugely parallel today. Every CPU, even the "small cores" (E-Cores from Intel or little cores from ARM) are out-of-order superscalar cores. This means that every CPU can execute 2, 3, 4, 6, 8 or more instructions per clock tick.

What a CPU is, fundamentally, is a device that takes a classically written sequential program and *DISCOVERS* parallelism for you. And then executes that discovered parallelism.

A GPU, and SIMD in general, is the opposite. SIMD requires that the programmer writes the program in a grossly parallel manner before it even executes on the GPU. Naturally, GPUs lead to more parallelism, not due to magic but simply because the programmer thought about parallelism in a more scalable way.

polyaniline 6 months ago

The only time a CPU "discovers" parallelism is in

scoopdewoop 6 months ago

> you couldn’t tell your friend, “You calculate steps 1 through 50 while I calculate steps 51 through 100.”

For that example, couldn't you do exactly that? Multiplication is commutative

n2d4 6 months ago
Nit, the relevant property you're looking for is associative (you can omit parantheses), not commutative (you can swap left and right); there are some operations that are the latter but not the former, such as the average:
```
    avg(a, b) = avg(b, a)
```
but
```
    avg(avg(a, b), c) != avg(a, avg(b, c))
```
- wdutch 6 months ago
  That's a good example. My favorite has always been rock-paper-scissors. If `vs` is a binary operator which gives you the winner according to the game then
  rock vs paper = paper vs rock
  but
  (rock vs paper) vs scissors != rock vs (paper vs scissors)
Animats 6 months ago

Much of parallelism to accomplish a single goal is trying to break computations into units that are associative and commutative. Then you can combine the results regardless of order.
Trying to decompose a problem in this way ranges from hard to impossible.
vlaaad 6 months ago

My thoughts exactly. His sequential program example is easily parallelizable. Maybe one of the takeaways should be that we need to work on converting more serial problems to parallel ones
- Legend2440 6 months ago
  
  Unfortunately, many interesting problems are known to be impossible to efficiently parallelize. This includes some very broad problems like evaluating a logical expression or simulating a turing machine.
  https://en.wikipedia.org/wiki/P-complete
- Animats 6 months ago
  
  People have been beating on that problem for decades, with modest success. It's amazing how many different problems people are solving with GPUs.
  There are at least four main ways to break apart a problem into parallel work units:
  Search-type problems, where you want to do a lot of computations and discard most of them, are easy to parallelize. The extreme case is a crypto key tester. A Bitcoin miner and the WWII "bombe" are examples. The amount of data that comes out is tiny compared to the work done.
  Then there are problems which decompose spatially. Space is what keeps everything from being in the same place. Fluid flow, weather prediction, and hydrodynamics of nuclear explosions are all like that. The space is divided into regions and the regions talk to their neighbors. So most communication is local. Those are classic supercomputer problems. Machine learning, of course. Units are connected to local neighbors. So that, is in a sense, spatial.
  Graphics rendering does large numbers of triangles in parallel. The combining operation has to be associative and commutative. Z-buffers make that happen for opaque and transparent surfaces. Translucent is harder, but there are tricks to pull that off without a depth sort. See "order independent transparency".
  There are assembly-line approaches, where data progresses forward through multiple stages, with little or no data moving backwards. Audio and video processing work like that. So do some kinds of vision systems. This is the oldest approach - it's how large punched card tabulator shops worked, with trays of punched cards being carried from machine to machine.
  Most things you can do on a GPU seem to fit into one of those four patterns. If you can hammer a problem into one of those four patterns, a GPU might help. If not, well...
  Did I miss any core patterns?
bjourne 6 months ago

Yes, you would parallelize it with prefix scan which has complexity nlog n I believe. However, floating point is neither commutative nor associative.
pfdietz 6 months ago

You mean associative?
- scoopdewoop 6 months ago
  
  probably, lol
  - pfdietz 6 months ago
    
    It should be noted that operations like addition on floating point numbers are not associative (even though addition on real numbers is). Round off error changes depending on the order of operations. The error in divide-and-conquer summation (which you might see in parallelized addition) is worst case much better than sequential addition, although not as good as more elaborate algorithms like Kahan summation.
    https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Alte...
philipov 6 months ago

A better example would be the Fibonacci sequence, or some other inductive process.
- Legend2440 6 months ago
  
  Generally recurrent sequences cannot be computed in parallel, but the Fibonacci sequence is a special case because it's a linear recurrence. There's a closed-form equation to generate the Nth number.
- dragontamer 6 months ago
  
  Linked Lists and other Tree Traversals are sequential.
  Raytracing (which is a tree traversal) is a notable exception because there's so many millions of rays that get launched, that it becomes efficient to group them back together and then parallelize the traversals all together.
  So if you have "enough" of them going at once, its back to GPU Superiority. But a *SINGLE* linked list or *SINGLE* tree search will always be best done on a CPU.

Nevermark 6 months ago

GPU cores:

+ Efficient decoding of one instruction stream for many parallel cores.

+ Efficient parallel access to lots of data in regular layout with address offsets relative to core index.

- Lots of instruction streams quickly require serialization, given only one instruction stream can be handled per block of co-working cores.

- Only 1 or a few parallel calculations per stream, leaves rest of co-workers unused.

- Irregular data layouts, force parallel memory hardware to act sequentially, with each core getting access while the others wait their turn, until all cores access is complete.

- Lots of branching stalls cores, as cores that take one branch execute, while the other cores stall, and then vice versa.

CPU cores:

+ Efficient decoding of different instruction streams per core.

+ Efficient with different memory layouts, due to per core caching.

+ Highly optimized for branching.

- Parallel calculations across cores suffer from redundant instruction decoding overhead for every core.

- Highly parallel calculations suffer, from the relatively low number of cores.

GPU and CPU:

+ Both optimized for sequential data access across sequentially offset addresses.

CPU Bonuses:

+ They handle overall system initialization, power & device management, and other system tasks.

+ SIMD instructions let each CPU core perform a bit like a little GPU core group.

SteveVeilStream 6 months ago

One might wonder if they eventually get rebranded as Serial Processing Units and Parallel Processing Units but of course it isn't that cut and dry either.

teleforce 6 months ago

It's the wrong question to ask, since GPU is mainly being used now as data processing accelerator.

Perhaps a better question is does our current CPU architecture is the most suitable for data processing and the likely answer is no.

There are much better architecture for data processing tasks namely dataflow CPU architecture [1].

There's a new trend moving toward dataflow based architecture for data processing and AMD is one of them [2].

I think a combination of dataflow architecture optimized for array computation is the best for data processing because it generalized almost all common data processing techniques including spreadsheets, databases, graphs and matrices [3].

[1] Dataflow architecture:

https://en.wikipedia.org/wiki/Dataflow_architecture

[2] AIEBLAS: Open-Source Expandable BLAS Implementation for AMD/Xilinx Versal Device:

https://github.com/atlarge-research/AIE-BLAS

[3] Mathematics of Big Data: Spreadsheets, Databases, Matrices, and Graphs:

https://mitpress.mit.edu/9780262038393/mathematics-of-big-da...

leshokunin 6 months ago

In the eyes of most users, GPUs are improving dramatically every year where a CPUs are not. This is highlighted by a lot of growth in PC hardware is driven by gaming where most gamers are able to keep the same CPU for a five year period nowadays, while GPUs require more updates to keep up.

I think the comparison makes a lot more sense when you consider a GPU compared with a system on the chip where the SOC can do the basic work that you would do. To the average gamer, it looks like the only part of a “CPU“ that is improving is the graphics.

So we might as well put everything on the GPU at this point. This goes hand-in-hand with the most showing games rendering via AI, storage being added to graphics cards, and AI processing in general.

He helps build This notion that maybe things wouldn’t be bottlenecked by the CPU anymore if we switched to a new paradigm.

I think maybe if the switch to ARM happens for PCs, we will see more development on that idea of an SOC bundled with high end GPUs.

rao-v 6 months ago

Is there a transpiler / compiler that can take basic Python / Rust / C code and run it exclusively on GPU (I'm assuming GPUs support some sort of branching these days)?

It would be nice to be able to take random bits of code and actually look at what you win / lose by running (unoptimized) on GPU vs. CPU.

Hummm various models seem to think the problem is nobody has done the work of figuring out how stuff like memory allocation and function control flow would work.

I'm surprised there isn't a fun github with a minimalist proof of concept or that none of the big players have implemented (even very inefficient) instructions in their GPU to allow general purpose computing.

porcoda 6 months ago

This is basically a problem of automatic vectorization and parallelization. This is not a new area: if you’re curious about it there is literature about this in the compilers and high performance computing world going back to the 70s. Short answer: it’s quite hard, especially for generic code. It’s achievable to some degree for array-oriented, loop heavy code, but even then with severe limitations often related to the regularity of the logic in the loop bodies, the iteration space, and the use of pointers. There has been some work for Python to do this (eg, numba), but it is largely for array heavy code instead of general Python.
- rao-v 6 months ago
  
  Is the thought that it can't be done efficiently or that GPUs are missing instructions to make it possible at all?
MobiusHorizons 6 months ago

No one has done it because it’s a comically bad idea. It’s like trying to build taxis with only rail infrastructure. GPUs are only fast because they do the same thing across multiple pieces of data at the same time. Branching is possible but incredibly expensive. You would end up with terrible performance and poor power efficiency. And you would still need a cpu to load programs on the GPU and handle io.
junon 6 months ago

Yes, SPIRV does pretty much exactly this. Not for Python I'd imagine, maybe someone has leveraged the ast package for it though.
There's nothing particularly special about graphics programs. It's just that until Vulkan and Metal, the pipelines were more or less defined for you (with OpenGL, DX, etc).
fooker 6 months ago

There are several research efforts that do this, none you can reliably use in production.
- rao-v 6 months ago
  
  Got a link or two?
  - fooker 6 months ago
    
    https://clang.llvm.org/docs/OffloadingDesign.html
    https://llvm.org/devmtg/2022-11/slides/TechTalk10-DirectGPUC...

jiehong 6 months ago

What I wish we had is a portable version of SIMD that runs on any CPU and any GPU as part of all programming languages, so that taking advantage of those does not suck.

Just like we have for loops and threads that work on all CPU architectures and cores.

Instead we have a mess of various CPU instructions (SSE, AVX, NEON, SVE) along with incompatible GPU frameworks (CUDA, rocm?).

Java is trying to do that with Vector api but only for stuff on CPUs, so GPUs are not included [0].

If we had a unified way to take advantage of CPU and GPU, I’m sure devs would do it more.

[0]: https://openjdk.org/jeps/426

dragontamer 6 months ago
```
    #pragma openmp simd
    for(int i=0; i<1000; i++){
        z[i] = x[i] + y[i]; // Look ma, I'm parallel!
    }
```
Relies very heavily on the autovectorizer of GCC / LLVM, but it works somewhat well if you know what you're doing. Also, I'm pretty sure #pragma openmp simd doesn't actually do anything but maybe makes the "failed to vectorize" warnings a bit more obvious??
I dunno, I don't play with the auto-vectorizer that much actually.

tzs 6 months ago

Suppose you want to use say 100 GPUs for LLM training but you can’t buy them because the US doesn’t allow them to be sold in your country.

Say you can readily get CPUs that are 1/100th the TFLOPS of the GPUs.

Could you use 10000 of the CPUs and get performance on LLM training similar to what the 100 GPUs would have given?

MontagFTB 6 months ago

When I bought a power drill, I didn’t throw out my screwdrivers. There are still cases where I find them essential. They’re all tools, each with their merits and tradeoffs. Few things truly move the Pareto line.

calmbonsai 6 months ago

Answer: Greater facility with out of order execution and performance in high variance duty cycles.

To use a gross analogy, why do we use Otto cycle engines when (emissions aside), gas turbines exist?

m463 6 months ago

The line blurs all the time. Just like client <-> server.

ChadNauseam 6 months ago

good article. I would add that GPUs are particularly good for the case where you want to do the same thing to different datums in parallel. e.g. brightening every pixel of an image - you have some data, the pixels in this case, and you want to do the same thing to every datum. This is called SIMD, short for “single instruction multiple data”. this is because great instruction you give to the GPU applied to multiple pieces of data, hopefully in parallel.

garyfirestorm 6 months ago

If planes are so good, why we use cars and bicycles ;)

anshumankmr 6 months ago

AFAIK CPU also can predict instructions to speed up processing hence why sorted arrays are faster to process than an unsorted one.

bsder 6 months ago

Because the best minds at IBM, DEC, HP, and Intel couldn't make VLIW effective for general purpose workloads.

dragontamer 6 months ago

VLIW seems to have solidly lost to SIMD in the desktop space.
Don't forget: as Intel was floundering with Itanium, AMD was just making SSE and SSE2 instructions faster (AMD 3dNow! never became popular, but they continued to push that). It seems like SIMD was the historically correct choice over VLIW.
Today: every high-speed program in Desktop/Server is either SIMD-accelerated CPU (NEON or AVX512 or SVE), or full-blown SIMD GPU.
- bsder 6 months ago
  
  SIMD didn't really win for general purpose workloads, either.
  It's just that execution units are so small relative to everything else that you might as well expend some extra silicon for them.
  SIMD is only really wins for very specific workloads (memory copy and 3D graphics, for example).
  - dragontamer 6 months ago
    
    VLIW seems to be in that awkward space where (at least in the 00s), compilers weren't good enough to beat traditional code + traditional CPUs. But tweaking the assembly for VLIW / Itanium was still too hard. So SIMD (which is much harder to write than VLIW) wins because if the programmer drops down to the assembly level... "might as well" go for the harder SIMD-model that achieves slightly better performance over VLIW.
    I do wonder if today VLIW machines with modern compilers could be a thing. Compiler technology is so much better now, maybe we are in the era where VLIW could be renewed. But on the other hand, CUDA shows that SIMD-compilers / SIMD-languages have improved as well.

fooker 6 months ago

Different tools for different jobs.

Though I expect the lines between hardware paradigms to get a bit blurry in the near future.

timewizard 6 months ago

CPUs are less expensive, have higher yields, and have a more favorable model for managing peripherals.

Bimos 6 months ago

> If GPUs Are So Good

Who ever claimed that?

t0bia_s 6 months ago

You need a GPU to pull a trigger of preinstalled pipes with colour capsules?