If you want to browse the internet, you must first invent the universe

Let’s imagine the scene: you’re at your computer, browser open to google.com. You type “attention is all you need” into the search box and decisively hit “Enter”. Barely a moment later, your browser gives you this:

google result

Today, I thought to myself: can I precisely describe everything that unfolds in the instant after I hit “Enter”? This blog post is my expedition to list down “every single” step involved in converting that keystroke into the polished webpage Google delivers. This means tracing and noting down the intricate dance of all the systems, mechanisms, and protocols I know are part of the show. Think of it as recursively asking “how did that happen?” until I stumble upon a topic that is completely beyond my grasp. Ideally, this recursive journey bottoms out at silicon atoms, given my computer science background (and lack of formal training elsewhere). Though, truth be told, we might hit some dead ends much sooner.

The title takes inspiration from Carl Sagan’s famous quote: “If you wish to make an apple pie from scratch, you must first invent the universe”. Every act of creation builds upon something else. A pie demands flour. Make your own flour? You’ll need wheat and milling gear. For those, you need farms and metal, requiring soil and minerals. To truly claim you made it from scratch, you’d need to conjure the dirt and minerals themselves. Forging iron necessitates a star “cooking” lighter elements. And to birth a star… well, you’re tracing back to the Big Bang itself.

A gentle disclaimer: This blog is written by a relatively naive, fairly recent computer science graduate. If you spot inaccuracies or think an explanation could use better grounding, please reach out via email or twitter.

The “Enter” moment

The instant my finger depresses the “Enter” key after typing the query, my browser’s (Firefox, in this case) “web engine” is going to take note of that. It likely executes a bunch of JavaScript code designed to redirect me to https://www.google.com/search?q=attention+is+all+you+need. This JavaScript likely performs some additional housekeeping too (dealing with cookies, perhaps some tracking mechanisms) if I’m not logged into Google. Firefox uses Gecko as its rendering engine, which gets its JavaScript engine from SpiderMonkey. I tried to dig into the source code for both, hoping to unearth some meaningful insights about JavaScript runtimes and how they detect and handle events like a key press. I even peeked into the ambitious SerenityOS project, which features a browser built entirely from the ground up. Alas, the sheer complexity of these internals is beyond my small brain – a somewhat humbling start to our exhaustive inventory.

With that, let’s follow Firefox as it starts retrieving the webpage at https://www.google.com/search?q=attention+is+all+you+need.

Fetching the webpage from google.com

Gecko delegates all its networking duties to Necko. Since we were already looking at google.com when “Enter” was hit, Firefox almost certainly has google.com’s IP address cached. If it didn’t, a DNS request would be made to the configured DNS server. In my rather standard home setup, my PC connects to my internet router via an RJ45 connector cable, and the router conveniently doubles as my DNS server. The DNS request itself is just a specially structured packet sent using the Internet Protocol.

Once the IP address is secured, Firefox will likely issue an HTTP GET (or perhaps a POST?) request with the path /search?q=attention+is+all+you+need. My understanding is that Necko handles the upper layers of the OSI model (layers 7, 6, and 5). Necko relies on NSS for the cryptographic heavy lifting of TLS. The transport layers and everything below are abstracted away by functions like NSPR’s PR_OpenTCPSocket. On Linux, I suspect these are essentially wrappers around the familiar POSIX-style socket() APIs. I believe these APIs primarily manage layers 5, 4, and sometimes layer 3. Layer 3 and downwards ideally becomes the domain of hardware – i.e. your network interface card (NIC), whether it’s a dedicated card or integrated onto the motherboard. The CPU offloads the data to the network card using “direct memory access” (DMA), a common technique for data transfers involving network cards, storage devices, and graphics cards.

The NIC itself would take charge of robustly translating digital data into the actual physical signals transmitted over the cable. Several excellent online guides demonstrate the circuitry required to modulate signals on the physical medium to correctly encode data. So, while intricate, there (likely) isn’t any deep dark magic here (unlike the CPU) in the signaling itself. However, this raw “dumb” physicality is augmented by many smart routing algorithms enabling networks to dynamically discover their topology and calculate optimal paths to each other. These algorithms (like “link state routing”) typically run when the network changes (e.g., a new device joins), so it’s unlikely any routing tables were updated during our specific search query journey.

So far, we’ve pressed “Enter”, established a TCP connection to Google and our request has just been encoded into electrical pulses, sent down the wire, and arrived at Google’s doorstep.

Inside the Google land

When our request reaches the server associated with google.com in the DNS records, it very likely doesn’t hit the “actual server” (with all the search index data) directly. Instead, it first encounters a “load balancer”. As the name implies, the load balancer acts like a traffic director, selecting an “appropriate” backend server to handle the query before a response is sent back. The criteria for “appropriate” can vary – perhaps geographic proximity, current server load, or a simple round-robin assignment. Unfortunately, finding concrete details on the exact implementations and how they seamlessly manage this handoff (ensuring our browser still sees google.com, not the specific backend server’s IP) proved challenging. I’ve included some links in the references that discuss this, but battle-tested implementations aren’t openly documented.

Notice too that our request URL – /search?q=attention+is+all+you+need – contains the query itself. This structure makes the corresponding response easily cacheable. According to a Google blog post from 2019, roughly 15% of the queries they receive are novel, ones they haven’t encountered before. For our specific query, “attention is all you need,” it’s highly probable that Google can instantly serve a cached HTML response.

But what might happen for those 15% of fresh queries? The exact inner workings are Google’s secret sauce, naturally. But dusting off some information retrieval concepts can offer an educated guess. Document retrieval often relies on an inverted index – essentially a map where keys are keywords and values are lists of documents containing those keywords. Given a query, Google looks up the keywords to fetch a set (by merging sets of documents from individual keyword) of potentially relevant documents. Then, various metrics are applied to rank these documents – factors like “how many query keywords appear in a document, adjusted for document length”. Google likely selects the top 10-15 (along with 3-4 ads, sigh) for the first page. Alongside relevance, page “importance” is critical. A Wikipedia page, for instance, referenced by countless other sites, carries more weight. PageRank is the famous automated system for gauging this importance. The original algorithm proposed by Larry Page hasn’t been used since 2006; Google later patented a new approach, called PageRank 2.0 which is also possibly obsolete by now.

Both the massive inverted index and the data underpinning the importance calculations likely reside in a high-performance, fault-tolerant distributed database system like Google’s own “Spanner”. So, for those uncached 15% queries, Google probably performs an inverted index lookup, ranks the results using its importance mechanism, and serves the top results (sprinkled with a few ads, naturally). Since 2019, Google has also incorporated powerful neural networks like BERT to enhance the quality (not necessarily the speed) of search results. Whatever the precise steps, Google very likely operates at near-peak computational efficiency across this entire pipeline. And you can bet this newly computed result (along with many intermediate and partial results) are now aggressively cached for future use.

Rendering Google’s response

Having identified the relevant documents, Google will quickly put them into an HTML response and send it back across the internet. As this response travels back, our browser initially paints the designated “canvas” area as blank, often displaying a spinning animation – the universal sign for “hang tight”. The network stack within our PC and browser diligently strips away the protocol layers, verifies the integrity of the data, and hands the raw HTML content to Gecko. Gecko then parses this HTML and begins the process of fetching any additional resources referenced within it – external CSS files, JavaScript code, fonts, image thumbnails, and so on. Simultaneously, it starts “painting” the visual representation onto the browser window in front of me. This display iteratively refines itself as more dependent resources arrive and are processed. Once Gecko determines that all necessary steps to render the page are complete, it halts the spinner animation. I can now scroll through the results and continue my work. All this process likely only took a few (O(100)) milliseconds to complete.

Prequel: How did we even get here?

On my system, the journey began when I pressed the power button. My computer booted into its operating system – Arch Linux with the KDE desktop environment. Then, using my mouse, I navigated the cursor to the “Firefox” icon and clicked it once, eventually landing on google.com.

The spark of life: booting up

Pressing the power button triggers the CPU to execute specific instructions embedded in the motherboard’s firmware, initiating a cascade of system initializations. A key step is the “power-on self-test” (POST). Once these checks pass, the CPU jumps to a predefined “reset vector” address in the flash memory, launching a specific program – UEFI (Unified Extensible Firmware Interface) in my case. UEFI performs its own checks, verifying that essential hardware – disks, keyboard controllers, display adapters, even network controllers (for network booting scenarios) – is present and responding correctly before proceeding. The precise mechanism for these checks is a bit unclear to me, but I imagine it involves sending predefined signals over specific motherboard buses and verifying the expected responses. The intricacies of “Secure Boot” and what happens if UEFI fails to validate the signature of an EFI binary are also areas where my knowledge is fuzzy. We’ll touch upon the CPU’s inner workings later.

Assuming UEFI gives the green light, it launches the designated EFI application – in my setup, the rEFInd boot manager. The boot manager’s job is to load the appropriate Linux kernel image (vmlinux). The kernel then undertakes much more sophisticated hardware detection and initialization. Many terms related to this phase and subsequent processes are still somewhat opaque to me; the Arch boot process page on ArchWiki and the OS book (See References) are my go-to references for specifics. The crucial outcome is that the kernel eventually starts the very first userspace program, /init, assigning it the special Process ID (PID) 1.

Setting the stage: starting the DE

The /init process orchestrates the startup of numerous system services and processes, including my desktop environment (DE), ultimately presenting me with a graphical user interface. The DE’s code relies heavily on graphics drivers – either generic ones included with the kernel or the proprietary drivers from the GPU manufacturer (NVIDIA, in my case) – to render visuals on my display. As I understand it, the DE describes the desired visual output using a set of “graphics primitives” (lines, shapes, text, etc.) for individual windows. A “compositor” then takes these individual window drawings and combines them into the final image displayed on the screen.

The transformation of these abstract primitives into the grid of actual pixel colors your monitor understands is called “rasterization”. OpenGL (and its successors like Vulkan) provides a standardized API for specifying how rasterization-based rendering should behave. This means applications can target the OpenGL standard, and hardware manufacturers just need to provide drivers implementing that standard. The performance (both responsiveness and computational efficiency) of a desktop environment often hinges on using the appropriate graphics driver, as highlighted in the official KDE documentation.

The click: launching firefox

The DE constantly listens for user input from devices like the keyboard and mouse. It’s important to remember that a DE, like any userspace application, doesn’t directly access the raw interrupts or architecture specific opcodes from these devices. Every time you click a mouse button, press a key, or nudge your mouse, the input device sends an interrupt to the CPU. The kernel intercepts this interrupt, identifies its source, and notifies the appropriate userspace program – usually the one currently in focus or the window manager. If you’re on Linux, try running watch -n0.5 "cat /proc/interrupts" in a terminal; you’ll see fascinating statistics about the different interrupt types your CPU is handling.

Most graphical programs, including my DE, operate within an event loop: an infinite cycle that renders the current visual state (“frame”) while simultaneously listening for incoming events (like interrupts). When a mouse click is registered at specific $(x, y)$ screen coordinates, I assume the DE consults the compositor to determine precisely which UI element (like an icon or button) should receive that click event. I say “assume” because I couldn’t find a definitive source, but the compositor seems logically positioned to have the necessary information about screen layout. Since my click landed on the Firefox icon, the DE proceeds to request the operating system to launch the Firefox application, likely using a system call like execve(..) or one of its variants.

Loading Firefox involves the OS performing numerous critical tasks. The Firefox binary residing on my disk isn’t statically linked (unlike the vmlinux kernel image); it depends on various system-level shared libraries (like the C standard library) to function. The OS is responsible for locating the Firefox executable file on the disk, reading its contents into memory, correctly mapping different sections of the executable (code, data) into the new process’s address space, and linking it with the required shared libraries. This process touches upon several fundamental OS concepts: security, memory management (paging), process management (scheduling), and user/permission handling. Since we’re just launching a common web browser, Linux will likely initialize the process structure, mark it as ready to run, and hand it over to the scheduler, which decides when and for how long this new Firefox process gets to execute its instructions on the CPU. Diving deeper here would mean revisiting core OS textbooks (see References).

The heart: How to CPU

At its core, a CPU is a vast, intricate electrical circuit centered around an Arithmetic Logic Unit (ALU). Other key components include the control unit and Input/Output (I/O) units, following the classic von Neumann architecture. If you’ve taken a basic digital electronics course, you’ve likely seen how fundamental logic gates (AND, OR, NOT) can be combined to create structures like a full adder. The CPU implements basic arithmetic and logical operations using such circuits. Its immediate working memory consists of a small set of high-speed storage locations called registers, including a special one often called the accumulator.

When the CPU encounters a binary instruction (like addl $5, %eax, which adds the constant value 5 to the eax register), the control unit electronically “decodes” it. This decoding process generates a sequence of control signals that open and close specific electrical pathways, directing data flow between registers, the ALU, and back. A system clock provides a synchronizing pulse; each rising edge of the clock signal triggers the next step in executing an instruction.

Memory hierarchy

Operations like our example – adding a constant to a register, or adding two registers together – are typically lightning fast, often completing within a handful of clock cycles. However, complications arise whenever an instruction needs to access data stored in main memory (RAM). The memory controller and the RAM chips themselves are physically “distant” from the CPU core, meaning data takes a relatively long time to travel back and forth. To mitigate this latency, whenever the CPU requests data from a specific memory address, it usually fetches a small block of adjacent data (a “cache line”, often 64 bytes) along with it, storing this block in its own small, extremely fast onboard memory called the CPU cache. Building large, fast caches is expensive due to the required circuit complexity. Furthermore, main memory access can be slowed down even more if virtual memory (paging) is used (as it is on virtually all modern systems) and the required address translation isn’t found in the Translation Lookaside Buffer (TLB) – a sort of cache for page table entries. Writing code that is mindful of this memory hierarchy can yield massive performance gains without needing a faster CPU or more cache!

Instruction pipelining

Because of this inherent memory bottleneck, modern CPUs employ a technique called “instruction pipelining”. The core idea is to break down the execution of a single instruction into several smaller, often independent stages. Common stages include: Fetch (get instruction from memory/cache), Decode (figure out what the instruction does), Execute (perform the operation, e.g., in the ALU), Memory (access main memory if needed), and Write Back (store the result in a register). This is analogous to an assembly line: instead of one person doing all five steps to assemble a product, five people each do one step, passing the product along. The benefit is throughput: while one instruction is in the Execute stage, the next can be in the Decode stage, and the one after that can be Fetched. Simple instructions like addl $5, %eax no longer necessarily have to wait for a previous instruction’s slow memory access to complete. Of course, if the previous instruction modified %eax, our addl instruction would have to wait (a pipeline stall or hazard) until the correct value is ready. Stalls can also occur due to conditional branches (if statements), where the CPU might have to wait to find out which path of execution to take. Clever modern compilers often employ “branch prediction”: they make an educated guess (based on past behavior or static analysis) about which branch will be taken and fill up the pipeline speculatively with instructions down that path. If the prediction was correct, time is saved; if wrong, the pipeline is flushed (small time penalty), and the correct path is taken.

Fabricating a CPU (from scratch)

If you understand basic transistor logic for arithmetic, you could theoretically design a simple ALU and processor circuit. In fact, I once built a rather janky 8-bit CPU myself using redstone in Minecraft, complete with basic compute and data movement instructions. Power efficiency, speed, and circuit density were obviously non-concerns! Even back in 2019, this wasn’t groundbreaking, with several impressive designs demonstrating 32-bit Minecraft CPUs.

Transitioning from a conceptual circuit (or a redstone contraption) to a real, manufacturable, high-performance CPU requires profound expertise in material science, electrical engineering, and perhaps even chemistry. Here’s a vastly simplified glimpse of that journey: First, the intricate circuit design is translated into a physical layout using specialized tools like MagicVLSI. Even achieving just this layout could potentially make you rich if you licensed it to a manufacturer like TSMC (assuming your design is sound). From here, we dive deep into material science. The process starts with silicon dioxide (essentially purified sand) which is refined to extreme purity levels. This silicon is formed into large cylindrical ingots, which are then sliced into thin (around 0.75mm) circular wafers, typically 300mm (about 12 inches) in diameter. This size is apparently a sweet spot, minimizing edge waste while managing contamination risk.

Next comes the magic (and extreme precision). The silicon wafers are “doped” by introducing specific impurities (like phosphorus or boron) into the crystal lattice to create the N-type and P-type semiconductor regions that form transistors. The wafer is then coated with a light-sensitive material called “photoresist”. The circuit layout designed earlier is converted into a series of “masks” (often made of quartz), essentially stencils for different layers of the chip. One mask is precisely aligned over the wafer, and ultraviolet light is shone through it. This process is called “photolithography”. Where the light hits the photoresist, its chemical properties change. A solvent then washes away the exposed (or unexposed, depending on the process) photoresist, leaving behind a pattern. Ion implantation or diffusion processes create the doped regions in the exposed silicon. Insulating layers (like silicon dioxide) are deposited, and tiny trenches are etched and filled with conductive material (like copper) to create the wiring connecting the transistors. This surface is then polished perfectly flat, completing a single layer of the CPU.

A modern CPU consists of dozens of such layers meticulously stacked atop one another. This multi-layer fabrication process is incredibly delicate and time-consuming; it’s a marvel of modern engineering that it works at all. At every single step, maintaining an environment virtually free of dust particles or any other contaminants is paramount. A single microscopic flaw can render an entire CPU (or potentially all ~230 CPUs on the wafer) useless – the stakes make the infamous “Fly” episode in Breaking Bad seem trivial. Ultra-pure water is used extensively for cleaning between stages. After this complex ballet of deposition, etching, and doping is complete, the individual CPU squares (“dies”) are cut from the wafer and rigorously tested. Based on their performance characteristics (clock speed achieved, functional cores), they are “binned” and sold as different product tiers (e.g., Core i3 vs. Core i9). Defective dies are discarded. Finally, the functional dies are packaged – mounted onto a substrate with pins or pads for connecting to the motherboard, and capped with a protective heat spreader.

And here… here is where my understanding hits a wall. When the CPU executes an instruction like addl $5, %eax, a cascade of events unfolds at a scale bordering on the atomic, far smaller than my intuition can grasp. A “4nm” process technology refers to the characteristic size of features like transistor gates being around 4 nanometers. A single silicon atom is roughly 0.2 nanometers across. We are operating incredibly close to fundamental physical limits. I don’t know the specific challenges faced at 4nm, nor how much smaller we can realistically go. We’ve reached the silicon atom bedrock. To dig deeper requires expertise far outside computer science, and perhaps, necessitates inventing the universe to ensure silicon atoms behave exactly as needed – not very conductive / insulating – just the perfect amount of “semi” conductance!

Conclusion

So there you have it. A journey sparked by pressing “Enter” during a mundane Google search led us down through layers of software, networking protocols, operating system machinery, and finally into the actual black magic fuckery of CPU fabrication and operation. It’s genuinely astounding that we can build machines capable of such intricate mechanisms like pipelining and caching. The sheer number of coordinated processes firing off in near-instantaneous succession just to deliver search results is staggering. It represents a colossal accumulation of human ingenuity, contributed by countless individuals across diverse fields, many of whom never knew each other, culminating in technologies we now often take utterly for granted.

To put this into context, this entire deep dive was prompted by a relatively simple, everyday task: browsing the text-based web. We haven’t even touched upon the complexities involved in applications like modern computer games or streaming high-definition video, which involve an intense interplay between networking, CPUs, and specialized throughput-oriented processors like GPUs (whose own fascinating architecture we completely skipped). We also glossed over the gnarly, crucial details of process isolation, memory protection, and user management within the operating system – areas where subtle flaws can lead to significant security vulnerabilities, many perhaps still undiscovered – or more dangerously, discovered but undisclosed.

Now, a reasonable question arises: “In the age of AI assistants like Claude, Gemini, or [Future-AI-Model] that can generate working code from a simple prompt, is going into this level of detail still valuable?” I don’t have a definitive answer for everyone. Personally, I was drawn to CS precisely because I wanted to understand this intricate technological tapestry. I believe that having this deeper awareness – understanding the layers beneath the abstractions – could make one a better programmer, better equipped to diagnose problems, optimize performance, and reason about system behavior. Perhaps I’m clinging to an old-fashioned notion, and the next generation of AI will internalize all this knowledge and more. Yet, the desire to comprehend this unfathomably delicate choreography of systems, protocols and eventually physical atoms, still persists. It remains profoundly useful to internalize that CPU operations are often constrained by memory access speed, and that moving data – whether across a network or from RAM to a CPU register – is still one of the most fundamental and challenging bottlenecks in computing. I struggle to think of any other field where such routine, everyday actions performed by millions of consumers, rely on the flawless, split-second execution of so many disparate, complex systems. That, to me, is why computer science retains its magic, and I hope it always will.

References + interesting tangents

Operating System Concepts, 10th Edition by Silberschatz, Gavin and Gagne
Understanding the Linux Kernel by Bovet and Gesati
Computer Networks by Tanenbaum
Introduction to Information Retrieval by Manning, Raghavan and Schutze
A first take at building an inverted index
“Updating Inverted Index” US Patent
“Index server architecture using tiered and sharded phrase posting lists” US Patent
“Producing a ranking for pages using distances in a web-link graph “ US Patent
The POSIX.1-2024 standard
What is load balancing
Using nginx as HTTP load balancer
Modern processors have a backstage cast by Hugo Landou
DNSFS: Store your files in others DNS resolver caches by Ben Cox
Let’s build a browser engine by Matt Brubeck
SerenityOS
Dustin Brett : I Spent 4 YEARS Building an OS in the Browser
strace
How printf works internally
Protection rings
Why are CPU privilege rings 1 and 2 not used?
Console Hacking 2016 (33c3) – fail0verflow on running PS4 on Linux
Branch Education : How are Microchips Made? 🖥️🛠️ CPU Manufacturing Process Steps
LTT : I Can Die Now. - Intel Fab Tour!
ICs in Garage: First IC and Second IC by Sam Zeloof