How the Internet works

How the Internet works Jon Gjengset (23 min. read)

Posted on Oct 12, 2015 — shared on Hacker News Twitter

The Internet has become a critical part of almost every part of our society — it provides information, communication, and entertainment to billions of people every day, and enables coordination and collaboration between people and business across the globe. Unfortunately, this crucial piece of infrastructure is poorly understood by a large majority of the population; the myriad of technologies (both hard and soft) that let your laptop connect to your bank or to your mom’s Skype window could just as well be magic as far as average person in the street is concerned.

For many other vital technologies in today’s world, this is not really a problem. Most people don’t know how to fix their car or how to build a house, and few people are complaining about that. The Internet is different. It is, despite what it may seem, still in its infancy, and there is a lot of debate about how it should work, and more importantly, how it should be regulated. This is an important conversation to have, but in order to make sensible decisions, it is necessary that all participants are sufficiently informed about the topics being discussed. So far, we have seen many examples where this is not the case, and this could be directly harmful to the future of the Internet as we know it.

This post aims to help inform and guide those who wish to (or should) expand their understanding of this extraordinary communication channel, in a way that is digestible without prior knowledge of Computer Science. Since the Internet is a fairly complex beast, we will touch on many topics over a short amount of time, so the depth will naturally be somewhat limited. I will provide links to further reading materials for each topic so the interested reader can continue the journey on their own.

With a system as deep and wide as the Internet, it is hard to decide exactly where to start. Most of the pieces are interlinked, and often in very subtle ways. In the interest of making the topic easier to comprehend, we will start with a fairly high-level overview, and then dive into each individual component as we go along. Our journey starts with a user called Alice. Alice owns a laptop, and wants to send e-mail to a person called Bob. Bob uses GMail, and so he will read the e-mail from Alice by going to gmail.com with his Web Browser. By the end of this article, my hope is that you will understand all the steps taken by the various parts of the Internet “stack” (a stack is simply a combination of technologies) from Alice presses “Send” in her e-mail client until Bob can read her e-mail.

When Alice presses “Send”, her laptop’s first task is to figure out how to reach Bob. To make our example more concrete, let us say that his e-mail address is bob@gmail.com. This address tells Alice’s computer that “bob” is Bob’s username, and that his account is managed by the server gmail.com. A server is simply a machine somewhere on the Internet that provides some service (e-mail in this case) to remote users. Given this information, Alice’s e-mail program now needs to somehow start talking to the gmail.com server to give it Alice’s e-mail to Bob.

The first step in this process is for Alice’s machine to find gmail.com’s Internet Protocol address, or IP address for short. To give a real-world analogy, imagine that you are sending a package to Google. You might say that you’re sending it to “Google’s Headquarters”, but on the package itself, you have to put a so-called routable address like “1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA”. The former is a handy shortcut, like gmail.com, but in order to actually get something there, you need to give the full address so that whomever is carrying your package can find the right destination.

Addresses (IPs) on the Internet are sequences of numbers, either on the form 192.168.1.102 (IP version 4), or on the form 2001:418:1425:28b::255e (IP version 6). The former is slowly being dropped in favor of the latter, as the latter allows many more unique addresses (IPv4 has only four numbers from 0-255, so approximately 4 billion in all, whereas IPv6 has many more combinations). We will get back to how these numbers are used to route Alice’s e-mail, but for now, let us concentrate on the problem of how Alice even learns the address of gmail.com.

gmail.com is known as a domain name on the internet. You have likely seen many of these, as they are what you use to access most websites (i.e. the part that goes between http:// and the first / in your Internet browser). The translation from a domain name to an address is done using what is known as the Domain Name Systems, or DNS. To figure out the address associated with a particular domain name, your machine will contact a DNS server, and ask it “what is the address of gmail.com?”, and it will in return tell you the address (if it knows it). At this point, the attentive reader might realize that there’s a potential for an infinite loop here — how would you know the address of the DNS servers? We’ll get back to this when we talk about DHCP later, but for now, imagine that every machine already knows the address of at least one DNS server, so there is no need to look one up.

DNS is a large topic in and of itself, and many of the discussions regarding internet censorship and access control are based heavily on DNS. Unfortunately, many of those who discuss these topics do not understand the fundamentals of DNS, which complicates the discussion significantly. In particular, many of the misunderstandings stem from the belief that refusing to look up a particular domain name will make it inaccessible on the Internet. We won’t get further into DNS in this post, but if this is relevant to you, you should go read this thorough article on webhostinggeeks.com, and page 3 and page 4 of HowStuffWorks’ article on DNS.

So, now Alice’s computer knows the address of the server responsible for handling Bob’s mail. Now what? Well, the first thing to realize is that Alice’s machine also has an IP address. Every machine connected to the Internet is accessible in some way through one (or more) IP addresses. When you connect to a network, your machine will (usually) run a protocol known as the Dynamic Host Configuration Protocol, or DHCP.

DHCP dictates that when a machine connects to a network (and doesn’t yet have an IP address), it should send out some specially formatted transmissions saying that it would like to be given an address. If there is a DHCP server on the same network (this is usually the modem you got from your Internet Service Provider (ISP), or the wireless router (the box with antennas) that you have standing somewhere in your home or office), it will hear this transmission, pick an unused IP address, and transmit it back. When your machine hears this response, it will adopt the contained IP address. The DHCP machine will usually also include with the IP address the address of another box on the network that can talk to the Internet (usually its own), and the IP addresses of some DNS servers.

The Internet-connected box on the network is referred to as a gateway, and any time your machine wants to talk to a remote machine, it sends its transmission through the gateway. You can think of it as your local post office. Whatever machine your machine wants to talk to will have a similar setup, with its own gateway, or local post office. For now, let us assume that the gateways can talk directly to each other, as though they had a single cable connecting them together physically.

Furthermore, the Internet is what is known as a packet-switched network. This means that messages are sent in packets of limited size, and a message that is larger than that maximum size will be split into multiple packets and will have to be re-assembled at the other end. Each of these smaller packets are routed individually through the Internet on their way to their final destination (gmail.com in our example).

To send a message to gmail.com, Alice’s machine first needs to establish a connection with the machine at gmail.com’s address. This is done using the Transmission Control Protocol, or TCP. TCP is what is known as a reliable, in-order protocol; in other words, what goes in on one end will come out the same way in the other end. A protocol is needed (as opposed to just stuffing in all your packets) because the Internet can be a dangerous and unpredictable place; if you naively send packets from one side to another, only some will make it through intact, and those that do will often be shuffled along the way and arrive out-of-order. TCP has a number of mechanisms in place to piece back together the original message, including asking for missing pieces, and re-ordering them correctly. There exists another protocol, the User Datagram Protocol, or UDP, that is commonly used for games or video conferencing where speed is more important than reliability. UDP simply transmits the packets one at the time, with no reassembly or retransmission at the receiving end. We will not discuss UDP further in this post.

With this connection established, Alice’s machine can now start sending the message to gmail.com. However, much like with real mail, she needs to put it in an envelope that gives the recipient’s name, and sender’s address, the subject line, etc. (the destination address is automatically added by the IP “layer”). For this, we have the Internet Message Format, or IMF. It dictates that an e-mail should be preceded with a number of key/value pairs on the form Key: Value, and defines some standard ones like From: , Subject:, and To:. Alice’s e-mail message will therefore look something like

To: bob@gmail.com
From: alice@fastmail.com
Subject: Hello there

Hi Bob, how are you?
- Alice

Okay, surely we are ready to actually send the message to gmail.com now? Well, turns out there’s one more thing we have to cover. The Internet doesn’t know about such sophisticated things as letters (or “characters” in more technical terminology), it only knows about bits (0s and 1s), and bytes (a sequence of eight 0s and 1s). Thus, we somehow have to convert our text message above into bytes in such a way that gmail.com can reassemble the text on the other end. A number of standards exist for this purpose, but we will stick to the basics and use the widely adopted American Standard Code for Information Interchange, or ASCII. ASCII essentially defines a table that can be used to look up the numerical value of a given letter or symbol. This numerical value is in the range 0-127, which means that it fits in a single byte.

Since ASCII can only represent 128 letters and symbols, it usefulness is restricted mostly to American correspondence. Some variations have been proposed that uses 256 such mappings (so they also fit in a single byte), and include more European letters and symbols such as æøå. These are known as the ISO 8859 family of standards. More modern systems have moved to the Unicode standard, and specifically the UTF-8 encoding scheme, which can represent practically every letter and symbol in use on Earth by using more than a single byte per character.

Finally, we can now send the e-mail to gmail.com. We take Alice’s encoded e-mail, and we ship it over our connection to gmail.com. Technically, there is another small protocol being used that dictates how to talk to a mail server, called the Simple Mail Transfer Protocol, or SMTP, but it is of little relevance to us right now.

At this point, gmail.com have received Alice’s e-mail, and it will go off and do whatever it needs to do to scan it for viruses, check that it is not spam, and all sorts of other things, before finally storing it in Bob’s mailbox. gmail.com will probably also send Bob a notification stating that he has received a new e-mail, so Bob will be inclined to go read it. While there are many ways he could do so (the interested reader should look at What’s the Difference Between POP3, IMAP, and Exchange?, and the very interested reader can look at IMAP and POP), we will assume that Bob reads his e-mail through a Web Browser like Chrome, Safari, Firefox, or Internet Explorer by going to http://gmail.com and logging in.

When Bob types this into his browser’s address bar and presses Enter, a lot of things happen in very quick succession. First, a DNS lookup is performed for the domain name gmail.com. Second, a connection is established to gmail.com using TCP. Third, his browser sends a request over that connection using the Hypertext Transfer Protocol, or HTTP. “But wait!”, I hear you cry, “won’t that connection reach the mail server that is running at gmail.com?”.

To address this question, we need to talk a bit more about TCP. In addition to providing reliable connections, TCP also provides what is known as application multiplexing. That is, multiple servers can run on a single machine, and remote clients can choose which one they wish to connect to by specifying a port. Many of these port numbers are well-known; for example, the port for sending mail is port number 25, and the port for accessing web sites is port 80. Some of you may have seen this in the address bar of your browser: when the domain name is followed by a colon, and then a number (http://gmail.com:81/...), this tells the browser to use that port instead of the default port (which is 80 for HTTP).

Bob’s browser has now established a connection to the web server running on gmail.com, and is ready to request the login page. For this, it uses HTTP, which looks somewhat similar to the IMF format we used for Alice’s e-mail. It consist of three parts: an “action” line (e.g. GET /login), several lines of key/value pairs (e.g. User-Agent: Mozilla/Firefox, Host: gmail.com), and then any data that Bob wants to send to the web server (e.g. his login information).

In response to Bob’s first request for http://gmail.com/login:

GET /login HTTP/1.1
Host: gmail.com

the gmail.com web server will reply with an HTML document. HTML (or HyperText Markup Language) is akin to a programming language, and lets developers of web pages construct pages that are more than simply plain text. HTML lets the developer include images, links, and videos, as well as style the contents of the page by changing colors, font sizes, etc. Modern HTML sites also use more sophisticated tools for styling and altering their pages on the fly. For instance, you may have seen that on many websites, clicking a link or doing a search will not cause the page to reload, but instead the results will pleasantly (and quickly) animate into the page you are currently on. This is done using the web programming language JavaScript.

Given this HTML document (along with any styles or scripts), the browser will render the login page to Bob, showing him a login form. When Bob “submits” this form, another HTTP request will be sent:

POST /login HTTP/1.1
Host: gmail.com

username=bob@gmail.com&password=bob_is_super_cool!

Notice that this request is no longer a GET request, but POST, which tells the web server that Bob is sending it some data that it should consider. In this case, the data in the request is Bob’s username and password, which the web server will use to authenticate Bob. If Bob’s credentials were valid, the server will respond with another HTML document, this time showing Bob his inbox. It will also include what is known as a cookie. A cookie is an identifier that Bob’s web browser will pass with every subsequent HTTP request to gmail.com, to prove that he is still Bob. For example, if Bob later goes to http://gmail.com/inbox, his browser might send something like:

GET /inbox HTTP/1.1
Host: gmail.com
Cookie: token=2f0030c535193fc164e4e2b5689d9aba6533e6ddea82

The web server will recognize this token, and allow Bob access to his inbox without requiring him to log in again. The astute reader may notice that this cookie allows the web site to “track” Bob as he navigates through the site. This might not be such a problem in this case, since Bob really does want to be associated with gmail.com, but for third-party advertisement sites, this becomes a problem. Since Bob’s browser will always include the cookie it has for a particular domain name, if two web sites use the same advertisement provider (say, ads.com), then ads.com can see that Bob has visited both of these sites. The same goes for Facebook’s like buttons that are distributed throughout the web today; they are all loaded from facebook.com, and so Facebook gets to see all the sites you visit that have a Facebook Like button on them. Many browser now try to block this information from leaking, and this was the motivation for the infamous EU “Cookie Law”, but it can be hard to fix this problem in its entirety without breaking many web pages that rely on this behavior.

When Bob’s web browser shows him his inbox, he can read Alice’s e-mail simply by sending another HTTP request for it. The web server will, entirely unbeknownst to Bob, unwrap the IMF envelope, and format the e-mail nicely using HTML, before sending it back to his web browser. Finally, Bob can read Alice’s e-mail. Yay!

So, now we’re done, right? Well, not quite. There are still some loose ends to tie up. First, you may have questioned how a single machine can be responsible for all of gmail.com’s e-mail, and you would be entirely correct to do so. In practice, DNS will often return multiple addresses when a particular domain name is looked up, and the web browser simply chooses one of them at random. There are also other computers, known as load-balancers, that will take all the inbound traffic for a particular address, and distribute it to many many servers that it is responsible for. gmail.com uses both of these to ensure that its service almost never becomes unavailable to users. A single server can also have multiple domain names associated with it (e.g. the IP address 74.125.226.86 might handle both gmail.com and googlemail.com), and this is the reason many protocols (like HTTP) include a Host field identifying which domain name the client is expecting to talk to.

Another observation you may have made is that addresses on the web are increasingly prefixed with https://, and not just http://. The s stands for “Secure”, and implies that a protocol known as Transport Layer Security, or TLS is in use. TLS attempts to authenticate one or both parts of a network connection (i.e. so Bob can know that he is talking to the “real” gmail.com), as well as hide the information being exchanged from snooping. TLS is mostly transparent to users, as the browser runs it automatically just below HTTP, where it silently encrypts all the text before it is sent on the network between the server and the client (and vice-versa). In fact, TLS is so transparent and versatile that it can also be used for the mail exchange protocol we discussed when Alice was sending her e-mail to gmail.com. A discussion about TLS is outside the scope of this post, but suffice to say that correctly using TLS is quite hard, and that it relies on trusting some fairly large security organizations.

The final piece of the puzzle is what really happens to get packets from one machine’s gateway to another’s. Clearly, every pair of gateways cannot be connected by a single cable as we assumed above, so something more complicated must be going on. In fact, the core of the Internet is so complex that it cannot be readily explained in a post such as this. We will instead give a simplified summary. If you want to dig deeper into this, you should start by learning about Tier 1 networks, the basics of Internet routing and peering, and, if you dare, the Border Gateway Protocol.

The core of the Internet has many layers, or tiers. Smaller Internet Service Providers are usually in the lowest layer, “Tier 3”. Tier 3 providers only connect together a small number of users, and cannot provide connectivity to any machine that is not on their network. To connect their users with the Internet, Tier 3 providers will establish a contract with a Tier 2 provider to lease access to the Internet from them. When a Tier 3 provider receives traffic that is not destined for any of its users, it will simply forward it to its Tier 2 provider.

Tier 2 providers are often larger, and will have peering agreements with a number of other Tier 2 providers. For example, many Tier 2 providers may want to peer with Netflix’s network so they can provide better connectivity for their users. Peering simply means that some money exchanges hands, and a dedicated network link is established between two Tier 2 providers. However, a Tier 2 provider also does not have direct connectivity to all the machines on the Internet; Tier 2 providers are usually regional, and have only users in, say, the continental United States. Just like Tier 3 providers, all Tier 2 traffic that is not destined for their own users needs to go further up the chain. As you might have guessed, the traffic goes to a Tier 1 provider that the Tier 2 provider is paying for access to the rest of the Internet.

Tier 1 providers are international entities that provide connectivity between countries or continents, and that are connected to a very large number of users (or Tier 2 providers). However, even Tier 1 providers do not have connections to every machine wanting to be accessible on the Internet. Instead, the Tier 1 providers all peer with one another, often without any money exchanging hands. This works simply because all the Tier 1 providers bring a large number of both users and traffic to the table, and they all want to provide global connectivity, so it is in everyone’s best interest that they cooperate. That said, there have been instances of so-called “peering wars”, where Tier 1 providers disagree, and end up severing the connection between different parts of the Internet. When this happens, huge swaths of users lose connectivity to certain parts of the Internet.

So, how does this tie together with Alice’s e-mail for Bob? When the packets destined for gmail.com reach Alice’s gateway, it will forward them to Alice’s ISP. Her ISP will first check if gmail.com is also a customer of the same ISP, and if so, it will send the packet directly to gmail.com’s gateway (like we assumed initially). If gmail.com is not a subscriber of the same ISP, the packets are forwarded to the ISP’s Tier 2 provider. The Tier 2 provider will check all its customers’ and peers’ networks to see if gmail.com is to be found there, and if so, forward the packets there. If gmail.com is not on any of those networks, the packets are forwarded further up the chain to a Tier 1 provider. The Tier 1 provider will check which of its Tier 1 peers are ultimately responsible for gmail.com, and forward the packet to them, From there, the packets will descend the pyramid until they reach gmail.com’s ISP, which will finally forward the packets to gmail.com’s gateway.

We have now covered most of the technologies that enable communication on the Internet. While we have moved quickly, and covered things at a high level, I hope this post has given you some insight into just how large and complex the Internet is, and some of the problems that those who work to keep it healthy face. Specifically, I hope this information will help inform the ongoing debates about the future of the Internet, so that reasonable decisions can be made given the underlying technologies.

To see all the things wrong with this post, see the Hacker News discussion here.