lucavallin
Published on

Google Cloud Networking 101: The Comprehensive TLDR

avatar
Name
Luca Cavallin

I've been going back and forth on whether to attempt the Google Cloud Professional Network Engineer certification. I haven't committed yet, but I've been studying for it anyway, partly because the exam syllabus is a good forcing function for filling gaps, and partly because I work with GCP networking enough that having a solid mental map of the whole thing is useful regardless of whether I ever sit the exam.

These are my notes. There's a lot of material out there on GCP networking, but I couldn't find anything that gives you a comprehensive, clear intro that covers all the important pieces in one place without either being a shallow overview or a 700-page book. So I wrote it. I'm publishing it because someone else might find it useful.

One thing worth saying upfront: GCP networking has a different mental model than what most people are used to. The three things that will trip you up early if you don't internalize them first: VPCs are global (not regional like in AWS), the whole stack is fully software-defined with no physical routers or switches anywhere, and inter-region traffic within a VPC stays on Google's backbone and never hits the public internet. That last one is pretty cool. With that, let's get into it.

The Core Building Blocks: VPC, Subnets, IPs, Routing, and Firewalls

Everything in GCP networking is built on top of a handful of primitives. Get comfortable here before moving on, because the rest of the platform is just increasingly more complex combinations of these concepts :(.

VPCs and Subnets

A VPC is a global, private network. When you create one, you choose between auto mode - where GCP automatically creates a subnet in every region - and custom mode, where you control every subnet explicitly. Use custom mode for anything production. Auto mode creates subnets you didn't ask for with default ranges that frequently conflict with on-premises networks or other VPCs you'll want to connect to later. The time you save getting started with auto mode you'll pay back with interest when you try to set up peering and realize your ranges overlap.

Subnets are regional resources and have a primary CIDR range, plus optional secondary ranges. Those secondary ranges are how GKE gets its pod and service IP space (we'll discuss this in more detail later). One important thing to know upfront: you can expand a subnet range after creation, but you can't shrink it. This isn't a problem until it suddenly is, so choose your ranges from the start. Fixing CIDR overlap after the fact is one of the most painful things you can do to yourself in GCP, because it blocks VPC peering and tends to surface at exactly the wrong moment.

GCP also lets you bring your own IP (BYOIP) if you have portable public address space. That's useful for migrating workloads that have external dependencies on specific IP addresses. For internal IP management across many VPCs and environments, you really do want a proper IPAM tool. A spreadsheet works too, as long as it's actually maintained. The point is to have a single source of truth for allocations before you start building.

One more thing to plan for: GCP enforces quotas at both the project and VPC level: routes, firewall rules, subnets, forwarding rules, all have limits. These limits are high enough that you won't hit them in most setups, but if you're building something that scales horizontally or spans many environments, design with headroom in mind.

Routing

Routes in GCP come in two flavors. Static routes are manually configured: you specify a destination prefix and a next hop, which can be a VM instance, an internal load balancer, a VPN gateway, or an interconnect attachment. Static routes are simple and predictable but require manual maintenance as your topology evolves. Dynamic routing uses BGP via Cloud Router to learn and advertise routes automatically, and is what you want for anything involving hybrid connectivity to on-premises or other clouds.

One important mode to understand: global vs. regional dynamic routing. In regional mode, Cloud Router only propagates BGP-learned routes to VMs in its own region. In global mode, those routes propagate across all regions in the VPC. For most hybrid setups you want global mode, otherwise your on-premises prefixes are only reachable from VMs in the same region as your Cloud Router, which is almost never what you actually want.

GCP also gives you routing policies using tags and priorities. You assign network tags to VMs and scope static routes to those tags, which means different VMs in the same subnet can follow different routing policies. Combined with route priorities (lower number wins), this gives you fine-grained traffic control without needing separate subnets for every traffic class.

A pattern worth knowing: the Internal Load Balancer as a next hop. Instead of pointing a custom static route at a specific VM, you point it at an ILB. Traffic for that destination gets distributed across a pool of backends, with the ILB handling health checking and failover automatically. This is the standard way to run HA firewall appliances in GCP and we'll come back to it when we talk about packet inspection.

When you use VPC peering, you can also control which routes are exchanged between networks. Custom route import and export over VPC peering lets you be selective about what each side advertises rather than blindly sharing everything: useful when you want controlled connectivity between VPCs without full route table merging.

Firewalls

GCP firewalls are stateful, distributed, and attached to the VPC - not to individual VMs. Each rule specifies a target, a direction, a protocol and port range, and a priority. Understanding how these interact is important for getting your security posture right.

Targeting is how the rule decides which VMs it applies to. You can target based on network tags (simple strings you assign to VMs) or service accounts. For production workloads, prefer service account targeting - tags can be set by anyone with VM admin access, which makes them easier to misuse or misconfigure. Service accounts can't be arbitrarily assigned, which makes targeting more reliable.

Ingress and egress rules are separate. The defaults are: allow all egress, deny all ingress. Every rule has a priority, and when multiple rules match the same traffic, the one with the lowest priority number wins. This is straightforward but easy to get confused when you're debugging unexpected allow or deny behavior. Always check what other rules might be matching before assuming a specific rule is the problem.

Firewall rule logs can be enabled per rule, giving you a Cloud Logging entry for every matched connection. This is pretty useful for debugging and security auditing but adds cost at high traffic volumes. Enable it selectively, not globally.

Connecting Things Inside GCP

Once you understand the primitives, the next step is how to structure connectivity across multiple projects, teams, or environments within GCP. The answer almost always involves some combination of VPC peering and Shared VPC, and choosing between them is one of the earlier architectural decisions you'll make.

Single vs. Multiple VPCs

The first question is whether you even need multiple VPCs. A single VPC is significantly simpler to manage and reason about. Multiple VPCs make sense when you need hard network boundaries - for example - between prod and non-prod, between business units with different compliance requirements, or between systems that should have no path between them even if IAM were misconfigured. If you don't have a clear isolation requirement driving the decision, start with one VPC and split later only if you need to.

VPC Peering

VPC peering connects two VPCs (same project, different projects, even different organizations) so they can communicate using internal IPs. Traffic stays on Google's network and never hits the public internet. It's a simple concept with one critical constraint: peering is non-transitive.

If VPC A peers with VPC B, and VPC B peers with VPC C, then A and C cannot talk through B. There is no route propagation across the peering link. You need to peer A and C directly if you want them to communicate. This makes hub-and-spoke topologies using only VPC peering very painful at scale, since every spoke needs to peer with every other spoke it needs to talk to, which grows as O(n²). Shared VPC is often the better answer for these topologies.

Shared VPC

Shared VPC centralizes network ownership. A host project owns the VPC (its subnets, routes, firewall rules) and service projects attach to it, deploying their resources into the shared subnets. Network management is centralized; teams own their compute resources in their own projects. This is the right model for most multi-team or multi-environment setups. The cost is more upfront configuration and more involved IAM: the host project needs to grant service projects the compute.networkUser role to allow them to use shared subnets. Miss this and your cluster or VM creation will fail with errors that aren't always immediately helpful.

You can also share subnets using folders, which lets you apply sharing policies at the folder level in your organization hierarchy rather than project by project. Useful for large enterprises where manually managing individual project grants doesn't scale!

Connecting to GCP Managed Services

Not everything you interact with is a VM you control. GCP has a large ecosystem of managed services - Cloud SQL, Pub/Sub, GCS, BigQuery, and more - and they each connect to your VPC differently.

Private Google Access solves a specific problem: VMs with only internal IPs (no external address) need a way to reach Google APIs without going through the public internet. You enable it per subnet, and from that point, VMs in that subnet can reach Google APIs using internal routing. For VMs on-premises connecting over VPN or Interconnect, the setup is slightly different: you route traffic to private.googleapis.com or restricted.googleapis.com via custom routes, so API traffic stays off the public internet end to end.

For managed services like Cloud SQL, the mechanism behind private IP is Private Service Connect or VPC peering into a Google-managed VPC: Google runs the managed service backend in their own VPC and peers it into yours. For SaaS, PaaS, and IaaS services more broadly, the connection model varies by service: some use Private Service Connect, some use Private Service Access, some use direct internet. It's worth checking which model applies to each service before you design your firewall and routing rules around it.

GKE Networking

GKE networking is VPC networking with extra layers on top.

The most important thing to remember about GKE networking is that you need to plan its IP addressing upfront as part of your broader VPC IP plan. A VPC-native cluster consumes IPs from three separate ranges: the node subnet's primary range (VM IPs for nodes), a secondary range for pod IPs (one per pod), and another secondary range for service IPs (one per Kubernetes Service). These ranges need to be large enough for your expected cluster size plus growth headroom, and they cannot overlap with anything else you might want to peer or connect to later. You can expand GKE IP ranges after the fact, but it's operationally awkward, so it's best to get it right the first time.

VPC-native clusters with alias IP ranges are the recommended networking mode and have been for some time. Pod IPs come from alias IP ranges attached to the node's network interface, which means they're natively routable within the VPC without any per-pod static routes. The older routes-based networking mode created a static route per node, which hit VPC route quotas at scale and is considered legacy. Avoid it for new clusters.

Running GKE in a Shared VPC means the node subnets and secondary ranges live in the host project, while the cluster itself lives in a service project. The GKE service account in the service project needs specific IAM permissions in the host project to manage firewall rules and use the shared subnets; missing these permissions is the most common cause of confusing cluster creation failures in Shared VPC setups.

On the security side, Kubernetes NetworkPolicy lets you control pod-to-pod traffic within the cluster. GKE implements this via Calico or Cilium depending on your cluster configuration. By default, all pods can communicate with all other pods, no isolation. Enabling network policies is an explicit cluster-level setting, and it's something you should do for any production cluster. Don't assume the VPC firewall covers this, it operates at a different layer.

For cluster access, the key decision is public vs. private nodes and control plane endpoints. Private clusters give nodes only internal IPs with no external addresses. The control plane (API server) can also be made private, accessible only from within the VPC. This is the right posture for production, you don't want your Kubernetes API server reachable from the public internet. When you need external access to the API server (from a CI/CD runner or your laptop), authorized networks let you whitelist specific CIDRs that can reach the control plane endpoint, keeping the attack surface small.

Reaching the Outside World: Load Balancing, NAT, and CDN

Your workloads need to be reachable from the internet, and they need to reach the internet themselves. GCP's toolkit has plenty of stuff, so let's have a look at it.

Load Balancing

GCP's load balancer family is large and the naming isn't always intuitive. The primary ones are global vs. regional, external vs. internal, and L7 (HTTP/S) vs. L4 (TCP/UDP). They're not interchangeable, so it's important to choose the right one.

The foundation of everything is the backend service, which defines how a load balancer distributes traffic across its backends, including the balancing mode (utilization-based or rate-based), capacity scaling, and session affinity settings. Network Endpoint Groups (NEGs) are how you connect a load balancer to modern backends, for example container-native load balancing in GKE (which gives you pod-level traffic distribution, bypassing kube-proxy entirely), Cloud Run, App Engine, or even endpoints outside GCP. If you're running GKE and care about accurate traffic distribution, container-native LB with NEGs is what you want.

One thing that confuses a lot of people: firewall rules for health checks. GCP load balancers probe your backends from the ranges 130.211.0.0/22 and 35.191.0.0/16. If you forget to allow these in your firewall rules, backends show as unhealthy and traffic doesn't flow. Write this down somewhere! You'll need it when backends start failing health checks and you can't figure out why.

For internet-facing web traffic, the external HTTP(S) Load Balancer is the tool of choice. It's global (anycast), terminates TLS, supports HTTP/2, routes requests via URL maps, and integrates directly with Cloud Armor and Cloud CDN. Backends can be in any region and the LB routes to the closest healthy one. For non-HTTP TCP traffic that still needs global reach or TLS termination, the external TCP and SSL proxy load balancers fill the gap. For high-throughput, latency-sensitive L4 workloads where you handle TLS yourself and want client IP preservation, the Network Load Balancer is regional and passes traffic at the network layer without any proxy overhead.

For traffic that stays inside your VPC, the internal HTTP(S) and TCP proxy load balancers are the equivalents. The internal HTTP(S) LB is particularly versatile because it can also serve as a next hop in custom static routes - which is the standard pattern for HA firewall appliances (traffic routes through the ILB, which distributes across your appliance pool and handles failover automatically). Protocol forwarding is a simpler option when you need to forward raw traffic to a specific VM without the overhead of a full backend service setup. Useful for network appliances that need to receive traffic directly.

For handling traffic spikes, GCP managed instance groups can autoscale based on CPU utilization, LB utilization, or custom Cloud Monitoring metrics. The max CPU and max utilization settings on the backend service control when GCP starts distributing traffic to additional backends or regions. Get these tuned for your workload or you'll see uneven distribution under real load.

Cloud Armor

Cloud Armor is GCP's WAF and DDoS protection layer, attached to external HTTP(S) load balancers. You define security policies - named sets of rules evaluated in priority order, each matching on conditions like source IP, geographic origin, or request attributes, and either allowing or denying the request. WAF rules are pre-built for common attack patterns (SQLi, XSS, and more) based on ModSecurity rule sets, with tunable sensitivity levels per rule. Policies attach at the backend service level, so you can apply different policies to different backends behind the same load balancer (a more restrictive policy for your admin API, a more permissive one for your public-facing static assets, for example).

Cloud CDN

Cloud CDN is enabled per backend service on external HTTP(S) LBs. You control the cache mode (whether to respect origin cache headers, cache only static content, or force-cache everything), the cache key composition (which request components determine cache identity - useful for A/B testing or locale-specific content), and cache invalidation (by URL or URL prefix, no per-tag invalidation). Signed URLs let you restrict access to time-limited, cryptographically signed requests (the right approach for protected content delivery). CDN can also pull from custom origins outside GCP, not just from GCP backends.

One thing to be careful about: what you're actually caching. Accidentally caching authenticated responses because a backend sent permissive cache headers is a real mess that's not fun to debug in production.

Cloud NAT

Cloud NAT solves the outbound internet problem for VMs without external IPs - they need to reach external APIs, download packages, call third-party services - without exposing any inbound connectivity. It's not a VM or an appliance, it's a fully distributed software-defined service with no single point of failure, which means you don't need to worry about it as a bottleneck the way you would with a traditional NAT gateway.

You attach it to a Cloud Router in the relevant region and choose which subnets it covers. NAT IPs can be auto-allocated or manually specified (the useful when you need stable egress IPs for external allowlists). Port allocation controls how many concurrent connections a VM can make to the same external destination, and running out of NAT ports is a real issue for workloads that open many connections to the same endpoint. Watch the nat_allocation_failed metric. TCP, UDP, and ICMP timeouts are configurable, which matters if you have long-lived connections that keep getting silently dropped. Organization Policy constraints let you enforce Cloud NAT configuration standards at scale, restricting which regions it can be deployed in, enforcing manual IP allocation, and so on.

Hybrid Connectivity: On-Premises and Multi-Cloud

Connecting GCP to on-premises infrastructure or other clouds is where GCP networking gets quite complex and operational overhead becomes real. The first thing to be clear on is why you're doing it, because it's not always the best idea.

The common drivers for hybrid networks are: regulatory requirements that keep certain data on-premises, legacy systems that can't migrate, staged migration strategies, or multi-cloud redundancy. Your overall goals - latency targets, bandwidth requirements, cost envelope, failover behavior - should drive technology selection. Don't start with "which product should I use" before you've answered those questions.

The Connectivity Options

At the high end, Dedicated Interconnect gives you a physical cross-connect between your network and Google's at a colocation facility - 10G or 100G circuits, low latency, predictable bandwidth, and a strong SLA. You provision VLAN attachments to connect the physical circuit to your VPCs via Cloud Router. For production you want at least two circuits in different metropolitan areas; a single circuit is a single point of failure. Partner Interconnect is the alternative when you can't colocate with Google directly - you go through a carrier partner who manages the physical circuit, and you still provision VLAN attachments on the GCP side. Bandwidth starts lower (50 Mbps) and it's the right choice when your traffic volumes don't justify a dedicated circuit or when Dedicated Interconnect isn't available in your location.

Direct Peering and Carrier Peering let you establish BGP sessions directly with Google's network at a peering location, useful for optimizing routing of internet-bound traffic. These are less commonly needed for typical enterprise hybrid connectivity because they don't give you access to VPC resources the way Interconnect or VPN does.

For most setups that don't need the bandwidth or latency guarantees of Interconnect, HA VPN is the right choice. Two VPN interfaces, each with its own external IP, each running a BGP session via Cloud Router, connected to two peer gateways on-premises; when configured correctly, this gives you 99.99% SLA and automatic failover without manual intervention. Classic VPN is the legacy alternative: single tunnel, static or policy-based routing, 99.9% SLA. It still works, but new deployments should default to HA VPN. VPN throughput tops out at around 3 Gbps per tunnel; if you need more, use multiple tunnels or move to Interconnect.

The bandwidth and constraints of each option is important: Interconnect gives you more throughput and lower latency, VPN is operationally simpler and doesn't require colocation. Choose based on your actual requirements!

Topologies and Failover

For multi-cloud topologies (GCP + AWS, GCP + Azure), the connection options are the same VPN and Interconnect products, just pointed at the other cloud's equivalent. The topology pattern (fully meshed, hub-and-spoke, or isolated with a shared services layer) depends on how much traffic crosses cloud boundaries and what your latency requirements are. There's no universally correct answer.

Failover and disaster recovery at the network level means multiple VPN tunnels or Interconnect circuits with BGP failover configured, Cloud Router advertising routes from multiple regions, and health-check-aware routing policies that can drain traffic away from a failing region before it goes completely dark. The routing mode of your VPC (regional vs. global) directly affects whether a Cloud Router in one region can advertise learned routes to VMs in other regions - for most DR scenarios, you want global routing mode.

IP address management across on-premises and cloud is one of those problems that causes real pain if neglected. Keep a single source of truth for all CIDR allocations (on-premises, per-VPC, per-region) and treat overlap between them as a hard failure condition during design, not something to work around later.

Cloud Router

Cloud Router is the BGP speaker on the GCP side of any hybrid connection. It's a fully managed service, not a VM, which means you don't manage its availability... but you do need to understand its BGP configuration.

BGP sessions are established over link-local addresses (169.254.x.x range) through VPN tunnels or VLAN attachments. You configure Cloud Router with a private ASN, and use MED (Multi-Exit Discriminator) values and BGP route priority to influence which path traffic prefers when you have multiple connections. Default route advertisement via BGP lets Cloud Router announce a default route to on-premises, which is useful when you want to centralize internet egress through GCP for inspection or logging. Custom route advertisements let you control exactly which prefixes Cloud Router announces, rather than advertising the entire VPC. Useful for summarization and for keeping internal prefixes private.

For critical hybrid connectivity, deploy Cloud Routers in multiple regions with multiple BGP sessions. Use MED values to express active/backup path preference. The Cloud Router service itself is managed and highly available, what you're building redundancy for is the underlying physical connectivity (tunnels or circuits).

Security: Perimeters, Controls, and Best Practices

GCP networking security has multiple layers that operate at different levels of abstraction. Firewall rules are the baseline, they control traffic at the network level. VPC Service Controls operate at a higher level, controlling access to GCP APIs regardless of what network a request comes from. Understanding both layers and how they interact is important.

VPC Service Controls

VPC SC is one of the most powerful security features on GCP and also one of the most complex to deploy correctly. The core idea is this: even if a principal has IAM permission to read a GCS bucket, VPC SC can block that request if it originates outside a defined security perimeter. It's defense-in-depth on top of IAM, not a replacement for it. The primary threat it addresses is data exfiltration via legitimate API calls: an attacker or misconfigured service using real credentials to exfiltrate data through GCP APIs.

A service perimeter defines a boundary around a set of GCP projects and the GCP services within them. Requests crossing that boundary are blocked by default. Access levels define conditions under which crossing is permitted - source IP range, device trust level, user identity, and more. You attach access levels to ingress and egress policies on the perimeter to create controlled exceptions.

Setting one up involves: enabling the Access Context Manager and Cloud Resource Manager APIs, creating an access policy for your organization, defining access levels, and then creating and populating the perimeter. It's not complicated conceptually, but the devil is in the details of what's inside the perimeter and what needs to cross it.

VPC Accessible Services adds an additional restriction: it limits which GCP APIs are accessible from within the perimeter, preventing VMs inside from calling services outside and closing off an internal exfiltration vector. Be careful with this one: forgetting to include Cloud Logging or Cloud Monitoring in the allowed services list will break observability silently.

When you have two separate perimeters that have a good reason to exchange data, a perimeter bridge creates a controlled connection between them. It's more restrictive than merging the perimeters: only explicitly bridged resources can communicate across the bridge, so you retain the isolation you wanted while enabling specific data flows.

Audit logging is non-negotiable. Every VPC SC violation generates a log entry in Cloud Audit Logs. During rollout this is how you catch what you missed; in ongoing operations it's how you detect anomalous access patterns.

The most important operational guidance for VPC SC: always use dry-run mode before enforcing. In dry-run, violations are logged but not blocked. The workflow is: create the perimeter in dry-run mode, then monitor violation logs across all your access patterns, then adjust access levels and perimeter config based on what you find, then test again, tnen enforce, then test the enforced perimeter, then clean up. This is tedious, but skipping it and going straight to enforcement in an existing environment will cause production incidents every time.

One more thing on VPC SC: it has specific interactions with Shared VPC and VPC Peering that require careful design. Host and service projects in a Shared VPC need to be in the same perimeter, or you need bridges between perimeters that span the boundary. VPC-peered projects similarly need good perimeter placement - the documentation on these interactions is worth reading closely before you finalize your perimeter design.

Firewalls, NGFWs, and IAM

At the firewall layer, GCP gives you native distributed firewall rules and also the ability to run third-party NGFW appliances as VMs. Native GCP firewall rules are fast to deploy, centrally managed, and good enough for most workloads. NGFW appliances (Palo Alto, Fortinet, and others running as VMs with multiple NICs) give you deeper inspection (L7 awareness, TLS inspection, IDS/IPS) but add significant operational complexity. Use native rules as your baseline and add NGFW appliances where your compliance posture or threat model actually requires deeper inspection.

Firewall Insights is worth running periodically on production environments. It surfaces overly permissive rules, shadowed rules (where a higher-priority rule always matches first, making another rule effectively dead), and rules that haven't matched any traffic recently. Firewall rulesets tend to accumulate cruft over time, Insights gives you a data-driven way to clean them up.

Shared VPC IAM deserves particular attention because getting it wrong causes confusing failures. The compute.networkUser role at the host project level is what allows service project resources to use shared subnets. The compute.securityAdmin role controls who can manage firewall rules in the host project. When diagnosing IAM failures in Shared VPC, look at the protoPayload in Cloud Logging for compute.googleapis.com: permission denied errors surface there with enough detail to identify exactly what's missing.

DNS and Packet Inspection

Cloud DNS

Cloud DNS is more than just a nameserver. It's a first-class architectural component with several modes that affect how name resolution works across your entire environment.

Public zones serve internet-facing DNS; private zones are visible only to specific VPCs. Managing zones and records is standard (A, AAAA, CNAME, MX, TXT) with changes propagating in seconds. If you're migrating from another DNS provider, the workflow is: export your zone file, import it into Cloud DNS, verify all records (pay special attention to MX, SPF, and DKIM), then coordinate the NS record cutover with your registrar. Test before you cut over!

DNSSEC is supported for public zones and adds cryptographic authenticity verification to DNS responses, protecting against cache poisoning and spoofing. Worth enabling for public zones, but it requires coordinating with your registrar to publish the DS record and adds ongoing operational complexity around key management.

Forwarding zones and DNS server policies give you control over where queries go. Forwarding zones send queries for specific domains to external resolvers - typically your on-premises DNS servers. Server policies let you configure which DNS server your VMs use by default. Together, these let you build flexible hybrid DNS setups without necessarily deploying full bidirectional DNS infrastructure.

For full hybrid DNS integration, the standard bidirectional pattern works like this: on-premises resolvers forward GCP-internal domain queries to Cloud DNS inbound forwarder IPs, and Cloud DNS forwarding zones send on-premises domain queries back to on-premises resolvers. The result is that VMs in GCP can resolve on-premises names, and on-premises hosts can resolve GCP internal names - seamless name resolution across the hybrid boundary.

Split-horizon DNS is the pattern where the same domain name resolves differently depending on where the query comes from. A public zone returns public IPs; a private zone (visible only to VPC resources) returns internal IPs for the same FQDN. This is the right approach for services that need to be reachable both from the internet and internally, without having to use different domain names for each context.

DNS peering lets one VPC delegate resolution for a zone to another VPC's Cloud DNS. In hub-and-spoke topologies, a central VPC often owns DNS, and spoke VPCs use DNS peering to resolve names from the hub without needing full VPC peering or Shared VPC just for name resolution. Private DNS logging records DNS queries from VMs in your VPC: useful for security auditing and for debugging resolution issues that are otherwise very hard to trace.

Packet Inspection

Packet mirroring clones traffic from specific VMs or subnets and sends it to a collector (another VM or an Internal Load Balancer) without affecting the traffic path itself. This is how you integrate IDS/IPS tools and traffic analyzers into GCP without putting them inline. You configure a mirroring policy specifying sources and destination, and use source and traffic filters to be selective about what gets mirrored: filtering by VM, subnet, protocol, or CIDR range. Mirroring everything in a high-throughput environment is expensive; be deliberate about what you actually need to capture.

Packet mirroring works across multiple VPCs too, with mirroring traffic encapsulated and tunneled between them, so your collector infrastructure doesn't have to live in the same VPC as the sources.

For inspecting inter-VPC traffic using multi-NIC VMs, the pattern is to deploy NGFW appliances as VMs with multiple network interfaces (one per VPC) and use custom static routes to funnel traffic through them. Traffic enters one NIC, gets inspected, and exits another. For high availability, you use an Internal Load Balancer as the next hop in those static routes: traffic hits the ILB, the ILB distributes it across your appliance pool, and the ILB's health checking automatically routes around failed instances. This is the production-grade pattern for running third-party NGFW products in GCP.

Operating the Network: Observability, Troubleshooting, and Maintenance

Designing and building the network gets all the architectural attention, but operating it is the ongoing work. GCP gives you a solid toolset here that most people should use.

Observability

VPC Flow Logs are per-subnet, sampled records of network flows. You control the sampling rate from 1% to 100%. At low sampling rates they give you a solid traffic pattern overview at modest cost; at higher rates you get more complete data for detailed analysis or security investigations. Enable them broadly at a low sample rate and increase selectively when you need detail.

Firewall rule logs tell you per-connection what matched and what decision was made. Combined with Firewall Insights, which analyzes your full ruleset and surfaces overly permissive rules, shadowed rules, and rules that haven't matched traffic in a long time, you have a solid picture of both what's happening and whether your rule configuration makes sense. Run Insights periodically: firewall rulesets tend to accumulate cruft.

For broader monitoring, Cloud Monitoring gives you metrics for everything relevant to network operations: VPN tunnel state and throughput, Cloud Interconnect connection state and bit error rates, Cloud Router BGP session state and route counts, load balancer request counts and latency and backend health, Cloud Armor allow and deny rates, and Cloud NAT port utilization and allocation failures. You should have alerts on VPN tunnel down, backend health degradation, BGP session down, and NAT port exhaustion at minimum. These are the failures that cascade if you don't catch them early. Reviewing logs for networking components (VPN, Cloud Router, VPC Service Controls, Cloud NAT) should feed into those alerts via log-based metrics rather than manual review.

Troubleshooting

When something can't connect, Network Intelligence Center's Connectivity Tests should be your first stop. You specify a source and destination, and it simulates the packet path - telling you whether traffic would be allowed or blocked, and if blocked, exactly which firewall rule or route is the cause. This saves hours of manual tracing through firewall rule lists and routing tables. The Topology view gives you a visual map of your VPCs and their connections, which is particularly useful when you've inherited a network someone else built.

For VPN troubleshooting, start with tunnel state via gcloud compute vpn-tunnels describe. The common failure modes are IKE negotiation failures (mismatched PSK or IKE version/algorithm parameters), routing issues where the BGP session is up but routes aren't being propagated (usually a misconfigured ASN or an unexpected prefix filter), and MTU problems where IPsec overhead reduces effective MTU and causes fragmentation issues with large packets. Test MTU explicitly: ping with a fixed packet size and the don't-fragment bit set.

For Cloud Router BGP issues, gcloud compute routers get-status shows you session state and the full set of learned and advertised routes. The common issues are ASN mismatch between peers, timer parameter mismatch, link-local IP conflicts on the BGP session addresses, and misconfigured custom route advertisement causing prefixes to not be announced.

For draining traffic from a load balancer backend, set its capacity weight to 0 on the backend service rather than deleting it. The load balancer will drain existing connections gracefully and stop sending new ones. Deleting a backend instance group while it's under load is not nice.

Testing latency and throughput is worth doing at baseline so you have something to compare against when performance degrades. Use iperf3 for throughput between VMs, ping for ICMP RTT as a proxy for network latency. For high-bandwidth, high-latency paths like intercontinental connections, TCP window size is often the limiting factor: run iperf3 with explicit window size settings to get accurate numbers. The Performance Dashboard in Network Intelligence Center gives you latency heatmaps between GCP regions and can confirm whether anomalous latency you're seeing is a real deviation or within normal variance.

The End

Google Cloud networking is deep. This post covers a lot of ground deliberately, but every section here is a surface, each one has books and certification exams dedicated to it. The goal was to give you a solid mental map of how the pieces fit together and where to go when you need to go deeper.

If you take nothing else away: plan your IP space before you build anything, because fixing CIDR conflicts after the fact is miserable. Shared VPC is the right default for multi-team environments. VPC Service Controls is not optional for regulated workloads, and you must use dry-run mode before enforcing. Cloud Router and BGP are the backbone of hybrid connectivity - understand the routing mode and MED levers before you need them in an incident. And make Network Intelligence Center your first debugging tool: Connectivity Tests alone will save you more debugging hours than almost anything else in the platform!

I hope you found this post useful. If you have any questions or feedback, please let me know!