Senior Java Backend Architecture Guide: From Spring Boot to Kafka, Microservices, and Production Systems
Senior Java Backend Architecture Guide
If you are moving from mid-level to senior, your job shifts from writing endpoints to shaping reliable systems. This guide is a practical roadmap with what to learn, why it matters, and how to implement it in Java/Spring Boot with Kafka. Every section ties the topic to microservices and distributed systems impact.
Table of Contents
- Core Java and JVM
- Concurrency and Reactive
- Build, Packaging, and Dependency Hygiene
- Spring Core and Spring Boot
- HTTP APIs (REST)
- Persistence: SQL with JPA and JDBC
- NoSQL and Caching (Redis)
- Messaging with Kafka
- Transactions, Idempotency, and Outbox
- Microservices Architecture Components
- API Gateway and Edge Patterns
- Resiliency (Retries, Circuit Breakers, Rate Limits)
- Observability (Logs, Metrics, Traces)
- Security (OAuth2/OIDC)
- Testing (Unit, Integration, Contract)
- CI/CD, Containers, and Kubernetes
- Monitoring Stack and SLOs
- Capstone Project: User / Order / Payment (Kafka + Outbox + Gateway + Observability)
- Printable Service Checklist
- Step-by-Step Roadmap
Core Java and JVM
- What: Collections, generics, streams, records, sealed classes; JVM memory model, GC.
- Why: Data structures and memory behavior drive performance and latency.
- How: Prefer immutable DTOs and defensive copies on boundaries; use parallel streams carefully. Java snippet:
record UserDto(String id, String name) {} List<String> names = users.stream().map(User::getName).toList();
Concurrency and Reactive
- What: Thread pools, CompletableFuture, virtual threads (Project Loom), reactive (Reactor).
- Why: Throughput and tail latency depend on non-blocking IO and correct backpressure.
- How: Use bounded executors; for high concurrency IO, consider WebFlux or Loom. Java snippet (CompletableFuture):
ExecutorService io = Executors.newFixedThreadPool(64); CompletableFuture<Response> f = CompletableFuture.supplyAsync(() -> client.call(), io);
Build, Packaging, and Dependency Hygiene
- What: Gradle/Maven, BOMs, dependency convergence, layered jars, Docker images.
- Why: Reproducible builds and small images reduce CVEs and deploy time.
- How: Use Spring Boot layered jar and slim base images. Example Dockerfile:
FROM eclipse-temurin:21-jre ARG JAR=app.jar COPY build/libs/${JAR} /app.jar ENTRYPOINT ["java", "-XX:+UseZGC", "-jar", "/app.jar"]
Spring Core and Spring Boot
- What: DI, configuration properties, profiles, actuator.
- Why: Clean composition and production toggles are essential for microservices.
- How: Externalize config, validate @ConfigurationProperties, expose health and info. Java snippet:
@ConfigurationProperties(prefix = "service") record ServiceProps(String name, Duration timeout) {}
HTTP APIs (REST)
- What: Controllers, DTO validation, error handling, idempotency.
- Why: APIs are the product surface; correctness and predictability reduce incidents.
- How: Version your API, add problem+json errors, define idempotency keys for writes. Java snippet:
@RestController @RequestMapping("/v1/users") class UserController { private final UserService svc; UserController(UserService s){ this.svc = s; } @PostMapping ResponseEntity<UserDto> create(@Valid @RequestBody CreateUser req, @RequestHeader(value="Idempotency-Key", required=false) String idk){ return ResponseEntity.status(HttpStatus.CREATED).body(svc.create(req, idk)); } }
Persistence: SQL with JPA and JDBC
- What: JPA/Hibernate vs plain JDBC; transactions; connection pools.
- Why: Schema and queries define scalability; lazy loading traps; N+1 patterns.
- How: Use explicit DTO projections, batch writes, and connection timeouts. JPA snippet:
public interface UserRepo extends JpaRepository<UserEntity, String> { @Query("select new com.acme.UserDto(u.id,u.name) from UserEntity u where u.status=:s") List<UserDto> findByStatus(@Param("s") Status status); }
NoSQL and Caching (Redis)
- What: Key-value (Redis), document (Mongo), column (Cassandra).
- Why: Latency and scale; choose per access pattern; avoid cache stampedes.
- How: Cache aside with TTL; prefer small value objects; compress large payloads. Redis pseudo-YAML:
spring: data: redis: host: redis:6379 timeout: 100ms
Messaging with Kafka
- What: Topics, partitions, consumer groups, delivery semantics.
- Why: Decoupling and scale; async flows; backpressure via consumer lag.
- How: Keys define partitioning; configure acks, retries, idempotence. Spring Kafka config (application.yaml):
spring: kafka: bootstrap-servers: localhost:9092 producer: acks: all retries: 5 properties: enable.idempotence: true consumer: group-id: user-svc auto-offset-reset: earliest
Java consumer:
@KafkaListener(topics="user-events", groupId="user-svc") void on(UserEvent evt){ handler.process(evt); }
Transactions, Idempotency, and Outbox
- What: Exactly-once is a workflow property; implement idempotency and outbox.
- Why: Prevent double charges and lost messages in distributed boundaries.
- How: Write to DB within tx + outbox row, Debezium/connector publishes to Kafka. Outbox table:
create table outbox( id uuid primary key, aggregate_id varchar(64), type varchar(64), payload json, created_at timestamp );
Microservices Architecture Components
- What: Service discovery (Eureka/Consul), config server, API gateway, centralized auth.
- Why: Operate many services with consistent cross-cutting concerns.
- How: Spring Cloud Config + Consul; minimize service-to-service dynamic deps.
API Gateway and Edge Patterns
- What: Routing, rate-limiting, auth, request shaping.
- Why: Single ingress for policies and observability.
- How: Spring Cloud Gateway or Kong/Apigee at edge; validate and normalize headers. Gateway route (yaml):
spring: cloud: gateway: routes: - id: user-api uri: http://user:8080 predicates: [ Path=/v1/users/** ] filters: [ RemoveRequestHeader=Cookie ]
Resiliency (Retries, Circuit Breakers, Rate Limits)
- What: Resilience4j for retries/circuit; token bucket for rate limits.
- Why: Isolate failures and prevent cascades.
- How: Configure jittered retries; set timeouts smaller than upstream timeouts. Java snippet:
@Retry(name="userRetry") @CircuitBreaker(name="userCb") public UserDto callUpstream(String id){ return client.getUser(id); }
Observability (Logs, Metrics, Traces)
- What: Structured logs, Micrometer metrics, OpenTelemetry traces.
- Why: You cannot fix what you cannot see; SLOs need signals.
- How: JSON logs; Micrometer to Prometheus; OTLP exporter to Jaeger/Tempo. Micrometer counter:
Counter created = Counter.builder("user_created_total").register(meterRegistry); created.increment();
OTel exporter (yaml):
management: otlp: tracing: endpoint: http://otel-collector:4317
Security (OAuth2/OIDC)
- What: Spring Security with resource server; Keycloak/Okta as IdP.
- Why: Token-based auth scales across services; zero trust perimeter.
- How: Bearer tokens with scopes; fine-grained authorities via claims. Java config:
@EnableWebSecurity class SecCfg { @Bean SecurityFilterChain http(HttpSecurity h) throws Exception { h.authorizeHttpRequests(a -> a.requestMatchers("/actuator/**").permitAll() .anyRequest().authenticated()) .oauth2ResourceServer(o -> o.jwt()); return h.build(); } }
Testing (Unit, Integration, Contract)
- What: JUnit5, Testcontainers, WireMock, Pact.
- Why: Prevent regressions and verify contracts across services.
- How: Run PostgreSQL/Kafka via Testcontainers in CI for realism. JUnit + Testcontainers:
@Container static PostgreSQLContainer<?> pg = new PostgreSQLContainer<>("postgres:16");
CI/CD, Containers, and Kubernetes
- What: Pipelines (GitHub Actions), image build, deployment strategies.
- Why: Safe, fast releases; progressive rollouts reduce risk.
- How: Blue/green or canary; Helm charts; GitOps (ArgoCD). K8s snippet:
apiVersion: apps/v1 kind: Deployment spec: replicas: 3 template: spec: containers: - name: user image: acme/user:1.0.0 resources: requests: { cpu: "200m", memory: "256Mi" } limits: { cpu: "1", memory: "512Mi" }
Monitoring Stack and SLOs
- What: Prometheus, Grafana, Loki/ELK, Jaeger.
- Why: Close the loop with alerts on SLO burn rates.
- How: Build RED and USE dashboards; set actionable alerts with runbooks.
Capstone Project: User / Order / Payment (Kafka + Outbox + Gateway + Observability)
Goal: ship three services with reliable messaging, API gateway, and full telemetry.
Repo structure:
acme-platform/ gateway/ # Spring Cloud Gateway user-service/ # PostgreSQL + outbox table order-service/ # PostgreSQL + outbox table payment-service/ # PostgreSQL + outbox table infra/ docker-compose.yml # Postgres, Kafka, Schema Registry, Debezium, Prometheus, Grafana, Jaeger topics.sh # create topics: user-events, order-events, payment-events
Gateway routes (application.yaml):
spring: cloud: gateway: routes: - id: user uri: http://user-service:8080 predicates: [ Path=/v1/users/** ] - id: order uri: http://order-service:8080 predicates: [ Path=/v1/orders/** ] - id: payment uri: http://payment-service:8080 predicates: [ Path=/v1/payments/** ]
Outbox table (shared shape across services):
create table if not exists outbox( id uuid primary key, aggregate_id varchar(128) not null, type varchar(64) not null, payload jsonb not null, created_at timestamptz default now() ); create index if not exists idx_outbox_created on outbox(created_at);
Spring profiles (user-service application.yaml):
spring: datasource: url: jdbc:postgresql://postgres:5432/userdb username: user password: pass jpa: hibernate: ddl-auto: validate kafka: bootstrap-servers: kafka:9092 producer: acks: all properties: { enable.idempotence: true } management: endpoints: web.exposure.include: health,info,prometheus
Debezium connector (user outbox -> Kafka):
{ "name": "user-outbox", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "postgres", "database.port": "5432", "database.user": "debezium", "database.password": "dbz", "database.dbname": "userdb", "table.include.list": "public.outbox", "topic.prefix": "user", "tombstones.on.delete": "false" } }
OpenTelemetry + Micrometer (user-service application.yaml):
management: tracing: sampling.probability: 0.1 otlp: tracing.endpoint: http://otel-collector:4317 metrics.export.prometheus.enabled: true
Consumer pattern (order-service consumes user events):
@KafkaListener(topics = "user.public.outbox") void on(ConsumerRecord<String,String> rec){ // parse payload, apply idempotency using event id }
docker-compose (infra/docker-compose.yml) highlights:
services: postgres: { image: postgres:16 } kafka: { image: confluentinc/cp-kafka:7.6.0 } schema-registry: { image: confluentinc/cp-schema-registry:7.6.0 } debezium: { image: debezium/connect:2.6 } prometheus: { image: prom/prometheus } grafana: { image: grafana/grafana } jaeger: { image: jaegertracing/all-in-one }
Printable Service Checklist
Copy, paste, and print per service (User / Order / Payment).
-
Deployment
- Image uses slim JRE; layered jar; SBOM stored.
- Readiness/liveness probes configured.
- Resource requests/limits set with headroom.
- Config/secret via env or vault; no secrets in images.
-
Resilience
- Timeouts set on all clients; retries with jitter; circuit breakers on remote calls.
- Bulkheads (thread pools) bounded; rate limits at gateway and service.
- Idempotency keys for writes; outbox + CDC for cross-service events.
- DLQ and replay plan documented; backoff tuned.
-
Observability
- Structured JSON logs with correlation/trace IDs.
- Micrometer metrics: RED (requests, errors, duration).
- Traces exported via OTLP; percent sampled and adjustable.
- Dashboards and alerts exist with runbooks; SLO defined.
Step-by-Step Roadmap
- Java and JVM fundamentals; collections, streams, records.
- Concurrency: thread pools, futures, timeouts; intro to Reactor or Loom.
- Spring Boot core: configs, profiles, actuator; solid REST APIs.
- SQL mastery: normalization, indexing, transactions, JPA pitfalls.
- NoSQL + Redis: pick per access pattern; cache correctness and TTLs.
- Kafka: producers/consumers, keys, partitions, idempotent writes.
- Resiliency: retries, timeouts, circuit breakers, bulkheads; rate limits.
- Observability: logs, Micrometer metrics, OpenTelemetry traces.
- Security: OAuth2/OIDC resource server; propagate identity across services.
- Architecture components: config, discovery, gateway; cross-cutting policies.
- Data correctness patterns: idempotency, outbox, CDC; backfills and replay.
- Platform: Docker, K8s, Helm; blue/green and canary; GitOps.
- Monitoring and SLOs: burn rate alerts, triage, and postmortems.
Focus on outcomes: lower P99 latency, fewer incident tickets, faster safe releases.
Related Articles
Incident Playbook for Beginners: Real-World Monitoring and Troubleshooting Stories
A story-driven, plain English incident playbook for new backend & SRE engineers. Find, fix, and prevent outages with empathy and practical steps.
System Design Power-Guide 2025: What To Learn, In What Order, With Real-World Links
Stop bookmarking random threads. This is a tight, no-fluff map of what to study for system design in 2025 - what each topic is, why it matters in interviews and production, and where to go deeper.
DSA Patterns Master Guide: How To Identify Problems, Pick Patterns, and Practice (With LeetCode Sets)
A practical, pattern-first road map for entry-level engineers. Learn how to identify the right pattern quickly, apply a small algorithm template, know variants and pitfalls, and practice with curated LeetCode problems.