Introduction

  • TL;DR: Kubeflow is an ecosystem for running reproducible ML workflows on Kubernetes—from notebooks and pipelines to distributed training and model serving. (Kubeflow)
  • In practice, “using Kubeflow” means wiring together Profiles/Namespaces, Notebooks, Pipelines (KFP), training (Trainer), tuning (Katib), and serving (KServe) with clear operational boundaries. (Kubeflow)

1) What “Kubeflow” is in 2026: Projects vs Platform

Kubeflow can be installed as standalone projects (e.g., Pipelines-only) or as the integrated Kubeflow AI reference platform. The official “Installing Kubeflow” guide explicitly frames these as two installation methods. (Kubeflow)

Why it matters: Treating Kubeflow as an ecosystem lets you start small (one project) and expand to a full platform when your team is ready, reducing operational risk. (Kubeflow)

2) Install and Access: Manifests + Kustomize + Istio Gateway

Kubeflow 1.11 was released on 2025-12-15. The upstream manifests repository documents both “single-command” and “install individual components” approaches using Kustomize.

2-1) Fast local install (Kind example)

The manifests README provides a Kind-based flow and a retry loop for applying resources (to handle CRD/CR timing).

1
2
3
while ! kustomize build example | kubectl apply --server-side --force-conflicts -f -; do
  echo "Retrying to apply resources"; sleep 20;
done

2-2) Access via port-forward

The default access path is port-forwarding the Istio ingress gateway and logging in via Dex using the documented default credentials. (GitHub)

1
2
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
# open http://localhost:8080

For exposing Kubeflow via Ingress/LoadBalancer, the manifests README warns that many web apps rely on Secure Cookies, so you’ll typically need HTTPS for non-localhost domains. (GitHub)

Why it matters: Most “Kubeflow is broken” reports are actually ingress/auth/cookie issues. Lock down access patterns early (HTTPS + identity integration) for a stable UI. (GitHub)

3) Multi-tenancy Basics: Profiles and Namespaces

A Kubeflow Profile wraps a Kubernetes Namespace and supports an owner + contributors model. (Kubeflow) Kubeflow Pipelines multi-user isolation is part of the Profile/Namespace isolation strategy and is documented as supported in Kubeflow Platform deployments. (Kubeflow)

Why it matters: Namespaces are the unit of isolation for runs, artifacts, and access control. Define Profile conventions before onboarding multiple teams. (Kubeflow)

4) Notebooks: Reproducible dev environments on Kubernetes

The Notebooks quickstart shows the standard UI flow: open Central Dashboard → select a namespace → create notebook servers. (Kubeflow)

Why it matters: Notebooks are the front door to your platform. Standardize images, PVC usage, and namespace defaults to make downstream Pipelines/Trainer workloads consistent. (Kubeflow)

5) Kubeflow Pipelines: DSL → IR YAML → Run

To submit a pipeline, you compile it to YAML using the KFP SDK compiler; the output is an IR YAML representation of the pipeline spec. (Kubeflow)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from kfp import compiler, dsl

@dsl.component
def hello(name: str) -> str:
    return f"hello, {name}"

@dsl.pipeline(name="hello-pipeline")
def pipeline(name: str = "kubeflow"):
    hello(name=name)

compiler.Compiler().compile(pipeline_func=pipeline, package_path="pipeline.yaml")

The official “Run a Pipeline” guide describes uploading the compiled artifact from the KFP dashboard to start runs. (Kubeflow) The manifests repo also documents a Kubernetes-native API mode where pipeline definitions are stored as Kubernetes CRs (Pipeline, PipelineVersion). (GitHub)

Why it matters: Treat IR YAML as the contract for reproducibility and CI/CD. Namespace-aware design is essential for multi-tenant operations. (Kubeflow)

6) Training at Scale: Kubeflow Trainer v2

Kubeflow Trainer is presented as a Kubernetes-native project for scalable distributed training (including LLM fine-tuning) across frameworks. (Kubeflow) Its installation guide lists prerequisites such as Kubernetes >= 1.31 and kubectl >= 1.31. (Kubeflow) Migration guidance explains that Trainer v2 introduces unified APIs (e.g., TrainJob, TrainingRuntime) that replace framework-specific CRDs like PyTorchJob/TFJob/MPIJob. (Kubeflow)

Why it matters: Unified APIs reduce operational fragmentation as you add frameworks and hardware types (multi-node, multi-GPU) over time. (Kubeflow)

7) Tuning and Serving: Katib + KServe

Katib user guides describe configuring Trial templates for HPO experiments. (Kubeflow) The manifests repo documents KServe installation, noting KFServing was rebranded to KServe. (GitHub) KServe’s website shows the InferenceService-based workflow, and Kubeflow’s Models Web App documentation states it works with v1beta1 InferenceService. (kserve.github.io)

Why it matters: Katib + KServe closes the loop from experimentation to production APIs, turning “trained models” into continuously deployable services. (Kubeflow)

Conclusion

  • Start with the right mental model: Kubeflow Projects vs the integrated platform. (Kubeflow)
  • Use manifests + Kustomize for installation and plan access (HTTPS) early. (GitHub)
  • Make Profiles/Namespaces your multi-tenant foundation. (Kubeflow)
  • Standardize on KFP’s IR YAML workflow for reproducibility and CI/CD. (Kubeflow)
  • Scale training with Trainer v2 and serve models with KServe for end-to-end MLOps. (Kubeflow)

Summary

  • Install Kubeflow via upstream manifests and access via Istio gateway.
  • Use Profiles/Namespaces for isolation and governance.
  • Build pipelines with KFP (DSL → IR YAML → Runs).
  • Run distributed training with Kubeflow Trainer v2.
  • Tune with Katib and deploy with KServe (InferenceService).

#kubeflow #kubernetes #mlops #kubeflowpipelines #kfp #kubeflowtrainer #katib #kserve #gitops #llm

References

  1. Kubeflow 1.11 Release | Kubeflow | 2025-12-15 | https://www.kubeflow.org/docs/releases/kubeflow-1.11/ (Kubeflow)
  2. Kubeflow Deployment Manifests | GitHub | 2026-01-08 (accessed) | https://github.com/kubeflow/manifests (GitHub)
  3. Installing Kubeflow | Kubeflow | 2025-12 (published, “3 weeks ago”) | https://www.kubeflow.org/docs/started/installing-kubeflow/ (Kubeflow)
  4. Profiles and Namespaces | Kubeflow | 2025-03-29 | https://www.kubeflow.org/docs/components/central-dash/profiles/ (Kubeflow)
  5. Notebooks Quickstart Guide | Kubeflow | 2025-03-29 | https://www.kubeflow.org/docs/components/notebooks/quickstart-guide/ (Kubeflow)
  6. Compile a Pipeline | Kubeflow | 2025-08-12 | https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/compile-a-pipeline/ (Kubeflow)
  7. Run a Pipeline | Kubeflow | 2025-03-29 | https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/run-a-pipeline/ (Kubeflow)
  8. Multi-user Isolation (KFP) | Kubeflow | 2025-12-04 | https://www.kubeflow.org/docs/components/pipelines/operator-guides/multi-user/ (Kubeflow)
  9. Kubeflow Trainer Installation | Kubeflow | 2025-11-07 | https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/ (Kubeflow)
  10. KServe (InferenceService) | KServe | 2026-01-08 (accessed) | https://kserve.github.io/website/ (kserve.github.io)