♊️ GemiNews 🗞️ (dev)

Demo 1: Embeddings + Recommendation Demo 2: Bella RAGa Demo 3: NewRetriever Demo 4: Assistant function calling

🗞️GKE + Gemma + Ollama: The Power Trio for Flexible LLM Deployment

🗿Semantically Similar Articles (by :title_embedding)

GKE + Gemma + Ollama: The Power Trio for Flexible LLM Deployment

2024-03-29 - Federico Iezzi (from Google Cloud - Medium)

In today’s exploration, we delve into the intricacies of deploying a variety of LLMs, focusing particularly on Google Gemma. The platform of choice will be GKE with invaluable assistance from the Ollama framework. Our journey to achieving this milestone will be facilitated by the Open WebUI, which bears a remarkable resemblance to the original OpenAI ChatGPT prompt interface, ensuring a seamless and intuitive user experience.Before going into the nitty and gritty details, let’s address the elephant in the room: why pursue this route in the first place? To me, the rationale is crystal clear and can be distilled into several compelling factors:Cost-Effectiveness: Operating LLMs on public cloud infrastructures could potentially offer a more economical solution, especially for smaller organizations or research entities constrained by budgetary limitations. It’s essential, however, to underscore the conditional nature of this benefit, as platforms like Vertex AI Studio and the OpenAI Developer Platform already provide cost-effective, fully flashed, managed services. Vertex AI will also manage life-cycle and observability of your models. Bear that in mind.Customization and Flexibility: Ollama is crafted with customization, flexibility, and open-source principles at its core. Despite the comprehensive model offerings available through cloud providers’ model registries — Google’s one being the Model Garden that features a more than comprehensive offering — there may be scenarios where a specific model you’re interested in isn’t readily available. This is where Ollama steps in, offering a solution.Portability across environments: Ollama’s design is cloud and platform-agnostic, granting the freedom to deploy it on any private or public platform that accommodates Docker, even on your own laptop. This stands in contrast to other powerful solutions like Vertex AI and SageMaker, which are inherently tied to their respective cloud environments. There is a reason why Docker and Kubernetes took over the entire market. And the very same thing is also valid for x86.Privacy and Data Control: For those inclined towards harnessing fully open-source models, such as 🌋 LLaVA and Gemma, within a wholly private framework, this approach offers an optimal path to ensuring data privacy and full control over the deployment environment.∘ The GKE Platform ∘ Deploying Ollama and Open WebUI (Formerly Ollama WebUI) ∘ GPU vs. CPU — a matter of speed ∘ Ollama’s Current Limitations: A Deeper Dive ∘ Key TakeawaysThe GKE PlatformFor this experiment, my GKE platform setup prioritized efficiency and performance:GKE 1.27 (Regular channel): Ensures compatibility and access to recent Google Kubernetes Engine features.Container-Optimized OS: Reduces node startup time for faster workload deployment (you can read more on my former article).g2-standard-4 Node Pool (NVIDIA L4 GPU): Powerful combination of GPU and CPU resources, ideal for ML tasks. Benchmark results will illustrate the advantages.Managed NVIDIA GPU drivers: Streamlined setup process by integrating drivers directly into GKE, ensuring seamless experience just a flag away gpu-driver-version. Once the cluster is up it’s also ready to go.The NVIDIA L4 GPU pack a punch when it comes to raw specs and results in robust processing capabilities for compute-intensive ML workloads:7680 Shader Processors, 240 TMUs, 80 ROPs, 60 RT cores, 240 Tensor Cores.24GB GDDR6 memory at 300GB/s bandwidth.485 teraFLOPs (FP8 throughput).The G2 Machine Series is the underline platform, based on Intel Cascade Lake and it provides excellent all-around processing to complement the GPU and keep it fed.G2 supports Spot VM: Offers substantial cost savings (approximately 67% discount) for suitable ML workloads that can tolerate interruptions.Deploying Ollama and Open WebUI (Formerly Ollama WebUI)The K8s ecosystem’s maturity has simplified the deployment process, now essentially a matter of executing helm install and kubectl apply commands. For Ollama, the deployment leverages a community-driven Helm Chart available on GitHub, outlining a canonical values.yaml file to guide the configuration:ollama: gpu: enabled: true type: 'nvidia' number: 1 models: - gemma:7b - llava:13b - llama2:7bpersistentVolume: enabled: true size: 100Gi storageClass: "premium-rwo"Conversely, for deploying Open WebUI, the choice veered towards an official Chart and Kustomize template from the community, offering a more fitting approach for this implementation:open-webui/kubernetes/manifest at main · open-webui/open-webuiWhile Open WebUI offers manifests for Ollama deployment, I preferred the feature richness of the Helm Chart. After deployment, you should be able to access the Open WebUI login screen by navigating to the GCP Load Balancer’s IP address on port 8080.Simple checks in the ollama namespace should show all systems operational.Let’s tackle a classic science question: Why is the sky blue?This is real-time footage — Gemma 7B on the NVIDIA L4 delivers results at lightning speed! Want to try it yourself? Deploying models on Ollama couldn’t be easier: just use ollama run gemma:7b.GPU vs. CPU — a matter of speedNow that the platform is ready to rock, you know I can’t resist a good benchmark session 😉. I ran two types of benchmarks across different models:The classic Why is the sky blue? question: Put to Gemma 2B and 7B, as well as LLaMA v1.6 7B and 13B. Gotta test those multimodal and unimodal LLMs!What’s in this picture? for LLaMA v1.6 7B and 13B: Focusing on image analysis here.Don’t worry, I’m not about to start a full-blown LLM showdown — that’s a whole different rabbit hole and way above my understanding. My goal was to track how different machine types impact speed and responsiveness.Prices comparison for europe-west4 regionWithout GPU acceleration, inference performance depended entirely on raw CPU power and memory bandwidth. Naturally, I deployed Ollama without CPU or memory limits and verified full CPU utilization. However, inference tasks often become bottlenecked by memory bandwidth availability and memory architecture.This first graph illustrates several key metrics:total duration: How long the model takes to process the input and generate a response.response_token/s: A measure of how quickly the model produces output.monthly cost: The financial impact of running the chosen configuration for an entire month.A lot needs to be unpacked here but I want to start with a warning: the performance numbers you are about to witness are representative just of this specific scenarios. The world of LLM is so vast and fast, that this current picture could be completely irrelevant in a matter of days and even with slightly different scenarios.GPU Dominance:GPUs deliver drastically lower latency (higher tokens per second) than CPUs. Even 180 dedicated CPU cores at $12k/month can’t compete.The NVIDIA L4 offers a 15% speed advantage over the older T4, with a 78% cost increase. Sustained Use Discounts were factored in.While the A100 is lightning-fast, about three times faster than L4, its high price and focus on training make it overkill for most inference tasks. Yet it managed to answer in just shy of 3.6 seconds 🤯.CPU Struggles:Smaller CPUs are undeniably slow and surprisingly expensive.Even cost-comparable CPUs (c3-highcpu-22 / c3d-highcpu-16) lag behind the L4 and T4 in throughput.The largest CPUs (c3-standard-176 / c3d-standard-360) offer poor performance for their exorbitant cost.C3 scale badly, this could be a potential issues with ollama/llama.cpp, my setup, or C3 instance and their lack of vNUMA topology. Regardless, the price makes it pointless.Now, looking at an image recognition prompt, this time the model of choice was LLaVA v1.6 with 13B parameters.The GPU’s performance advantage holds true here as well, demonstrating that CPUs simply can’t compete in this domain. Interestingly, the c3-standard-176 finally outperformed the c3-highcpu-22, which dispels any suspicions of bugs in C3 or my setup.As per tradition, all results are publicly available at the following Google Sheet:[ollama][medium] - GPU vs. CPU - Mar 28th 2024Before discussing a few points about Ollama, I’d like to share the exact SHA and tags used in this environment. The AI world is moving so fast that anybody attempting at reproducing my work could discover a different landscape just a few weeks down the road:ollama v0.1.29;Gemma 2B SHA b50d6c999e59Gemma 7B SHA 430ed3535049LLaVA v1.6 7B SHA 8dd30f6b0cb1LLaVA v1.6 13B SHA 0d0eb4d7f485And on the how the benchmark were executed:curl http://localhost:8080/api/generate -d \'{ "model": "gemma:7b", "prompt": "Why is the sky blue?", "stream": false, "options": {"seed": 100}}'curl http://localhost:8080/api/generate -d \'{ "model": "llava:13b", "prompt":"What is in this picture?", "images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC"], "stream": false, "options": {"seed": 100}}'As you can see, while recording the results:Direct Ollama API communication.Streaming disabled.Same seed across all prompts.Ollama’s Current Limitations: A Deeper DiveWhile it’s important to remember that Ollama is a rapidly evolving project, it’s useful to examine some key constraints that power users should be aware of:The Repository Bottleneck: Being locked into registry.ollama.ai stifles innovation and experimentation. Imagine if Docker had never expanded beyond Quay.io! While a workaround might be possible, a native solution for diverse model sources would be a huge step forward and the community has already made a proposal.Missed Opportunities with Parallelism: Ollama’s sequential request handling limits its real-world throughput. Imagine a high-traffic scenario where users experience frustrating delays. The good news is that parallel decoding was merged in llama.cpp and pulled in during the v0.1.30 cycle — something to keep a close eye on is issue #358 open upstream.The AVX512 Letdown and an Emerging Option: It’s disappointing that AVX512 optimizations don’t deliver the expected performance boost in Ollama. I even made an attempt at making it better before facing reality: AVX512 sucks, it’s slower than AVX2 😭 (of course the core clock is more than halve), and “I Hope AVX512 Dies a Painful Death”. Intel AMX paints a brighter picture. Its competitive pricing, early benchmark results, and the potential to outpace GPUs in certain workloads make it an exciting alternative. On this topic, I strongly encourage a deep look at The Next Platform take on why AI Inference will remain largely on CPUs.Why AI Inference Will Remain Largely On The CPUKey TakeawaysDeploying LLMs on GKE with Ollama offers a compelling option for users prioritizing customization, flexibility, potential cost savings, and privacy within their LLM solutions. This approach unlocks the ability to use models unavailable on commercial platforms and provides complete control over the deployment environment. Crucially, GPU acceleration is indispensable for optimal LLM performance, drastically outpacing even powerful CPU-based instances. However, it’s essential to stay mindful of Ollama’s current limitations, such as the registry dependency and sequential request handling, which may impact real-world scenarios. As Ollama continues to evolve, these limitations are likely to be addressed, further enhancing its potential.I hope you had fun, this was a new journey also for me. If you have any questions, do not hesitate and leave a comment.GKE + Gemma + Ollama: The Power Trio for Flexible LLM Deployment 🚀 was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

[Blogs] 🌎 https://medium.com/google-cloud/gke-gemma-ollama-the-power-trio-for-flexible-llm-deployment-5f1fa9223477?source=rss----e52cf94d98af---4 [🧠] [v2] article_embedding_description: {:llm_project_id=>"Unavailable", :llm_dimensions=>nil, :article_size=>21233, :llm_embeddings_model_name=>"textembedding-gecko"}
[🧠] [v1/3] title_embedding_description: {:ricc_notes=>"[embed-v3] Fixed on 9oct24. Only seems incompatible at first glance with embed v1.", :llm_project_id=>"unavailable possibly not using Vertex", :llm_dimensions=>nil, :article_size=>21233, :poly_field=>"title", :llm_embeddings_model_name=>"textembedding-gecko"}
[🧠] [v1/3] summary_embedding_description:
[🧠] As per bug https://github.com/palladius/gemini-news-crawler/issues/4 we can state this article belongs to titile/summary version: v3 (very few articles updated on 9oct24)

🗿article.to_s

------------------------------
Title: GKE + Gemma + Ollama: The Power Trio for Flexible LLM Deployment
[content]
In today’s exploration, we delve into the intricacies of deploying a variety of LLMs, focusing particularly on Google Gemma. The platform of choice will be GKE with invaluable assistance from the Ollama framework. Our journey to achieving this milestone will be facilitated by the Open WebUI, which bears a remarkable resemblance to the original OpenAI ChatGPT prompt interface, ensuring a seamless and intuitive user experience.Before going into the nitty and gritty details, let’s address the elephant in the room: why pursue this route in the first place? To me, the rationale is crystal clear and can be distilled into several compelling factors:Cost-Effectiveness: Operating LLMs on public cloud infrastructures could potentially offer a more economical solution, especially for smaller organizations or research entities constrained by budgetary limitations. It’s essential, however, to underscore the conditional nature of this benefit, as platforms like Vertex AI Studio and the OpenAI Developer Platform already provide cost-effective, fully flashed, managed services. Vertex AI will also manage life-cycle and observability of your models. Bear that in mind.Customization and Flexibility: Ollama is crafted with customization, flexibility, and open-source principles at its core. Despite the comprehensive model offerings available through cloud providers’ model registries — Google’s one being the Model Garden that features a more than comprehensive offering — there may be scenarios where a specific model you’re interested in isn’t readily available. This is where Ollama steps in, offering a solution.Portability across environments: Ollama’s design is cloud and platform-agnostic, granting the freedom to deploy it on any private or public platform that accommodates Docker, even on your own laptop. This stands in contrast to other powerful solutions like Vertex AI and SageMaker, which are inherently tied to their respective cloud environments. There is a reason why Docker and Kubernetes took over the entire market. And the very same thing is also valid for x86.Privacy and Data Control: For those inclined towards harnessing fully open-source models, such as 🌋 LLaVA and Gemma, within a wholly private framework, this approach offers an optimal path to ensuring data privacy and full control over the deployment environment.∘ The GKE Platform ∘ Deploying Ollama and Open WebUI (Formerly Ollama WebUI) ∘ GPU vs. CPU — a matter of speed ∘ Ollama’s Current Limitations: A Deeper Dive ∘ Key TakeawaysThe GKE PlatformFor this experiment, my GKE platform setup prioritized efficiency and performance:GKE 1.27 (Regular channel): Ensures compatibility and access to recent Google Kubernetes Engine features.Container-Optimized OS: Reduces node startup time for faster workload deployment (you can read more on my former article).g2-standard-4 Node Pool (NVIDIA L4 GPU): Powerful combination of GPU and CPU resources, ideal for ML tasks. Benchmark results will illustrate the advantages.Managed NVIDIA GPU drivers: Streamlined setup process by integrating drivers directly into GKE, ensuring seamless experience just a flag away gpu-driver-version. Once the cluster is up it’s also ready to go.The NVIDIA L4 GPU pack a punch when it comes to raw specs and results in robust processing capabilities for compute-intensive ML workloads:7680 Shader Processors, 240 TMUs, 80 ROPs, 60 RT cores, 240 Tensor Cores.24GB GDDR6 memory at 300GB/s bandwidth.485 teraFLOPs (FP8 throughput).The G2 Machine Series is the underline platform, based on Intel Cascade Lake and it provides excellent all-around processing to complement the GPU and keep it fed.G2 supports Spot VM: Offers substantial cost savings (approximately 67% discount) for suitable ML workloads that can tolerate interruptions.Deploying Ollama and Open WebUI (Formerly Ollama WebUI)The K8s ecosystem’s maturity has simplified the deployment process, now essentially a matter of executing helm install and kubectl apply commands. For Ollama, the deployment leverages a community-driven Helm Chart available on GitHub, outlining a canonical values.yaml file to guide the configuration:ollama:  gpu:    enabled: true    type: 'nvidia'    number: 1  models:    - gemma:7b    - llava:13b    - llama2:7bpersistentVolume:  enabled: true  size: 100Gi  storageClass: "premium-rwo"Conversely, for deploying Open WebUI, the choice veered towards an official Chart and Kustomize template from the community, offering a more fitting approach for this implementation:open-webui/kubernetes/manifest at main · open-webui/open-webuiWhile Open WebUI offers manifests for Ollama deployment, I preferred the feature richness of the Helm Chart. After deployment, you should be able to access the Open WebUI login screen by navigating to the GCP Load Balancer’s IP address on port 8080.Simple checks in the ollama namespace should show all systems operational.Let’s tackle a classic science question: Why is the sky blue?This is real-time footage — Gemma 7B on the NVIDIA L4 delivers results at lightning speed! Want to try it yourself? Deploying models on Ollama couldn’t be easier: just use ollama run gemma:7b.GPU vs. CPU — a matter of speedNow that the platform is ready to rock, you know I can’t resist a good benchmark session 😉. I ran two types of benchmarks across different models:The classic Why is the sky blue? question: Put to Gemma 2B and 7B, as well as LLaMA v1.6 7B and 13B. Gotta test those multimodal and unimodal LLMs!What’s in this picture? for LLaMA v1.6 7B and 13B: Focusing on image analysis here.Don’t worry, I’m not about to start a full-blown LLM showdown — that’s a whole different rabbit hole and way above my understanding. My goal was to track how different machine types impact speed and responsiveness.Prices comparison for europe-west4 regionWithout GPU acceleration, inference performance depended entirely on raw CPU power and memory bandwidth. Naturally, I deployed Ollama without CPU or memory limits and verified full CPU utilization. However, inference tasks often become bottlenecked by memory bandwidth availability and memory architecture.This first graph illustrates several key metrics:total duration: How long the model takes to process the input and generate a response.response_token/s: A measure of how quickly the model produces output.monthly cost: The financial impact of running the chosen configuration for an entire month.A lot needs to be unpacked here but I want to start with a warning: the performance numbers you are about to witness are representative just of this specific scenarios. The world of LLM is so vast and fast, that this current picture could be completely irrelevant in a matter of days and even with slightly different scenarios.GPU Dominance:GPUs deliver drastically lower latency (higher tokens per second) than CPUs. Even 180 dedicated CPU cores at $12k/month can’t compete.The NVIDIA L4 offers a 15% speed advantage over the older T4, with a 78% cost increase. Sustained Use Discounts were factored in.While the A100 is lightning-fast, about three times faster than L4, its high price and focus on training make it overkill for most inference tasks. Yet it managed to answer in just shy of 3.6 seconds 🤯.CPU Struggles:Smaller CPUs are undeniably slow and surprisingly expensive.Even cost-comparable CPUs (c3-highcpu-22 / c3d-highcpu-16) lag behind the L4 and T4 in throughput.The largest CPUs (c3-standard-176 / c3d-standard-360) offer poor performance for their exorbitant cost.C3 scale badly, this could be a potential issues with ollama/llama.cpp, my setup, or C3 instance and their lack of vNUMA topology. Regardless, the price makes it pointless.Now, looking at an image recognition prompt, this time the model of choice was LLaVA v1.6 with 13B parameters.The GPU’s performance advantage holds true here as well, demonstrating that CPUs simply can’t compete in this domain. Interestingly, the c3-standard-176 finally outperformed the c3-highcpu-22, which dispels any suspicions of bugs in C3 or my setup.As per tradition, all results are publicly available at the following Google Sheet:[ollama][medium] - GPU vs. CPU - Mar 28th 2024Before discussing a few points about Ollama, I’d like to share the exact SHA and tags used in this environment. The AI world is moving so fast that anybody attempting at reproducing my work could discover a different landscape just a few weeks down the road:ollama v0.1.29;Gemma 2B SHA b50d6c999e59Gemma 7B SHA 430ed3535049LLaVA v1.6 7B SHA 8dd30f6b0cb1LLaVA v1.6 13B SHA 0d0eb4d7f485And on the how the benchmark were executed:curl http://localhost:8080/api/generate -d \'{  "model": "gemma:7b",  "prompt": "Why is the sky blue?",  "stream": false,  "options": {"seed": 100}}'curl http://localhost:8080/api/generate -d \'{  "model": "llava:13b",  "prompt":"What is in this picture?",  "images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC"],  "stream": false,  "options": {"seed": 100}}'As you can see, while recording the results:Direct Ollama API communication.Streaming disabled.Same seed across all prompts.Ollama’s Current Limitations: A Deeper DiveWhile it’s important to remember that Ollama is a rapidly evolving project, it’s useful to examine some key constraints that power users should be aware of:The Repository Bottleneck: Being locked into registry.ollama.ai stifles innovation and experimentation. Imagine if Docker had never expanded beyond Quay.io! While a workaround might be possible, a native solution for diverse model sources would be a huge step forward and the community has already made a proposal.Missed Opportunities with Parallelism: Ollama’s sequential request handling limits its real-world throughput. Imagine a high-traffic scenario where users experience frustrating delays. The good news is that parallel decoding was merged in llama.cpp and pulled in during the v0.1.30 cycle — something to keep a close eye on is issue #358 open upstream.The AVX512 Letdown and an Emerging Option: It’s disappointing that AVX512 optimizations don’t deliver the expected performance boost in Ollama. I even made an attempt at making it better before facing reality: AVX512 sucks, it’s slower than AVX2 😭 (of course the core clock is more than halve), and “I Hope AVX512 Dies a Painful Death”. Intel AMX paints a brighter picture. Its competitive pricing, early benchmark results, and the potential to outpace GPUs in certain workloads make it an exciting alternative. On this topic, I strongly encourage a deep look at The Next Platform take on why AI Inference will remain largely on CPUs.Why AI Inference Will Remain Largely On The CPUKey TakeawaysDeploying LLMs on GKE with Ollama offers a compelling option for users prioritizing customization, flexibility, potential cost savings, and privacy within their LLM solutions. This approach unlocks the ability to use models unavailable on commercial platforms and provides complete control over the deployment environment. Crucially, GPU acceleration is indispensable for optimal LLM performance, drastically outpacing even powerful CPU-based instances. However, it’s essential to stay mindful of Ollama’s current limitations, such as the registry dependency and sequential request handling, which may impact real-world scenarios. As Ollama continues to evolve, these limitations are likely to be addressed, further enhancing its potential.I hope you had fun, this was a new journey also for me. If you have any questions, do not hesitate and leave a comment.GKE + Gemma + Ollama: The Power Trio for Flexible LLM Deployment 🚀 was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
[/content]

Author: Federico Iezzi
PublishedDate: 2024-03-29
Category: Blogs
NewsPaper: Google Cloud - Medium
Tags: llm, google-cloud-platform, kubernetes, gemma, ollama
{"id"=>1234,
"title"=>"GKE + Gemma + Ollama: The Power Trio for Flexible LLM Deployment ",
"summary"=>nil,
"content"=>"

In today’s exploration, we delve into the intricacies of deploying a variety of LLMs, focusing particularly on Google Gemma. The platform of choice will be GKE with invaluable assistance from the Ollama framework. Our journey to achieving this milestone will be facilitated by the Open WebUI, which bears a remarkable resemblance to the original OpenAI ChatGPT prompt interface, ensuring a seamless and intuitive user experience.

Before going into the nitty and gritty details, let’s address the elephant in the room: why pursue this route in the first place? To me, the rationale is crystal clear and can be distilled into several compelling factors:

  1. Cost-Effectiveness: Operating LLMs on public cloud infrastructures could potentially offer a more economical solution, especially for smaller organizations or research entities constrained by budgetary limitations. It’s essential, however, to underscore the conditional nature of this benefit, as platforms like Vertex AI Studio and the OpenAI Developer Platform already provide cost-effective, fully flashed, managed services. Vertex AI will also manage life-cycle and observability of your models. Bear that in mind.
  2. Customization and Flexibility: Ollama is crafted with customization, flexibility, and open-source principles at its core. Despite the comprehensive model offerings available through cloud providers’ model registries — Google’s one being the Model Garden that features a more than comprehensive offering — there may be scenarios where a specific model you’re interested in isn’t readily available. This is where Ollama steps in, offering a solution.
  3. Portability across environments: Ollama’s design is cloud and platform-agnostic, granting the freedom to deploy it on any private or public platform that accommodates Docker, even on your own laptop. This stands in contrast to other powerful solutions like Vertex AI and SageMaker, which are inherently tied to their respective cloud environments. There is a reason why Docker and Kubernetes took over the entire market. And the very same thing is also valid for x86.
  4. Privacy and Data Control: For those inclined towards harnessing fully open-source models, such as 🌋 LLaVA and Gemma, within a wholly private framework, this approach offers an optimal path to ensuring data privacy and full control over the deployment environment.
\"\"

The GKE Platform
Deploying Ollama and Open WebUI (Formerly Ollama WebUI)
GPU vs. CPU — a matter of speed
Ollama’s Current Limitations: A Deeper Dive
Key Takeaways

The GKE Platform

For this experiment, my GKE platform setup prioritized efficiency and performance:

  • GKE 1.27 (Regular channel): Ensures compatibility and access to recent Google Kubernetes Engine features.
  • Container-Optimized OS: Reduces node startup time for faster workload deployment (you can read more on my former article).
  • g2-standard-4 Node Pool (NVIDIA L4 GPU): Powerful combination of GPU and CPU resources, ideal for ML tasks. Benchmark results will illustrate the advantages.
  • Managed NVIDIA GPU drivers: Streamlined setup process by integrating drivers directly into GKE, ensuring seamless experience just a flag away gpu-driver-version. Once the cluster is up it’s also ready to go.

The NVIDIA L4 GPU pack a punch when it comes to raw specs and results in robust processing capabilities for compute-intensive ML workloads:

  • 7680 Shader Processors, 240 TMUs, 80 ROPs, 60 RT cores, 240 Tensor Cores.
  • 24GB GDDR6 memory at 300GB/s bandwidth.
  • 485 teraFLOPs (FP8 throughput).

The G2 Machine Series is the underline platform, based on Intel Cascade Lake and it provides excellent all-around processing to complement the GPU and keep it fed.

G2 supports Spot VM: Offers substantial cost savings (approximately 67% discount) for suitable ML workloads that can tolerate interruptions.

Deploying Ollama and Open WebUI (Formerly Ollama WebUI)

The K8s ecosystem’s maturity has simplified the deployment process, now essentially a matter of executing helm install and kubectl apply commands. For Ollama, the deployment leverages a community-driven Helm Chart available on GitHub, outlining a canonical values.yaml file to guide the configuration:

ollama:
gpu:
enabled: true
type: 'nvidia'
number: 1
models:
- gemma:7b
- llava:13b
- llama2:7b
persistentVolume:
enabled: true
size: 100Gi
storageClass: "premium-rwo"

Conversely, for deploying Open WebUI, the choice veered towards an official Chart and Kustomize template from the community, offering a more fitting approach for this implementation:

open-webui/kubernetes/manifest at main · open-webui/open-webui

While Open WebUI offers manifests for Ollama deployment, I preferred the feature richness of the Helm Chart. After deployment, you should be able to access the Open WebUI login screen by navigating to the GCP Load Balancer’s IP address on port 8080.

\"\"

Simple checks in the ollama namespace should show all systems operational.

\"\"

Let’s tackle a classic science question: Why is the sky blue?

\"\"

This is real-time footage — Gemma 7B on the NVIDIA L4 delivers results at lightning speed! Want to try it yourself? Deploying models on Ollama couldn’t be easier: just use ollama run gemma:7b.

GPU vs. CPU — a matter of speed

Now that the platform is ready to rock, you know I can’t resist a good benchmark session 😉. I ran two types of benchmarks across different models:

  • The classic Why is the sky blue? question: Put to Gemma 2B and 7B, as well as LLaMA v1.6 7B and 13B. Gotta test those multimodal and unimodal LLMs!
  • What’s in this picture? for LLaMA v1.6 7B and 13B: Focusing on image analysis here.

Don’t worry, I’m not about to start a full-blown LLM showdown — that’s a whole different rabbit hole and way above my understanding. My goal was to track how different machine types impact speed and responsiveness.

\"\"
Prices comparison for europe-west4 region

Without GPU acceleration, inference performance depended entirely on raw CPU power and memory bandwidth. Naturally, I deployed Ollama without CPU or memory limits and verified full CPU utilization. However, inference tasks often become bottlenecked by memory bandwidth availability and memory architecture.

This first graph illustrates several key metrics:

  • total duration: How long the model takes to process the input and generate a response.
  • response_token/s: A measure of how quickly the model produces output.
  • monthly cost: The financial impact of running the chosen configuration for an entire month.
\"\"

A lot needs to be unpacked here but I want to start with a warning: the performance numbers you are about to witness are representative just of this specific scenarios. The world of LLM is so vast and fast, that this current picture could be completely irrelevant in a matter of days and even with slightly different scenarios.

GPU Dominance:

  • GPUs deliver drastically lower latency (higher tokens per second) than CPUs. Even 180 dedicated CPU cores at $12k/month can’t compete.
  • The NVIDIA L4 offers a 15% speed advantage over the older T4, with a 78% cost increase. Sustained Use Discounts were factored in.
  • While the A100 is lightning-fast, about three times faster than L4, its high price and focus on training make it overkill for most inference tasks. Yet it managed to answer in just shy of 3.6 seconds 🤯.

CPU Struggles:

  • Smaller CPUs are undeniably slow and surprisingly expensive.
  • Even cost-comparable CPUs (c3-highcpu-22 / c3d-highcpu-16) lag behind the L4 and T4 in throughput.
  • The largest CPUs (c3-standard-176 / c3d-standard-360) offer poor performance for their exorbitant cost.
  • C3 scale badly, this could be a potential issues with ollama/llama.cpp, my setup, or C3 instance and their lack of vNUMA topology. Regardless, the price makes it pointless.

Now, looking at an image recognition prompt, this time the model of choice was LLaVA v1.6 with 13B parameters.

\"\"

The GPU’s performance advantage holds true here as well, demonstrating that CPUs simply can’t compete in this domain. Interestingly, the c3-standard-176 finally outperformed the c3-highcpu-22, which dispels any suspicions of bugs in C3 or my setup.

As per tradition, all results are publicly available at the following Google Sheet:

[ollama][medium] - GPU vs. CPU - Mar 28th 2024

Before discussing a few points about Ollama, I’d like to share the exact SHA and tags used in this environment. The AI world is moving so fast that anybody attempting at reproducing my work could discover a different landscape just a few weeks down the road:

  • ollama v0.1.29;
  • Gemma 2B SHA b50d6c999e59
  • Gemma 7B SHA 430ed3535049
  • LLaVA v1.6 7B SHA 8dd30f6b0cb1
  • LLaVA v1.6 13B SHA 0d0eb4d7f485

And on the how the benchmark were executed:

curl http://localhost:8080/api/generate -d \\
'{
"model": "gemma:7b",
"prompt": "Why is the sky blue?",
"stream": false,
"options": {"seed": 100}
}'
curl http://localhost:8080/api/generate -d \\
'{
"model": "llava:13b",
"prompt":"What is in this picture?",
"images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC"],
"stream": false,
"options": {"seed": 100}
}'

As you can see, while recording the results:

  • Direct Ollama API communication.
  • Streaming disabled.
  • Same seed across all prompts.

Ollama’s Current Limitations: A Deeper Dive

While it’s important to remember that Ollama is a rapidly evolving project, it’s useful to examine some key constraints that power users should be aware of:

  • The Repository Bottleneck: Being locked into registry.ollama.ai stifles innovation and experimentation. Imagine if Docker had never expanded beyond Quay.io! While a workaround might be possible, a native solution for diverse model sources would be a huge step forward and the community has already made a proposal.
  • Missed Opportunities with Parallelism: Ollama’s sequential request handling limits its real-world throughput. Imagine a high-traffic scenario where users experience frustrating delays. The good news is that parallel decoding was merged in llama.cpp and pulled in during the v0.1.30 cycle — something to keep a close eye on is issue #358 open upstream.
  • The AVX512 Letdown and an Emerging Option: It’s disappointing that AVX512 optimizations don’t deliver the expected performance boost in Ollama. I even made an attempt at making it better before facing reality: AVX512 sucks, it’s slower than AVX2 😭 (of course the core clock is more than halve), and “I Hope AVX512 Dies a Painful Death”. Intel AMX paints a brighter picture. Its competitive pricing, early benchmark results, and the potential to outpace GPUs in certain workloads make it an exciting alternative. On this topic, I strongly encourage a deep look at The Next Platform take on why AI Inference will remain largely on CPUs.

Why AI Inference Will Remain Largely On The CPU

Key Takeaways

Deploying LLMs on GKE with Ollama offers a compelling option for users prioritizing customization, flexibility, potential cost savings, and privacy within their LLM solutions. This approach unlocks the ability to use models unavailable on commercial platforms and provides complete control over the deployment environment. Crucially, GPU acceleration is indispensable for optimal LLM performance, drastically outpacing even powerful CPU-based instances. However, it’s essential to stay mindful of Ollama’s current limitations, such as the registry dependency and sequential request handling, which may impact real-world scenarios. As Ollama continues to evolve, these limitations are likely to be addressed, further enhancing its potential.

I hope you had fun, this was a new journey also for me. If you have any questions, do not hesitate and leave a comment.

\"\"

GKE + Gemma + Ollama: The Power Trio for Flexible LLM Deployment 🚀 was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

",
"author"=>"Federico Iezzi",
"link"=>"https://medium.com/google-cloud/gke-gemma-ollama-the-power-trio-for-flexible-llm-deployment-5f1fa9223477?source=rss----e52cf94d98af---4",
"published_date"=>Fri, 29 Mar 2024 03:27:33.000000000 UTC +00:00,
"image_url"=>nil,
"feed_url"=>"https://medium.com/google-cloud/gke-gemma-ollama-the-power-trio-for-flexible-llm-deployment-5f1fa9223477?source=rss----e52cf94d98af---4",
"language"=>nil,
"active"=>true,
"ricc_source"=>"feedjira::v1",
"created_at"=>Sun, 31 Mar 2024 20:53:35.961778000 UTC +00:00,
"updated_at"=>Mon, 21 Oct 2024 16:56:26.641693000 UTC +00:00,
"newspaper"=>"Google Cloud - Medium",
"macro_region"=>"Blogs"}
Edit this article
Back to articles