The business team complains that revenue from sales on Online Boutique have gone down. The number of users appears to be mostly constant, but some users stopped buying from the site. As the lead DevOps engineer in the company, you are tasked with investigating the issue.
To simulate the code change and deployment of a new version of the
recommendationservice please run the following command:
kubectl label -n glasnostic-online-boutique pod -l app=recommendationservice \
It will take a few seconds to update the container image. Use
kubectl get pods -n glasnostic-online-boutique to verify the successful deployment.
What’s going on with checkouts?
From the business team we know the issue has to do with the checkout process, so let’s start by looking at what
checkoutservice is doing.
- Make sure the Sources perspective is selected in the top menu, then find and click the
checkoutservicein the service map.
- Click on the Metrics tab and notice how the aggregate latency (L) between
checkoutserviceon the Sources side and its dependencies on the Destinations side is abnormally high.
- Using the Metrics menu in the menu bar, choose Latency as the key metric.
- Then look at the metrics in the Sources and Destinations columns. On the Sources side,
checkoutservicehas an average latency of 5.5 seconds, but in looking at the destination side, we see that
productcatalogservicetakes on average 9.3 seconds to complete, while other destinations incur comparatively miniscule latencies. This extremely high latency for requests going to
productcatalogservicemight very well be the reason for the drop in checkout completions!
Let’s see what’s going on with
- Click Cancel for now.
The trouble with
This time, we want to see who is talking to
productcatalogservice and how much.
Choose Destinations from the Perspective menu in the menu bar.
Find and select
productcatalogservicein the service map (select all instances if there are more than one), then select the Metrics tab.
Since we are in Destination perspective, the Destination column is now on the left and the Sources column on the right. The perspective is inverted, however: for each destination, the Sources column shows which sources interact with it, ordered by the current metric.
- Note how, overall, interactions with
productcatalogservicehave very high latency (L) and concurrency (C). Note too that, while not as extreme as the latencies between
productcatalogservice, the latencies between
productcatalogserviceare also too high. Finally, latencies between
productcatalogserviceare high, as well.
Using the Metrics menu in the menu bar, choose Requests as the key metric.
Notice how the number of requests between
productcatalogserviceis unexpectedly high. Apparently,
productcatalogservice, causing excessive concurrency between the two.
Using the Metrics menu in the menu bar, choose Concurrency as the key metric.
Notice how the concurrency between
productcatalogserviceis also unexpectedly high.
At this point, we could dive head-first into finding the cause for this behavior and then readying a patch to deploy, but that of course would take some time—during which carts would continue to be abandoned during checkout. It would be much better to contain the situation while the team diagnoses what’s going on by exerting some backpressure against
- Click Cancel for now.
checkoutservice is critical for completing purchases and thus takes precedence over
recommendationservice, which merely shows related products. We’ll therefore apply some backpressure against the latter to free up
productcatalogservice capacity for the former—at least until the team gets a chance to fix the resource behavior of
Switch back to Source perspective by choosing Sources from the Perspective menu in the menu bar.
Click the Create View button and make sure the Definition tab is selected.
recommendation*into the Source column and
productcatalog*into the Destination column, hitting Return each time.
We are using wildcards (*) here because we want this view to apply to all instances—past, present, and future—and because the exact instance of the pod will naturally change over time.
Click the Metrics tab and enter "Recommendation service backpressure" in the name box.
To set a connection pool-aware policy for requests from
productcatalogserviceinstances, click Set Policy for concurrency (C) and enter "30", then click Set (or hit Return). This policy will limit concurrent requests to 30.
Because exerting backpressure will increase latencies and thus increase the likelihood that the response will be no longer needed, let’s also shed long-running requests. Click Set Policy for latency (L) and enter "1000" to limit request durations to 1.0 seconds and click Set.
- Any policies you create are committed to a
gitrepository for auditing. As with regular
gitworkflows, click Commit and then Push on the next screen to push the changes live.
checkoutservice has recovered
Once the policies have been pushed out onto the network, you should see their effect in the new data points as they roll in.
- Staying in the Recommendation service backpressure view, you should be able to see that both concurrency and latency (C and L) are now actively controlled as intended.
Now let’s confirm that
checkoutservicehas recovered. Click the back button to return to the Home view.
Click Create View. Again, since we want to keep this view around for a while, enter
checkoutservice*in the Source column and
*in the Destination column so we capture all instances of services past, present and future.
Click the Metrics tab, name it "Checkout interactions", click Commit and then Push on the next page.
Using the Metric menu in the menu bar, choose Latency as our key metric.
The Destinations column should now confirm that
productcatalogservicelatency has gone down from 9.3 seconds to just around 100 ms and that everything appears to be running smoothly.
We started out by examining the
checkoutservice and discovered that its
productcatalogservice dependency exhibited unacceptable latencies. We then looked at which services might put undue load on
productcatalogservice and identified
recommendationservice as the culprit. We then exerted backpressure against
recommendationservice and confirmed that this action allowed
checkoutservice to recover.
That’s it! This is how you can use Glasnostic to quickly detect issues, identify their causes, and fix them.