Skip to main content

Backpressure - Abandoned shopping carts

Scenario

The business team complains that revenue from sales on Online Boutique have gone down. The number of users appears to be mostly constant, but some users stopped buying from the site. As the lead DevOps engineer in the company, you are tasked with investigating the issue.

Deployment

To simulate the code change and deployment of a new version of the recommendationservice please run the following command:

kubectl label -n glasnostic-online-boutique pod -l app=recommendationservice \
ENABLE_USERBASED_RECOMMEND=true

It will take a few seconds to update the container image. Use kubectl get pods -n glasnostic-online-boutique to verify the successful deployment.

What’s going on with checkouts?

From the business team we know the issue has to do with the checkout process, so let’s start by looking at what checkoutservice is doing.

  1. Make sure the Sources perspective is selected in the top menu, then find and click the checkoutservice in the service map.

Select checkoutservice

  1. Click on the Metrics tab and notice how the aggregate latency (L) between checkoutservice on the Sources side and its dependencies on the Destinations side is abnormally high.

Latency graph

  1. Using the Metrics menu in the menu bar, choose Latency as the key metric.

Metrics dropdown

  1. Then look at the metrics in the Sources and Destinations columns. On the Sources side, checkoutservice has an average latency of 5.5 seconds, but in looking at the destination side, we see that productcatalogservice takes on average 9.3 seconds to complete, while other destinations incur comparatively miniscule latencies. This extremely high latency for requests going to productcatalogservice might very well be the reason for the drop in checkout completions!

Latency graph

Let’s see what’s going on with productcatalogservice.

  1. Click Cancel for now.

The trouble with productcatalogservice

This time, we want to see who is talking to productcatalogservice and how much.

  1. Choose Destinations from the Perspective menu in the menu bar. Perspective menu

  2. Find and select productcatalogservice in the service map (select all instances if there are more than one), then select the Metrics tab.

note

Since we are in Destination perspective, the Destination column is now on the left and the Sources column on the right. The perspective is inverted, however: for each destination, the Sources column shows which sources interact with it, ordered by the current metric.

  1. Note how, overall, interactions with productcatalogservice have very high latency (L) and concurrency (C). Note too that, while not as extreme as the latencies between checkoutservice and productcatalogservice, the latencies between frontend and productcatalogservice are also too high. Finally, latencies between recommendationservice and productcatalogservice are high, as well.

Metrics

  1. Using the Metrics menu in the menu bar, choose Requests as the key metric.

  2. Notice how the number of requests between recommendationservice and productcatalogservice is unexpectedly high. Apparently, recommendationservice is hammering productcatalogservice, causing excessive concurrency between the two.

Recommendation requests

  1. Using the Metrics menu in the menu bar, choose Concurrency as the key metric.

  2. Notice how the concurrency between recommendationservice and productcatalogservice is also unexpectedly high.

At this point, we could dive head-first into finding the cause for this behavior and then readying a patch to deploy, but that of course would take some time—during which carts would continue to be abandoned during checkout. It would be much better to contain the situation while the team diagnoses what’s going on by exerting some backpressure against recommendationservice.

  1. Click Cancel for now.

Taking control

checkoutservice is critical for completing purchases and thus takes precedence over recommendationservice, which merely shows related products. We’ll therefore apply some backpressure against the latter to free up productcatalogservice capacity for the former—at least until the team gets a chance to fix the resource behavior of recommendationservice.

  1. Switch back to Source perspective by choosing Sources from the Perspective menu in the menu bar.

  2. Click the Create View button and make sure the Definition tab is selected.

  3. Enter recommendation* into the Source column and productcatalog* into the Destination column, hitting Return each time.

note

We are using wildcards (*) here because we want this view to apply to all instances—past, present, and future—and because the exact instance of the pod will naturally change over time.

  1. Click the Metrics tab and enter "Recommendation service backpressure" in the name box.

  2. To set a connection pool-aware policy for requests from recommendationservice instances to productcatalogservice instances, click Set Policy for concurrency (C) and enter "30", then click Set (or hit Return). This policy will limit concurrent requests to 30.

  3. Because exerting backpressure will increase latencies and thus increase the likelihood that the response will be no longer needed, let’s also shed long-running requests. Click Set Policy for latency (L) and enter "1000" to limit request durations to 1.0 seconds and click Set.

Set policy

  1. Any policies you create are committed to a git repository for auditing. As with regular git workflows, click Commit and then Push on the next screen to push the changes live.

Confirm checkoutservice has recovered

Once the policies have been pushed out onto the network, you should see their effect in the new data points as they roll in.

  1. Staying in the Recommendation service backpressure view, you should be able to see that both concurrency and latency (C and L) are now actively controlled as intended.

Policy hit demonstration

  1. Now let’s confirm that checkoutservice has recovered. Click the back button Back button to return to the Home view.

  2. Click Create View. Again, since we want to keep this view around for a while, enter checkoutservice* in the Source column and * in the Destination column so we capture all instances of services past, present and future.

  3. Click the Metrics tab, name it "Checkout interactions", click Commit and then Push on the next page.

  4. Using the Metric menu in the menu bar, choose Latency as our key metric.

  5. The Destinations column should now confirm that productcatalogservice latency has gone down from 9.3 seconds to just around 100 ms and that everything appears to be running smoothly.

Updated Latency Value

Summary

We started out by examining the checkoutservice and discovered that its productcatalogservice dependency exhibited unacceptable latencies. We then looked at which services might put undue load on productcatalogservice and identified recommendationservice as the culprit. We then exerted backpressure against recommendationservice and confirmed that this action allowed checkoutservice to recover.

That’s it! This is how you can use Glasnostic to quickly detect issues, identify their causes, and fix them.