We use cookies and other tracking technologies to improve your browsing experience on our site, analyze site traffic, and understand where our audience is coming from. To find out more, please read our privacy policy.

By choosing 'I Accept', you consent to our use of cookies and other tracking technologies.

We use cookies and other tracking technologies to improve your browsing experience on our site, analyze site traffic, and understand where our audience is coming from. To find out more, please read our privacy policy.

By choosing 'I Accept', you consent to our use of cookies and other tracking technologies. Less

We use cookies and other tracking technologies... More

Login or register
to apply for this job!

Login or register to start contributing with an article!

Login or register
to see more jobs from this company!

Login or register
to boost this post!

Show some love to the author of this blog by giving their post some rocket fuel πŸš€.

Login or register to search for your ideal job!

Login or register to start working on this issue!

Engineers who find a new job through Functional Works average a 15% increase in salary πŸš€

Blog hero image

Streaming: a skill gap?

Michal Charemza 11 March, 2020 | 4 min read

I've noticed a bit of a skill gap: I think a lot of developers are not able to code up "streaming" solutions to problems.

However, streaming can often be useful, even needed, in what are now run-of-the-mill web applications; and wonderfully, we often don't need anything fancier than the tools already being used: we just need to know how to use them.

What is streaming?

Any situation when you process data concurrently with receiving it. This process can be analyzing the data, or just forwarding it onwards.

What are the benefits of streaming?

There are two main [potential] benefits.

Speed

If you start processing the data sooner, before its all received, then you [might] finish sooner.

Support higher concurrency / size limits

Say you would like users to be able to upload 500mb files: in these days of video and hi-res images, this isn't a far-fetched requirement, even for a standard web application. If you don't forward the uploaded data onwards while it's still being uploaded, just a few users uploading concurrently could use all the memory on a server.

[You can upload directly from a browser to the underlying data store. For example, to S3 using presigned URLs. However, this has its own set of drawbacks, ommitted here for brevity.]

What are typical problems with streaming?

Streaming is not a perfect/one-size-fits-all solution: it does have its downsides.

Testing You're testing an upload with a 5KB file, and it works. Are you sure it's streaming and will work with a 5GB file? There are two options that I'm aware of.

  • Actually test a 5GB file [making you have less than 5GB of memory available]. While this is quite a good "real" test, it can be slow.
  • Hook into both sides of the streaming process, and ensure that the target receives data before the source has sent all of its. You can do this with smaller data, and so such a test can be quick. However, this can be more brittle with respect to refactorings, i.e. the test can fail while the production behaviour continues to work.

Errors Handling errors, i.e. communicating and responding to them, can be more difficult.

Conveniently, HTTP has some of this built-in. If streaming an HTTP body with a content-length header specifying the number of bytes, if the receiver doesn't receive that amount by the time the connection has closed, they know an error has occurred. If transfer-encoding: chunked is used, if the receiver doesn't receive a 0-length chunk at the end, they know there has been an error.

It's not perfect though: there is no way to send an HTTP status code once the body has begun to stream. But for many situations, this is enough.

What to do when an error has occurred may be more tricky. With a non-streaming multi-stage pipeline, if one part fails, you can usually retry because you have the source bytes to retry with. However if streaming, the bytes have gone. To retry, have to build in a mechanism to re-retrieve them from the source.

Complexity Especially when considering error handling, retrying, or say, efficiently dealing with bandwidth differences/variation in different parts of the stream, there could be more complexity compared to a non-streaming solution.

This being said, a) you may not need to implement such things [e.g. OS-provided TCP buffers may adequately compensate for bandwidth variation], and b) I suspect the complexity is sometimes overstated and conflated with unfamiliarity [although it would be naive to think this is isn't a problem, as mentioned below].

Performance Ironically, there might be a performance penalty compared to non-streaming solutions due to what could be radically different operations / orders of operations. This could be especially true if using streaming for smaller amounts of data.

Homogeneity Each part of the pipeline needs to support streaming. It's not the default in a lot of cases: which is unfortunate since you can use code that supports streaming to process data in a non-streaming way [by just using a single "chunk"], but it's impossible to do the opposite.

Unfamiliarity Streaming has an unfortunate problem: it's the skill gap itself.

Since fewer developers are familiar with it, issues are less likely to be spotted in code reviews, streaming behaviour may be accidentally broken [if there aren't appropriate tests on it], there are fewer people to ask for help, and unfortunately, any help that is given has a higher chance of being misleading.

This is admittedly a bit of a chicken/egg situation!

What can I do?

I keep hearing my mother say, practice, Harry, practice! Harry Kim, Star Trek Voyager

Wonderfully, I think you can get a lot of valuable experience from just a few small practice web-based projects.

  • A GET endpoint that responds with a generated HTTP response of several GBs, just of some fake data.
  • A GET endpoint that responds with a file from the filesystem of several GBs. Try with both transfer-encoding: chunked and with a specified content-length.
  • Proxying a file to or from S3 through a server. Try with a plain HTTP client, not just one that is AWS-aware such as Boto3.
  • Downloading a Postgres table of several GBs. Try with just a single query. Try responding with CSV or JSON.
  • Accept a large CSV upload and calculate some basic stats on the columns while it is being uploaded, e.g. min, max, mean, standard deviation.

Once you have done these, you would be in a much better place to weigh up the trade-offs to know if a streaming solution is right for any given real-world project. At the very least, you'll be in a better place to review colleagues' streaming-based code.

Originally published on charemza.name

Author's avatar
Michal Charemza
    JavaScript
    HTML
    Haskell
    PHP
    Shell
    Python

Related Issues

WorksHub / client
  • Open
  • 0
  • 0
  • Beginner
  • Clojure
  • $50
WorksHub / client
WorksHub / client
  • Started
  • 0
  • 3
  • Intermediate
  • Clojure
  • $100
WorksHub / client
  • 1
  • 0
  • Intermediate
  • Clojure
WorksHub / client
  • 1
  • 0
  • Intermediate
  • Clojure
WorksHub / client
  • 1
  • 0
  • Intermediate
  • Clojure
WorksHub / client
  • Open
  • 0
  • 0
  • Intermediate
  • Clojure
cosmwasm / wasmd
  • 1
  • 2
  • Intermediate
  • Go
cosmwasm / wasmd
  • Started
  • 0
  • 1
  • Intermediate
  • Go
cosmwasm / wasmd
  • Started
  • 0
  • 1
  • Intermediate
  • Go

Get hired!

Sign up now and apply for roles at companies that interest you.

Engineers who find a new job through Functional Works average a 15% increase in salary.

Get Started with