The End of Rotoscoping.

Nobody likes rotoscoping. Fortunately you won’t have to for much longer.

The End of Rotoscoping.
Electric Sheep are automating rotoscoping with AI/ML

Nobody likes rotoscoping. Fortunately you won’t have to for much longer.

Read below for an introduction and overview of rotoscoping, or click here to jump to how we’re solving it.


The History of Rotoscoping

Rotoscoping Today

Rotoscoping Tomorrow

Beyond Rotoscoping

Final Words

The History of Rotoscoping

Rotoscoping is a technique to trace over objects frame by frame, In most modern day use cases, this is a technical process to separate people from backgrounds.

The term rotoscoping comes from the piece of equipment called the rotoscope. It was invented by animator Max Fleischer in 1914 for animation.

Fleischer was granted a patent for this technique in 1917.

Fig 3 from Patent application "Method of Producing Moving-Picture Cartoons"

Fleischer projected a film of his brother David (A clown from Coney island) onto a glass panel, and then traced it frame by frame. David’s clown character was known as “Koko the clown” and was the basis of the ‘out of the inkwell’ animated series Fleischer made famous from this technique.

Soon after Fleischer’s patent expired in 1934, Disney would begin filming live actors performing scenes as a reference for character movement, and then rotoscoping over it to make animated films.

This technique was used in animated films such as Snow White and the Seven Dwarfs (1938), Cinderella (1950), Alice in Wonderland (1951), and Sleeping Beauty (1959).

This process next evolved through a man named Bob Sabiston who developed rotoshop animation software.  This allowed ‘interpolated rotoscoping’ which is the process of being able to select a range of hand drawn frames and let the software morph between them, rather than painstakingly rotoscope each frame by hand.

Sabiston Eventually went on to do ‘A Scanner Darkly’, which - for good reason - is the most commonly referenced film mentioned around rotoscoping.

The computer software aided the process, and made rotoscoping viable for achieving a creative look, but it still took a long time.

The exact timings aren’t clear (as is often the case with finding specific VFX budget breakdowns), but in an interview the producer Tommy Pallotta said “We were thinking it was going to take about 350 man hours per minute of material.. And we ended up being pretty off on that.. it took a lot longer..”

Still from "A Scanner Darkly"

Rotoscoping Today

Today, most rotoscoping is for technical not creative reasons. For example, separating an actor for compositing in new backgrounds or visual effects.

As a result it needs to be perfect per frame, while many tools try, they are not able to interpolate accurately.

When shooting against a green screen or blue screen, chroma keying is a useful place to start.

This will take a band of colours and reduce the opacity so you can composite another layer behind it. But this rarely - if ever - is a one stop solution.

Due to the time sensitive nature of an actual film set, capturing the entire frame within the greenscreen or shooting suboptimal conditions is common, and as a result manually matte painting and rotoscoping is pushed into post production.

Rotoscoping can be tricky for various reasons - such as when green/blue screens don't cover the full actor/shot, when the lighting spill on the actor's face from the greenscreen or environment lighting is too strong, or when a green/blue screen isn’t used at all.

Rotoscoping can be done in modern applications such as Flame, Nuke, Adobe After Effects, or even Non-Linear editors like Davinci Resolve, however, this process is still very manual and is often done frame by frame by VFX artists.

Rotoscoping Tomorrow

Ultimately - the next step of efficiency for rotoscoping - and broader image processing in Film and TV - is using object detection and image matting techniques through machine learning.

This is how Electric Sheep is solving this problem.  

Other service offerings aimed at the professional market are delivered as plugins inside apps and are constrained by the users hardware, causing them to be slow and ineffectual.

Our strategy is to use large scalable GPU compute and focus on delivering a significant quality increase.

We are also firm believers in the 2030 Cloud Vision by Movielabs, and we want to be part of the wave of services moving the applications to the media, reading directly from a cloud archive of the original camera footage (OCF), and delivering back to the vendor, so that all vendors will be working off of the latest without the large file versioning overheads, and constant upload/download problems currently faced.

Getting back to rotoscoping:

Generating a pixel accurate luma or alpha matte is a complicated task for a machine that can be tackled in many ways.

The complexities arise from several angles:

  1. What is a person?

Understanding and recognising objects in a frame as a ‘person’, and that their extremities such as hair and fingers are to be captured in great detail requires some sort of object detection and image segmentation. This means breaking down the image into the constituent objects (figure one below), Then drawing an accurate matte around those.

Object detection on a frame.

Additionally, being able to recognise parts of a person in frame to be matted. For example: the camera pans and there is just a leg in shot.

2. Extensions of a person?

Understanding a hat as an extension of a person in the same way that a costume with a cape or a backpack is too, quickly becomes a philosophical question.

If they are sitting on a chair is that considered a part of their silhouette? Their reflection in glass? Leaning on a walking cane would be matted, but would you consider leaning on a fence to need the fence matted?

We are only focusing on people with this algorithm and not considering external objects, but there are also questions about what objects should be considered ‘important’ if they were considered.

3. Determining depth?

How do you define to a machine what the background is? To determine which objects are background and which objects are foreground, you need to do depth analysis, and typically need some form of trimap, which estimates depth into 3 possible values, ‘foreground’, ‘background’ and ‘unknown’. Generating this from a still frame is costly and often inaccurate. Lidar could help inform this, but again depth - and by extension ‘the background’ - is dependent on the context of a shot.

4. Consistency between frames?

Interframe consistency in events like occlusion by objects crossing the path of the frame, drastic light changes, or motion blur.  

5. Delivering enough detail in the matte to live up to professional Film and TV VFX standards

Matting around objects is hard, but having the fidelity required for fingers, hair, and costumes is even trickier.

Take for example the detail required for the hair strands in the matte below.

Luma matte demonstrating required hair detail.

Fortunately, all of these problems are solvable. In fact, these problems can be solved in many ways with different AI models, frameworks and workflows.

Previously we discussed the place of Diffusion models for image synthesis in our last article.

For the task of rotoscoping, Electric Sheep are using a Generative Adversarial network (GAN) framework.

GANs are broken into two components: One component tries to generate an output, and the other tries to find flaws in it. They both improve until the generation algorithm is good enough to fool the discriminator.

With ML and AI algorithms, the effectiveness of the algorithm is only ever as good as the data that trains it.

In this case we are hyper tuning parameters with a very carefully curated dataset of over 50,000 images, we believe this to be the largest training dataset for this type of operation to date for this industry.

In the early version of our algorithm the mask was far too large and didn’t have near enough fidelity on the person (hair etc).. We fondly referred to this as the michelin man algorithm internally.. ahem..

Original (Unmatted)
V1 algorithm "Michelin Man"

Now in version 4 we have the fidelity we need.

Original (Unmatted)
V4 algorithm.

This is very exciting news for us.

Along the way we also had to overcome several other problems such as delivering videos with consistent image mattes between frames.

There were frequently holes in the silhouette of the matte in the earlier versions, we haven’t solved this perfectly but we have seen considerable improvements so far.

This is actually a problem we’ve seen in all current algorithms and is one of the final hurdles to consistent video matting in my opinion. The art is to make a temporally aware algorithm that can compensate between frames, intelligent enough to deal with occlusion from random objects, and valid gaps within a silhouette, while still being robust enough in a single frame. Take for example this frame below of a bent elbow:

Valid hole in silhouette
Unwanted hole in silhouette (V3 algorithm)

Without having an intelligent and temporally aware algorithm, it is impossible to differentiate these inputs.

Beyond Rotoscoping

Asides from our desire to remove a tedious process from the media workflow, we have a grand future vision, strongly onboard with virtual production enabling us to move much of the post production workflow into game engines in the future.

We expect post will evolve but always exist in some fashion, and that it is too limiting to capture everything in camera in volume stages for both financial and practical reasons. We want to create tools to glue virtual production workflow with post to benefit from efficiencies all the way through. So we can all get back to saying “fix it in post” again, virtual production edition.

We believe by having access to generate incredibly clean mattes, this will be invaluable data as the 2d -> 3d model technology matures, reducing the need for photogrammetry and complex object or body scans, we will be able to generate models with a good enough matted video (and possibly injected lidar metadata for incredibly accurate depth mapping - instead of relying on depth mapping from purely optical data)

In future blog posts I will explore this concept and our experiments with it further.

Final words

Finally, we are happy with the time savings currently achieved with some of our early testers, and thank all the VFX houses joining in early on this journey with us.

Our initial findings have found that using our algorithm, an average length shot (3 seconds) can be rotoscoped within a minute. Compared to traditional methods of 1-3 days this is a huge time saving.

We are looking forward to posting about these test projects that are currently under wraps when we can!

Today we are proud to announce we are one step closer to automating away rotoscoping for professional film and tv.

Technical workflows for professional film and tv bring their own challenges with challenging colour space conversions, industry created file types and formats amongst other things, but we will be covering that in later entries of the blog.

Electric Sheep allows artists to focus on being creative, and helps storytellers realise their vision.

Our mission statement is to address the pressing concerns in the highest calibre of professional workflows, bringing AI to Film and TV.

Stay tuned for more updates!

If you would like to join our early testing closed alpha, please reach out to us directly. It is free for early adopters.