This blog post will cover some technical aspects, but I will try to stay high level to keep this interesting to those more in film/tv than tech.
Below, I talk about the various tools and what we have learnt about them. If you aren’t interested or just want to look at the comparison, you can jump to the side by side photos and our thoughts on how they will affect the industry here.
Recently there have been incredible leaps in computer vision technology (artificial intelligence and machine learning).
Several image diffusion algorithms have become available to beta testers. These tools allow us to enter text prompts to generate a corresponding image.
These algorithms are trained off of millions of images. After training, these algorithms are able to ‘dream’ using the training image data as inspiration and create entirely novel images, which are owned commercially by the person generating the image.
In this blog post we are going to evaluate the three biggest available: Dall-E 2, Midjourney, and Stable Diffusion.
Inherent Algorithmic Styles
I preface this section by saying that the following thoughts are from my experience, and with an effective prompt, all of these tools can generate good results.
Furthermore, the algorithms are rapidly evolving, as I write this blog post Midjourney has recently shifted to version three of their algorithm which has considerably changed the prompt output.
Midjourney was the first good text to image model that I had access to, during the early testers closed beta program. I’ve tested from V1 to V3.
I personally preferred the characteristics of the v2 algorithm as the side by side comparisons I’ve seen seem to lose a lot of ‘character’ and become more generic. However, I’m confident that it will be patched in v4 onwards.
Since multiple good articles comparing the versions side by side already exist, I won't cover the differences in this blog post. This post will include prompts from multiple Midjourney versions.
I will start by saying that I was blown away by how good the algorithm was from the beginning. There was seemingly no fuss surrounding it and then it appeared overnight.
Images generated with Midjourney from various prompts have a recognisable 'artistic style' in the same way you'd expect from a human artist.
It is fascinating that while these algorithms are trained on such an incomprehensible amount of data, they still have their own hallmarks - 'dialects' if you will.
Midjourney excels at ‘dark’ image content, as in Cyberpunk, Dystopian cities, post apocalyptic, Ghost towns etc..
Rendering portraits with a game engine style like ‘Octane render’ also has exceptional results, as for example the ‘Westworld’ robot prompt below.
Dall-E 2 has the most hype surrounding it on the internet and thus the longest waiting list to sign up. It needs little introduction. From my experience it is the easiest of the tools to generate ‘photorealistic’ images with minimal coercion from the prompt.
Dall-E 2 is the largest undertaking of this type of tool currently. It is hard to find precise numbers but the algorithm is rumoured to have cost $300,000 to train.
I tend to have better results with Dall-E when tying it back to reality and giving it contextual clues, such as film titles for the ambiance I’m envisioning.
To generate the Dall-E digital art rabbits below, I found the outputs much better when listing ‘Interstellar (2014)’ in the prompt, which I didn’t do for Midjourney or Stable Diffusion.
The outpainting feature of Dall-E (where you can edit and extend a portion of the image) is an amazingly useful tool and is very easy to use. I've had excellent results with that on the claymation image. Unfortunately, the editor crashed and I lost the assembly but I managed to recreate it.
Of all the tools, I find that Dall-E is the best at handling complex prompts, it was the only one that would consistently generate the claymation in front of a greenscreen as described in the prompt.
Stable Diffusion was the latest to the fray, I joined during the first wave of testers and got to hear some exciting thoughts from the developers about the future (touched upon later), and joined the very active and enthusiastic users group on Discord.
Stable Diffusion is an exciting prospect, as it is an open source model, this means that many can contribute to the development, and in a longer term the team behind it are expecting people to host their own ‘tuned’ versions, where they retrain the model on specialised images relevant to their audience, which caters for that specific niche better.
I find that it requires more fine tuning with the prompts but due to the architecture, you are able to rapidly generate results and iterate upon these.
In the advanced options you can set denoising steps and the seed number.
In layman's terms the denoising or inference steps will cause the algorithm to do more processing, which takes longer, usually (but not always) generating a higher quality output.
The seed number is a magic number allowing you to recreate identical output settings each time. Keeping the same seed number will result in the same image being generated from the prompt each time.
If you slightly change the number the interpretation slightly changes, while drastically changing the seed number will create a completely different output.
The ability to tune the denoise steps and seed number along with other settings gives Stable Diffusion the most stable base to work from. Without these settings the other tools feel a little bit like shifting sands at times.
Additionally, the generation speed is much quicker because any Machine Learning (ML) capable hardware is able to self host and run it.
As a result Stable Diffusion is the most flexible tool and I expect it to become my favourite in time.
An interesting, lesser known feature of Stable Diffusion is that word frequency factors into the prompts. Take for example the following prompts using the same seed:
bunny, bunny bunny, bunny bunny bunny
‘Magic’ Tokens and Prompt Craft
Dall-E, Midjourney, and Stable Diffusion have all been trained off of existing images and the metadata associated with them. ‘Prompt craft’, is the art of writing good prompts for each of these tools.
Some features are under the hood like specifying aspect ratios of images such as 16:9 or 2.39. Or for example the depth of field in the generated image.
It goes without saying that if you have a photography background then you might have an advantage for capturing the shot you want.
Most models seem to abide by the specification of F-stops loosely. f/13.0 for example, will provide a deep depth of field and f/1.0 or f/1.2 will attempt a shallow depth of field (ideal for portraits), though it helps to be accompanied with ‘bokeh’.
I will write another blog post on f-stops, t-stops and aspect ratios another day. Boy, have I spent a lot of time talking about rectangles in my life.
For now here is a handy f-stop chart I found on the internet. (Credit to phototraces.com)
Likewise, if you are trying to generate hyper real/game engine looking assets, it benefits to know render engines like Octane Render and Unreal Engine.
Compare and Contrast
Okay, let’s look at some comparisons. For the sake of best results with faces, I’ve sometimes cleaned up using Tencent (a facial restoration algorithm) to remove the artefacts.
I will write another blog post later about upscaling techniques.
I’ve also played around with the prompts until I got the best results I could out of each tool. Using a one-size-fits-all approach does not work, each algorithm has its own quirks and this is part of the ‘prompt craft’. In the spirit of openness and pushing creativity forward, I’m sharing all the prompts I used for you to compare.
First up is photorealism.
Man in a red hat.
Bunny in a space helmet.
Claymation clay figure.
Landscape - Unreal Engine.
Landscape - Watercolour.
(I’m not posting the prompts here to keep it as a clean gallery but please feel free to drop us an email below and I am happy to share.)
(The next few are all credited to Antonio Pavlov, a great vfx artist and a good friend.)
This technology is still very much in its infantile stages but is already showing incredible results. Each of the existing tools seem to have their specialisations.
Photographic: Generating photorealistic images with Midjourney has proven to be very difficult as I have been unable to get this to work well. I need to spend more time developing my photorealistic Midjourney prompt craft.
Dall-E is excellent as always.
Stable Diffusion was surprisingly good but required much more tweaking to get something acceptable.
Stylised Digital Art: Midjourney is hands down the best. Dall-E seems to struggle with this as does Stable Diffusion.
Landscapes: For more organic material like paintings I prefer Stable Diffusion, and for more stylised I prefer Midjourney. Dall-E is good at landscape photos.
At Electric Sheep we are working on a few early stage projects leveraging this technology.
In one project we are using a hybrid of these tools plus some internal tools (to generate more frame based consistency) to help empower an animation project by assisting generation of hand drawn backgrounds, accelerating the speed of their animators who can focus on hand drawing the characters.
In another project we are using these tools to generate concept art as a jumping off point, to be enhanced by the concept artist and get a feature film pitch across the line.
Compositing generated elements into the background or foreground gives the most granular control and has the best end result.
Stay tuned for a case study on the results of these projects. We are excited to share them with you soon!
The next natural evolution will be generating coherent videos. These algorithms are not currently temporally aware (aware of more than each frame) and as such they struggle to make sense in the context of a video.
Objects warble between frames, and hair/faces specifically is very tricky for these early algorithms to get right without post correction.
In the future we might see the ability to generate RAW images from text prompts so that they can be properly colour graded and matched to other elements.
Further enhancements may also include layer segmentation (pictures split into foreground/background layers) so that they can be composited easier.
The landscape surrounding these tools is rapidly evolving. The future is bright for video.
If you need any technical creative services like assisting with creative pitch decks, early concept art ideas, pitch generation, or advice for how to best leverage these upcoming technologies for prep/on set or post, please get in touch with us here.