Record your browser
As this year's Congress design relies heavily on the Kario Duplex Var main font, we needed a rendering engine that could take full advantage of all the font's typographic features while enabling the relatively straightforward creation of images.
One of the initial tasks was to generate intros and outros for the conference recordings. Early drafts took a wild tour through all the shifting weights, zooms, and context-aware ligatures on offer. The first experiments with PIL and HarfBuzz produced discouraging results; for instance, the kerning broke hopelessly between animation phases. Although the Drawbot framework allowed for some nice static image generation features that made full use of all the font's 'easter eggs', it was not ideal for creating a stable animation, either.
However, there is another widely deployed and tested font rendering engine around: Your browser comes with a powerful engine that can dynamically produce high-quality animations using simple CSS rules, ready to be adopted and adjusted by even less experienced users. You can see its output in action every day.
Sooo, you want to know how I got it to render me an animation like this:
The first experiments were very promising, involving trivial animation statements such as
over
one can already prototype substantial in-browser animations of meandering states. So far, so good. However, recording these animations opened a can of worms. Asking the video crew to take dozens of screen recordings of their browser would probably make them reach for their batons. To prevent that, I took a deep dive down the rabbit hole of browser instrumentation, so you don't have to.
In terms of remotely controlling a browser, there are essentially only two serious protocols: the Firefox one and the Chrome DevTools Protocol. Only the CDP allows for screen casting. Since Chromium is available on virtually any platform imaginable, I just went with it.
The protocol looks quite reasonable at first. You launch your browser from the command line with
or for our Apple users:
and then get yourself a list of more or less active control sockets like this:
import urllib.request import json CHROMIUM_REMOTE = "http://localhost:9222" with urllib.request.urlopen(CHROMIUM_REMOTE + "/json") as resp: targets = json.load(resp) ws_url = targets[0]["webSocketDebuggerUrl"] print("Bootstrapping using ws_url: " + ws_url)
yields a happy little
So far, so good.
It is of utmost the importance to understand that this first debugger URL is tied to a so called target, which may or may not be useful for remote control. This means that the target could represent a tab that has already been closed, e.g. with your Chromium settings. Therefore you need to create a new target that you have complete control over. Since you got yourself a WebSocket endpoint list that allows for bootstrapping, you can create a new target and attach to it to get a new session ID. From then on, use that session ID over the bootstrap websocket.
This process is not completely straightforward, is poorly documented and took me quite a while to understand, so feel free to use these snippets.:
import asyncio import time, subprocess, json, base64, sys import websockets import urllib.request CHROMIUM_REMOTE = "http://localhost:9222" TARGET_URL = "https://erdgeist.org/39C3/font/index2.html" VIEWPORT_WIDTH = 1920 VIEWPORT_HEIGHT = 1080 VIDEO_LENGTH_SEC = 5 async def main(): # Discover the first available target for bootstrapping with urllib.request.urlopen(CHROMIUM_REMOTE + "/json") as resp: targets = json.load(resp) ws_url = targets[0]["webSocketDebuggerUrl"] print("Bootstrapping using ws_url: " + ws_url) async with websockets.connect(ws_url) as control_sock: msg_id = 0 session_id = 0 async def wait_for_id(expected): while True: msg = json.loads(await control_sock.recv()) if msg.get("id") == expected: return msg async def ws_send(method, params=None, in_session=True): nonlocal msg_id nonlocal session_id msg_id += 1 cmd = {"id": msg_id, "method": method } if params: cmd["params"] = params if in_session: cmd["sessionId"] = session_id print("Sending " + json.dumps(cmd)) await control_sock.send(json.dumps(cmd)) return await wait_for_id(msg_id) create_resp = await ws_send("Target.createTarget", params = {"url": TARGET_URL}, in_session = False) target_id = create_resp["result"]["targetId"] attach_resp = await ws_send("Target.attachToTarget", params = {"targetId": target_id, "flatten": True}, in_session = False) session_id = attach_resp["result"]["sessionId"] print("target created and attached, session:", session_id)
This yields a
Sending {"id": 1, "method": "Target.createTarget", "params": {"url": "https://erdgeist.org/39C3/font/index3.html"}} Sending {"id": 2, "method": "Target.attachToTarget", "params": {"targetId": "13A7AC88BB49C1E22A2AABA536375E77", "flatten": true}} target created and attached, session: 9E73BE323AD10D72BB8A1BF7E7120ED9
and from then on, using the session_id for the new target, we can use the initial control socket to instrument our new tab.
Next, we need to do some housekeeping. First, we need to enable the page and set the dimensions of the viewport so that our screen recording is the correct size. We also need to be notified when the page has loaded all the necessary assets and rendered the firstMeaningfulPaint, so that we can start recording. This is why we enable the page's lifecycle events.
# Enable page, lifecycle events and set viewport dimension await ws_send("Page.enable") await ws_send("Page.setLifecycleEventsEnabled", {"enabled": True}) await ws_send("Page.setDeviceMetricsOverride", { "width": VIEWPORT_WIDTH, "height": VIEWPORT_HEIGHT, "deviceScaleFactor": 0, "mobile": False })
Now, we just need to wait for the first paint event
while True: msg = json.loads(await control_sock.recv()) if msg.get("method") == "Page.lifecycleEvent" and msg["params"]["name"] in ("firstMeaningfulPaintCandidate", "firstMeaningfulPaint"): break
and we're off to the races. The plan is simple: We want to start a screen cast with the given viewport width and height. In order to render high-quality, uncompressed animations, we will select a quality setting of 100 and the PNG format. There are two important things to note here: Firstly, the screencast is sent via your control socket as a series of screencastFrame messages, each containing some metadata and the base64-encoded payload. You must acknowledge each frame before the browser sends the next one. This is why we immediately acknowledge the frame before serialising it. Secondly, counterintuitively, the sessionId that needs to be sent with the frame ack is not your session's sessionId, but a new virtual sessionId generated for the screencast. This usually starts with 1 and is presented in the frame's metadata dictionary. In your ack message, it needs to go into the parameters dict, not the command dict, which will still require the session's sessionId.
To avoid blocking the backend for later frames, we defer heavy tasks such as re-encoding to MP4 and simply dump the stream to a JSONL file:
with open("temp_out.jsonl", "w") as f: # Start screencast await ws_send("Page.startScreencast", { "format": "png", "quality": 100, "maxWidth": VIEWPORT_WIDTH, "maxHeight": VIEWPORT_HEIGHT, "everyNthFrame": 1 }) start_time = time.time() while time.time() - start_time < 15: msg = json.loads(await control_sock.recv()) if msg.get("method") == "Page.screencastFrame": params = msg["params"] # Fast-track acknowledgment (before file I/O) # Note: sessionId is for the screencast await ws_send("Page.screencastFrameAck", {"sessionId": params["sessionId"]}) # Dump base64 frame string + timestamp record = { "timestamp": params["metadata"]["timestamp"], "data": params["data"] } f.write(json.dumps(record) + "\n") f.flush() else: print("Received " + json.dumps(msg))
Once we have finished, we stop the screencast, detach and close the target, i.e. browser tab.
# Stop screencast and close gracefully await ws_send("Page.stopScreencast") print("Stopped recording after", VIDEO_LENGTH_SEC, "seconds") await ws_send({"method": "Target.detachFromTarget", "params": {"sessionId": session_id}}) await ws_send({"method": "Target.closeTarget", "params": {"targetId": target_id}})
All that's left to do is pipe the result through ffmpeg, to generate a nice and pristine mp4. I'm sure, you've installed ffmpeg by now.
ffmpeg = subprocess.Popen([ "ffmpeg", "-y", # overwrite "-f", "image2pipe", # stream of images "-vcodec", "png", # input format "-r", str(60), # assume 60 fps timing "-i", "-", # stdin "-c:v", "libx264", "-pix_fmt", "yuv420p", "out.mp4" ], stdin=subprocess.PIPE) with open("temp_out.jsonl") as f: for line in f: rec = json.loads(line) ffmpeg.stdin.write(base64.b64decode(rec["data"])) ffmpeg.stdin.close() ffmpeg.wait()
Remember to actually execute main():
You should have a screen recording saved as out.mp4. You can find the complete source file here. The video should look something like the one at the top.
Ideally, we would just call it day and stitch together all the videos, but alas, despite our plan to record high-quality video at 60 fps, the Chromium gods had their own ideas about how often and when to render frames. Even more frustratingly, if the browser detects that there are no visible changes to the layout — for example, at the beginning or end of ease-in-out animations — it simply refuses to repaint and emit screencastFrame s.
Depending on the monitor or screen on which the browser was launched on, and the complexity of the scene due to be rendered, you will see frame rates all over the place. Although the metadata field in each of those Page.screencastFrame messages contains a precise enough timestamp to allow ffmpeg to re-create a better synchronised movie using a concat script generated in real time, we're entering a realm where we would need to generate interpolated in-between frames to achieve this. For most people, the current behaviour is sufficient.
However, I would like to use the browser's font rendering engine with a precise frame rate, even if that means abandoning the comforting guard rails of CSS animations. So let's look into that, shall we?
First I need to get rid of the annoying synchronisation of the browser's render cycles with the vsync and the browser opening and closing browser tabs on my desktop. From now on, we'll just launch the browser in a headless mode and instruct it not to limit its frame rate:
chromium-browser --headless=new --disable-gpu-vsync --disable-frame-rate-limit --remote-debugging-port=9222 --new-window about:blank
Much better! Now that the browser is no longer responsible for synchronisation, it is our job to manually cycle through all the animation steps ourselves. The idea is to have the website set up its animation once and then we single step through all states individually, taking a screenshot for each animation phase. Sure enough, calling and inspecting JavaScript Runtime functions and values is quite easy with the Runtime.evaluate() and Runtime.callFunctionOn() methods, respectively. To use these inside the site's scope, we need to enable the Runtime for the page once:
# Enable Runtime for access to JavaScript calls, Page and navigate to the target URL await ws_send("Runtime.enable") await ws_send("Page.enable") await ws_send("Page.reload")
The Runtime.callFunctionOn() method requires an objectId to be called on. Obtaining one is more complicated than it has any right to be, but you can use JavaScript Runtime.evaluate() the window's objectId for you:
# Get window's objectId resp = await ws_send("Runtime.evaluate", { "expression": "window", "objectGroup": "page", "returnByValue": False }) window_id = resp["result"]["result"]["objectId"]
This can then be passed to Runtime.callFunctionOn() as follows:
await ws_send("Runtime.callFunctionOn", params = { "objectId": window_id, "functionDeclaration": "function(config) { setContent(config); }", "arguments": [ { "value": { "title": "Your presentation's title", "fps": FRAME_RATE } } ] });
The corresponding setContent function in the HTML should look like this:
function isEmojiOnly(str) { const stringToTest = str.replace(/ /g,''); const emojiRegex = /^(?:(?:\p{RI}\p{RI}|\p{Emoji}(?:\p{Emoji_Modifier}|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?(?:\u{200D}\p{Emoji}(?:\p{Emoji_Modifier}|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?)*)|[\u{1f900}-\u{1f9ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}])+$/u; return emojiRegex.test(stringToTest) && Number.isNaN(Number(stringToTest)); } window.setContent = async function (content) { let title = document.querySelector('#title p'); title.textContent = ''; var offset = 0; for (let i = 0; i < content.title.length; i++) { let span = document.createElement('span'); span.textContent = content.title.charAt(i); if (isEmojiOnly(span.textContent)) { span.classList.add('is-emoji'); } else { span.setAttribute('anim-off', offset); offset += 1; } title.appendChild(span); } window.fps = content.fps ?? 60; window.content_set = true; console.log("seting content to ", content.title); }
Please note that in order to retrieve the console.log output, you must to listen out for the Runtime.consoleAPICalled event.
Now that we no longer have any real-time requirements (remember that we now control when the new frame is rendered!), we can drop the intermediate JSONL containing the raw frame PNG and pipe all the images directly through FFmpeg like this:
ffmpeg = subprocess.Popen([ "ffmpeg", "-y", "-f", "image2pipe", "-framerate", str(args.frame_rate), "-i", "-", "-pix_fmt", "yuv420p", "-f", "mp4", args.movie_name ], stdin=subprocess.PIPE) for i in range(args.frame_count): await ws_send( "Runtime.evaluate", params = { "expression": "renderFrame()", "awaitPromise": True }) resp = await ws_send( "Page.captureScreenshot", params = { "format": "png" }) png_b64 = resp["result"]["data"] png_bytes = base64.b64decode(png_b64) ffmpeg.stdin.write(png_bytes)
we perform the detaching and closing as before, wait for ffmpeg to settle … and then we're finished. Just add a tiny argument parser like this one:
parser = argparse.ArgumentParser(description='Render chrome anim.') parser.add_argument('--chrome-remote', default=CHROME_REMOTE, type=str) parser.add_argument('--movie-name', default=VIDEO_OUT_PATH, type=str) parser.add_argument('--frame-rate', default=FRAMERATE, type=int) parser.add_argument('--frame-count', default=FRAMECOUNT, type=int) parser.add_argument('--frame-width', default=VIEWPORT_WIDTH, type=int) parser.add_argument('--frame-height', default=VIEWPORT_HEIGHT, type=int) parser.add_argument('--url', default='file://./index.html', type=str) args = parser.parse_args() asyncio.run(main(args))
The recording part of our new little setup is complete. The only thing left to implement is the renderFrame() function in our JavaScript world which prepares the scene for the next screenshot. As we now have to do the animations manually, I first lifted the easing functions right from easing.net, and chose easeInOutQuad for my font weight animation:
window.renderFrame = async function(from_CDP = true) { let fps = window.fps ?? 60.0; let main_anim_duration = 7.0 * fps; let main_anim_period = 2.0 * fps; /* Animate main scene font weight */ let para = document.querySelector('#anim'); let frame = window.frame ?? 0; if (frame < main_anim_duration) { let cycle = (frame % main_anim_period) / main_anim_period; /* Runs 0..1*/ let normalized = 2.0 * cycle - 1.0; /* Runs -1..+1*/ let shifted = 1.0 - Math.abs(normalized); /* Runs 0..1..0 */ let easing = easeInOutQuad(shifted); let weight = 10 + easing * 90; para.style.fontWeight = weight; } else para.style.fontWeight = 100; window.frame = frame + 1; // first rAF: flush style & layout await new Promise(requestAnimationFrame); // second rAF: ensures paint has happened await new Promise(requestAnimationFrame); }
Note that we're synchronising with the rAF method twice to ensure that a new render method is triggered and that it was not already triggered before we set the new styles. The "awaitPromise": True when calling Runtime.evaluate on our renderFrame method ensures that both painting Promises have been fulfilled and that the page is indeed ready for a snapshot.
Stitching all these snippets together should produce a working scene renderer, like the one in this demo, which can produce videos like this one.