Produced by:
| Follow Us  

Understanding Frame Rate

January 23rd, 2015 | No Comments | Posted in Download, Schubin Cafe

Recorded on January 20, 2015 at the SMPTE Toronto meeting.

In viewing tests, increased frame rate delivers a greater sensation of improvement than increased resolution (at a fraction of the increase in data rate), but some viewers of the higher-frame-rate Hobbit found the sensation unpleasant. How does frames-per-second translate into pixels-per-screen-width? One common frame rate is based on profit; another is based on an interpretation of Asian spirituality. Will future frame rates have to take image contrast into consideration?

Direct Link (61MB / 34:34 TRT): Understanding Frame Rate – SMPTE Toronto


Tags: , , , , , , , , , , , , , , , , , , ,

The Habit and “The Hobbit”

February 5th, 2013 | 2 Comments | Posted in Schubin Cafe

Here are a couple of questions to get you started: What is the image at left? And what is the sound of a telephone call?

I’ll offer some more information about the first one. It’s an “intertitle,” the sort of thing inserted into silent movies to help advance their plot.

This one happens to be from a pretty famous movie. Got any idea yet of which one? You’re likely to be familiar with it even if you never saw it. But the answer might be surprising.

Now, how about that telephone call? Bell Labs researcher and audio pioneer Harvey Fletcher wanted its sound to be unidentifiable, i.e., just as good as being there. Today, if you use a certain type of mobile phone, you might be able to identify certain negative artifacts, but, in general, with contemporary technology, Fletcher’s dream has been achieved: a telephone call sounds pretty much like any other reproduction of an electronic audio signal. And that’s a problem.

When the kidnapper calls to demand ransom in a movie or TV thriller, the camera might offer a close-up of the person taking the call, but the kidnapper’s voice shouldn’t sound like it’s coming from the same room. So a voice filter is used, typically restricting the bandwidth of the sound to a range from roughly 300 Hz to 3 kHz as shown at the right in the Cisco white paper “Wideband Audio and IP Telephony” <>.

If you’re familiar with sampling theory, you know that, to avoid spurious frequencies known as aliases, sampling must be done at a rate higher than twice the desired highest frequency, and the signal must be filtered to prevent anything higher than that highest desired frequency from entering the sampler. Filters are imperfect, so, if a telephone company wanted to sample 8,000 times per second, it would not be totally unreasonable for the system to pass little more than 3 kHz.

Digital transmission systems don’t care about filtering low frequencies, however, so why the 300 Hz low-frequency cutoff? It dates back to analog transmission systems, wherein different frequencies would be attenuated by different amounts, and an equalizer would restore them. The attenuation might be described as a certain number of decibels per decade. A decade, in this case, is a tenfold increase in frequency, as from 300 Hz to 3 kHz. Going down to 30 Hz from 300 would add another decade, doubling the equalization needed.

Today, in the era of digital transmission, going down to 30 or even 20 Hz would not be a problem, which is why people describe today’s real-world telephone calls in such terms as “sounding like you’re next to me.” But the sound of a telephone-call voice in a movie or on TV still harks back to an earlier era (just as a print ad might tell its viewer to “dial” a certain phone number in an era when it’s hard to find a dial-equipped phone outside a museum).

It’s not easy on a visual web page to provide examples of telephone call sounds, especially since I have no idea what your listening equipment is like. But here is another common example of a motion-image-media indicator that strays from reality: the binoculars mask.

If you use binoculars, you probably know you’re supposed to adjust their eye separation so that there’s one circular image, not the lazy eight shown at left. But, if there’s no binoculars mask effect, how is a viewer supposed to know that the scene is seen through binoculars?

Now, perhaps, we can consider frame rate. Though he wanted telephone calls to sound just like being there in person, Fletcher did the research that identified the 300 Hz-to-3 kHz range for speech intelligibility and identification. Are there physical parameters affecting choice of frame rate? There are more than one.

One is typically called the fusion frequency, the frequency at which a sequence of individual pictures appears to be a motion picture. You can find your own fusion frequency with a common flip book; an 1886 version called a Kineograph is shown at right.

Flip through the pages slowly, and they are individual still pictures. Flip through them quickly, and they are a single motion picture.

Unfortunately, there is no single fusion frequency. It varies from person to person and with illumination, color, angle, and type of presentation.

The type of presentation becomes significant in another frame-rate variable: what’s commonly called the flicker frequency, the rate at which sources of illumination appear to be steady, rather than flickering.

Some of the earliest motion-picture systems took advantage of a fusion frequency generally lower than the flicker frequency. They presented motion pictures, but they flickered, thus an early nickname for movies: flickers or flicks.

One “solution” to the flicker problem was the use of a two-bladed shutter in the projector. A film image would be moved into place, the shutter would turn, the image would appear on screen, the shutter would turn again, the image would disappear, it would turn again, it would reappear, and it would turn again while a new image moved into place. The result was an illumination-repetition rate twice that of the frame rate, perhaps enough to achieve the flicker frequency, depending, again, on a number of viewing factors.

While the two-bladed (or, in some cases, three-bladed) shutter helped ameliorate flicker, it introduced a new artifact into motion presentation. A moving object would appear to move from one frame to another but to stall in mid-motion from one shutter opening to another. Clearly, that was a step away from reality, but, like a limited-bandwidth telephone call and a binoculars mask, it tended to indicate the look of a movie.

What rate is required? When Thomas Edison initially chose 46 frames per second (fps) for his Kinetoscope, he said it was because his research had showed that “the average human retina was capable of taking 45 or 46 photographs in a second and communicating them to the brain.” But the publication Electricity, in its June 6, 1891 issue, contrasted the Kinetoscope’s supposed 46 fps with Wordsworth Donisthorpe’s Kinesigraph’s six-to-eight: “Now, considering that the retina can retain an impression for 1/7 of a second, 8 photographs per second are sufficient for the purpose of reproduction and the remaining 38 are mere waste.”

Is there a “correct” frame rate? This week’s Super Bowl coverage made use of For-A’s FT-One cameras (above), which can shoot 4K images at up to 900 fps. But that was for replay analysis.

At the International Broadcasting Convention (IBC) in Amsterdam in 2008, the British Broadcasting Corporation (BBC) provided a demonstration in the European Broadcasting Union (EBU) “village” that showed how frame rates as high as 300 fps could be beneficial for real-time viewing. At left is a simulation of 50-fps (top) vs. 100-fps (bottom), showing a huge difference in dynamic resolution (detail in moving images).

Note that the stationary tracks and ties are equally sharp in both images. The moving train, however, is not. Other parts of the demonstration showed that high-definition resolution might appear no better than standard-definition for moving objects at common TV frame rates.

A clear case seemed to be made for frame rates higher than those normally used in television. Again, that was in 2008. In 2001, however, Kodak, Laser-Pacific, and Sony each won an engineering Emmy award for making possible 24-fps video–video at a lower frame rate than that normally used.

As the BBC/EBU demo at IBC clearly showed, 24-fps video has worse dynamic resolution than even normal TV frame rates, let alone higher ones. Yet 24-fps video has also been wildly successful. It provides a particular look, just as a binoculars mask does. In this case, the look contributes to a sensation that the sequence was shot on film. But why did movies end up at 24-fps? It’s not Edison’s 46 nor Donisthorpe’s 8.

The figure is based on research but not research into any form of visual perception. Go back to the intertitle at the top of this column. Have you guessed the movie yet? It’s The Jazz Singer, the one that ushered in the age of sound movies, even though, as the intertitle shows, it, itself, was not an all-singing, all-talking movie.

Some say 24-fps was chosen as the minimum frame rate that would provide sufficient sound quality. But The Jazz Singer, like many other sound movies, used a sound-reproduction system, Vitaphone, unrelated to the film rate: phonograph disks. In the 1926 demo photo above, engineer Edward B. Craft holds one of the 16-inch-diameter disks. Their size and rotational speed (33-1/3 rpm, the first time that speed had been used) were carefully chosen for sound quality and capacity, but they could have been synchronized to a projector running at any particular speed.

That was the key. Sound movies did not require 24-fps, but they required a single, standardized speed. The choice of that speed fell to Stanley Watkins, an employee of Western Electric, which developed the Vitaphone process. Watkins diligently undertook research. According to Scott Eyman’s book The Speed of Sound (Simon & Schuster 1997), he explained the process in 1961:

“What happened was that we got together with Warners’ chief projectionist and asked him how fast they ran the film in theaters. He told us it went at 80 to 90 feet per minute in the best first-run houses and in the small ones anything from 100 feet up, according to how many shows they wanted to get in during the day. After a little thought, we settled on 90 feet a minute [24-fps for 35 mm film] as a reasonable compromise.”

That’s it. That’s where 24-fps came from: no visual or acoustic testing, no scientific calculation, just a conversation between one projectionist, one engineer, and, according to Watkins’s daughter Barbara Witemeyer in a 2000 paper (“The Sound of Silents”), Sam Warner (of Warner Bros.) and Walter Rich, president of Vitaphone. After Vitaphone and Warner Bros., Fox adopted the speed, and soon it was ubiquitous.

Fluke or not, 24 fps came to symbolize the look of film, which is why 24-fps video is so popular. We have a habit of associating that rate with movies.

The Hobbit broke that habit. It is available in a 48-fps, so-called “HFR” (high-frame-rate) version. And its look has received some unusual reviews.

Some have complained of nausea. It’s conceivable that there is some artifact of the way The Hobbit has been projected in some theaters (in stereoscopic 3D) that triggers a queasiness response in some viewers, but it seems (to me) more likely that those viewers might be reacting to some overhead, spinning shots in the same way that viewers have reacted to roller-coaster shots in slower-frame-rate movies.

Others have complained of a news-like or video-like look that made it more difficult for them to suspend disbelief and get into the story. That’s certainly possible. If 24-fps contributes to the look of what we are in the habit of thinking of as a movie, then 48-fps is different.

Of course, we no longer watch flickering silent black-&-white movies with intertitles, projected at a rate faster than they were shot, either. Times change.


Tags: , , , , , , , , , , , , , , , , , , , , ,

The Impossible Dream: Perfect Lip Sync

March 31st, 2010 | No Comments | Posted in Schubin Cafe

There is definitely plenty that can be done to improve lip sync.  Making it perfect, however, might not be possible.

Perhaps it would be best to start with a definition.  Lip sync is the synchronization of the sounds emerging from moving lips with the images of those moving lips.  No moving images, no lip-sync issues, per se.

There are many creation myths, and one associated with moving images and sound is that The Jazz Singer (1927) was the first sound movie.  It wasn’t.

It wasn’t even the first Warner Bros. Vitaphone synchronized-sound feature movie.  And it wasn’t the first “all-talking, all-singing” sound movie, not least because it wasn’t all talking or all singing.  Here’s a typical “silent” movie “intertitle,” from one of many non-talking sections of The Jazz Singer:

Jazz Singer slide

What the first sync-sound movie actually was is not obvious.  Scientific American suggested adding sound to 3D projected images in 1877, but those were to be still pictures.  Wordsworth Donisthorpe responded in Nature a few weeks later that he could do it with moving pictures.

It’s possible (based on recollections decades later) that some experimental apparatus was built around 1888.  Edison wrote in his fourth motion-picture patent caveat that “all movements of a person photographed will be exactly coincident with any sound made by him.”

Edison Kinetophone There’s no question that Edison demonstrated sound-movie Kinetophones by 1893.  But, despite a contemporary report that the sound was in sync with the pictures, it’s possible that the sound merely started at the same time as the pictures.  And the Kinetophone was a one-viewer-at-a-time, short-duration system.

phono-cinema-theatre-exposition-de-1900There’s also no question that a form of sync-sound movies was shown at the Phono-Cinéma-Théatre at the World’s Fair in Paris in 1900.   But the system was different from what we’re accustomed to in video production today.

First, the pictures were captured.  Then, watching the images of themselves on screen, the performers lip-synched to what they had done during a phonograph sound-recording session.

In presentation, the process was reversed.  The projectionist used a telephone receiver to listen to the sound (from a phonograph in the orchestra pit) and adjusted the cranking speed of the projector to maintain lip sync (or at least to attempt to maintain something pretty close to proper lip sync).

True lip sync, with sound and picture locked, was actually patented towards the end of the 19th century, and implemented no later than the first decade of the 20th.  More of the history may be found here:

From roughly the beginning of the 20th century to the introduction of digital video processing in the early 1970s, there was good lip sync.  But it wasn’t always automatic.

Movie sound was typically recorded separately from pictures.  A clapper atop the slate provided a sync point, and various mechanisms were used to make the camera and sound-recorder motors run in sync, but sound was manually synchronized to picture.  Video recorders captured both sound and picture together, but editors using early mechanical equipment had to take into consideration a considerable distance between the video and audio heads.

Then came that digital video processing.  The CVS 500 in 1973 could not only synchronize incoming feeds but also shrink them to a quarter of their size, something that seems trivial today but was near miraculous at the time.  Unfortunately, it also delayed the video by one field (half a frame).

In the grand scheme of things, half a frame is not a lot.  But multiple passes through video-delaying devices soon followed.  A feed to a network might get synchronized, and then the network’s feed to a station might get synchronized again.  One pass through a digital effects processor might have been used to shrink an image so it fits within a larger one, and another pass might have been used to push both images off the screen.

International standards converters intentionally used longer delays to help with their frame-rate conversion.  Today, there are also up- and down-converters to and from HDTV and 24p.

There was even a video delay caused, during a brief period of madness in U.S. television, by a different timing issue.  When NTSC color was introduced in 1953, there was no specified relationship between the phase of the color subcarrier and the horizontal sync pulse, because it didn’t matter.  When color recorders were introduced, however, that lack of specificity tended to increase the size of the horizontal blanking interval (the period between the end of video at right edge of the picture and its start at the left).

After enough generations of re-recording and editing, the increase could violate FCC regulations (though it was almost never enough to be visible on a home TV).  So, after digital video effects units were introduced that could expand the picture, broadcasters began using them to conform to the regulations.  Pictures got blurry, and sound got out of sync, before the FCC announced that it wouldn’t demand the correction.

All of those video-delaying devices advanced the sound, the worst possible lip-sync problem.  And, initially, there were no matching audio delays.  Some news broadcasts (usually involving frame synchronizers and often adding standards conversion and video effects) started to look as non-synchronous as some Fellini movies.  Today, with audio delays available, there’s no longer any good excuse for lip-sync errors in production and post.

Then there’s distribution, commonly involving MPEG bit-rate reduction.  Presentation time stamps (PTS) are used to lock audio and video together.  Unfortunately, decoders aren’t required to use them, and, if they don’t, lip sync can slip.  If your TV set, cable box, or satellite receiver has slipping lip sync, the best you can do (other than complaining) is change channels and come back; the signal interruption will usually cause the decoder to lock up.  And, if you’ve been watching the same channel for a long time, it might be a good idea to change channels and return before settling in for a movie.

After enough complaints or lost business, perhaps all decoders will someday keep and maintain lip sync.  And it’s certainly possible to make sure any full-picture video delays are matched by audio delays (imaging chips and displays sometimes introduce differential delays between the tops and bottoms of pictures, but they’re very brief).  But then there is space, the final frontier as far as lip sync is concerned.

Light travels so fast that it’s essentially instantaneous.  Sound is a lot slower.  Aircraft have traveled faster than sound; bullets do it a lot.  At nominal temperature and humidity, sound travels a little less than 37 feet in the course of one video frame.

If someone is singing 50 feet away from a microphone (as on an opera stage), the audio will be picked up more than a frame late.  If the sound is then heard in the back row of a movie theater, there will be still more delay.

Harding Inauguration small

Inauguration of President Harding

There’s one way around this.  It’s called visual-acoustic perspective.  When we see someone speaking from a distance, we don’t expect the lip sync to be correct.  That’s why someone sitting three frames away from the stage, hearing a singer two frames behind the proscenium, doesn’t think there’s anything wrong.

Unfortunately, tight lenses can create close-ups, and close-ups make people want tight lip sync, even when it’s physically impossible.  There have already been cases when viewers of live transmissions to cinemas have complained of varying lip sync when all that was happening was cutting between wide shots and close-ups.

Directors of productions shown at large viewing distances should bear that problem in mind.  Otherwise, there’s not much that can be done about acoustic lip-sync issues.  Advancing the sound doesn’t help viewers in the front row.

Otherwise, just make sure all video delays are matched by audio delays.  And complain regularly about decoders not using time stamps.

Tags: , , , , , , , ,
Web Statistics