The experience of creating a real-time video sequencer on iOS

Hi, my name is Anton and I am an iOS developer at Rosberry. Not so long ago, I had the opportunity to work on the Hype Type project and to solve several interesting tasks on working with video, text and animations. In this article I will talk about the pitfalls and possible ways to circumvent them when writing realtime video sequencer on iOS.

A little about the application itself ...

Hype Type allows the user to record several short snippets of video and / or photos with a total duration of up to 15 seconds, add text to the resulting clip and apply one of the animations to choose from.

The main feature of working with video in this case is that the user should be able to control the video excerpts independently of each other: change the playback speed, reverse, flip and (possibly in future versions) change the passages on the fly.

Ready solutions?

“Why not use AVMutableComposition ?” You might ask, and, in most
cases, you are right - this is really quite a convenient system video sequencer, but, alas, it has limitations that prevented us from using it. First of all, it is the inability to change and add tracks on the fly - to get the modified video stream you need to recreate the AVPlayerItem and reinitialize the AVPlayer . Also, AVMutableComposition is far from ideal work with images - in order to add a static image to the timeline, you have to use AVVideoCompositionCoreAnimationTool , which will add a fair amount of overhead and significantly slow down the render.

A brief search on the Internet did not reveal any other more or less suitable solutions for the problem, so it was decided to write your sequencer.

So…

First, a little about the render pipeline structure in the project. I will say straight away that I will not go into the details too much and I will consider that you are already more or less familiar with this topic, otherwise this material will grow to an incredible scale. If you are a beginner - I advise you to pay attention to the fairly well-known GPUImage framework ( Obj-C , Swift ) - this is a great starting point in order to understand OpenGLES in an illustrative example.

View, which is engaged in drawing the received video on the screen on a timer ( CADisplayLink ), requests frames from the sequencer. Since the application works primarily with video, it is most logical to use YCbCr colorspace and transfer each frame as CVPixelBufferRef . After receiving the frame, luminance and chrominance textures are created, which are transferred to the shader program. The output is an RGB image, which the user sees. In this case, the refresh loop will look something like this:

- (void)onDisplayRefresh:(CADisplayLink *)sender { // advance position of sequencer [self.source advanceBy:sender.duration]; // check for new pixel buffer if ([self.source hasNewPixelBuffer]) { // get one PixelBuffer *pixelBuffer = [self.source nextPixelBuffer]; // dispatch to gl processing queue [self.context performAsync:^{ // prepare textures self.luminanceTexture = [self.context.textureCache textureWithPixelBuffer:pixelBuffer planeIndex:0 glFormat:GL_LUMINANCE]; self.chrominanceTexture = [self.context.textureCache textureWithPixelBuffer:pixelBuffer planeIndex:1 glFormat:GL_LUMINANCE_ALPHA]; // prepare shader program, uniforms, etc self.program.orientation = pixelBuffer.orientation; // ... // signal to draw [self setNeedsRedraw]; }]; } if ([self.source isFinished]) { // rewind if needed [self.source rewind]; } } // ... - (void)draw { [self.context performSync:^{ // bind textures [self.luminanceTexture bind]; [self.chrominanceTexture bind]; // use shader program [self.program use]; // unbind textures [self.luminanceTexture unbind]; [self.chrominanceTexture unbind]; }]; }

Virtually everything here is built on wrappers (for CVPixelBufferRef , CVOpenGLESTexture , etc.), which allows you to take the main low-level logic to a separate layer and significantly simplify the basic points of working with OpenGL . Of course, this has its drawbacks (mostly - a small loss of performance and less flexibility), but they are not so critical. What is worth explaining: self.context is a fairly simple wrapper over EAGLContext that facilitates working with CVOpenGLESTextureCache and multithreaded calls to OpenGL . self.source is a sequencer that decides which frame from which track to render in view.

Now about how the reception of frames for rendering is organized. Since the sequencer should work with both video and pictures, it is most logical to close everything with a general protocol. Thus, the sequencer's task will be reduced to keeping track of the playhead and, depending on its position, to give a new frame from the corresponding track.

 @protocol MovieSourceProtocol <NSObject> // start & stop reading methods - (void)startReading; - (void)cancelReading; // methods for getting frame rate & current offset - (float)frameRate; - (float)offset; // method to check if we already read everything... - (BOOL)isFinished; // ...and to rewind source if we did - (void)rewind; // method for scrubbing - (void)seekToOffset:(CGFloat)offset; // method for reading frames - (PixelBuffer *)nextPixelBuffer; @end

The logic of how to receive frames falls on objects that implement the MovieSourceProtocol . Such a scheme makes it possible to make the system universal and expandable, since the only difference in image and video processing will be only the method of obtaining frames.

Thus, the VideoSequencer becomes quite simple, and the main difficulty is the definition of the current track and the reduction of all tracks to a single frame rate.

 - (PixelBuffer *)nextPixelBuffer { // get current track VideoSequencerTrack *track = [self trackForPosition:self.position]; // get track source id<MovieSourceProtocol> source = track.source; // Here's our source // get pixel buffer return [source nextPixelBuffer]; }

VideoSequencerTrack here is a wrapper over an object that implements a MovieSourceProtocol containing a different metadata.

 @interface FCCGLVideoSequencerTrack : NSObject - (id) initWithSource:(id<MovieSourceProtocol>)source; @property (nonatomic, assign) BOOL editable; // ... and other metadata @end

We work with statics

We now turn directly to the receipt of personnel. Consider the simplest case - the display of a single image. It is possible to get it either from the camera, and then we can immediately get the CVPixelBufferRef in the YCbCr format, which you can simply copy (why this is important, I will explain later) and send on request; or from the media library - in this case, you have to twist a little and manually convert the image into the desired format. The conversion from RGB to YCbCr could be transferred to the GPU, however on modern devices and the CPU it does the job fairly quickly, especially considering the fact that the application additionally sprinkles and compresses the image before using it. Otherwise, everything is quite simple, all that needs to be done is to give the same frame within the allotted period of time.

 @implementation ImageSource // init with pixel buffer from camera - (id)initWithPixelBuffer:(PixelBuffer *)pixelBuffer orientation:(AVCaptureVideoOrientation)orientation duration:(NSTimeInterval)duration { if (self = [super init]) { self.orientation = orientation; self.pixelBuffer = [pixelBuffer copy]; self.duration = duration; } return self; } // init with UIImage - (id)initWithImage:(UIImage *)image duration:(NSTimeInterval)duration { if (self = [super init]) { self.duration = duration; self.orientation = AVCaptureVideoOrientationPortrait; // prepare empty pixel buffer self.pixelBuffer = [[PixelBuffer alloc] initWithSize:image.size pixelFormat:kCVPixelFormatType_420YpCbCr8BiPlanarFullRange]; // get base addresses of image planes uint8_t *yBaseAddress = self.pixelBuffer.yPlane.baseAddress; size_t yPitch = self.pixelBuffer.yPlane.bytesPerRow; uint8_t *uvBaseAddress = self.pixelBuffer.uvPlane.baseAddress; size_t uvPitch = self.pixelBuffer.uvPlane.bytesPerRow; // get image data CFDataRef pixelData = CGDataProviderCopyData(CGImageGetDataProvider(image.CGImage)); uint8_t *data = (uint8_t *)CFDataGetBytePtr(pixelData); uint32_t imageWidth = image.size.width; uint32_t imageHeight = image.size.height; // do the magic (convert from RGB to YCbCr) for (int y = 0; y < imageHeight; ++y) { uint8_t *rgbBufferLine = &data[y * imageWidth * 4]; uint8_t *yBufferLine = &yBaseAddress[y * yPitch]; uint8_t *cbCrBufferLine = &uvBaseAddress[(y >> 1) * uvPitch]; for (int x = 0; x < imageWidth; ++x) { uint8_t *rgbOutput = &rgbBufferLine[x * 4]; int16_t red = rgbOutput[0]; int16_t green = rgbOutput[1]; int16_t blue = rgbOutput[2]; int16_t y = 0.299 * red + 0.587 * green + 0.114 * blue; int16_t u = -0.147 * red - 0.289 * green + 0.436 * blue; int16_t v = 0.615 * red - 0.515 * green - 0.1 * blue; yBufferLine[x] = CLAMP(y, 0, 255); cbCrBufferLine[x & ~1] = CLAMP(u + 128, 0, 255); cbCrBufferLine[x | 1] = CLAMP(v + 128, 0, 255); } } CFRelease(pixelData); } return self; } // ... - (BOOL)isFinished { return (self.offset > self.duration); } - (void)rewind { self.offset = 0.0; } - (PixelBuffer *)nextPixelBuffer { if ([self isFinished]) { return nil; } return self.pixelBuffer; } // ...

We work with video

And now add the video. For this, it was decided to use AVPlayer - mainly due to the fact that it has a quite convenient API for receiving frames and completely takes over the work with sound. In general, it sounds simple enough, but there are some points that are worth paying attention to.
Let's start with the obvious:

 - (void)setURL:(NSURL *)url withCompletion:(void(^)(BOOL success))completion { self.setupCompletion = completion; // prepare asset self.asset = [[AVURLAsset alloc] initWithURL:url options:@{ AVURLAssetPreferPreciseDurationAndTimingKey : @(YES), }]; // load asset tracks __weak VideoSource *weakSelf = self; [self.asset loadValuesAsynchronouslyForKeys:@[@"tracks"] completionHandler:^{ // prepare player item weakSelf.playerItem = [AVPlayerItem playerItemWithAsset:weakSelf.asset]; [weakSelf.playerItem addObserver:weakSelf forKeyPath:@"status" options:NSKeyValueObservingOptionNew context:nil]; }]; } - (void)observeValueForKeyPath:(NSString *)keyPath ofObject:(id)object change:(NSDictionary *)change context:(void *)context { if(self.playerItem.status == AVPlayerItemStatusReadyToPlay) { // ready to play, prepare output NSDictionary *outputSettings = @{ (id)kCVPixelBufferPixelFormatTypeKey: @(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange), (id)kCVPixelBufferOpenGLESCompatibilityKey: @(YES), (id)kCVPixelBufferOpenGLCompatibilityKey: @(YES), (id)kCVPixelBufferIOSurfacePropertiesKey: @{ @"IOSurfaceOpenGLESFBOCompatibility": @(YES), @"IOSurfaceOpenGLESTextureCompatibility": @(YES), }, }; self.videoOutput = [[AVPlayerItemVideoOutput alloc] initWithPixelBufferAttributes:outputSettings]; [self.playerItem addOutput:self.videoOutput]; if (self.setupCompletion) { self.setupCompletion(); } }; } // ... - (void) rewind { [self seekToOffset:0.0]; } - (void)seekToOffset:(CGFloat)offset { [self.playerItem seekToTime:[self timeForOffset:offset] toleranceBefore:kCMTimeZero toleranceAfter:kCMTimeZero]; } - (PixelBuffer *)nextPixelBuffer { // check for new pixel buffer... CMTime time = self.playerItem.currentTime; if(![self.videoOutput hasNewPixelBufferForItemTime:time]) { return nil; } // ... and grab it if there is one CVPixelBufferRef bufferRef = [self.videoOutput copyPixelBufferForItemTime:time itemTimeForDisplay:nil]; if (!bufferRef) { return nil; } PixelBuffer *pixelBuffer = [[FCCGLPixelBuffer alloc] initWithPixelBuffer:bufferRef]; CVBufferRelease(bufferRef); return pixelBuffer; }

We create AVURLAsset , load the information about the tracks, create an AVPlayerItem , wait for a notification that it is ready for playback and create an AVPlayerItemVideoOutput with suitable parameters for rendering - everything is still quite simple.

However, the first problem also lies there - seekToTime is not fast enough, and there are noticeable delays with the loop. If you don’t change the tolerance parameters Before and tolerance After , then it makes little difference, except that, in addition to the delay, positioning inaccuracy is also added. This is a limitation of the system and it cannot be completely solved, but it is possible to bypass it, for which it’s enough to cook 2 AVPlayerItem 'a and use them one by one - as soon as one of them finishes playing, he immediately starts playing the other while the first one is rewound to the beginning. And so in a circle.

Another unpleasant, but solvable problem - AVFoundation as it follows (seamless & smooth) supports changing the playback speed and reverse is not for all file types, and if in the case of recording from the camera we control the output format, then if the user downloads video from the media library, we have no such luxury. Making the users wait until the video is converted - the output is bad, all the more far from the fact that they will use these settings, so it was decided to do it in the background and quietly replace the original video with the converted one.

 - (void)processAndReplace:(NSURL *)inputURL outputURL:(NSURL *)outputURL { [[NSFileManager defaultManager] removeItemAtURL:outputURL error:nil]; // prepare reader MovieReader *reader = [[MovieReader alloc] initWithInputURL:inputURL]; reader.timeRange = self.timeRange; // prepare writer MovieWriter *writer = [[FCCGLMovieWriter alloc] initWithOutputURL:outputURL]; writer.videoSettings = @{ AVVideoCodecKey: AVVideoCodecH264, AVVideoWidthKey: @(1280.0), AVVideoHeightKey: @(720.0), }; writer.audioSettings = @{ AVFormatIDKey: @(kAudioFormatMPEG4AAC), AVNumberOfChannelsKey: @(1), AVSampleRateKey: @(44100), AVEncoderBitRateStrategyKey: AVAudioBitRateStrategy_Variable, AVEncoderAudioQualityForVBRKey: @(90), }; // fire up reencoding MovieProcessor *processor = [[MovieProcessor alloc] initWithReader:reader writer:writer]; processor.processingSize = (CGSize){ .width = 1280.0, .height = 720.0 }; __weak FCCGLMovieStreamer *weakSelf = self; [processor processWithProgressBlock:nil andCompletion:^(NSError *error) { if(!error) { weakSelf.replacementURL = outputURL; } }]; }

MovieProcessor here is a service that receives frames and audio samples from the reader and gives them to the writer. (In fact, he also knows how to handle the frames received from the reader on the GPU, but this is used only when rendering the entire project in order to put animation frames on the finished video)

And now more difficult

And what if the user wants to add 10-15 video clips to the project right away? Since the application should not limit the user in the number of clips that he can use in the application, you need to provide for this scenario.

If you prepare each passage for playback as needed, there will be too noticeable delays. Preparing to play all the clips at once does not work either - due to the limitation of iOS on the number of h264 decoders operating at the same time. Of course, there is a way out of this situation and it is quite simple - to prepare in advance a couple of tracks that will be played next, “clearing” those that are not planned to be used in the near future.

 - (void) cleanupTrackSourcesIfNeeded { const NSUInteger cleanupDelta = 1; NSUInteger trackCount = [self.tracks count]; NSUInteger currentIndex = [self.tracks indexOfObject:self.currentTrack]; if (currentIndex == NSNotFound) { currentIndex = 0; } NSUInteger index = 0; for (FCCGLVideoSequencerTrack *track in self.tracks) { NSUInteger currentDelta = MAX(currentIndex, index) - MIN(currentIndex, index); currentDelta = MIN(currentDelta, index + (trackCount - currentIndex - 1)); if (currentDelta > cleanupDelta) { track.playheadPosition = 0.0; [track.source cancelReading]; [track.source cleanup]; } else { [track.source startReading]; } ++index; } }

This uncomplicated way was able to achieve continuous playback and loop'a. Yes, with scrubbing there will inevitably be a small lag, but this is not so critical.

Underwater rocks

Finally, I will tell you a little about the pitfalls that may occur when solving such problems.

The first is if you work with pixel buffers received from the camera of the device - either release them immediately or copy them if you want to use them later. Otherwise, the video stream will freeze - I did not find mention of this restriction in the documentation, but, apparently, the system tracks pixel buffers, which it gives and simply will not give you new ones while the old ones are hanging in memory.

The second is multithreading when working with OpenGL . OpenGL itself is not very friendly, however, this can be circumvented by using different EAGLContext , located in the same EAGLSharegroup , which allows you to quickly and simply share the logic of drawing what the user sees on the screen, and various background processes (video processing, rendering etc.).

Source: https://habr.com/ru/post/332416/

All Articles